KiCad PCB EDA Suite
UTF8 Class Reference

Class UTF8 is an 8 bit std::string that is assuredly encoded in UTF8, and supplies special conversion support to and from wxString, and has iteration over unicode characters. More...

#include <utf8.h>

Inheritance diagram for UTF8:

Classes

class  uni_iter
 class uni_iter is a non-muting iterator that walks through unicode code points in the UTF8 encoded string. More...
 

Public Member Functions

 UTF8 (const wxString &o)
 
 UTF8 (const char *txt)
 This is a constructor for which you could end up with non-UTF8 encoding, but that would be your fault. More...
 
 UTF8 (const wchar_t *txt)
 For use with _() function on wx 2.8. More...
 
 UTF8 (const std::string &o)
 
 UTF8 ()
 
 ~UTF8 ()
 
UTF8operator= (const wxString &o)
 
UTF8operator= (const std::string &o)
 
UTF8operator= (const char *s)
 
UTF8operator= (char c)
 
UTF8 substr (size_t pos=0, size_t len=npos) const
 
 operator wxString () const
 
 operator char * () const
 This one is not in std::string, and one wonders why... More...
 
uni_iter ubegin () const
 Function ubegin returns a uni_iter initialized to the start of "this" UTF8 byte sequence. More...
 
uni_iter uend () const
 Function uend returns a uni_iter initialized to the end of "this" UTF8 byte sequence. More...
 

Static Public Member Functions

static int uni_forward (const unsigned char *aSequence, unsigned *aResult=NULL)
 Function uni_forward advances over a single UTF8 encoded multibyte character, capturing the unicode character as it goes, and returning the number of bytes consumed. More...
 

Detailed Description

Class UTF8 is an 8 bit std::string that is assuredly encoded in UTF8, and supplies special conversion support to and from wxString, and has iteration over unicode characters.

I've been careful to supply only conversion facilities and not try and duplicate wxString() with many member functions. In the end it is to be a std::string. There are multiple ways to create text into a std::string without the need of too many member functions:

Because this class used no virtuals, it should be possible to cast any std::string into a UTF8 using this kind of cast: (UTF8 &) without construction or copying being the effect of the cast. Be sure the source std::string holds UTF8 encoded text before you do that.

Author
Dick Hollenbeck

Definition at line 53 of file utf8.h.

Constructor & Destructor Documentation

UTF8::UTF8 ( const wxString &  o)

Definition at line 40 of file utf8.cpp.

40  :
41  std::string( (const char*) o.utf8_str() )
42 {
43 }
UTF8::UTF8 ( const char *  txt)
inline

This is a constructor for which you could end up with non-UTF8 encoding, but that would be your fault.

Definition at line 61 of file utf8.h.

61  :
62  std::string( txt )
63  {
64  }
UTF8::UTF8 ( const wchar_t *  txt)

For use with _() function on wx 2.8.

BTW _() on wx >= 2.9 returns wxString, not wchar_t* like on 2.8.

Definition at line 166 of file utf8.cpp.

166  :
167  // size initial string safely large enough, then shrink to known size later.
168  std::string( wcslen( txt ) * 4, 0 )
169 {
170  /*
171 
172  "this" string was sized to hold the worst case UTF8 encoded byte
173  sequence, and was initialized with all nul bytes. Overwrite some of
174  those nuls, then resize, shrinking down to actual size.
175 
176  Use the wx 2.8 function, not new FromWChar(). It knows about wchar_t
177  possibly being 16 bits wide on Windows and holding UTF16 input.
178 
179  */
180 
181  int sz = wxConvUTF8.WC2MB( (char*) data(), txt, size() );
182 
183  resize( sz );
184 }
UTF8::UTF8 ( const std::string &  o)
inline

Definition at line 70 of file utf8.h.

70  :
71  std::string( o )
72  {
73  }
UTF8::UTF8 ( )
inline

Definition at line 75 of file utf8.h.

75  :
76  std::string()
77  {
78  }
UTF8::~UTF8 ( )
inline

Definition at line 80 of file utf8.h.

81  {
82  }

Member Function Documentation

UTF8::operator char * ( ) const
inline

This one is not in std::string, and one wonders why...

might be a solid enough reason to remove it still.

Definition at line 113 of file utf8.h.

114  {
115  return (char*) c_str();
116  }
UTF8::operator wxString ( ) const

Definition at line 46 of file utf8.cpp.

47 {
48  return wxString( c_str(), wxConvUTF8 );
49 }
UTF8 & UTF8::operator= ( const wxString &  o)

Definition at line 52 of file utf8.cpp.

53 {
54  std::string::operator=( (const char*) o.utf8_str() );
55  return *this;
56 }
UTF8& UTF8::operator= ( const std::string &  o)
inline

Definition at line 86 of file utf8.h.

87  {
88  std::string::operator=( o );
89  return *this;
90  }
UTF8& UTF8::operator= ( const char *  s)
inline

Definition at line 92 of file utf8.h.

93  {
94  std::string::operator=( s );
95  return *this;
96  }
UTF8& UTF8::operator= ( char  c)
inline

Definition at line 98 of file utf8.h.

99  {
100  std::string::operator=( c );
101  return *this;
102  }
UTF8 UTF8::substr ( size_t  pos = 0,
size_t  len = npos 
) const
inline

Definition at line 104 of file utf8.h.

Referenced by KIGFX::STROKE_FONT::Draw(), LIB_ID::Parse(), and LIB_ID::SetLibItemName().

105  {
106  return std::string::substr( pos, len );
107  }
uni_iter UTF8::ubegin ( ) const
inline

Function ubegin returns a uni_iter initialized to the start of "this" UTF8 byte sequence.

Definition at line 216 of file utf8.h.

Referenced by KIGFX::STROKE_FONT::ComputeStringBoundaryLimits(), and KIGFX::STROKE_FONT::drawSingleLineText().

217  {
218  return uni_iter( data() );
219  }
uni_iter UTF8::uend ( ) const
inline

Function uend returns a uni_iter initialized to the end of "this" UTF8 byte sequence.

Definition at line 225 of file utf8.h.

Referenced by KIGFX::STROKE_FONT::ComputeStringBoundaryLimits(), and KIGFX::STROKE_FONT::drawSingleLineText().

226  {
227  return uni_iter( data() + size() );
228  }
int UTF8::uni_forward ( const unsigned char *  aSequence,
unsigned *  aResult = NULL 
)
static

Function uni_forward advances over a single UTF8 encoded multibyte character, capturing the unicode character as it goes, and returning the number of bytes consumed.

Parameters
aSequenceis the UTF8 byte sequence, must be aligned on start of character.
aResultis where to put the unicode character, and may be NULL if no interest.
Returns
int - the count of bytes consumed.

Definition at line 66 of file utf8.cpp.

References THROW_IO_ERROR.

Referenced by UTF8::uni_iter::operator*(), and UTF8::uni_iter::operator++().

67 {
68  unsigned ch = *aSequence;
69 
70  if( ch < 0x80 )
71  {
72  if( aResult )
73  *aResult = ch;
74  return 1;
75  }
76 
77  const unsigned char* s = aSequence;
78 
79  static const unsigned char utf8_len[] = {
80  // Map encoded prefix byte to sequence length. Zero means
81  // illegal prefix. See RFC 3629 for details
82  /*
83  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 00-0F
84  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
85  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
86  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
87  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
88  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
89  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
90  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 70-7F
91  */
92  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 80-8F
93  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
94  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
95  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // B0-BF
96  0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // C0-C1 + C2-CF
97  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // D0-DF
98  3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, // E0-EF
99  4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 // F0-F4 + F5-FF
100  };
101 
102  int len = utf8_len[ *s - 0x80 /* top half of table is missing */ ];
103 
104  switch( len )
105  {
106  default:
107  case 0:
108  THROW_IO_ERROR( "invalid start byte" );
109  break;
110 
111  case 2:
112  if( ( s[1] & 0xc0 ) != 0x80 )
113  {
114  THROW_IO_ERROR( "invalid continuation byte" );
115  }
116 
117  ch = ((s[0] & 0x1f) << 6) +
118  ((s[1] & 0x3f) << 0);
119 
120  assert( ch > 0x007F && ch <= 0x07FF );
121  break;
122 
123  case 3:
124  if( (s[1] & 0xc0) != 0x80 ||
125  (s[2] & 0xc0) != 0x80 ||
126  (s[0] == 0xE0 && s[1] < 0xA0)
127  // || (s[0] == 0xED && s[1] > 0x9F)
128  )
129  {
130  THROW_IO_ERROR( "invalid continuation byte" );
131  }
132 
133  ch = ((s[0] & 0x0f) << 12) +
134  ((s[1] & 0x3f) << 6 ) +
135  ((s[2] & 0x3f) << 0 );
136 
137  assert( ch > 0x07FF && ch <= 0xFFFF );
138  break;
139 
140  case 4:
141  if( (s[1] & 0xc0) != 0x80 ||
142  (s[2] & 0xc0) != 0x80 ||
143  (s[3] & 0xc0) != 0x80 ||
144  (s[0] == 0xF0 && s[1] < 0x90) ||
145  (s[0] == 0xF4 && s[1] > 0x8F) )
146  {
147  THROW_IO_ERROR( "invalid continuation byte" );
148  }
149 
150  ch = ((s[0] & 0x7) << 18) +
151  ((s[1] & 0x3f) << 12) +
152  ((s[2] & 0x3f) << 6 ) +
153  ((s[3] & 0x3f) << 0 );
154 
155  assert( ch > 0xFFFF && ch <= 0x10ffff );
156  break;
157  }
158 
159  if( aResult )
160  *aResult = ch;
161 
162  return len;
163 }
#define THROW_IO_ERROR(x)
Definition: utf8.cpp:60

The documentation for this class was generated from the following files: