KiCad PCB EDA Suite
UTF8 Class Reference

Class UTF8 is an 8 bit string that is assuredly encoded in UTF8, and supplies special conversion support to and from wxString, to and from std::string, and has non-mutating iteration over unicode characters. More...

#include <utf8.h>

Classes

class  uni_iter
 class uni_iter is a non-mutating iterator that walks through unicode code points in the UTF8 encoded string. More...
 

Public Member Functions

 UTF8 (const wxString &o)
 
 UTF8 (const char *txt)
 This is a constructor for which you could end up with non-UTF8 encoding, but that would be your fault. More...
 
 UTF8 (const wchar_t *txt)
 For use with _() function on wx 2.8. More...
 
 UTF8 (const std::string &o)
 
 UTF8 ()
 
 ~UTF8 ()
 
const char * c_str () const
 
bool empty () const
 
std::string::size_type find (char c) const
 
std::string::size_type find (char c, size_t &s) const
 
void clear ()
 
std::string::size_type length () const
 
std::string::size_type size () const
 
int compare (const std::string &s) const
 
bool operator== (const UTF8 &rhs) const
 
bool operator== (const std::string &rhs) const
 
bool operator== (const char *s) const
 
std::string::size_type find_first_of (const std::string &str, std::string::size_type pos=0) const
 
UTF8operator+= (const UTF8 &str)
 
UTF8operator+= (char ch)
 
UTF8operator+= (const char *s)
 
UTF8operator= (const wxString &o)
 
UTF8operator= (const std::string &o)
 
UTF8operator= (const char *s)
 
UTF8operator= (char c)
 
std::string substr (size_t pos=0, size_t len=npos) const
 
 operator const std::string & () const
 
wxString wx_str () const
 
 operator wxString () const
 
std::string::const_iterator begin () const
 
std::string::const_iterator end () const
 
uni_iter ubegin () const
 Function ubegin returns a uni_iter initialized to the start of "this" UTF8 byte sequence. More...
 
uni_iter uend () const
 Function uend returns a uni_iter initialized to the end of "this" UTF8 byte sequence. More...
 

Static Public Member Functions

static int uni_forward (const unsigned char *aSequence, unsigned *aResult=NULL)
 Function uni_forward advances over a single UTF8 encoded multibyte character, capturing the unicode character as it goes, and returning the number of bytes consumed. More...
 

Static Public Attributes

static const std::string::size_type npos = std::string::npos
 

Protected Attributes

std::string m_s
 

Detailed Description

Class UTF8 is an 8 bit string that is assuredly encoded in UTF8, and supplies special conversion support to and from wxString, to and from std::string, and has non-mutating iteration over unicode characters.

I've been careful to supply only conversion facilities and not try and duplicate wxString() with many member functions. There are multiple ways to create text into a std::string without the need of too many member functions:

Because this class used no virtuals, it should be possible to cast any std::string into a UTF8 using this kind of cast: (UTF8 &) without construction or copying being the effect of the cast. Be sure the source std::string holds UTF8 encoded text before you do that.

Author
Dick Hollenbeck

Definition at line 73 of file utf8.h.

Constructor & Destructor Documentation

UTF8::UTF8 ( const wxString &  o)

Definition at line 42 of file utf8.cpp.

42  :
43  m_s( (const char*) o.utf8_str() )
44 {
45 }
std::string m_s
Definition: utf8.h:305
UTF8::UTF8 ( const char *  txt)
inline

This is a constructor for which you could end up with non-UTF8 encoding, but that would be your fault.

Definition at line 81 of file utf8.h.

References c_str(), and MAYBE_VERIFY_UTF8.

81  :
82  m_s( txt )
83  {
85  }
std::string m_s
Definition: utf8.h:305
const char * c_str() const
Definition: utf8.h:107
#define MAYBE_VERIFY_UTF8(x)
Definition: utf8.h:47
UTF8::UTF8 ( const wchar_t *  txt)

For use with _() function on wx 2.8.

BTW _() on wx >= 2.9 returns wxString, not wchar_t* like on 2.8.

Definition at line 200 of file utf8.cpp.

References m_s.

200  :
201  // size initial string safely large enough, then shrink to known size later.
202  m_s( wcslen( txt ) * 4, 0 )
203 {
204  /*
205 
206  "this" string was sized to hold the worst case UTF8 encoded byte
207  sequence, and was initialized with all nul bytes. Overwrite some of
208  those nuls, then resize, shrinking down to actual size.
209 
210  Use the wx 2.8 function, not new FromWChar(). It knows about wchar_t
211  possibly being 16 bits wide on Windows and holding UTF16 input.
212 
213  */
214 
215  int sz = wxConvUTF8.WC2MB( (char*) m_s.data(), txt, m_s.size() );
216 
217  m_s.resize( sz );
218 }
std::string m_s
Definition: utf8.h:305
UTF8::UTF8 ( const std::string &  o)
inline

Definition at line 91 of file utf8.h.

References c_str(), and MAYBE_VERIFY_UTF8.

91  :
92  m_s( o )
93  {
95  }
std::string m_s
Definition: utf8.h:305
const char * c_str() const
Definition: utf8.h:107
#define MAYBE_VERIFY_UTF8(x)
Definition: utf8.h:47
UTF8::UTF8 ( )
inline

Definition at line 97 of file utf8.h.

98  {
99  }
UTF8::~UTF8 ( )
inline

Definition at line 101 of file utf8.h.

102  {
103  }

Member Function Documentation

std::string::const_iterator UTF8::begin ( ) const
inline

Definition at line 189 of file utf8.h.

References m_s.

Referenced by LIB_TABLE::FormatOptions(), and KIGFX::STROKE_FONT::linesCount().

189 { return m_s.begin(); }
std::string m_s
Definition: utf8.h:305
void UTF8::clear ( )
inline

Definition at line 113 of file utf8.h.

References m_s.

Referenced by LIB_ID::clear().

113 { m_s.clear(); }
std::string m_s
Definition: utf8.h:305
int UTF8::compare ( const std::string &  s) const
inline

Definition at line 116 of file utf8.h.

References m_s.

Referenced by LIB_ID::compare(), SCH_REFERENCE::CompareLibName(), and SCH_REFERENCE::CompareRef().

116 { return m_s.compare( s ); }
std::string m_s
Definition: utf8.h:305
bool UTF8::empty ( ) const
inline
std::string::const_iterator UTF8::end ( ) const
inline

Definition at line 190 of file utf8.h.

References m_s.

Referenced by LIB_TABLE::FormatOptions(), and KIGFX::STROKE_FONT::linesCount().

190 { return m_s.end(); }
std::string m_s
Definition: utf8.h:305
std::string::size_type UTF8::find ( char  c) const
inline

Definition at line 110 of file utf8.h.

References m_s.

Referenced by KIGFX::STROKE_FONT::Draw(), and LIB_ID::Parse().

110 { return m_s.find( c ); }
std::string m_s
Definition: utf8.h:305
std::string::size_type UTF8::find ( char  c,
size_t &  s 
) const
inline

Definition at line 111 of file utf8.h.

References m_s.

111 { return m_s.find( c, s ); }
std::string m_s
Definition: utf8.h:305
std::string::size_type UTF8::find_first_of ( const std::string &  str,
std::string::size_type  pos = 0 
) const
inline

Definition at line 122 of file utf8.h.

References m_s.

Referenced by okLogical(), and LIB_ID::SetLibItemName().

123  {
124  return m_s.find_first_of( str, pos );
125  }
std::string m_s
Definition: utf8.h:305
std::string::size_type UTF8::length ( ) const
inline

Definition at line 114 of file utf8.h.

References m_s.

Referenced by PCB_EDIT_FRAME::DoGenFootprintsPositionFile(), and LIB_ID::Parse().

114 { return m_s.length(); }
std::string m_s
Definition: utf8.h:305
UTF8::operator const std::string & ( ) const
inline

Definition at line 180 of file utf8.h.

References m_s.

180 { return m_s; }
std::string m_s
Definition: utf8.h:305
UTF8::operator wxString ( ) const

Definition at line 54 of file utf8.cpp.

55 {
56  return wxString( c_str(), wxConvUTF8 );
57 }
const char * c_str() const
Definition: utf8.h:107
UTF8& UTF8::operator+= ( const UTF8 str)
inline

Definition at line 127 of file utf8.h.

References c_str(), m_s, and MAYBE_VERIFY_UTF8.

128  {
129  m_s += str.m_s;
131  return (UTF8&) *this;
132  }
Class UTF8 is an 8 bit string that is assuredly encoded in UTF8, and supplies special conversion supp...
Definition: utf8.h:73
std::string m_s
Definition: utf8.h:305
const char * c_str() const
Definition: utf8.h:107
#define MAYBE_VERIFY_UTF8(x)
Definition: utf8.h:47
UTF8& UTF8::operator+= ( char  ch)
inline

Definition at line 134 of file utf8.h.

References c_str(), m_s, and MAYBE_VERIFY_UTF8.

135  {
136  m_s.operator+=( ch );
138  return (UTF8&) *this;
139  }
Class UTF8 is an 8 bit string that is assuredly encoded in UTF8, and supplies special conversion supp...
Definition: utf8.h:73
std::string m_s
Definition: utf8.h:305
const char * c_str() const
Definition: utf8.h:107
#define MAYBE_VERIFY_UTF8(x)
Definition: utf8.h:47
UTF8& UTF8::operator+= ( const char *  s)
inline

Definition at line 141 of file utf8.h.

References c_str(), m_s, and MAYBE_VERIFY_UTF8.

142  {
143  m_s.operator+=( s );
145  return (UTF8&) *this;
146  }
Class UTF8 is an 8 bit string that is assuredly encoded in UTF8, and supplies special conversion supp...
Definition: utf8.h:73
std::string m_s
Definition: utf8.h:305
const char * c_str() const
Definition: utf8.h:107
#define MAYBE_VERIFY_UTF8(x)
Definition: utf8.h:47
UTF8 & UTF8::operator= ( const wxString &  o)

Definition at line 60 of file utf8.cpp.

References m_s.

61 {
62  m_s = (const char*) o.utf8_str();
63  return *this;
64 }
std::string m_s
Definition: utf8.h:305
UTF8& UTF8::operator= ( const std::string &  o)
inline

Definition at line 152 of file utf8.h.

References c_str(), m_s, and MAYBE_VERIFY_UTF8.

153  {
154  m_s = o;
156  return *this;
157  }
std::string m_s
Definition: utf8.h:305
const char * c_str() const
Definition: utf8.h:107
#define MAYBE_VERIFY_UTF8(x)
Definition: utf8.h:47
UTF8& UTF8::operator= ( const char *  s)
inline

Definition at line 159 of file utf8.h.

References c_str(), m_s, and MAYBE_VERIFY_UTF8.

160  {
161  m_s = s;
163  return *this;
164  }
std::string m_s
Definition: utf8.h:305
const char * c_str() const
Definition: utf8.h:107
#define MAYBE_VERIFY_UTF8(x)
Definition: utf8.h:47
UTF8& UTF8::operator= ( char  c)
inline

Definition at line 166 of file utf8.h.

References c_str(), m_s, and MAYBE_VERIFY_UTF8.

167  {
168  m_s = c;
170  return *this;
171  }
std::string m_s
Definition: utf8.h:305
const char * c_str() const
Definition: utf8.h:107
#define MAYBE_VERIFY_UTF8(x)
Definition: utf8.h:47
bool UTF8::operator== ( const UTF8 rhs) const
inline

Definition at line 118 of file utf8.h.

References m_s.

118 { return m_s == rhs.m_s; }
std::string m_s
Definition: utf8.h:305
bool UTF8::operator== ( const std::string &  rhs) const
inline

Definition at line 119 of file utf8.h.

References m_s.

119 { return m_s == rhs; }
std::string m_s
Definition: utf8.h:305
bool UTF8::operator== ( const char *  s) const
inline

Definition at line 120 of file utf8.h.

References m_s.

120 { return m_s == s; }
std::string m_s
Definition: utf8.h:305
std::string::size_type UTF8::size ( ) const
inline
std::string UTF8::substr ( size_t  pos = 0,
size_t  len = npos 
) const
inline

Definition at line 175 of file utf8.h.

References m_s.

Referenced by KIGFX::STROKE_FONT::Draw(), LIB_ID::Parse(), and LIB_ID::SetLibItemName().

176  {
177  return m_s.substr( pos, len );
178  }
std::string m_s
Definition: utf8.h:305
uni_iter UTF8::ubegin ( ) const
inline

Function ubegin returns a uni_iter initialized to the start of "this" UTF8 byte sequence.

Definition at line 278 of file utf8.h.

References m_s.

Referenced by KIGFX::STROKE_FONT::ComputeStringBoundaryLimits(), and KIGFX::STROKE_FONT::drawSingleLineText().

279  {
280  return uni_iter( m_s.data() );
281  }
std::string m_s
Definition: utf8.h:305
uni_iter UTF8::uend ( ) const
inline

Function uend returns a uni_iter initialized to the end of "this" UTF8 byte sequence.

Definition at line 287 of file utf8.h.

References m_s.

Referenced by KIGFX::STROKE_FONT::ComputeStringBoundaryLimits(), and KIGFX::STROKE_FONT::drawSingleLineText().

288  {
289  return uni_iter( m_s.data() + m_s.size() );
290  }
std::string m_s
Definition: utf8.h:305
int UTF8::uni_forward ( const unsigned char *  aSequence,
unsigned *  aResult = NULL 
)
static

Function uni_forward advances over a single UTF8 encoded multibyte character, capturing the unicode character as it goes, and returning the number of bytes consumed.

Parameters
aSequenceis the UTF8 byte sequence, must be aligned on start of character.
aResultis where to put the unicode character, and may be NULL if no interest.
Returns
int - the count of bytes consumed.

Definition at line 70 of file utf8.cpp.

References THROW_IO_ERROR.

Referenced by IsUTF8(), UTF8::uni_iter::operator*(), UTF8::uni_iter::operator++(), and UTF8::uni_iter::operator->().

71 {
72  unsigned ch = *aSequence;
73 
74  if( ch < 0x80 )
75  {
76  if( aResult )
77  *aResult = ch;
78  return 1;
79  }
80 
81  const unsigned char* s = aSequence;
82 
83  static const unsigned char utf8_len[] = {
84  // Map encoded prefix byte to sequence length. Zero means
85  // illegal prefix. See RFC 3629 for details
86  /*
87  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 00-0F
88  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
89  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
90  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
91  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
92  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
93  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
94  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 70-7F
95  */
96  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 80-8F
97  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
98  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
99  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // B0-BF
100  0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // C0-C1 + C2-CF
101  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // D0-DF
102  3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, // E0-EF
103  4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 // F0-F4 + F5-FF
104  };
105 
106  int len = utf8_len[ *s - 0x80 /* top half of table is missing */ ];
107 
108  switch( len )
109  {
110  default:
111  case 0:
112  THROW_IO_ERROR( "invalid start byte" );
113  break;
114 
115  case 2:
116  if( ( s[1] & 0xc0 ) != 0x80 )
117  {
118  THROW_IO_ERROR( "invalid continuation byte" );
119  }
120 
121  ch = ((s[0] & 0x1f) << 6) +
122  ((s[1] & 0x3f) << 0);
123 
124  // assert( ch > 0x007F && ch <= 0x07FF );
125  break;
126 
127  case 3:
128  if( (s[1] & 0xc0) != 0x80 ||
129  (s[2] & 0xc0) != 0x80 ||
130  (s[0] == 0xE0 && s[1] < 0xA0)
131  // || (s[0] == 0xED && s[1] > 0x9F)
132  )
133  {
134  THROW_IO_ERROR( "invalid continuation byte" );
135  }
136 
137  ch = ((s[0] & 0x0f) << 12) +
138  ((s[1] & 0x3f) << 6 ) +
139  ((s[2] & 0x3f) << 0 );
140 
141  // assert( ch > 0x07FF && ch <= 0xFFFF );
142  break;
143 
144  case 4:
145  if( (s[1] & 0xc0) != 0x80 ||
146  (s[2] & 0xc0) != 0x80 ||
147  (s[3] & 0xc0) != 0x80 ||
148  (s[0] == 0xF0 && s[1] < 0x90) ||
149  (s[0] == 0xF4 && s[1] > 0x8F) )
150  {
151  THROW_IO_ERROR( "invalid continuation byte" );
152  }
153 
154  ch = ((s[0] & 0x7) << 18) +
155  ((s[1] & 0x3f) << 12) +
156  ((s[2] & 0x3f) << 6 ) +
157  ((s[3] & 0x3f) << 0 );
158 
159  // assert( ch > 0xFFFF && ch <= 0x10ffff );
160  break;
161  }
162 
163  if( aResult )
164  *aResult = ch;
165 
166  return len;
167 }
#define THROW_IO_ERROR(msg)
Definition: ki_exception.h:38
wxString UTF8::wx_str ( ) const

Definition at line 48 of file utf8.cpp.

References c_str().

Referenced by SCH_EDIT_FRAME::CreateArchiveLibrary(), FP_LIB_TABLE::FootprintLoad(), and DIALOG_SYMBOL_REMAP::remapSymbolsToLibTable().

49 {
50  return wxString( c_str(), wxConvUTF8 );
51 }
const char * c_str() const
Definition: utf8.h:107

Member Data Documentation

const std::string::size_type UTF8::npos = std::string::npos
static

Definition at line 148 of file utf8.h.

Referenced by KIGFX::STROKE_FONT::Draw(), and LIB_ID::Parse().


The documentation for this class was generated from the following files: