Friday, September 3, 2010

Unicode With ICU

Building ICU on Mac OS X
The ICU library at first sight seems a bit overwhelming in its scope and depth, but like all such things it's a matter of persistence and familiarity. Unfortunately I ran into problems quite early as just building it was failing.

With the usual aid of a Google search though somebody else at Mac Ports had already found a workaround. In short, my CFLAGS includes the directive '-arch x86_64'  which was enough to convince the ./configure script to build a universal binary. For reasons I'm not entirely clear on this then resulted in a segmentation fault in the 'data' section of the build. The solution was to edit the configure script and alter the line  :
*-arch*ppc*|*-arch*i386*|*-arch*x86_64*) ac_cv_c_bigendian=universal;;
to :-
*-arch*ppc*) ac_cv_c_bigendian=universal;;

This is enough to generate a 64 bit only binary and for the build to complete.

ICU Basic use
The first thing I then tried to do was to create an instance of the UnicodeString class so that I could then convert it to UTF-8 which is the encoding I wanted to persist back to Postgres. This started out with:

#include <unicode/unistr.h>

icu::UnicodeString strUnicode( (UChar*)L"ΠαρθένωνΗ", 9 );


But then how to transform this into UTF-8? I could see there were two methods toUTF8() which was documented, and toUTF8String() which was mentioned, but nowhere documented.

toUTF8() takes the mysterious (to me) type ByteSink which from the documentation I could not figure out how to get a pointer to the data being contained in this type. Only after a lot of digging through the source did I figure out that this type only wraps a pointer to a char[] array, and provides others a safe wrapper to ensure that you cannot overflow the array bounds. In fact the toUTF8String() method (more on this later) uses a specialisation CheckedArrayByteSink() when it's storing the transformed UnicodeString to UTF8. This then yielded the following code to convert my UnicodeString into UTF8:

  char utf8Buff[128]; // conversion target buffer
  icu::CheckedArrayByteSink byte_buff(utf8Buff, sizeof(utf8Buff) ); //safe wrapper for c array
  strUnicode.toUTF8( byte_buff ); // converts to UTF-8 and stores in utf8Buff


And the mysterious toUTF8String() method? It does exactly the same, but will store the output in a std::string container instead. Its implementation can actually be seen in the <unistr.h> header file.
So the code :
  std::string cs;
  strUnicode.toUTF8String(cs);

will result in an STL string with your UTF-16 string transformed to UTF-8.

This *should* yield a proper UTF-8 output string.
But it didn't. Stored on the database was a string with just one character. What was happening? I'd constructed the class from a UTF-16 string,  L"ΠαρθένωνΗ" - but the resulting UTF-8 string had every 3rd byte set to '\0'. It was as if the conversion had been only half done and the resulting string peppered with '\0' byte markers. Lots of head scratching ensued.

And then it struck me - my base assumption was that the compiler would turn the string prefixed with an 'L' into UTF-16, but this is not always so. In Windows where my first use of Unicode was UTF-16 is the base character set used throughout the NT API. It's straightforward and mostly predictable, Java  uses this character set also. But not every character in the world can be represented with this scheme, and another more inclusive wider set character set UTF-32 is required. This is what the native wide-char is on Mac OS X. A quick check confirmed this:
  int wchar_size = sizeof(wchar_t); // returns 4

So now my example has to construct a string from UTF-32, not UTF-16 as I had wrongly assumed.
What I ended up with is:
  //STL string from icu::UnicodeString
  //UTF-8 string in char[] buffer from UnicodeString
  icu::UnicodeString strUnicode = icu::UnicodeString::fromUTF32( (UChar32*)L"ΠαρθένωνΗ", 9 );

  std::string cs;
  strUnicode.toUTF8String(cs);

  char utf8Buff[128];
  memset( utf8Buff,0, sizeof(utf8Buff) );
  icu::CheckedArrayByteSink byte_buff(utf8Buff, sizeof(utf8Buff) );
  strUnicode.toUTF8( byte_buff );


ICU 4.2
Mac OSX 10.6

No comments:

Post a Comment