Tchar I18N Text Abstraction

This documention describes how to write C software that will support international charsets and compile and run the resulting programs on different platforms such as Linux, Microsoft Windows, BSD, or MacOS/X with little or no modification. In short, macros and typedefs are used to abstract the character type and all functions that operate on it. This will permit the software to be compiled using plain 8 bit, multi-byte, or wide character encodings. Very little extra work is necessary to benifit from this technique although there are pitfalls that will be described in detail.

Like most modules in Libmba, the ideas and code can be extracted or adapted to meet the needs of your application and do not bind you to a particular "environment" or library. Although the text.h typedefs and macros are used throughout the libmba package, users can simply choose to ignore them and pass char * to functions that accept tchar * (or wchar_t * if libmba has been compiled with -DUSE_WCHAR).

Unicode, Charsets, and Character Encodings

To use this technique successfully it is essential to understand that each non-ASCII character may occupy a variable number of bytes in memory. Examples will be given below that illustrate why this is important but first some background information about Unicode, charsets, and character encodings might be useful.

Consider the Russian character called CYRILLIC CAPITAL LETTER GHE which looks like an upside down 'L' has a Unicode value of U+0413. This character's value will be different depending on which charset is being discussed but Unicode is the international standard superset of virtually all charsets so unless otherwise specified Unicode will be used to describe characters throughout this documentation. The number of bytes that the U+0413 character occupies depends on the character encoding being used. Notice that a charset and a character encoding are different things.

Charsets A charset (or characterset) is a map that defines which numeric value represents a particular character. In the Unicode charset CYRILLIC CAPITAL LETTER GHE is the numeric value 0x0413 (written U+0413 in the Unicode standard convention). Some example charsets are ISO-8859-1, CP1251, and GB2312.

Character Encodings A character encoding defines how the numeric values representing characters are serialized into a sequence of bytes so that it can be operated on in memory, stored on a disk, or transmitted over a network. For example, at least two bytes of memory would be required to hold the value 0x0413. If this value were simply stored as the byte 0x04 followed by the byte 0x13 this encoding would be UCS-2BE meaning Unicode Character Set, 2 bytes, and big-endian. The following are examples of some other Unicode character encodings:

I18N Text Handling

There are primarily three techniques for managing I18N strings in a C program.

The Tchar Text Abstraction

To write a program that will compile and run without modification on Linux, Windows, and a variety of other platforms is a matter of abstracting the techniques listed above used by each platform. Linux uses both multi-byte and wide character encodings. Windows uses wide characters however it is important to note that Windows does not support a UTF-8 locale so if Unicode is desired wide character strings are the only option. Most other Unix and Unix-like systems support multi-byte strings as well as possibly wide character strings to different degrees. Programs written using the technique described here will still premit using runtime defined codepages using the standard setlocale(3) mechanism.

The Tchar Type

The idea behind this technique is to use a typedef for the character type that resolves to either plain char or wchar_t. In this way the character type identifier does not change in the source code.

  #ifdef USE_WCHAR
  typedef wchar_t tchar;
  #else
  typedef unsigned char tchar;
  #endif
  

Abstracting String Functions

In addition to the character type, all functions that operate on it will need to be abstracted with macros that reference tchar rather than char or wchar_t. Consider the strncpy function. It uses the plain char type. Fortunately the major string functions have a wide character equivalent that usually has the same signature but accepts the wchar_t type.

  char *strncpy(char *dest, const char *src, size_t n);
  wchar_t *wcsncpy(wchar_t *dest, const wchar_t *src, size_t n);
  
From the above signatures it can be seen that the only difference is the character type. The number, order, and meaning of the parameters are the same. This permits the function to be abstracted with macros as follows:

  #ifdef USE_WCHAR
  #define tcsncpy wcsncpy
  #else
  #define tcsncpy strncpy
  #endif
  
To use this function is now a matter of substituting all instances of strncpy or wcsncpy with tcsncpy. Depending on how the program is compiled, code that uses these functions will support wide character or multi-byte strings (but not both at the same time). See the Text Module API Documentation for a complete list of macros in text.h.

There are of course many other functions that operate on strings. Fortunately most standard C library function have wide character versions that are reasonably consistent about identifier names. An identifier that begins with str will likey have a wide character version that begins with wcs. Other functions like vswprintf are not so obvious and depending the the system being used there will certainly be omissions or incompatablities (e.g. the vsnprintf counterpart wide character function is vswprintf without the n even though it accepts an n parameter). If a function does not have a man page or if the compiler issues a warning it does not necessarily mean the function does not exist on your system. For example, with the GNU C library it may be necessary to specify C99 or define _XOPEN_SOURCE=500 to indicate a UNIX98 environment is desired. Check your C library documentation (e.g. /usr/include/features.h). Check the POSIX documentation on the OpenGroup website. On my RedHat Linux 7.3 system the wcstol and several other conversion functions are not documented. It is necessary to specify -std=c99 or define -D_ISOC99_SOURCE with gcc to trigger it to export that symbol.

Variable Width Encodings

Unicode on Unix and Unix-like systems is supported using UTF-8. On Microsoft Windows UTF-16LE is used. As explained previously these are variable width encodings. Each character can occupy a variable number of bytes in memory. The question is; when does this require special processing in your code?

A good example of when UTF-8 string handling requires special hanlding is when each character needs to be examined individually. Consider the example of caseless comparison of two strings. They cannot simply be compared element by element. Each character must be decoded to their wide character value and converted to upper or lowercase for the comparison to be valid. Below is just such a function:

  /* Case insensitive comparison of two UTF-8 strings
   */
  int
  utf8casecmp(const unsigned char *str1, const unsigned char *str1lim,
  		const unsigned char *str2, const unsigned char *str2lim)
  {
  	int n1, n2;
  	wchar_t ucs1, ucs2;
  	int ch1, ch2;
  	mbstate_t ps1, ps2;
  
  	memset(&ps1, 0, sizeof(ps1));
  	memset(&ps2, 0, sizeof(ps2));
  	while (str1 < str1lim && str2 < str2lim) {
  		if ((*str1 & 0x80) && (*str2 & 0x80)) {           /* both multibyte */
  			if ((n1 = mbrtowc(&ucs1, str1, str1lim - str1, &ps1)) < 0 ||
  					(n2 = mbrtowc(&ucs2, str2, str2lim - str2, &ps2)) < 0) {
  				PMNO(errno);
  				return -1;
  			}
  			if (ucs1 != ucs2 && (ucs1 = towupper(ucs1)) != (ucs2 = towupper(ucs2))) {
  				return ucs1 < ucs2 ? -1 : 1;
  			}
  			str1 += n1;
  			str2 += n2;
  		} else {                                /* neither or one multibyte */
  			ch1 = *str1;
  			ch2 = *str2;
  
  			if (ch1 != ch2 && (ch1 = toupper(ch1)) != (ch2 = toupper(ch2))) {
  				return ch1 < ch2 ? -1 : 1;
  			} else if (ch1 == '\0') {
  				return 0;
  			}
  			str1++;
  			str2++;
  		}
  	}
  
  	return 0;
  }
  
This is a fairly pathological example. In practice this is probably as difficult as it gets. For example, if the objective is to search for a certain ASCII character such as a space or '\0' termniator, it is not necessary to decode a Unicode value at all. It might even be reasonable to use isspace and similar functions (but probably not ispunct for example). This will require some experimenting and research.

Another example of when using a variable width encoding requires special handling in your code is when calculating the number of bytes from the string required to occupy at most a certin number of dispay positions in a terminal window. In this case it is necessary to convert each character to it's Unicode value and then use the wcwidth(3) function. When the total of values returned by wcwidth(3) equals or exceeds the desired number of columns the number of bytes traversed in the substring is known.

Potential Problems

This technique is not perfect. The wide character functions were not designed with this technique in mind. The prototypes are largely the same only for the sake of consistency. It is important to understand where problems can occur and understand how to correctly fix or avoid them. There are most certainly other problems and incompatabilies that I have omitted here. If you encounter any such example, please drop me a mail.

TCHAR in Microsoft Windows

For programmers that have used the variety of string handling functions on the Microsoft Windows platform this character abstraction technique should look familar. It is indeed the same. The abstract character type in the Win32 environment is named TCHAR in uppercase rather than lower and the string functions are prefixed with _tcs like _tcsncpy rather than tcsncpy but after macro processing the resulting code is the same. The identifier names where chosen to be the same as those found on Windows (minus a few Windows coding conventions that clash with Unix/Linux conventions) simply becuase the Windows platform is very popular and there was no practical reason to use different names. The exception is that USE_WCHAR is used to signal that wide characters should be used rather than _UNICODE because on Unix and Unix-like systems multi-byte strings support Unicode in the UTF-8 locale which would make the _UNICODE macro somewhat inaccurate.