Text

Copyright 2003 Michael B. Allen <mba2000 ioplex.com> mba/text.h Text The text module provides typedefs and macros to abstract the character type and all standard library functions that operate on them. The resulting source code will support extended charsets and can be used with little or no modification on a variety of popular platforms. If USE_WCHAR is defined, wide characters will be used (e.g. wchar_t is UTF-16LE on Windows). Otherwise the locale dependent encoding will be used (e.g. UTF-8 on Unix). Many functions in this library now accept tchar * strings however char * or wchar_t * strings can be used with these functions as tchar is just a typedef for unsigned char or wchar_t.

Additionally, several sentinel pointer string functions are provided.

See Tchar I18N Text Abstraction for details. Tchar Definitions The <ident>tchar</ident> Type To abstract what character type is used the following typedef is defined.

#ifdef USE_WCHAR
typedef wchar_t tchar;
typedef wint_t tint_t;
#define TEOF WEOF
#else
typedef unsigned char tchar;
typedef int tint_t;
#define TEOF EOF
#endif


To use this new type is a matter of substituing all instances of char or wchar_t with the new tchar. If the program is compiled on Windows, the program should be compiled with USE_WCHAR whereas other systems would use the tchar that is defined as unsigned char which is suitable for use with multibyte strings and locale dependent codepages.

Of course this is not enough to abstract the character type. All string handling functions must be abstracted as well. Also, wide character string literals must be prefixed with an L. The <ident>TEXT</ident> Macro The problem of prefixing wide character string literals is easily resolved with the following trivial TEXT macro (or identical shorthand _T macro).

#ifdef USE_WCHAR
#define TEXT(s) L##s
#define _T(s) L##s
#else
#define TEXT(s) s
#define _T(s) s
#endif


Depending on wheather or not the target code is compiled with USE_WCHAR the string or character literal will be properly prefixed with L. Consider the example below that is properly written using tchar and the _T macro. If this code is compiled without USE_WCHAR the _T macro is simply be removed to produce code that manages strings using the standard locale dependant behavior. If USE_WCHAR is defined however, the L will be prepended to string and character literals.
const tchar *foo = _T("bar");
if (ch == _T('\n')) {

/* preprocessing yields */

const unsigned char *foo = "bar";
if (ch == '\n') {

/* preprocessing with USE_WCHAR defined gives */

const wchar_t *foo = L"bar";
if (ch == L'\n') {


	Function Macros
	
The macros for common library functions that accept characters and strings are defined as follows. Note that wide character stream I/O cannot be mixed with non-wide I/O. Because it is difficult to write a program that performs all character I/O using entirely wide characters, currently there are no macros for wide character I/O functions such as fwprintf, fputwc, fgetws, ...etc. This may be addressed in a future version of this module.
#ifdef USE_WCHAR

#define istalnum iswalnum
#define istalpha iswalpha
#define istcntrl iswcntrl
#define istdigit iswdigit
#define istgraph iswgraph
#define istlower iswlower
#define istprint iswprint
#define istpunct iswpunct
#define istspace iswspace
#define istupper iswupper
#define istxdigit iswxdigit
#define istblank iswblank
#define totlower towlower
#define totupper towupper
#define tcscpy wcscpy
#define tcsncpy wcsncpy
#define tcscat wcscat
#define tcsncat wcsncat
#define tcscmp wcscmp
#define tcsncmp wcsncmp
#define tcscoll wcscoll
#define tcsxfrm wcsxfrm
#define tcscoll_l wcscoll_l
#define tcsxfrm_l wcsxfrm_l
#define tcsdup wcsdup
#define tcschr wcschr
#define tcsrchr wcsrchr
#define tcschrnul wcschrnul
#define tcscspn wcscspn
#define tcsspn wcsspn
#define tcspbrk wcspbrk
#define tcsstr wcsstr
#if defined(_WIN32)
#define tcstok(s,d,p) wcstok(s,d)
#else
#define tcstok wcstok
#endif
#define tcslen wcslen
#define tcsnlen wcsnlen
#define tmemcpy wmemcpy
#define tmemmove wmemmove
#define tmemset wmemset
#define tmemcmp wmemcmp
#define tmemchr wmemchr
#define tcscasecmp wcscasecmp
#define tcsncasecmp wcsncasecmp
#define tcscasecmp_l wcscasecmp_l
#define tcsncasecmp_l wcsncasecmp_l
#define tcpcpy wcpcpy
#define tcpncpy wcpncpy
#define tcstod wcstod
#define tcstof wcstof
#define tcstold wcstold
#define tcstol wcstol
#define tcstoul wcstoul
#define tcstoq wcstoq
#define tcstouq wcstouq
#define tcstoll wcstoll
#define tcstoull wcstoull
#define tcstol_l wcstol_l
#define tcstoul_l wcstoul_l
#define tcstoll_l wcstoll_l
#define tcstoull_l wcstoull_l
#define tcstod_l wcstod_l
#define tcstof_l wcstof_l
#define tcstold_l wcstold_l
#define tcsftime wcsftime
#define fputts _fputws
#if !defined(_WIN32)
#define stprintf swprintf
#define vstprintf vswprintf
#else
#define stprintf _snwprintf
#define vstprintf _vsnwprintf
#endif
#define stscanf swscanf
#define vstscanf vswscanf

#define text_length wcs_length
#define text_size wcs_size
#define text_copy wcs_copy
#define text_copy_new wcs_copy_new

#else

#define istalnum isalnum
#define istalpha isalpha
#define istcntrl iscntrl
#define istdigit isdigit
#define istgraph isgraph
#define istlower islower
#define istprint isprint
#define istpunct ispunct
#define istspace isspace
#define istupper isupper
#define istxdigit isxdigit
#define istblank isblank
#define totlower tolower
#define totupper toupper
#define tcscpy strcpy
#define tcsncpy strncpy
#define tcscat strcat
#define tcsncat strncat
#define tcscmp strcmp
#define tcsncmp strncmp
#define tcscoll strcoll
#define tcsxfrm strxfrm
#define tcscoll_l strcoll_l
#define tcsxfrm_l strxfrm_l
#define tcsdup strdup
#define tcschr strchr
#define tcsrchr strrchr
#define tcschrnul strchrnul
#define tcscspn strcspn
#define tcsspn strspn
#define tcspbrk strpbrk
#define tcsstr strstr
#if defined(__GNUC__)
#define tcstok strtok_r
#else
#define tcstok(s,d,p) strtok(s,d)
#endif
#define tcslen strlen
#define tcsnlen strnlen
#define tmemcpy memcpy
#define tmemmove memmove
#define tmemset memset
#define tmemcmp memcmp
#define tmemchr memchr
#define tcscasecmp strcasecmp
#define tcsncasecmp strncasecmp
#define tcscasecmp_l strcasecmp_l
#define tcsncasecmp_l strncasecmp_l
#define tcpcpy stpcpy
#define tcpncpy stpncpy
#define tcstod strtod
#define tcstof strtof
#define tcstold strtold
#define tcstol strtol
#define tcstoul strtoul
#define tcstoq strtoq
#define tcstouq strtouq
#define tcstoll strtoll
#define tcstoull strtoull
#define tcstol_l strtol_l
#define tcstoul_l strtoul_l
#define tcstoll_l strtoll_l
#define tcstoull_l strtoull_l
#define tcstod_l strtod_l
#define tcstof_l strtof_l
#define tcstold_l strtold_l
#define tcsftime strftime
#define fputts fputs
#if !defined(_WIN32)
#define stprintf snprintf
#define vstprintf vsnprintf
#else
#define stprintf _snprintf
#define vstprintf _vsnprintf
#endif
#define stscanf sscanf
#define vstscanf vsscanf

#define text_length str_length
#define text_size str_size
#define text_copy str_copy
#define text_copy_new str_copy_new

#endif

Sentinel Pointer String Functions In addition to the standard library string functions, the text module has some additional functions that under certain conditions are superior to their traditional counterparts. By using a limit pointer instead of a count, the limit pointer does not need to be recalculated as the target pointer is advanced during complex text processing. The limit pointer never changes which can make the resulting code simpler and inherently safer. Determining if a pointer is within the bounds of the target text is a simple conditional expression (e.g. p < plim).

int text_length(const tchar *src, const tchar *slim);

The text_length function returns the number of elements in the text at src up to but not including the '\0' terminator. This function returns 0 if;

no '\0' terminator is encountered before slim,
src == NULL,
or src >= slim

The text_length function is actually a macro for either str_length or wcs_length. The wcs_length function has the same prototype but accepts wchar_t parameters whereas str_length accepts unsigned char parameters.

int text_copy(const tchar *src, const tchar *slim, tchar *dst, tchar *dlim, int n);

The copy function copies at most n elements of the text at src into dst up to and including the '\0' terminator. The text at dst is always '\0' terminated unless dst is a null pointer or dst >= dlim.

The text_copy function is actually a macro for either str_copy or wcs_copy. The wcs_copy function has the same prototype but accepts wchar_t parameters whereas str_copy accepts unsigned char parameters. The text_copy function returns the number of elements in the text copied to dst not including the '\0' terminator. This function returns 0 if;

no '\0' terminator is encountered before slim,
dst == NULL,
dst >= dlim,
src == NULL,
or src >= slim

int text_copy_new(const tchar *src, const tchar *slim, tchar **dst, int n, struct allocator *al);

The text_copy_new function copies at most n elements of the text at src up to and including the '\0' terminator into memory allocated from the allocator specified by the al parameter. The pointer pointed to by dst is set to point to the new memory. If the text is copied successfully it is always '\0' terminated.

The text_copy_new function is actually a macro for either str_copy_new or wcs_copy_new. The wcs_copy_new function has the same prototype but accepts wchar_t parameters whereas str_copy_new accepts unsigned char parameters. The text_copy_new function returns the number of elements in the text at *dst not including the '\0' terminator. This function sets *dst to NULL and returns 0 if;

no '\0' terminator is encountered before slim,
src == NULL,
or src >= slim

and returns 0 if dst == NULL. If memory for the text cannot be allocated -1 will be returned and errno will be set appropriately.

size_t text_size(const tchar *src, const tchar *slim);

The size function returns the number of bytes occupied by the text at src including the '\0' terminator. This function returns 0 if;

no '\0' terminator is encountered before slim,
src == NULL,
or src >= slim

The text_size function is actually a macro for either str_size or wcs_size. The wcs_size function has the same prototype but accepts wchar_t parameters whereas str_size accepts unsigned char parameters.