Mbs

Copyright 2002 Michael B. Allen <mba2000 ioplex.com> mba/mbs.h Mbs The mbs(3m) module provides extended string functions that will work with the locale dependant encoding such as UTF-8 or any 8 bit encoding. They are useful for determining complete substrings of UTF-8 sequences such as when terminal output must consider the number of display positions that a sequence of characters will occupy. More generally, the objective of this function is to emulate the behavior of non-multibyte Unicode string manipulation like that of UTF-16 and JAVA encodings although such behavior has not been verified.

Please note some of these functions are not actively used by the author. They have been tested but should be considered experimental and may be subject to change or removal. Multibute string functions These functions convert multibyte sequences into UCS codes to determine the number of characters (read number of display positions provided you're strings are not polluted with control characters) in a string, the size of a complete valid sequence of characters, create a copy of a complete valid sequence of characters, return the substring starting at an offset number of characters, etc.

Which encoding is used is dependant on locale. Programs that use these functions can write programs that will exibit the same behavior in many different locales. Developers can test the success of their work by running their program in the UTF-8 locale provided they have a capable terminal, a Unicode font, supporting mbtowc(3) and wctomb(3) functions, and a __STDC_ISO_10646__ environment. Although this may not be obvious the Linux glibc 2.2 and Solaris with dtterm environments appear to meet these requirements.

To execute a program in the UTF-8 locale on a glibc 2.2+ Linux system try:

plain$ xterm -u8 -fn '-*-fixed-*-*-*-*-12-*-*-*-*-*-iso10646-1'
xterm$ LANG=en_US.UTF-8 ./someprogram

For more information on UTF-8 and i18n particularly on Linux read the UTF-8 and Unicode FAQ for Unix/Linux.

int mbslen(const char *src);

The mbslen function will return the number of characters in the multibyte string pointed to by src. Characters in this context are contol characters and complete multibyte sequences. Combining characters are not reduced. See mbswidth(3m) for calculating display positions.

int mbsnlen(const char *src, size_t sn, int cn);

The mbsnlen function will return the number of characters in the multibyte string pointed to by src. Characters in this context are contol characters and complete multibyte sequences. Combining characters are not reduced. See mbswidth(3m) for calculating display positions. No more than sn bytes of src will be examined and no more than cn characters will be converted to make the determination. Either or both sn and cn can be -1 indicating that the constraint should be ignored (no limit).

size_t mbssize(const char *src);

The mbssize function returns the number of bytes in a complete character sequence. Note this will not be the same as strlen(3) if there is an incomplete multibyte sequence at the end of the string.

size_t mbsnsize(const char *src, size_t sn, int cn);

The mbsnsize function returns the number of bytes in a complete character sequence. No more than sn bytes of src will be examined and no more than cn characters will be converted. Note this will not be the same as strnlen(3) if the sn or cn constraints end on an incomplete multibyte sequence or if the '\0' is encountered in the middle of an incomplete multibyte sequence.

char *mbsdup(const char *src);

The mbsdup function will return a copy of the multibyte string at src. An incomplete multibyte sequence at the end of the string will not be copied. Only a complete valid multibyte string will be returned.

char *mbsndup(const char *src, size_t n, int cn);

The mbsndup function will return a copy of the multibyte string at src. No more than sn bytes of src will be examined and no more than cn characters will be converted. If the sn or cn constraints end on an incomplete multibyte sequence or if the '\0' is encountered in the middle of an incomplete multibyte sequence those extra bytes will not be copied. Only a complete multibyte string will be returned.

char *mbsoff(char *src, int off);

The mbsoff function will return the substring of src that starts at off. The off parameter is measured in characters where characters are display positions and control character however it is not common that strings contain control characters (should not from an ADT perspective).

char *mbsnoff(char *src, int off, size_t sn);

The mbsnoff function will return the substring of src that starts at off number of characters. No more than sn number of bytes of src will be examined. If the sn parameter is exhausted, a pointer to the next valid multibyte character sequence following the sn position is returned.

char *mbschr(char *src, wchar_t wc);

The mbschr function will return a substring pointing to the first occurrence of the character wc in the mutibyte string represented by src.

char *mbsnchr(char *src, size_t sn, int cn, wchar_t wc);

The mbschr function will return a substring pointing to the first occurrence of the character wc in the mutibyte string represented by src. No more than sn bytes of src will be examined and no more than cn characters will be converted. Either or both sn and cn may be -1 indicating the constraint should be ignored (no limit).

int mbswidth(const char *src, size_t sn, int wn);

The mbswidth function will return the number of display positions a multibyte sequence will occupy. No more than sn bytes of src will be examined and no more than wn display positions will be considered. Control characters are considered to occupy 1 display position (so there should be no control characters in the src string).

char *mbssub(char *src, size_t sn, int wn);

The mbssub function will return a substring of the multibyte sequence src that is no larger in size than sn and will occupy no more than wn display positions should it be printed on a mutilbyte (UTF-8) capable display.