UTF-8 tools library


Index



Introduction

The library libutf8tools is part of the GEDCOM parser library, but it can be used in unrelated programs too.  It provides some help functions for handling UTF-8 encoding.  It comes with the following installed:
The following sections describe the features of the library.

UTF-8 string functions

The following simple functions are available to handle UTF-8 strings in general:
int   is_utf8_string (char *input);
int utf8_strlen (char *input);
The first one returns 1 if the given input is a valid UTF-8 string, it returns 0 otherwise, the second gives the number of UTF-8 characters in the given input.  Note that the second function assumes that the input is valid UTF-8, and gives unpredictable results if it isn't.

Converting character sets

For conversion from and to UTF-8 there is a generic interface which gives all the necessary flexibility, and a specific interface for conversion to and from the locale, which is less flexible, but more straightforward.

Generic interface

Conversion handle

In general, the program needs to initialize a conversion handle before some actual text can be converted to and from UTF-8.  This initialization (and the cleanup at the end) is performed via the following functions:
convert_t   initialize_utf8_conversion (const char *charset, int ext_outbuf);
void cleanup_utf8_conversion (convert_t conv);
The first function returns a conversion handle, which needs to be passed to all generic conversion functions.  Through this handle, bidirectional conversion can take place between UTF-8 and the given character set 'charset'.  The implementation of this handle is not visible to the program that uses it.  In case of an error, the returned value is NULL and errno gives the error that occurred.

The second parameter ext_outbuf should be non-zero if you want to control the output buffer yourself (see below).  For normal circumstances, you should pass 0 for this parameter.

To avoid memory leaks, it is advised that conversion handles are cleaned up when not needed anymore, using the cleanup_utf8_conversion function.  Note that after using this function, any access to the handle will result in undefined behaviour.

Conversion functions

Once a conversion handle is initialized, it can be used to convert text between UTF-8 and the given character set.  There are three functions available to do so:
char* convert_from_utf8 (convert_t conv, const char* input, int* conv_fails, size_t* output_len);

char* convert_to_utf8 (convert_t conv, const char* input, size_t input_len);
char* convert_to_utf8_incremental (convert_t conv, const char* input, size_t input_len);
All three functions take the conversion handle as first parameter, and the text to convert as second parameter.  They return a pointer to an output buffer, which is overwritten at each call of the functions (unless you control your own output buffers, see below).  

The difference between the last two functions is that convert_to_utf8 converts only entire strings (i.e. it resets the conversion state each time), whereas convert_to_utf8_incremental takes previous conversions into account for the current conversion (left over input characters from the previous conversion can then be combined with the current input characters).  If you pass NULL as input to convert_to_utf8_incremental, the conversion restarts from a clean state.

Since conversion from UTF-8 to another character set can fail (it's possible that some characters cannot be encoded in the target character set), the function convert_from_utf8 has a third parameter, conv_fails, which can return the number of conversion failures in the input.  Pass a pointer to an integer if you're interested, or pass NULL otherwise.  Note that for conversion failures the string '?' will be put in the output instead of the character that could not be converted.  This string can be changed using:
int conversion_set_unknown (convert_t conv, const char *unknown);
Some character sets use wide characters to encode text.  But since the conversion functions above for simplicity all need and return normal char strings, it is necessary to know in some cases how long the strings are (if the string is actually using wide characters, then it cannot be considered a null-terminated string, so strlen cannot work on it).  

For this reason, the function convert_from_utf8 has a fourth parameter which can return the length of the output string (pass NULL if you know you don't need it), and the other functions have an input_len parameter, which should always be the string length of the input string, even if it could also be retrieved via strlen.

Controlling the output buffer

In some cases, you'd like to control the output buffer yourself, e.g. when you want to have multiple output buffers for the efficiency of not having to copy the strings.  This can be done by declaring your intention at the initialization of the conversion handle (see above).  In that case, the initialization doesn't allocate an output buffer itself, and you have to control it via the following functions, before you can do any conversion:
conv_buffer_t create_conv_buffer (int initial_size);
void free_conv_buffer (conv_buffer_t buf);

int conversion_set_output_buffer (convert_t conv, conv_buffer_t buf);
The first function returns a handle to a new conversion buffer with given initial size (the buffer is expanded dynamically when necessary).  The second function frees the buffer: all further access to the buffer handle will result in undefined behaviour.

The third function allows to set the current output buffer for the given conversion handle.  This allows to switch between output buffers.  The function returns 1 on success, 0 on failure.

Specific locale conversion

For conversion to the current locale, there is a simpler interface available, which takes care of the conversion handle implicitly.  The following functions are available:
char *convert_utf8_to_locale (char *input, int *conv_failures);
char *convert_locale_to_utf8 (char *input);
Both functions return a pointer to a static buffer that is overwritten on each call.  To function properly, the application must first set the locale using the setlocale function.  

If you pass a pointer to an integer to the first function, it will be set to the number of conversion failures, i.e. characters that couldn't be converted; you can also just pass NULL if you are not interested (note that usually, the interesting information is just whether there were conversion failures or not, which is then given by the integer being bigger than zero or not).  The second function doesn't need this, because any locale can be converted to UTF-8.

You can change the "?" that is output for characters that can't be converted to any string you want, using the following function before the conversion calls:
void convert_set_unknown (const char *unknown);

$Id: utf8tools.html,v 1.2 2002/12/28 17:00:08 verthezp Exp $
$Name: R0_90_0 $