UTF-8 tools library
libutf8tools is part of the GEDCOM parser library,
but it can be used in unrelated programs too. It provides some help
functions for handling UTF-8 encoding. It comes with the following
The following sections describe the features of the library.
- a library '
libutf8tools.so', which should be linked in in your program
- a header '
utf8tools.h', which should be included in the source code of your program
UTF-8 string functions
The following simple functions are available to handle UTF-8 strings in general:
first one returns 1 if the given input is a valid UTF-8 string, it returns
0 otherwise, the second gives the number of UTF-8 characters in the given
input. Note that the second function assumes that the input is valid
UTF-8, and gives unpredictable results if it isn't.
int is_utf8_string (char *input);
int utf8_strlen (char *input);
Converting character sets
For conversion from and to UTF-8 there is a generic interface which gives
all the necessary flexibility, and a specific interface for conversion to
and from the locale, which is less flexible, but more straightforward.
In general, the program needs to initialize a conversion handle before some
actual text can be converted to and from UTF-8. This initialization
(and the cleanup at the end) is performed via the following functions:
The first function returns a conversion handle, which needs to be passed
to all generic conversion functions. Through this handle, bidirectional
conversion can take place between UTF-8 and the given character set
convert_t initialize_utf8_conversion (const char *charset, int ext_outbuf);
void cleanup_utf8_conversion (convert_t conv);
The implementation of this handle is not visible to the program that
uses it. In case of an error, the returned value is NULL and
errno gives the error that occurred.
The second parameter
ext_outbuf should be non-zero if you want
to control the output buffer yourself (see below). For normal circumstances,
you should pass 0 for this parameter.
To avoid memory leaks, it is advised that conversion handles are cleaned up when not needed anymore, using the
cleanup_utf8_conversion function. Note that after using this function, any access to the handle will result in undefined behaviour.
Once a conversion handle is initialized, it can be used to convert text between
UTF-8 and the given character set. There are three functions available
to do so:
All three functions take the conversion handle as first parameter, and the
text to convert as second parameter. They return a pointer to an output
buffer, which is overwritten at each call of the functions (unless you control
your own output buffers, see below).
char* convert_from_utf8 (convert_t conv, const char* input, int* conv_fails, size_t* output_len);
char* convert_to_utf8 (convert_t conv, const char* input, size_t input_len);
char* convert_to_utf8_incremental (convert_t conv, const char* input, size_t input_len);
The difference between the last two functions is that
convert_to_utf8 converts only entire strings (i.e. it resets the conversion state each time), whereas
takes previous conversions into account for the current conversion (left
over input characters from the previous conversion can then be combined with
the current input characters). If you pass
NULL as input to
convert_to_utf8_incremental, the conversion restarts from a clean state.
Since conversion from UTF-8 to another character set can fail (it's possible
that some characters cannot be encoded in the target character set), the
convert_from_utf8 has a third parameter,
which can return the number of conversion failures in the input. Pass
a pointer to an integer if you're interested, or pass NULL otherwise. Note
that for conversion failures the string '?' will be put in the output instead
of the character that could not be converted. This string can be changed
Some character sets use wide characters to encode text. But since the
conversion functions above for simplicity all need and return normal
int conversion_set_unknown (convert_t conv, const char *unknown);
strings, it is necessary to know in some cases how long the strings are (if
the string is actually using wide characters, then it cannot be considered
a null-terminated string, so
strlen cannot work on it).
For this reason, the function
convert_from_utf8 has a fourth
parameter which can return the length of the output string (pass NULL if
you know you don't need it), and the other functions have an
input_len parameter, which should always be the string length of the
input string, even if it could also be retrieved via strlen.
Controlling the output buffer
In some cases, you'd like to control the output buffer yourself, e.g. when
you want to have multiple output buffers for the efficiency of not having
to copy the strings. This can be done by declaring your intention at
the initialization of the conversion handle (see above). In that case,
the initialization doesn't allocate an output buffer itself, and you have
to control it via the following functions, before you can do any conversion:
The first function returns a handle to a new conversion buffer with given
initial size (the buffer is expanded dynamically when necessary). The
second function frees the buffer: all further access to the buffer handle
will result in undefined behaviour.
conv_buffer_t create_conv_buffer (int initial_size);
void free_conv_buffer (conv_buffer_t buf);
int conversion_set_output_buffer (convert_t conv, conv_buffer_t buf);
The third function allows to set the current output buffer for the given
conversion handle. This allows to switch between output buffers. The
function returns 1 on success, 0 on failure.
Specific locale conversion
For conversion to the current locale, there is a simpler interface available,
which takes care of the conversion handle implicitly. The following
functions are available:
char *convert_utf8_to_locale (char *input, int *conv_failures);
char *convert_locale_to_utf8 (char *input);
Both functions return a pointer to a static buffer that is overwritten
on each call. To function properly, the application must first set
the locale using the
If you pass a pointer to an integer to the first function, it will be
set to the number of conversion failures, i.e. characters that couldn't
be converted; you can also just pass
NULL if you are not interested
(note that usually, the interesting information is just whether there
were conversion failures or not, which is then given by the integer
being bigger than zero or not). The second function doesn't need this,
because any locale can be converted to UTF-8.
You can change the "?" that is output for characters that can't be converted
to any string you want, using the following function before the conversion
void convert_set_unknown (const char *unknown);
$Id: utf8tools.html,v 1.2 2002/12/28 17:00:08 verthezp Exp $
$Name: R0_90_0 $