UTF-8 tools library
Index
Introduction
The library libutf8tools
is part of the GEDCOM parser library,
but it can be used in unrelated programs too. It provides some help
functions for handling UTF-8 encoding. It comes with the following
installed:
- a library '
libutf8tools.so
', which should be linked in in your program
- a header '
utf8tools.h
', which should be included in the source code of your program
The following sections describe the features of the library.
UTF-8 string functions
The following simple functions are available to handle UTF-8 strings in general:
int is_utf8_string (char *input);
int utf8_strlen (char *input);
The
first one returns 1 if the given input is a valid UTF-8 string, it returns
0 otherwise, the second gives the number of UTF-8 characters in the given
input. Note that the second function assumes that the input is valid
UTF-8, and gives unpredictable results if it isn't.
Converting character sets
For conversion from and to UTF-8 there is a generic interface which gives
all the necessary flexibility, and a specific interface for conversion to
and from the locale, which is less flexible, but more straightforward.
Generic interface
Conversion handle
In general, the program needs to initialize a conversion handle before some
actual text can be converted to and from UTF-8. This initialization
(and the cleanup at the end) is performed via the following functions:
convert_t initialize_utf8_conversion (const char *charset, int ext_outbuf);
void cleanup_utf8_conversion (convert_t conv);
The first function returns a conversion handle, which needs to be passed
to all generic conversion functions. Through this handle, bidirectional
conversion can take place between UTF-8 and the given character set 'charset'
.
The implementation of this handle is not visible to the program that
uses it. In case of an error, the returned value is NULL and errno
gives the error that occurred.
The second parameter ext_outbuf
should be non-zero if you want
to control the output buffer yourself (see below). For normal circumstances,
you should pass 0 for this parameter.
To avoid memory leaks, it is advised that conversion handles are cleaned up when not needed anymore, using the cleanup_utf8_conversion
function. Note that after using this function, any access to the handle will result in undefined behaviour.
Conversion functions
Once a conversion handle is initialized, it can be used to convert text between
UTF-8 and the given character set. There are three functions available
to do so:
char* convert_from_utf8 (convert_t conv, const char* input, int* conv_fails, size_t* output_len);
char* convert_to_utf8 (convert_t conv, const char* input, size_t input_len);
char* convert_to_utf8_incremental (convert_t conv, const char* input, size_t input_len);
All three functions take the conversion handle as first parameter, and the
text to convert as second parameter. They return a pointer to an output
buffer, which is overwritten at each call of the functions (unless you control
your own output buffers, see below).
The difference between the last two functions is that convert_to_utf8
converts only entire strings (i.e. it resets the conversion state each time), whereas convert_to_utf8_incremental
takes previous conversions into account for the current conversion (left
over input characters from the previous conversion can then be combined with
the current input characters). If you pass NULL
as input to convert_to_utf8_incremental
, the conversion restarts from a clean state.
Since conversion from UTF-8 to another character set can fail (it's possible
that some characters cannot be encoded in the target character set), the
function convert_from_utf8
has a third parameter, conv_fails
,
which can return the number of conversion failures in the input. Pass
a pointer to an integer if you're interested, or pass NULL otherwise. Note
that for conversion failures the string '?' will be put in the output instead
of the character that could not be converted. This string can be changed
using:
int conversion_set_unknown (convert_t conv, const char *unknown);
Some character sets use wide characters to encode text. But since the
conversion functions above for simplicity all need and return normal char
strings, it is necessary to know in some cases how long the strings are (if
the string is actually using wide characters, then it cannot be considered
a null-terminated string, so strlen
cannot work on it).
For this reason, the function convert_from_utf8
has a fourth
parameter which can return the length of the output string (pass NULL if
you know you don't need it), and the other functions have an input_len
parameter, which should always be the string length of the input
string, even if it could also be retrieved via strlen.
Controlling the output buffer
In some cases, you'd like to control the output buffer yourself, e.g. when
you want to have multiple output buffers for the efficiency of not having
to copy the strings. This can be done by declaring your intention at
the initialization of the conversion handle (see above). In that case,
the initialization doesn't allocate an output buffer itself, and you have
to control it via the following functions, before you can do any conversion:
conv_buffer_t create_conv_buffer (int initial_size);
void free_conv_buffer (conv_buffer_t buf);
int conversion_set_output_buffer (convert_t conv, conv_buffer_t buf);
The first function returns a handle to a new conversion buffer with given
initial size (the buffer is expanded dynamically when necessary). The
second function frees the buffer: all further access to the buffer handle
will result in undefined behaviour.
The third function allows to set the current output buffer for the given
conversion handle. This allows to switch between output buffers. The
function returns 1 on success, 0 on failure.
Specific locale conversion
For conversion to the current locale, there is a simpler interface available,
which takes care of the conversion handle implicitly. The following
functions are available:
char *convert_utf8_to_locale (char *input, int *conv_failures);
char *convert_locale_to_utf8 (char *input);
Both functions return a pointer to a static buffer that is overwritten
on each call. To function properly, the application must first set
the locale using the setlocale
function.
If you pass a pointer to an integer to the first function, it will be
set to the number of conversion failures, i.e. characters that couldn't
be converted; you can also just pass NULL
if you are not interested
(note that usually, the interesting information is just whether there
were conversion failures or not, which is then given by the integer
being bigger than zero or not). The second function doesn't need this,
because any locale can be converted to UTF-8.
You can change the "?" that is output for characters that can't be converted
to any string you want, using the following function before the conversion
calls:
void convert_set_unknown (const char *unknown);
$Id: utf8tools.html,v 1.2 2002/12/28 17:00:08 verthezp Exp $
$Name: R0_90_0 $