Character encoding

Index



The character encoding problem

Developers are usually familiar with the ASCII character set.  This is a character set that assigns a unique number to some characters, e.g. an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or 0x61 in hex).  Some people may also have used ASCII codes for several drawing characters (such as a horizontal bar, a vertical bar, or a top-right corner) in the old DOS days, to be able to draw nice windows in text mode.

However, these last characters are strictly spoken not part of the ASCII set.  The standard ASCII set contains only the character positions from 0 to 127 (i.e. anything that fits into an integer that is 7 bits wide).  An example of this table can be found here.  Anything that has an ASCII code between 128 and 255 is in principle undefined.

Now, several systems (including the old DOS) have defined those character positions anyway, but usually in totally different ways.  Some well known extensions are:
And these are only examples of character sets used in West-European languages.  For Japanese, Chinese, Korean, Vietnamese, ... there are separate character sets in which one byte's meaning can even be influenced by what the previous byte was, i.e. these are multi-byte character sets.  This is because even 256 characters (the maximum for 8 bits) is totally inadequate to represent all characters in such languages.

So, summarizing, if a text file contains a byte that has a value 65, it is pretty safe to assume that this byte represents an "A", if we ignore the multi-byte character sets spoken of before.  However, a value 233 cannot be interpreted without knowing in which character set the text file is written.  In Latin-1, it happens to be the character "é", but in another character set it can be something totally different (e.g. in the DOS character set it is the Greek letter theta).

Conversion from byte value

Vice versa, if you need to write a character "é" to a file, it depends on the character set you will use what the numerical value will be in the file: in Latin-1 it will be 233, but if you use the DOS character set it will be 130, making it necessary again to know the encoding when you want to re-read the file.

Conversion to byte value

This is a source of great confusion as soon as you go outside the normal English character set, especially when you are using files on different systems...

Unicode code points

Enter the Unicode standard...

Unicode solves the problem of encoding by assigning unique numbers to every character that is used anywhere in the world.  Since it is not possible to do this in 8 bits (with a maximum of 256 code positions), a Unicode character is usually represented by 16 bits, denoted by U+0000 to U+FFFF in hexadecimal style.  A number such as U+0123 is named a "code point".

Recently (Unicode 3.1), some extensions have even been defined so that in fact the defined range is now U+000000 to U+10FFFF (21 bits), and formally, the character set is defined as 31-bits to allow for future expansion.

The Unicode character set is backward compatible with the ISO-8859-1 or Latin-1 character set (and thus automatically also with the ASCII character set), because for every ISO-8859-1 character with hexadecimal value 0xXY, the corresponding Unicode code point is U+00XY.

Some examples of Unicode code points (some of the characters here may not be displayed correctly in all browsers; current Mozilla works perfectly for this, but it also depends on the installed fonts of course):

Unicode code point
Character
U+0041
A
U+00E9
é
U+03B8
θ (the Greek theta)
U+20AC
€ (the euro)

Using the Unicode code points there is no confusion anymore which character is meant, because they uniquely define the character.  The full Unicode code charts can be found here (as a set of PDF documents).  A nice application to see all Unicode characters is the Unicode Character Map (ucm), which can be found here, and which allows to select and paste any Unicode character.

Some additional terminology (more terminology follows in the next section):

Unicode encodings, UTF-8

Since Unicode characters are generally represented by a number that is 16 bits wide, as seen above (for the basic plane), it would seem that all text files would double in size, since the usual ASCII characters are 8 bits wide.  However, the Unicode code points are not necessarily the values that are written to files...  

Indeed, the simplest solution is to take the code point that defines a character, split it up into two bytes, and write the two bytes to the file.  This is called the UCS-2 encoding scheme:

Character
Unicode code point
Byte values in file (UCS-2)
A
U+0041
0x00, 0x41
é
U+00E9
0x00, 0xE9
θ (theta)
U+03B8
0x03, 0xB8
€ (euro)
U+20AC
0x20, 0xAC

This table assumes a big-endian encoding of UCS-2: the endianness is in principle not defined, so there are two versions of UCS-2.  The little-endian encoding results in the same values as in the table above, but in the inverse order.

So, we see that the UCS-2 encoding results in a doubling of file sizes for files that contain only English text.  This is a disadvantage for this encoding.  Another disadvantage is that null bytes can occur in normal text, breaking all conventions for null-terminated C strings if you use the normal char type.  This is why C also defines the wchar_t type, which can hold a 32-bit character (at least in GNU systems).  To avoid both of these disadvantages, UTF-8 was introduced.

In UTF-8, the number of bytes used to write a character to a file depends on the Unicode code point.  The corresponding table to the table above is:

Character
Unicode code point
Byte values in file (UTF-8)
A
U+0041
0x41
é
U+00E9
0xC3, 0xA9
θ (theta)
U+03B8
0xCE, 0xB8
€ (euro)
U+20AC
0xE2, 0x82, 0xAC

Some immediate observations:
An excellent explanation of how to characters are encoded in UTF-8 can be found on this page.

Some additional terminology regarding encoding schemes (less important here):
Note that the byte order of UCS-2, UCS-4, UTF-16 and UTF-32 is not defined, so it can be big endian or little endian !

$Id: encoding.html,v 1.3 2002/01/13 12:20:00 verthezp Exp $
$Name: R0_90_0 $