Unicode code point |
Character |
U+0041 |
A |
U+00E9 |
é |
U+03B8 |
θ (the Greek theta) |
U+20AC |
€ (the euro) |
Character |
Unicode code point |
Byte values in file (UCS-2) |
A |
U+0041 |
0x00, 0x41 |
é |
U+00E9 |
0x00, 0xE9 |
θ (theta) |
U+03B8 |
0x03, 0xB8 |
€ (euro) |
U+20AC |
0x20, 0xAC |
char
type. This is why C also defines the wchar_t
type, which can hold a 32-bit character (at least in GNU systems). To
avoid both of these disadvantages, UTF-8 was introduced.Character | Unicode code point | Byte values in file (UTF-8) |
A | U+0041 | 0x41 |
é | U+00E9 | 0xC3, 0xA9 |
θ (theta) | U+03B8 | 0xCE, 0xB8 |
€ (euro) | U+20AC | 0xE2, 0x82, 0xAC |
wchar_t
type), although it is a little more difficult to get the length of the string.$Id: encoding.html,v 1.3 2002/01/13 12:20:00 verthezp Exp $
$Name: R0_90_0 $