[Freeciv-Dev] (PR#7317) Add Unicode support to Freeciv
[Top] [All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
<URL: http://rt.freeciv.org/Ticket/Display.html?id=7317 >
For a motivation of this see issues 1824, 7226, 7228 and a few more.
The goal is to provide fairness for every player and to avoid the
existence of charset/encoding related bugs by design. Currently the
client and server code assumes that the internal data storage and
network protocol uses latin1 (aka ISO-8859-1). This is unfair for
players which use another charset. This may lead to cases where the
player can't enter his own name. The solution for this is to use the
unicode which support (almost) all spoken languages and a lot of
non-existing and dead ones.
I will first start with some facts:
- unicode is a charset containing up to 2^21 chars
- unicode has various encodings
- the first 2^16 chars in unicode contains all for freeciv relevant
chars (freeciv has no need for "historic alphabets and ideographs,
mathematical and musical typesetting" or "Byzantine Musical Symbols")
- utf8 and utf16 are variable length encodings
- ucs2 and ucs4 are fixed length encodings
- the various range of possible encodings are:
utf8, utf16, utf32: 0-2^21
ucs2: 0-2^16
ucs4: 0-2^32
- the maximal number of bytes of the utf8 encoding are 4
- the maximal number of bytes of the utf16 encoding are 4
- BOM (a special char) is used to get the endianess of data
- the msgid for gettext is ASCII (read 7bit)
Some other constrains:
- network encoding should be space efficient
- the msgid should either not be transformed or transformed to latin1
or utf8
Without compression utf-8 is the choice for the network protocol. With
compression (which defaults to on) this isn't this clear. Some testing
needs to be done on utf-8 vs ucs2/utf16.
The decision about the local encoding is more non-technical. But first
lets state the facts (from
http://216.239.59.104/search?q=cache:Kw7QqNNqjaUJ:czyborra.com/utf/):
The good:
+ transparency and uniqueness for ASCII characters
The 7bit ASCII characters {U+0000...U+007F} pass through
transparently as {=00..=7F} and all non-ASCII are represented purely
with non-ASCII 8bit values {=80..=F7} so that you will not mistake
any non-ASCII character for an ASCII character, particularly the
string-delimiting ASCII =00 NULL only appears where there a U+0000
NULL was intended, and you can use your ASCII-based text processing
tools on UTF-8 text as long as they pass 8bit characters without
interpretation.
+ processor-friendliness
UTF-8 can be read and written quickly with simple bitmask and
bitshift operations without any multiplication or division. And as
the lead-byte announces the length of the multibyte character you
can quickly tell how many bytes to skip for fast forward parsing.
+ reasonable compression
UTF-8 is a reasonably compact encoding: ASCII characters are not
inflated, most other alphabetic characters occupy only two bytes
each, no basic Unicode character needs more than three bytes and all
extended Unicode characters can be expressed with four bytes so that
UTF-8 is no worse than UCS-4.
+ canonical sort-order
Comparing two UTF-8 multibyte character strings with the old
strcmp(mbs1,mbs2) function gives the same result as the
wcscmp(wcs1,wcs2) function on the corresponding UCS-4 wide character
strings so that the lexicographic sorting and tree-search order is
preserved.
+ EOF and BOM avoidance
The octets =FE and =FF never appear in UTF-8 output. That means that
you can use =FE=FF (U+FEFF ZERO WIDTH NO-BREAK SPACE) as an
indicator for UTF-16 text and =FF=FE as an indicator for
byte-swapped UTF-16 text from haphazard programs on little-endian
machines. And it also means hat means that C programs that
haphazardly store the result of getchar() in a char instead of an
int will no longer mistake U+00FF LATIN SMALL LETTER Y WITH
DIAERESIS as end of file because ÿ is now represented as =C3=BF. The
=FF octet was often mistaken as end of file because
/usr/include/stdio.h #defines EOF as -1 which looks just like =FF in
8bit 2-complement binary integer representation.
+ byte-order independence
UTF-8 has no byte-order problems as it defines an octet stream. The
octet order is the same on all systems. The UTF-8 representation
=EF=BB=BF of the byte-order mark U+FEFF ZERO WIDTH NO-BREAK SPACE is
not really needed as a UTF-8 signature.
+ detectability
You can detect that you are dealing with UTF-8 input with high
probability if you see the UTF-8 signature =EF=BB=BF () or if you
see valid UTF-8 multibyte characters since it is very unlikely that
they accidentally appear in Latin1 text. You usually don't place a
Latin1 symbol after an accented capital letter or a whole row of
them after an accented small letter.
The bad:
variable length
UTF-8 is a variable-length multibyte encoding which means that you
cannot calculate the number of characters from the mere number of
bytes and vice versa for memory allocation and that you have to
allocate oversized buffers or parse and keep counters. You cannot
load a UTF-8 character into a processor register in one operation
because you don't know how long it is going to be. Because of that
you won't want to use UTF-8 multibyte bitstrings as direct scalar
memory addresses but first convert to UCS-4 wide character
internally or single-step one byte at a time through a
single/double/triple indirect lookup table (somewhat similar to the
block addressing in Unix filesystem i-nodes). It also means that
dumb line breaking functions and backspacing functions will get
their character count wrong when processing UTF-8 text.
extra byte consumption
UTF-8 consumes two bytes for all non-Latin (Greek, Cyrillic, Arabic,
etc.) letters that have traditionally been stored in one byte and
three bytes for all symbols, syllabics and ideographs that have
traditionally only needed a double byte. This can be considered a
waste of space and bandwidth which is even tripled when the 8bit
form is MIME-encoded as quoted-printable ("=C3=A4" is 6 bytes for
the one character ä). SCSU aims to solve the compression problem.
illegal sequences
Because of the wanted redundancy in its syntax, UTF-8 knows illegal
sequences. Some applications even emit an error message and stop
working if they see illegal input: Java has its
UTFDataFormatException. Other UTF-8 implementations silently
interpret illegal input as 8bit ISO-8859-1 (or more sophisticated
defaults like your local 8bit charset or CP1252 or less
sophisticated defaults like whatever the implementation happens to
calculate) and output them through their putwchar() routines as
correct UTF-8 which may first seem as a nice feature but on second
thought it turns out as corrupting binary data (which then again you
weren't supposed to feed to your text processing tools anyway) and
throwing away valuable information. You cannot include arbitrary
binary string samples in a UTF-8 text such as this web page. Also,
by illegaly spreading bit patterns of ASCII characters across
several bytes you can trick the system into having filenames
containing ungraspable nulls (=C0=80) and slashes (=C0=AF).
I now argue that the "variable length" is the issue which causes some
of the issues in the past and also will cause us problems in the
future. In the network code and in the server we want to ensure that
the received strings have a sane length. For this we static buffers
and functions like strncpy. Here utf8 shows it's problems. How do you
dimension these buffers? You would have to take the conservative
approach of the worst-case (multiple their size by 4). We have also
seen the problems of truncated utf8 strings.
Because of this I'm for ucs2 as the internal encoding.
Raimar
--
email: rf13@xxxxxxxxxxxxxxxxx
"Real Users find the one combination of bizarre
input values that shuts down the system for days."
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Freeciv-Dev] (PR#7317) Add Unicode support to Freeciv,
Raimar Falke <=
|
|