| Complete.Org:
Mailing Lists:
Archives:
freeciv-dev:
January 2004: [Freeciv-Dev] (PR#7317) Add Unicode support to Freeciv |   | 
| [Freeciv-Dev] (PR#7317) Add Unicode support to Freeciv[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index] 
 <URL: http://rt.freeciv.org/Ticket/Display.html?id=7317 > For a motivation of this see issues 1824, 7226, 7228 and a few more. The goal is to provide fairness for every player and to avoid the existence of charset/encoding related bugs by design. Currently the client and server code assumes that the internal data storage and network protocol uses latin1 (aka ISO-8859-1). This is unfair for players which use another charset. This may lead to cases where the player can't enter his own name. The solution for this is to use the unicode which support (almost) all spoken languages and a lot of non-existing and dead ones. I will first start with some facts: - unicode is a charset containing up to 2^21 chars - unicode has various encodings - the first 2^16 chars in unicode contains all for freeciv relevant chars (freeciv has no need for "historic alphabets and ideographs, mathematical and musical typesetting" or "Byzantine Musical Symbols") - utf8 and utf16 are variable length encodings - ucs2 and ucs4 are fixed length encodings - the various range of possible encodings are: utf8, utf16, utf32: 0-2^21 ucs2: 0-2^16 ucs4: 0-2^32 - the maximal number of bytes of the utf8 encoding are 4 - the maximal number of bytes of the utf16 encoding are 4 - BOM (a special char) is used to get the endianess of data - the msgid for gettext is ASCII (read 7bit) Some other constrains: - network encoding should be space efficient - the msgid should either not be transformed or transformed to latin1 or utf8 Without compression utf-8 is the choice for the network protocol. With compression (which defaults to on) this isn't this clear. Some testing needs to be done on utf-8 vs ucs2/utf16. The decision about the local encoding is more non-technical. But first lets state the facts (from http://216.239.59.104/search?q=cache:Kw7QqNNqjaUJ:czyborra.com/utf/): The good: + transparency and uniqueness for ASCII characters The 7bit ASCII characters {U+0000...U+007F} pass through transparently as {=00..=7F} and all non-ASCII are represented purely with non-ASCII 8bit values {=80..=F7} so that you will not mistake any non-ASCII character for an ASCII character, particularly the string-delimiting ASCII =00 NULL only appears where there a U+0000 NULL was intended, and you can use your ASCII-based text processing tools on UTF-8 text as long as they pass 8bit characters without interpretation. + processor-friendliness UTF-8 can be read and written quickly with simple bitmask and bitshift operations without any multiplication or division. And as the lead-byte announces the length of the multibyte character you can quickly tell how many bytes to skip for fast forward parsing. + reasonable compression UTF-8 is a reasonably compact encoding: ASCII characters are not inflated, most other alphabetic characters occupy only two bytes each, no basic Unicode character needs more than three bytes and all extended Unicode characters can be expressed with four bytes so that UTF-8 is no worse than UCS-4. + canonical sort-order Comparing two UTF-8 multibyte character strings with the old strcmp(mbs1,mbs2) function gives the same result as the wcscmp(wcs1,wcs2) function on the corresponding UCS-4 wide character strings so that the lexicographic sorting and tree-search order is preserved. + EOF and BOM avoidance The octets =FE and =FF never appear in UTF-8 output. That means that you can use =FE=FF (U+FEFF ZERO WIDTH NO-BREAK SPACE) as an indicator for UTF-16 text and =FF=FE as an indicator for byte-swapped UTF-16 text from haphazard programs on little-endian machines. And it also means hat means that C programs that haphazardly store the result of getchar() in a char instead of an int will no longer mistake U+00FF LATIN SMALL LETTER Y WITH DIAERESIS as end of file because ÿ is now represented as =C3=BF. The =FF octet was often mistaken as end of file because /usr/include/stdio.h #defines EOF as -1 which looks just like =FF in 8bit 2-complement binary integer representation. + byte-order independence UTF-8 has no byte-order problems as it defines an octet stream. The octet order is the same on all systems. The UTF-8 representation =EF=BB=BF of the byte-order mark U+FEFF ZERO WIDTH NO-BREAK SPACE is not really needed as a UTF-8 signature. + detectability You can detect that you are dealing with UTF-8 input with high probability if you see the UTF-8 signature =EF=BB=BF () or if you see valid UTF-8 multibyte characters since it is very unlikely that they accidentally appear in Latin1 text. You usually don't place a Latin1 symbol after an accented capital letter or a whole row of them after an accented small letter. The bad: variable length UTF-8 is a variable-length multibyte encoding which means that you cannot calculate the number of characters from the mere number of bytes and vice versa for memory allocation and that you have to allocate oversized buffers or parse and keep counters. You cannot load a UTF-8 character into a processor register in one operation because you don't know how long it is going to be. Because of that you won't want to use UTF-8 multibyte bitstrings as direct scalar memory addresses but first convert to UCS-4 wide character internally or single-step one byte at a time through a single/double/triple indirect lookup table (somewhat similar to the block addressing in Unix filesystem i-nodes). It also means that dumb line breaking functions and backspacing functions will get their character count wrong when processing UTF-8 text. extra byte consumption UTF-8 consumes two bytes for all non-Latin (Greek, Cyrillic, Arabic, etc.) letters that have traditionally been stored in one byte and three bytes for all symbols, syllabics and ideographs that have traditionally only needed a double byte. This can be considered a waste of space and bandwidth which is even tripled when the 8bit form is MIME-encoded as quoted-printable ("=C3=A4" is 6 bytes for the one character ä). SCSU aims to solve the compression problem. illegal sequences Because of the wanted redundancy in its syntax, UTF-8 knows illegal sequences. Some applications even emit an error message and stop working if they see illegal input: Java has its UTFDataFormatException. Other UTF-8 implementations silently interpret illegal input as 8bit ISO-8859-1 (or more sophisticated defaults like your local 8bit charset or CP1252 or less sophisticated defaults like whatever the implementation happens to calculate) and output them through their putwchar() routines as correct UTF-8 which may first seem as a nice feature but on second thought it turns out as corrupting binary data (which then again you weren't supposed to feed to your text processing tools anyway) and throwing away valuable information. You cannot include arbitrary binary string samples in a UTF-8 text such as this web page. Also, by illegaly spreading bit patterns of ASCII characters across several bytes you can trick the system into having filenames containing ungraspable nulls (=C0=80) and slashes (=C0=AF). I now argue that the "variable length" is the issue which causes some of the issues in the past and also will cause us problems in the future. In the network code and in the server we want to ensure that the received strings have a sane length. For this we static buffers and functions like strncpy. Here utf8 shows it's problems. How do you dimension these buffers? You would have to take the conservative approach of the worst-case (multiple their size by 4). We have also seen the problems of truncated utf8 strings. Because of this I'm for ucs2 as the internal encoding. Raimar -- email: rf13@xxxxxxxxxxxxxxxxx "Real Users find the one combination of bizarre input values that shuts down the system for days." 
 
 
 |