Complete.Org: Mailing Lists: Archives: freeciv-dev: January 2004:
[Freeciv-Dev] (PR#7317) Add Unicode support to Freeciv
Home

[Freeciv-Dev] (PR#7317) Add Unicode support to Freeciv

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: undisclosed-recipients: ;
Subject: [Freeciv-Dev] (PR#7317) Add Unicode support to Freeciv
From: "Raimar Falke" <i-freeciv-lists@xxxxxxxxxxxxx>
Date: Sun, 25 Jan 2004 11:21:32 -0800
Reply-to: rt@xxxxxxxxxxx

<URL: http://rt.freeciv.org/Ticket/Display.html?id=7317 >


For a motivation of this see issues 1824, 7226, 7228 and a few more.

The goal is to provide fairness for every player and to avoid the
existence of charset/encoding related bugs by design. Currently the
client and server code assumes that the internal data storage and
network protocol uses latin1 (aka ISO-8859-1). This is unfair for
players which use another charset. This may lead to cases where the
player can't enter his own name. The solution for this is to use the
unicode which support (almost) all spoken languages and a lot of
non-existing and dead ones.

I will first start with some facts:
 - unicode is a charset containing up to 2^21 chars
 - unicode has various encodings
 - the first 2^16 chars in unicode contains all for freeciv relevant
 chars (freeciv has no need for "historic alphabets and ideographs,
 mathematical and musical typesetting" or "Byzantine Musical Symbols")
 - utf8 and utf16 are variable length encodings
 - ucs2 and ucs4 are fixed length encodings
 - the various range of possible encodings are:
   utf8, utf16, utf32: 0-2^21
   ucs2: 0-2^16
   ucs4: 0-2^32
 - the maximal number of bytes of the utf8 encoding are 4
 - the maximal number of bytes of the utf16 encoding are 4
 - BOM (a special char) is used to get the endianess of data
 - the msgid for gettext is ASCII (read 7bit)

Some other constrains:
 - network encoding should be space efficient
 - the msgid should either not be transformed or transformed to latin1
 or utf8

Without compression utf-8 is the choice for the network protocol. With
compression (which defaults to on) this isn't this clear. Some testing
needs to be done on utf-8 vs ucs2/utf16.

The decision about the local encoding is more non-technical. But first
lets state the facts (from
http://216.239.59.104/search?q=cache:Kw7QqNNqjaUJ:czyborra.com/utf/):

The good:
+ transparency and uniqueness for ASCII characters 
  The 7bit ASCII characters {U+0000...U+007F} pass through
  transparently as {=00..=7F} and all non-ASCII are represented purely
  with non-ASCII 8bit values {=80..=F7} so that you will not mistake
  any non-ASCII character for an ASCII character, particularly the
  string-delimiting ASCII =00 NULL only appears where there a U+0000
  NULL was intended, and you can use your ASCII-based text processing
  tools on UTF-8 text as long as they pass 8bit characters without
  interpretation.

+ processor-friendliness 
  UTF-8 can be read and written quickly with simple bitmask and
  bitshift operations without any multiplication or division. And as
  the lead-byte announces the length of the multibyte character you
  can quickly tell how many bytes to skip for fast forward parsing.

+ reasonable compression 
  UTF-8 is a reasonably compact encoding: ASCII characters are not
  inflated, most other alphabetic characters occupy only two bytes
  each, no basic Unicode character needs more than three bytes and all
  extended Unicode characters can be expressed with four bytes so that
  UTF-8 is no worse than UCS-4.

+ canonical sort-order 
  Comparing two UTF-8 multibyte character strings with the old
  strcmp(mbs1,mbs2) function gives the same result as the
  wcscmp(wcs1,wcs2) function on the corresponding UCS-4 wide character
  strings so that the lexicographic sorting and tree-search order is
  preserved.

+ EOF and BOM avoidance 
  The octets =FE and =FF never appear in UTF-8 output. That means that
  you can use =FE=FF (U+FEFF ZERO WIDTH NO-BREAK SPACE) as an
  indicator for UTF-16 text and =FF=FE as an indicator for
  byte-swapped UTF-16 text from haphazard programs on little-endian
  machines. And it also means hat means that C programs that
  haphazardly store the result of getchar() in a char instead of an
  int will no longer mistake U+00FF LATIN SMALL LETTER Y WITH
  DIAERESIS as end of file because ÿ is now represented as =C3=BF. The
  =FF octet was often mistaken as end of file because
  /usr/include/stdio.h #defines EOF as -1 which looks just like =FF in
  8bit 2-complement binary integer representation.

+ byte-order independence 
  UTF-8 has no byte-order problems as it defines an octet stream. The
  octet order is the same on all systems. The UTF-8 representation
  =EF=BB=BF of the byte-order mark U+FEFF ZERO WIDTH NO-BREAK SPACE is
  not really needed as a UTF-8 signature.

+ detectability 
  You can detect that you are dealing with UTF-8 input with high
  probability if you see the UTF-8 signature =EF=BB=BF () or if you
  see valid UTF-8 multibyte characters since it is very unlikely that
  they accidentally appear in Latin1 text. You usually don't place a
  Latin1 symbol after an accented capital letter or a whole row of
  them after an accented small letter.

The bad:

variable length 
  UTF-8 is a variable-length multibyte encoding which means that you
  cannot calculate the number of characters from the mere number of
  bytes and vice versa for memory allocation and that you have to
  allocate oversized buffers or parse and keep counters. You cannot
  load a UTF-8 character into a processor register in one operation
  because you don't know how long it is going to be. Because of that
  you won't want to use UTF-8 multibyte bitstrings as direct scalar
  memory addresses but first convert to UCS-4 wide character
  internally or single-step one byte at a time through a
  single/double/triple indirect lookup table (somewhat similar to the
  block addressing in Unix filesystem i-nodes). It also means that
  dumb line breaking functions and backspacing functions will get
  their character count wrong when processing UTF-8 text.

extra byte consumption 
  UTF-8 consumes two bytes for all non-Latin (Greek, Cyrillic, Arabic,
  etc.) letters that have traditionally been stored in one byte and
  three bytes for all symbols, syllabics and ideographs that have
  traditionally only needed a double byte. This can be considered a
  waste of space and bandwidth which is even tripled when the 8bit
  form is MIME-encoded as quoted-printable ("=C3=A4" is 6 bytes for
  the one character ä). SCSU aims to solve the compression problem.

illegal sequences 
  Because of the wanted redundancy in its syntax, UTF-8 knows illegal
  sequences. Some applications even emit an error message and stop
  working if they see illegal input: Java has its
  UTFDataFormatException. Other UTF-8 implementations silently
  interpret illegal input as 8bit ISO-8859-1 (or more sophisticated
  defaults like your local 8bit charset or CP1252 or less
  sophisticated defaults like whatever the implementation happens to
  calculate) and output them through their putwchar() routines as
  correct UTF-8 which may first seem as a nice feature but on second
  thought it turns out as corrupting binary data (which then again you
  weren't supposed to feed to your text processing tools anyway) and
  throwing away valuable information. You cannot include arbitrary
  binary string samples in a UTF-8 text such as this web page. Also,
  by illegaly spreading bit patterns of ASCII characters across
  several bytes you can trick the system into having filenames
  containing ungraspable nulls (=C0=80) and slashes (=C0=AF).

I now argue that the "variable length" is the issue which causes some
of the issues in the past and also will cause us problems in the
future. In the network code and in the server we want to ensure that
the received strings have a sane length. For this we static buffers
and functions like strncpy. Here utf8 shows it's problems. How do you
dimension these buffers? You would have to take the conservative
approach of the worst-case (multiple their size by 4). We have also
seen the problems of truncated utf8 strings.

Because of this I'm for ucs2 as the internal encoding.

        Raimar

-- 
 email: rf13@xxxxxxxxxxxxxxxxx
  "Real Users find the one combination of bizarre
   input values that shuts down the system for days."




[Prev in Thread] Current Thread [Next in Thread]
  • [Freeciv-Dev] (PR#7317) Add Unicode support to Freeciv, Raimar Falke <=