Complete.Org: Mailing Lists: Archives: freeciv-dev: August 2004:
[Freeciv-Dev] (PR#1824) charset discussion
Home

[Freeciv-Dev] (PR#1824) charset discussion

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: Kenn.Munro@xxxxxxxxxxxxxx, jdorje@xxxxxxxxxxxxxxxxxxxxx, jdwheeler42@xxxxxxxxx, jrg45@xxxxxxxxxxxxxxxxx, pawel@xxxxxxxxxxxxxxx
Cc: mrproper@xxxxxxxxxx, jlangley@xxxxxxx
Subject: [Freeciv-Dev] (PR#1824) charset discussion
From: "Per I. Mathisen" <per@xxxxxxxxxxx>
Date: Thu, 19 Aug 2004 09:18:50 -0700
Reply-to: rt@xxxxxxxxxxx

<URL: http://rt.freeciv.org/Ticket/Display.html?id=1824 >

In regards to the big charset discussion, I have spent some hours reading
up on the new standards and google, and looked at the issue, and I have a
few comments.

First, I agree with Vasco that UTF-8 internally is the best long-term
solution. UCS-2 is a dead end that is rapidly being obsoleted, as 2 bytes
are no longer enough to support all charsets. In particular, exotic
characters used for names had trouble fitting into UCS-2, and we have lots
of names.

We could go with UTF-32 (or UCS-4) internally and UTF-8 in data files, but
I do not like this much. It seems wasteful and the least supported format
externally.

The rest of the world seems to go either for UTF-16 (Java) or UTF-8 (W3C,
Unix, Linux, Gnu). Microsoft implemented UCS-2 very early before the
Unicode consortium decided 2 bytes were not enough, but their .NET stuff
seems to support a variety of encodings (AFAICT UTF-32 is not supported).

AFAICT, Vasco (and Wikipedia) is wrong, however, thinking that UTF-8 is a
maximum of 4 bytes long. It is a maximum of 6 bytes long. The reason for
this is that UTF-8 "wastes" a number of bytes, so that in order to
represent as many characters as UCS-4, it must use more potential bytes.
See http://www.ietf.org/rfc/rfc2279.txt. So if we are to pad out every
buffer to become bigger than the maximum of an UTF-8 string, we would have
to make it 6x in size. This is just a really bad idea.

The simpler solution is, however, to just note that we have a fixed length
buffer and a variable number of characters. If you use characters that
need more bytes, the maximum length of your string is shorter. It is the
responsibility of the input UI method to enforce this restriction. If it
is not enforced by mistake, the string will be truncated by copy or send
methods (as usual, since we should always check the length of any string
we copy). This way of doing it should require minimal code changes.

If this method makes a buffer too short, we can increase it, on a case by
case basis. I doubt this will be a problem, though, since, AFAIK, when UCS
characters with very high numbers are used, they usually stand for a lot
of content and then the string will be short anyway.

I am still undecided about Jason's patch. I will look at it later.

  - Per




[Prev in Thread] Current Thread [Next in Thread]
  • [Freeciv-Dev] (PR#1824) charset discussion, Per I. Mathisen <=