Complete.Org: Mailing Lists: Archives: freeciv-dev: January 2003:
[Freeciv-Dev] Re: (PR#1824) ruleset data is in incompatible charsets
Home

[Freeciv-Dev] Re: (PR#1824) ruleset data is in incompatible charsets

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: freeciv-dev@xxxxxxxxxxx
Subject: [Freeciv-Dev] Re: (PR#1824) ruleset data is in incompatible charsets
From: Reinier Post <rp@xxxxxxxxxx>
Date: Mon, 6 Jan 2003 18:15:22 +0100

On Fri, Jan 03, 2003 at 05:02:09AM -0800, Jason Short via RT wrote:
> 
> Freeciv's ruleset data is mostly in ASCII.  But some is in latin1.  Some
> may be in other charsets (latin2).
> 
> These charsets are generally not compatible with anything except
> themselves.  And without manual conversion of the strings into the local
> charset, a user that is using a different charset is likely to see
> garbage when they are displayed.
> 
> A simple solution would be to say all data must be in ASCII.  Then any
> ascii-compatible charset will work.  But this is inferior.
>
> Yet, if we have latin1 and latin2 characters for some nations, why
> shouldn't we have chinese or japanese characters for those nations? 
> Where, if anywhere, do we draw the line?  And if we do allow this, how
> much work are will willing to expend to get it to work?

I know little about this, but my gut feeling is: the goal should be
to handle everything in terms of GNU gettext locales and GNU iconv
character set encodings, and to provide support for any locale that this
software will support, with the constraint that any incarnation of a
Freeciv client or server supports only a single locale (and charset)
at the same time (plus the default C locale for unlocalized parts).
This means you can see stuff from the C locale and from the German or
Japanese locale in the same client incarnation, but you can never see
both German and Japanese in the same client incarnation.

BTW, while a full locale consists of

  + a character set encoding
  + a message catalog
  + a notation for numbers
  + a notation for time
  + a notation for money
  + a collating sequence for string sorting

(and perhaps even more?) Freeciv aims to support only the first two,
as far as I know.

> The ruleset currently has the concept of two "types" of strings:
> translatable and non-translatable.  The name strings are, rightly,
> marked as non-translatable.  But there are two types of non-translatable
> strings: name strings (such as leader, nation, and city names) and data
> strings (such as names of other ruleset files).

I don't think there should be.  There should be only two types:
"translatable" strings (marked for localization) and other strings.
How to mark is defined for the Freeciv source code; for documentation,
localization is supposed to be done on a per-file basis.  This breaks down
for the website, which has many pages that mix code and translatable text.
This is being worked on.

> To handle this
> correctly the ruleset needs a way to mark all three types - or the
> program needs to know which is which.  The translatable strings should
> be in ascii, and are translated into the local charset by gettext.  The
> name strings are in their own locale (specified in the ruleset), and
> should be converted into the local charset (by iconv) when loaded.  The
> data strings are in ascii, and shouldn't be touched.

I think letting names appear in a "native" locale will complicate matters
and will also confuse users.  Most English speakers won't like to see
Japanese city names in Japanese script.  It's much better to let, in the
Japansese client for example, all translatable strings be either localized
in the same (Japanese) locale, or if no translation is available, leave
them untranslated (i.e. in English).  That way you only have two types
of strings.  I don't know how the Japanese client handles untranslated
strings at the moment.

Unmarked text should not be called "ascii" text.  

> This is all simple if the server and client have the same charset (which
> in most cases they do).  But if that's not the case then we have a
> problem.

I don't see why the charset makes a difference.  What should be sent
is the unlocalized text including localization markings.  That text can
always be 7-bit clean ASCII.

> The best/only way to solve this is to say all network
> communications must be in UTF (probably UTF-8); each end of the
> connection may convert the UTF into their local encoding.  Again we have
> different types of strings: name strings (which should be sent in UTF-8,
> then converted) and data strings (which are sent in ASCII and left
> untouched).

There is no reason to do this, you can always send English language
strings plus any localization markings required.

> (This may be affected by patches that aim to improve the
> translation of server-side messages.)  Even with this, you may end up
> with an impossible conversion (for instance latin1 into japanese
> characters), but at least iconv will should give you a valid string (as
> opposed to the current situation where latin1 strings aren't even valid
> in utf8 and are not displayed at all), and if the user uses UTF they
> should be able to deal with anything.
>
> jason

-- 
Reinier


[Prev in Thread] Current Thread [Next in Thread]