wiki:NewLocaleRelatedIssues

Locale Related Issues

This page contains information about locale related problems and issues. In this paragraph you'll find a generic overview of things that can come up when configuring your system for various locales. Many (but not all) existing locale-related problems can be classified and fall under one of the headings below. The severety ratings below use the following criteria:

  • Critical: the program doesn't perform its main function. The fix would be very intrusive, it's better to search for a replacement.
  • High: part of the functionality that the program provides is not usable. If that functionality is required, it's better to search for a replacement.
  • Low: the program works in all typical use cases, but lacks some functionality normally provided by its equivalents.

The needed encoding is not a valid option in the program

Severety: critical

Some programs require the user to specify the character encoding for their input or output data, and present only a limited choice of encodings. E.g., this is the case for the "-X" option for A2PS and Enscript, the "-input-charset" option for unpatched Cdrtools, and the display character set setting in the menu of Links. If the required encoding is not in the list, the program usually becomes completely unusable. For non-interactive programs, it may be possible to work around this by converting the document to a supported input character set before submitting to the program.

A solution to this type of problem is to implement the necessary support for the missing encoding as a patch to the original program (as done for Cdrtools in this book), or to find a replacement. If this is impossible, switch locale to the one with a supported encoding.

Among BLFS packages, this type of problems applies to: Screen (for non-UTF-8 Chinese encodings), Links, Dillo, a2ps, Enscript.

The program assumes locale-based encoding for external documents

Severety: high for non-text documents, low for text documents

Some programs (e.g., Nano or Joe) assume that their documents are always in the encoding implied by the current locale. While this assumption may be valid for the user-created documents, it is not safe for external ones. When this assumption fails, non-ASCII charactrs are displayed incorrectly, and the document may become unreadable.

If the external document is entirely text-based, it can be converted to the current locale encoding using the "iconv" program.

This is not possible for documents that are not text-based, and the assumption may be completely invalid for documents where the MS Windows operating system sets de-facto standards. An example of the latter problem is ID3v1 tags in MP3 files, which are covered on a separate page. For these cases, the only solution is to find a replacement program that doesn't have the issue (e.g., that will allow you to specify the assumed document encoding).

Among BLFS packages, this type of problems applies to: Nano, Joe (FIXME: retest, wikipedia states otherwise), and all media players except Audacious.

A different side of the same problem is "my friends can't read what I mailed to them". This is common when these friends use Microsoft Windows, because there is only one character encoding for a given country, and most Windows programs don't bother to support anything else. E.g., under Linux, teTeX can process UTF-8 encoded documents if they have a "\usepackage[utf8]{inputenc}" line in their preamble. UTF-8 is the default and preferred encoding for text documents in UTF-8 locales. This makes it convenient to create such TeX documents in UTF-8 based Linux system with, e.g., the Joe editor. However, such UTF-8 encoded TeX documents are useless for Windows users who run WinEdt or TeXnicCenter because these editors assume the default Windows 8-bit encoding and can't be configured to assume anything else. If you have to collaborate with such users, convert your TeX documents with iconv to the Windows codepage before sending them (and don't forget to change the \usepackage???{inputenc} line).

In extreme cases, Windows encoding compatibility issues may be solved only by running Windows programs under Wine.

The program uses or creates filenames in the wrong encoding

Severety: critical

POSIX mandates that the filename encoding is the encoding implied by the current LC_CTYPE locale category. This information is well-hidden on the page which specifies the behaviour of Tar and CPIO programs, and some programs get it wrong by default (or simply don't have enough information to get it right). The result is that they create filenames which are not subsequently shown correctly by "ls", or refuse to accept filenames that "ls" shows properly. For the Glib2 library, the problem can be corrected by setting the G_FILENAME_ENCODING environment variable to the special "@locale" value. Glib2-based programs that don't respect that environment variable are buggy.

Among BLFS packages, the Zip, Unzip and Nautilus CD Burner programs have this problem because they hard-code the expected filename encoding. Unzip contains a hard-coded conversion table between CP850 (DOS) and ISO-8859-1 (UNIX) encodings popular in the USA, and uses this table when extracting archives created under DOS or MS Windows. However, not everyone lives in the USA, and even US Linux users prefer UTF-8 now. So the table is wrong for such users, causing non-ASCII characters to be mangled in the extracted filenames. Nauitilus CD Burner checks names of files dropped into its window for UTF-8 validity, but this is not the right thing to do in non-UTF-8 locales. Also, Nautilus CD Burner unconditionally calls mkisofs with the "-input-charset UTF-8" parameter, which is correct only in UTF-8 locales.

The general rule for avoiding this class of problems is to avoid installing broken programs, or use a locale that doesn't trigger them. If this is imposible, the convmv command-line tool can be used to fix filenames created by these broken programs, or intentionally mangle the existing filenames to meet the broken expectations of such programs.

In other cases, a similar problem is caused by importing filenames from a system using a different locale with a tool that is not charset-aware (e.g., NFS mount or scp). In order to avoid mangling non-ASCII characters when transferring files to a system with a different locale, any of the following methods can be used:

  • Transfer anyway, fix the damage with convmv.
  • Mail the files as attachments. Mail clients specify the encoding of attached filenames.
  • Transfer the files using SAMBA.
  • Transfer the files via FTP using RFC2640-aware server (this currently means only wu-ftpd, which has bad security history) and client (e.g., lftp).
  • On the sending side, create a tar archive with the --format=posix switch passed to tar (will be the default in a future version of tar).
  • Write the files to a removable disk formatted with FAT or FAT32 filesystem.

The last four methods work because they cause the filenames to be automatically converted from the sender's locale encoding to UNICODE, stored or sent in this form, and then transparently converted from UNICODE to the recepient's locale encoding.

The program breaks multibyte characters, or doesn't count character cells correctly

Severety: high or critical

Many programs were written in old times where multibyte locales were not common. Such programs assume that the "char" C data type can be used to store single characters (while in fact it stores bytes), that any sequence of characters is a valid string, and that every character occupies a single character cell. Such assumptions cmpletely break in UTF-8 locales. The visible manifestation may be that the program truncates strings prematurely (i.e., at 80 bytes instead of 80 characters). Terminal-based programs don't place the cursor correctly on the screen, don't react to the "Backspace" key by erasing one character, and leave junk characters around when updating the screen, usually turning the screen into a complete mess. Fixing this kind of problems is a tedious task from a programmer's point of view, like all other cases of retrofitting new concepts into the old flawed design. In this case, one has to redesign all data structures in order to accomodate to the fact that a complete character may span a variable number of "char"s (or switch to wchar_t and convert as needed). Also, for every call to the "strlen" and similar functions, find out whether a number of bytes, a number of characters, or the width of the string was really meant. Sometimes it is faster to write a program with the same functionality from scratch.

Among BLFS packages, this type of problems applies to: MC, Nano-1.2.5 (fixable by updrading to Nano-1.3.99), Ed, Xine-UI, and all shels.

Last modified 16 years ago Last modified on 09/24/2006 03:26:19 PM
Note: See TracWiki for help on using the wiki.