Changes between Version 11 and Version 12 of NewLocaleRelatedIssues
- Timestamp:
- 08/24/2006 09:44:00 AM (17 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
NewLocaleRelatedIssues
v11 v12 5 5 == The needed encoding is not a valid option in the program == 6 6 7 Some programs require the user to specify the character encoding for their input or output data, and present only a limited choice of encodings. E.g., this is the case for the "-X" option for [wiki:A2PS A2PS] and [wiki:Enscript Enscript], the "-input-charset" option for unpatched [wiki:Cdrtools Cdrtools], and the display character set setting in the menu of [wiki:LinksBrowser Links]. If the required encoding is not in the list, the program usually becomes completely unusable. For non-interactive programs, an exception to this rule is related to a (not always existing) possibility to convertthe document to a supported input character set before submitting to the program.7 Some programs require the user to specify the character encoding for their input or output data, and present only a limited choice of encodings. E.g., this is the case for the "-X" option for [wiki:A2PS A2PS] and [wiki:Enscript Enscript], the "-input-charset" option for unpatched [wiki:Cdrtools Cdrtools], and the display character set setting in the menu of [wiki:LinksBrowser Links]. If the required encoding is not in the list, the program usually becomes completely unusable. For non-interactive programs, it may be possible to work around this by converting the document to a supported input character set before submitting to the program. 8 8 9 A solution to this type of problem s is to implement the necessary support for the missing encoding as a patch to the original program (as done for [wiki:Cdrtools Cdrtools] in this booh), or to find a replacement.9 A solution to this type of problem is to implement the necessary support for the missing encoding as a patch to the original program (as done for [wiki:Cdrtools Cdrtools] in this book), or to find a replacement. 10 10 11 11 == The program assumes locale-based encoding for external documents == … … 13 13 Some programs (e.g., [wiki:Nano Nano] or [wiki:Joe Joe]) assume that their documents are always in the encoding implied by the current locale. While this assumption may be valid for the user-created documents, it is not safe for external ones. When this assumption fails, non-ASCII charactrs are displayed incorrectly, and the document may become unreadable. 14 14 15 If the external document is entirely text-based, it can be converted to the locale encoding using the "iconv" program.15 If the external document is entirely text-based, it can be converted to the current locale encoding using the "iconv" program. 16 16 17 This is not possible for documents that are not text-based, and the assumption may be completely invalid for documents where the MS Windows operating system sets de-facto standards. The latter is, e.g., the situation with ID3v1 tags in MP3 files, covered on [wiki:ID3v1Coding the separate page]. For these cases, the only solution is to find a replacement program that doesn't have the issue (e.g., allowsto specify the assumed document encoding).17 This is not possible for documents that are not text-based, and the assumption may be completely invalid for documents where the MS Windows operating system sets de-facto standards. An example of the latter problem is ID3v1 tags in MP3 files, which are covered on [wiki:ID3v1Coding a separate page]. For these cases, the only solution is to find a replacement program that doesn't have the issue (e.g., that will allow you to specify the assumed document encoding). 18 18 19 19 A different side of the same problem is "my friends can't read what I mailed to them". This is common when these friends use Microsoft Windows, because there is only one character encoding for a given country, and most Windows programs don't bother to support anything else. E.g., under Linux, [wiki:tetex teTeX] can process UTF-8 encoded documents if they have a "\usepackage[utf8]{inputenc}" line in their preamble. UTF-8 is the default and preferred encoding for text documents in UTF-8 locales. This makes it convenient to create such TeX documents in UTF-8 based Linux system with, e.g., the [wiki:Joe Joe] editor. However, such UTF-8 encoded TeX documents are useless for Windows users who run [http://www.winedt.com/ WinEdt] or [http://www.toolscenter.org/ TeXnicCenter] because these editors assume the default Windows 8-bit encoding and can't be configured to assume anything else. If you have to collaborate with such users, convert your TeX documents with iconv to the Windows codepage before sending them (and don't forget to change the \usepackage[???]{inputenc} line). … … 25 25 POSIX mandates that the filename encoding is the encoding implied by the current LC_CTYPE locale category. This information is well-hidden on the page which specifies the behaviour of Tar and CPIO programs, and some programs get it wrong by default (or simply don't have enough information to get it right). The result is that they create filenames which are not subsequently shown correctly by "ls", or refuse to accept filenames that "ls" shows properly. For the Glib2 library, the problem can be corrected by setting the G_FILENAME_ENCODING environment variable to the special "@locale" value. Glib2-based programs that don't respect that environment variable are buggy. 26 26 27 The Zip, Unzip and Nautilus CD Burner programs have this problem because they hard-code the expected filename encoding. E.g., Unzip contains a hard-coded conversion table between CP850 (DOS) and ISO-8859-1 (UNIX) encodings popular in the USA, and uses this table when extracting archives created under DOS or MS Windows. However, not everyone lives in USA, and even in USA, Linux users prefer UTF-8 now. So the table is wrong for such users, and non-ASCII characters become wrongin the extracted filenames. Nauitilus CD Burner checks names of files dropped into its window for UTF-8 validity, but this is not the right thing to do in non-UTF-8 locales. Also, Nautilus CD Burner unconditionally calls mkisofs with the "-input-charset UTF-8" parameter, which is correct only in UTF-8 locales.27 The Zip, Unzip and Nautilus CD Burner programs have this problem because they hard-code the expected filename encoding. E.g., Unzip contains a hard-coded conversion table between CP850 (DOS) and ISO-8859-1 (UNIX) encodings popular in the USA, and uses this table when extracting archives created under DOS or MS Windows. However, not everyone lives in the USA, and even US Linux users prefer UTF-8 now. So the table is wrong for such users, causing non-ASCII characters to be mangled in the extracted filenames. Nauitilus CD Burner checks names of files dropped into its window for UTF-8 validity, but this is not the right thing to do in non-UTF-8 locales. Also, Nautilus CD Burner unconditionally calls mkisofs with the "-input-charset UTF-8" parameter, which is correct only in UTF-8 locales. 28 28 29 29 The general rule for avoiding this class of problems is to avoid installing broken programs. If this is imposible, the [http://j3e.de/linux/convmv/ convmv] command-line tool can be used to fix filenames created by these broken programs, or intentionally mangle the existing filenames to meet the broken expectations of such programs. 30 30 31 In other cases, the similar problem is caused by importing filenames from a system using a different locale with a tool that is not locale-aware (e.g., NFS mount or scp). In order to avoid the damage ofnon-ASCII characters when transferring files to a system with a different locale, any of the following methods can be used:31 In other cases, a similar problem is caused by importing filenames from a system using a different locale with a tool that is not locale-aware (e.g., NFS mount or scp). In order to avoid mangling non-ASCII characters when transferring files to a system with a different locale, any of the following methods can be used: 32 32 33 33 * Transfer anyway, fix the damage with convmv. 34 * On the sending side, create a tar archive with the --format=posix switch passed to tar (will be the default in the future versionsof tar). This causes the filenames to be converted from the creator's locale encoding to UTF-8 when creating the archive, stored in the UTF-8 encoding in the archive, and converted from it to the recepient's locale encoding when unpacking.34 * On the sending side, create a tar archive with the --format=posix switch passed to tar (will be the default in a future version of tar). This causes the filenames to be converted from the creator's locale encoding to UTF-8 when creating the archive, stored in the UTF-8 encoding in the archive, and converted from it to the recepient's locale encoding when unpacking. 35 35 * Mail the files as attachments. Mail clients specify the encoding of attached filenames. 36 36 * Write the files to a removable disk formatted with FAT or FAT32 filesystem that stores file names in UNICODE. The kernel automatically converts them to and from UNICODE on demand.