19 | | issues. In this paragraph you'll find a generic overview of things that can |
20 | | come up when configuring your system for various locales. The previous |
21 | | sentence and the remainder of this paragraph must still be |
22 | | revised/completed.</para> |
| 19 | issues. In this paragraph you'll find a generic overview of things that |
| 20 | can come up when configuring your system for various locales. Many (but |
| 21 | not all) existing locale-related problems can be classified and fall |
| 22 | under one of the headings below.</para> |
28 | | <para>For package-specific issues, find the concerned package from the list |
29 | | below and follow the link to view the available information. If a package |
30 | | is not listed here, it does not mean there are no known locale-specific |
31 | | issues or problems with that package. It only means that this page has not |
32 | | been updated with the locale-specific information regarding that package. |
33 | | Please reference the BLFS Wiki page for a particular package for any |
34 | | additional locale-specific information. </para> |
| 28 | <para>Some programs require the user to specify the character encoding |
| 29 | for their input or output data, and present only a limited choice of |
| 30 | encodings. This is the case for the <option>-X</option> option in |
| 31 | <xref linkend="a2ps"/> and <xref linkend="enscript"/>, |
| 32 | the <option>-input-charset</option> option in unpatched |
| 33 | <xref linkend="cdrtools"/>, and the character sets offered for display |
| 34 | in the menu of <xref linkend="links"/>. If the required encoding is not |
| 35 | in the list, the program usually becomes completely unusable. For |
| 36 | non-interactive programs, it may be possible to work around this by |
| 37 | converting the document to a supported input character set before |
| 38 | submitting to the program.</para> |
52 | | <sect3 id="locale-mc" xreflabel="MC-&mc-version;"> |
| 52 | <para>Some programs, <xref linkend="nano"/> or |
| 53 | <xref linkend="joe"/> for example, assume that documents are always |
| 54 | in the encoding implied by the current locale. While this assumption |
| 55 | may be valid for the user-created documents, it is not safe for |
| 56 | external ones. When this assumption fails, non-ASCII charactrs are |
| 57 | displayed incorrectly, and the document may become unreadable.</para> |
56 | | <para>This package makes the assumption that <quote>characters</quote> |
57 | | and <quote>bytes</quote> are the same thing. This is not true in UTF-8 |
58 | | based locales. Due to this assumption <application>MC</application> will |
59 | | incorrectly position characters on the screen. After the cursor is moved |
60 | | a bit the screen becomes totally unreadable, as illustrated on |
61 | | <ulink url="&files-anduin;/mc-bad.png">this |
62 | | screenshot</ulink> (taken in a ru_RU.UTF-8 locale). Additionally, input |
63 | | of non-ASCII characters in the editor is impossible, even after selecting |
64 | | <quote>Other 8-bit</quote> encoding from the menu.</para> |
| 63 | <para>For documents that are not text-based, this is not possible. |
| 64 | In fact, the assumption made in the program may be completely |
| 65 | invalid for documents where the Microsoft Windows operating system |
| 66 | has set de-facto standards. An example of this problem is ID3v1 tags |
| 67 | in MP3 files (see <ulink url="&blfs-wiki;/ID3v1Coding">this page</ulink> |
| 68 | for more details). For these cases, the only solution is to find a |
| 69 | replacement program that doesn't have the issue (e.g., one that |
| 70 | will allow you to specify the assumed document encoding).</para> |
66 | | </sect3> |
| 72 | <para>Another problem in this category is when someone cannot read |
| 73 | the documents you've sent them because their operating system is |
| 74 | set up to handle character encodings differently. This can happen |
| 75 | often when the other person is using Microsoft Windows, which only |
| 76 | provides one character encoding for a given country. For example, |
| 77 | this causes problems with UTF-8 encoded TeX documents created in |
| 78 | Linux. On Windows, most applications will assume that these documents |
| 79 | have been created using the default Windows 8-bit encoding. See the |
| 80 | <ulink url="&blfs-wiki;/tetex">teTeX</ulink> Wiki page for more |
| 81 | details.</para> |
72 | | <note> |
73 | | <para>Use of <application>UnZip</application> in the |
74 | | <application>JDK</application>, <application>Mozilla</application>, |
75 | | <application>DocBook</application> or any other BLFS package |
76 | | installation is not a problem, as BLFS instructions never use |
77 | | <application>UnZip</application> to extract a file with non-ASCII |
78 | | characters in the file's name.</para> |
79 | | </note> |
| 89 | <sect2 id="locale-wrong-filename-encoding"> |
81 | | <para>The <application>UnZip</application> package assumes that filenames |
82 | | stored in the ZIP archives created on non-Unix systems are encoded in |
83 | | CP850, and that they should be converted to ISO-8859-1 when writing files |
84 | | onto the filesystem. Such assumptions are not always valid. In fact, |
85 | | inside the ZIP archive, filenames are encoded in the DOS codepage that is |
86 | | in use in the relevant country, and the filenames on disk should be in |
87 | | the locale encoding. In MS Windows, the OemToChar() C function (from |
88 | | <filename>User32.DLL</filename>) does the correct conversion (which is |
89 | | indeed the conversion from CP850 to a superset of ISO-8859-1 if MS |
90 | | Windows is set up to use the US English language), but there is no |
91 | | equivalent in Linux.</para> |
| 91 | <title>The Program Uses or Creates Filenames in |
| 92 | the Wrong Encoding</title> |
93 | | <para>When using <command>unzip</command> to unpack a ZIP archive |
94 | | containing non-ASCII filenames, the filenames are damaged because |
95 | | <command>unzip</command> uses improper conversion when any of its |
96 | | encoding assumptions are incorrect. For example, in the ru_RU.KOI8-R |
97 | | locale, conversion of filenames from CP866 to KOI8-R is required, but |
98 | | conversion from CP850 to ISO-8859-1 is done, which produces filenames |
99 | | consisting of undecipherable characters instead of words (the closest |
100 | | equivalent understandable example for English-only users is rot13). There |
101 | | are several ways around this limitation:</para> |
| 94 | <para>The POSIX standard mandates that the filename encoding is |
| 95 | the encoding implied by the current LC_CTYPE locale category. This |
| 96 | information is well-hidden on the page which specifies the behaviour |
| 97 | of <application>Tar</application> and <application>Cpio</application> |
| 98 | programs. Some programs get it wrong by default (or simply don't |
| 99 | have enough information to get it right). The result is that they |
| 100 | create filenames which are not subsequently shown correctly by |
| 101 | <command>ls</command>, or they refuse to accept filenames that |
| 102 | <command>ls</command> shows properly. For the <xref linkend="glib2"/> |
| 103 | library, the problem can be corrected by setting the |
| 104 | <envar>G_FILENAME_ENCODING</envar> environment variable to the special |
| 105 | "@locale" value. <application>Glib2</application> based programs that |
| 106 | don't respect that environment variable are buggy.</para> |
103 | | <para>1) For unpacking ZIP archives with filenames containing non-ASCII |
104 | | characters, use <ulink url="http://www.winzip.com/">WinZip</ulink> while |
105 | | running the <ulink url="http://www.winehq.com/">Wine</ulink> Windows |
106 | | emulator.</para> |
| 108 | <para>The <xref linkend="zip"/>, <xref linkend="unzip"/>, and |
| 109 | <xref linkend="nautilus-cd-burner"/> have this problem because |
| 110 | they hard-code the expected filename encoding. |
| 111 | <application>UnZip</application> contains a hard-coded conversion |
| 112 | table between the CP850 (DOS) and ISO-8859-1 (UNIX) encodings and |
| 113 | uses this table when extracting archives created under DOS or |
| 114 | Microsoft Windows. However, this assumption only works for those |
| 115 | in the US and not for anyone using a UTF-8 locale. Non-ASCII |
| 116 | characters will be mangled in the extracted filenames.</para> |
108 | | <para>2) After running <command>unzip</command>, fix the damage made to |
109 | | the filenames using the <command>convmv</command> tool |
110 | | (<ulink url="http://j3e.de/linux/convmv/"/>). The following is an example |
111 | | for the ru_RU.KOI8-R locale:</para> |
| 118 | <para>On the other hand, |
| 119 | <application>Nautilus CD Burner</application> checks names of |
| 120 | files added to its window for UTF-8 validity. This is wrong for |
| 121 | users of non-UTF-8 locales. Also, |
| 122 | <application>Nautilus CD Burner</application> unconditionally |
| 123 | calls <command>mkisofs</command> with the |
| 124 | <parameter>-input-charset UTF-8</parameter> parameter, which is |
| 125 | only correct in UTF-8 locales.</para> |
113 | | <blockquote> |
114 | | <para>Step 1. Undo the conversion done by |
115 | | <command>unzip</command>:</para> |
| 127 | <para>The general rule for avoiding this class of problems is to |
| 128 | avoid installing broken programs. If this is imposible, the |
| 129 | <ulink url="http://j3e.de/linux/convmv/">convmv</ulink> |
| 130 | command-line tool can be used to fix filenames created by these |
| 131 | broken programs, or intentionally mangle the existing filenames |
| 132 | to meet the broken expectations of such programs.</para> |
117 | | <screen><userinput>convmv -f iso-8859-1 -t cp850 -r --nosmart --notest \ |
118 | | <replaceable></path/to/unzipped/files></replaceable></userinput></screen> |
| 134 | <para>In other cases, a similar problem is caused by importing |
| 135 | filenames from a system using a different locale with a tool that |
| 136 | is not locale-aware (e.g., <xref linkend="nfs-utils"/> or |
| 137 | <xref linkend="openssh"/>). In order to avoid mangling non-ASCII |
| 138 | characters when transferring files to a system with a different |
| 139 | locale, any of the following methods can be used:</para> |
126 | | <para>3) Apply this patch to unzip: |
127 | | <ulink url="https://bugzilla.altlinux.ru/attachment.cgi?id=532"/></para> |
| 148 | <listitem> |
| 149 | <para>On the sending side, create a tar archive with the |
| 150 | <parameter>--format=posix</parameter> switch passed to |
| 151 | <command>tar</command> (this will be the default in a future |
| 152 | version of <command>tar</command>). This causes the filenames |
| 153 | to be converted from the creator's locale encoding to UTF-8 |
| 154 | when creating the archive, stored in the UTF-8 encoding in the |
| 155 | archive, and converted from it to the recepient's locale |
| 156 | encoding when unpacking.</para> |
| 157 | </listitem> |
129 | | <para>It allows to specify the assumed filename encoding in the ZIP |
130 | | archive using the <option>-O charset_name</option> option and the |
131 | | on-disk filename encoding using the <option>-I charset_name</option> |
132 | | option. Defaults: the on-disk filename encoding is the locale encoding, |
133 | | the encoding inside the ZIP archive is guessed according to the builtin |
134 | | table based on the locale encoding. For US English users, this still |
135 | | means that unzip converts from CP850 to ISO-8859-1 by default.</para> |
| 159 | <listitem> |
| 160 | <para>Mail the files as attachments. Mail clients specify the |
| 161 | encoding of attached filenames.</para> |
| 162 | </listitem> |
144 | | <sect3 id="locale-nano" xreflabel="Nano-&nano-version;"> |
145 | | |
146 | | <title><xref linkend="nano"/></title> |
147 | | |
148 | | <para>The current stable version of <application>Nano</application> |
149 | | (&nano-version;) does not support UTF-8 character encodings. A |
150 | | development version is available which addresses these issues. This |
151 | | version can be downloaded at <ulink |
152 | | url="http://www.nano-editor.org/dist/v1.3/nano-1.3.11.tar.gz"/>. |
153 | | Instructions for installing this version are the same as those found on |
154 | | the <xref linkend="nano"/> page.</para> |
155 | | |
156 | | </sect3> |
157 | | |