Ticket #1993: new-locale-issues-2.diff
File new-locale-issues-2.diff, 24.6 KB (added by , 16 years ago) |
---|
-
introduction/important/locale-issues.xml
16 16 <title>Locale Related Issues</title> 17 17 18 18 <para>This page contains information about locale related problems and 19 issues. In this paragraph you'll find a generic overview of things that can 20 come up when configuring your system for various locales. The previous 21 sentence and the remainder of this paragraph must still be 22 revised/completed.</para> 19 issues. In the following paragraphs you'll find a generic overview of 20 things that can come up when configuring your system for various locales. 21 Many (but not all) existing locale-related problems can be classified 22 and fall under one of the headings below. The severity ratings below use 23 the following criteria:</para> 23 24 24 <sect2> 25 <itemizedlist> 26 <listitem> 27 <para>Critical: The program doesn't perform its main function. 28 The fix would be very intrusive, it's better to search for a 29 replacement.</para> 30 </listitem> 31 <listitem> 32 <para>High: Part of the functionality that the program provides 33 is not usable. If that functionality is required, it's better to 34 search for a replacement.</para> 35 </listitem> 36 <listitem> 37 <para>Low: The program works in all typical use cases, but lacks 38 some functionality normally provided by its equivalents.</para> 39 </listitem> 40 </itemizedlist> 25 41 26 <title>Package Specific Locale Issues</title> 42 <para>If there is a known workaround for a specific package, it will 43 appear on that package's page.</para> 27 44 28 <para>For package-specific issues, find the concerned package from the list 29 below and follow the link to view the available information. If a package 30 is not listed here, it does not mean there are no known locale-specific 31 issues or problems with that package. It only means that this page has not 32 been updated with the locale-specific information regarding that package. 33 Please reference the BLFS Wiki page for a particular package for any 34 additional locale-specific information. </para> 45 <sect2 id="locale-not-valid-option" 46 xreflabel="Needed Encoding Not a Valid Option"> 35 47 36 < itemizedlist>48 <title>The Needed Encoding is Not a Valid Option in the Program</title> 37 49 38 <title>List of Packages with Locale Related Issues</title>50 <para>Severity: Critical</para> 39 51 40 <listitem> 41 <para><xref linkend="locale-mc"/></para> 42 </listitem> 43 <listitem> 44 <para><xref linkend="locale-unzip"/></para> 45 </listitem> 46 <listitem> 47 <para><xref linkend="locale-nano"/></para> 48 </listitem> 52 <para>Some programs require the user to specify the character encoding 53 for their input or output data and present only a limited choice of 54 encodings. This is the case for the <option>-X</option> option in 55 <xref linkend="a2ps"/> and <xref linkend="enscript"/>, 56 the <option>-input-charset</option> option in unpatched 57 <xref linkend="cdrtools"/>, and the character sets offered for display 58 in the menu of <xref linkend="links"/>. If the required encoding is not 59 in the list, the program usually becomes completely unusable. For 60 non-interactive programs, it may be possible to work around this by 61 converting the document to a supported input character set before 62 submitting to the program.</para> 49 63 50 </itemizedlist> 64 <para>A solution to this type of problem is to implement the necessary 65 support for the missing encoding as a patch to the original program 66 (as done for <xref linkend="cdrtools"/> in this book), or to find a 67 replacement.</para> 51 68 52 <sect3 id="locale-mc" xreflabel="MC-&mc-version;">69 </sect2> 53 70 54 <title><xref linkend="mc"/></title> 71 <sect2 id="locale-assumed-encoding" 72 xreflabel="Program Assumes Encoding"> 55 73 56 <para>This package makes the assumption that <quote>characters</quote> 57 and <quote>bytes</quote> are the same thing. This is not true in UTF-8 58 based locales. Due to this assumption <application>MC</application> will 59 incorrectly position characters on the screen. After the cursor is moved 60 a bit the screen becomes totally unreadable, as illustrated on 61 <ulink url="&files-anduin;/mc-bad.png">this 62 screenshot</ulink> (taken in a ru_RU.UTF-8 locale). Additionally, input 63 of non-ASCII characters in the editor is impossible, even after selecting 64 <quote>Other 8-bit</quote> encoding from the menu.</para> 74 <title>The Program Assumes the Locale-Based Encoding of External 75 Documents</title> 65 76 66 </sect3> 77 <para>Severity: High for non-text documents, low for text 78 documents</para> 67 79 68 <sect3 id="locale-unzip" xreflabel="UnZip-&unzip-version;"> 80 <para>Some programs, <xref linkend="nano"/> or 81 <xref linkend="joe"/> for example, assume that documents are always 82 in the encoding implied by the current locale. While this assumption 83 may be valid for the user-created documents, it is not safe for 84 external ones. When this assumption fails, non-ASCII characters are 85 displayed incorrectly, and the document may become unreadable.</para> 69 86 70 <title><xref linkend="unzip"/></title> 87 <para>If the external document is entirely text based, it can be 88 converted to the current locale encoding using the 89 <command>iconv</command> program.</para> 71 90 72 <note>73 <para>Use of <application>UnZip</application> in the74 <application>JDK</application>, <application>Mozilla</application>,75 <application>DocBook</application> or any other BLFS package76 installation is not a problem, as BLFS instructions never use77 <application>UnZip</application> to extract a file with non-ASCII78 characters in the file's name.</para>79 </note>91 <para>For documents that are not text-based, this is not possible. 92 In fact, the assumption made in the program may be completely 93 invalid for documents where the Microsoft Windows operating system 94 has set de facto standards. An example of this problem is ID3v1 tags 95 in MP3 files (see <ulink url="&blfs-wiki;/ID3v1Coding">this page</ulink> 96 for more details). For these cases, the only solution is to find a 97 replacement program that doesn't have the issue (e.g., one that 98 will allow you to specify the assumed document encoding).</para> 80 99 81 <para>The <application>UnZip</application> package assumes that filenames 82 stored in the ZIP archives created on non-Unix systems are encoded in 83 CP850, and that they should be converted to ISO-8859-1 when writing files 84 onto the filesystem. Such assumptions are not always valid. In fact, 85 inside the ZIP archive, filenames are encoded in the DOS codepage that is 86 in use in the relevant country, and the filenames on disk should be in 87 the locale encoding. In MS Windows, the OemToChar() C function (from 88 <filename>User32.DLL</filename>) does the correct conversion (which is 89 indeed the conversion from CP850 to a superset of ISO-8859-1 if MS 90 Windows is set up to use the US English language), but there is no 91 equivalent in Linux.</para> 100 <para>Among BLFS packages, this problem applies to 101 <xref linkend="nano"/>, <xref linkend="joe"/>, and all media players 102 except <xref linkend="audacious"/>.</para> 92 103 93 <para>When using <command>unzip</command> to unpack a ZIP archive 94 containing non-ASCII filenames, the filenames are damaged because 95 <command>unzip</command> uses improper conversion when any of its 96 encoding assumptions are incorrect. For example, in the ru_RU.KOI8-R 97 locale, conversion of filenames from CP866 to KOI8-R is required, but 98 conversion from CP850 to ISO-8859-1 is done, which produces filenames 99 consisting of undecipherable characters instead of words (the closest 100 equivalent understandable example for English-only users is rot13). There 101 are several ways around this limitation:</para> 104 <para>Another problem in this category is when someone cannot read 105 the documents you've sent them because their operating system is 106 set up to handle character encodings differently. This can happen 107 often when the other person is using Microsoft Windows, which only 108 provides one character encoding for a given country. For example, 109 this causes problems with UTF-8 encoded TeX documents created in 110 Linux. On Windows, most applications will assume that these documents 111 have been created using the default Windows 8-bit encoding. See the 112 <ulink url="&blfs-wiki;/tetex">teTeX</ulink> Wiki page for more 113 details.</para> 102 114 103 <para>1) For unpacking ZIP archives with filenames containing non-ASCII 104 characters, use <ulink url="http://www.winzip.com/">WinZip</ulink> while 105 running the <ulink url="http://www.winehq.com/">Wine</ulink> Windows 106 emulator.</para> 115 <para>In extreme cases, Windows encoding compatibility issues may be 116 solved only by running Windows programs under 117 <ulink url="http://www.winehq.com/">Wine</ulink>.</para> 107 118 108 <para>2) After running <command>unzip</command>, fix the damage made to 109 the filenames using the <command>convmv</command> tool 110 (<ulink url="http://j3e.de/linux/convmv/"/>). The following is an example 111 for the ru_RU.KOI8-R locale:</para> 119 </sect2> 112 120 113 <blockquote> 114 <para>Step 1. Undo the conversion done by 115 <command>unzip</command>:</para> 121 <sect2 id="locale-wrong-filename-encoding" 122 xreflabel="Wrong Filename Encoding"> 116 123 117 <screen><userinput>convmv -f iso-8859-1 -t cp850 -r --nosmart --notest \ 118 <replaceable></path/to/unzipped/files></replaceable></userinput></screen> 124 <title>The Program Uses or Creates Filenames in the Wrong Encoding</title> 119 125 120 <para>Step 2. Do the correct conversion instead:</para>126 <para>Severity: Critical</para> 121 127 122 <screen><userinput>convmv -f cp866 -t koi8-r -r --nosmart --notest \ 123 <replaceable></path/to/unzipped/files></replaceable></userinput></screen> 124 </blockquote> 128 <para>The POSIX standard mandates that the filename encoding is 129 the encoding implied by the current LC_CTYPE locale category. This 130 information is well-hidden on the page which specifies the behavior 131 of <application>Tar</application> and <application>Cpio</application> 132 programs. Some programs get it wrong by default (or simply don't 133 have enough information to get it right). The result is that they 134 create filenames which are not subsequently shown correctly by 135 <command>ls</command>, or they refuse to accept filenames that 136 <command>ls</command> shows properly. For the <xref linkend="glib2"/> 137 library, the problem can be corrected by setting the 138 <envar>G_FILENAME_ENCODING</envar> environment variable to the special 139 "@locale" value. <application>Glib2</application> based programs that 140 don't respect that environment variable are buggy.</para> 125 141 126 <para>3) Apply this patch to unzip: 127 <ulink url="https://bugzilla.altlinux.ru/attachment.cgi?id=532"/></para> 142 <para>The <xref linkend="zip"/>, <xref linkend="unzip"/>, and 143 <xref linkend="nautilus-cd-burner"/> have this problem because 144 they hard-code the expected filename encoding. 145 <application>UnZip</application> contains a hard-coded conversion 146 table between the CP850 (DOS) and ISO-8859-1 (UNIX) encodings and 147 uses this table when extracting archives created under DOS or 148 Microsoft Windows. However, this assumption only works for those 149 in the US and not for anyone using a UTF-8 locale. Non-ASCII 150 characters will be mangled in the extracted filenames.</para> 128 151 129 <para>It allows to specify the assumed filename encoding in the ZIP 130 archive using the <option>-O charset_name</option> option and the 131 on-disk filename encoding using the <option>-I charset_name</option> 132 option. Defaults: the on-disk filename encoding is the locale encoding, 133 the encoding inside the ZIP archive is guessed according to the builtin 134 table based on the locale encoding. For US English users, this still 135 means that unzip converts from CP850 to ISO-8859-1 by default.</para> 152 <para>On the other hand, 153 <application>Nautilus CD Burner</application> checks names of 154 files added to its window for UTF-8 validity. This is wrong for 155 users of non-UTF-8 locales. Also, 156 <application>Nautilus CD Burner</application> unconditionally 157 calls <command>mkisofs</command> with the 158 <parameter>-input-charset UTF-8</parameter> parameter, which is 159 only correct in UTF-8 locales.</para> 136 160 137 <para>Caveat: this method works only with 8-bit locale encodings, not 138 with UTF-8. Attempting to use a patched <command>unzip</command> in UTF-8 139 locales may result in a segmentation fault and is probably a security 140 risk.</para> 161 <para>The general rule for avoiding this class of problems is to 162 avoid installing broken programs. If this is impossible, the 163 <ulink url="http://j3e.de/linux/convmv/">convmv</ulink> 164 command-line tool can be used to fix filenames created by these 165 broken programs, or intentionally mangle the existing filenames 166 to meet the broken expectations of such programs.</para> 141 167 142 </sect3> 168 <para>In other cases, a similar problem is caused by importing 169 filenames from a system using a different locale with a tool that 170 is not locale-aware (e.g., <xref linkend="nfs-utils"/> or 171 <xref linkend="openssh"/>). In order to avoid mangling non-ASCII 172 characters when transferring files to a system with a different 173 locale, any of the following methods can be used:</para> 143 174 144 <sect3 id="locale-nano" xreflabel="Nano-&nano-version;"> 175 <itemizedlist> 176 <listitem> 177 <para>Transfer anyway, fix the damage with 178 <command>convmv</command>.</para> 179 </listitem> 180 <listitem> 181 <para>On the sending side, create a tar archive with the 182 <parameter>--format=posix</parameter> switch passed to 183 <command>tar</command> (this will be the default in a future 184 version of <command>tar</command>).</para> 185 </listitem> 186 <listitem> 187 <para>Mail the files as attachments. Mail clients specify the 188 encoding of attached filenames.</para> 189 </listitem> 190 <listitem> 191 <para>Write the files to a removable disk formatted with a FAT or 192 FAT32 filesystem.</para> 193 </listitem> 194 <listitem> 195 <para>Transfer the files using Samba.</para> 196 </listitem> 197 <listitem> 198 <para>Transfer the files via FTP using RFC2640-aware server 199 (this currently means only wu-ftpd, which has bad security history) 200 and client (e.g., lftp).</para> 201 </listitem> 202 </itemizedlist> 145 203 146 <title><xref linkend="nano"/></title> 204 <para>The last four methods work because the filenames are automatically 205 converted from the sender's locale to UNICODE and stored or sent in this 206 form. They are then transparently converted from UNICODE to the 207 recipient's locale encoding.</para> 147 208 148 <para>The current stable version of <application>Nano</application> 149 (&nano-version;) does not support UTF-8 character encodings. A 150 development version is available which addresses these issues. This 151 version can be downloaded at <ulink 152 url="http://www.nano-editor.org/dist/v1.3/nano-1.3.11.tar.gz"/>. 153 Instructions for installing this version are the same as those found on 154 the <xref linkend="nano"/> page.</para> 209 </sect2> 155 210 156 </sect3> 211 <sect2 id="locale-wrong-multibyte-characters" 212 xreflabel="Wrong Multibyte Characters"> 157 213 214 <title>The Program Breaks Multibyte Characters or Doesn't Count 215 Character Cells Correctly</title> 216 217 <para>Severity: High or critical</para> 218 219 <para>Many programs were written in an older era where multibyte 220 locales were not common. Such programs assume that C "char" data 221 type, which is one byte, can be used to store single characters. 222 Further, they assume that any sequence of characters is a valid 223 string and that every character occupies a single character cell. 224 Such assumptions completely break in UTF-8 locales. The visible 225 manifestation is that the program truncates strings prematurely 226 (i.e., at 80 bytes instead of 80 characters). Terminal-based 227 programs don't place the cursor correctly on the screen, don't react 228 to the "Backspace" key by erasing one character, and leave junk 229 characters around when updating the screen, usually turning the 230 screen into a complete mess.</para> 231 232 <para>Fixing this kind of problems is a tedious task from a 233 programmer's point of view, like all other cases of retrofitting new 234 concepts into the old flawed design. In this case, one has to redesign 235 all data structures in order to accommodate to the fact that a complete 236 character may span a variable number of "char"s (or switch to wchar_t 237 and convert as needed). Also, for every call to the "strlen" and 238 similar functions, find out whether a number of bytes, a number of 239 characters, or the width of the string was really meant. Sometimes it 240 is faster to write a program with the same functionality from scratch. 241 </para> 242 243 <para>Among BLFS packages, this problem applies to <xref linkend="mc"/>, 244 <xref linkend="nano"/>, <xref linkend="ed"/>, <xref linkend="xine-ui"/> 245 and all shells.</para> 246 158 247 </sect2> 159 248 160 249 </sect1> -
postlfs/editors/nano.xml
10 10 <!ENTITY nano-size "891 KB"> 11 11 <!ENTITY nano-buildsize "5.1 MB"> 12 12 <!ENTITY nano-time "0.1 SBU"> 13 14 <!-- The nano development version fixes a lot of issues w.r.t. 15 locale issues. This entity can be removed when nano-2.0 stable 16 is released and added to BLFS --> 17 <!ENTITY nano-devel-version "1.9.99pre2"> 13 18 ]> 14 19 15 20 <sect1 id="nano" xreflabel="nano-&nano-version;"> … … 35 40 36 41 <caution> 37 42 <para>The <application>Nano</application> package has some issues when 38 used in a UTF-8 based locale. A development version is available 39 which addresses these issues. Please see the 40 <xref linkend="locale-nano"/> section of the <xref 41 linkend="locale-issues"/>.</para> 43 used in a UTF-8 based locale. A development version is available 44 which addresses these issues at <ulink 45 url="http://www.nano-editor.org/dist/v1.3/nano-&nano-devel-version;.tar.gz"/>. 46 This version can be installed with the same instructions shown below. 47 See the <xref linkend="locale-issues"/> page for a more general 48 discussion of these problems.</para> 42 49 </caution> 43 50 44 51 <bridgehead renderas="sect3">Package Information</bridgehead> -
general/sysutils/mc.xml
37 37 38 38 <caution> 39 39 <para>The <application>MC</application> package has some issues when 40 used in a UTF-8 based locale. For a full explanation of the issues, see 41 the <xref linkend="locale-mc"/> section of the 42 <xref linkend="locale-issues"/>.</para> 40 used in a UTF-8 based locale because it assumes the characters are 41 always one byte wide. See <ulink url="&files-anduin;/mc-bad.png">this 42 screenshot</ulink> (taken in a ru_RU.UTF-8 locale). 43 See the <ulink url="&blfs-wiki;/MC">MC Wiki</ulink> page for a way 44 to work around these problems. 45 For a general discussion of these types of issues, see 46 the <xref linkend="locale-issues"/> page.</para> 43 47 </caution> 44 48 45 49 <bridgehead renderas="sect3">Package Information</bridgehead> -
general/sysutils/unzip.xml
38 38 39 39 <caution> 40 40 <para>The <application>UnZip</application> package has some locale 41 related issues. For a full explanation of the issues and some possible 42 solutions, see the <xref linkend="locale-unzip"/> section of the 43 <xref linkend="locale-issues"/>.</para> 41 related issues. See the discussion below in the 42 <xref linkend="unzip-locale-issues"/> section. A more general 43 discussion of these problems can be found on the 44 <xref linkend="locale-issues"/> page.</para> 44 45 </caution> 45 46 46 47 <bridgehead renderas="sect3">Package Information</bridgehead> … … 70 71 71 72 </sect2> 72 73 74 <sect2 id="unzip-locale-issues"> 75 <title>UnZip Locale Issues</title> 76 77 <note> 78 <para>Use of <application>UnZip</application> in the 79 <application>JDK</application>, <application>Mozilla</application>, 80 <application>DocBook</application> or any other BLFS package 81 installation is not a problem, as BLFS instructions never use 82 <application>UnZip</application> to extract a file with non-ASCII 83 characters in the file's name.</para> 84 </note> 85 86 <para>The <application>UnZip</application> package assumes that filenames 87 stored in the ZIP archives created on non-Unix systems are encoded in 88 CP850, and that they should be converted to ISO-8859-1 when writing files 89 onto the filesystem. Such assumptions are not always valid. In fact, 90 inside the ZIP archive, filenames are encoded in the DOS codepage that is 91 in use in the relevant country, and the filenames on disk should be in 92 the locale encoding. In MS Windows, the OemToChar() C function (from 93 <filename>User32.DLL</filename>) does the correct conversion (which is 94 indeed the conversion from CP850 to a superset of ISO-8859-1 if MS 95 Windows is set up to use the US English language), but there is no 96 equivalent in Linux.</para> 97 98 <para>When using <command>unzip</command> to unpack a ZIP archive 99 containing non-ASCII filenames, the filenames are damaged because 100 <command>unzip</command> uses improper conversion when any of its 101 encoding assumptions are incorrect. For example, in the ru_RU.KOI8-R 102 locale, conversion of filenames from CP866 to KOI8-R is required, but 103 conversion from CP850 to ISO-8859-1 is done, which produces filenames 104 consisting of undecipherable characters instead of words (the closest 105 equivalent understandable example for English-only users is rot13). There 106 are several ways around this limitation:</para> 107 108 <para>1) For unpacking ZIP archives with filenames containing non-ASCII 109 characters, use <ulink url="http://www.winzip.com/">WinZip</ulink> while- running the <ulink url="http://www.winehq.com/">Wine</ulink> Windows 110 emulator.</para> 111 112 <para>2) After running <command>unzip</command>, fix the damage made to 113 the filenames using the <command>convmv</command> tool 114 (<ulink url="http://j3e.de/linux/convmv/"/>). The following is an example 115 for the ru_RU.KOI8-R locale:</para> 116 117 <blockquote> 118 <para>Step 1. Undo the conversion done by 119 <command>unzip</command>:</para> 120 121 <screen><userinput>convmv -f iso-8859-1 -t cp850 -r --nosmart --notest \ 122 <replaceable></path/to/unzipped/files></replaceable></userinput></screen> 123 124 <para>Step 2. Do the correct conversion instead:</para> 125 126 <screen><userinput>convmv -f cp866 -t koi8-r -r --nosmart --notest \ 127 <replaceable></path/to/unzipped/files></replaceable></userinput></screen> 128 </blockquote> 129 130 <para>3) Apply this patch to unzip: 131 <ulink url="https://bugzilla.altlinux.ru/attachment.cgi?id=532"/></para> 132 133 <para>It allows to specify the assumed filename encoding in the ZIP 134 archive using the <option>-O charset_name</option> option and the 135 on-disk filename encoding using the <option>-I charset_name</option> 136 option. Defaults: the on-disk filename encoding is the locale encoding, 137 the encoding inside the ZIP archive is guessed according to the builtin 138 table based on the locale encoding. For US English users, this still 139 means that unzip converts from CP850 to ISO-8859-1 by default.</para> 140 141 <para>Caveat: this method works only with 8-bit locale encodings, not 142 with UTF-8. Attempting to use a patched <command>unzip</command> in UTF-8 143 locales may result in a segmentation fault and is probably a security 144 risk.</para> 145 146 </sect2> 147 73 148 <sect2 role="installation"> 74 149 <title>Installation of UnZip</title> 75 150