source: introduction/important/locale-issues.xml

trunk
Last change on this file was ab4fdfc, checked in by Pierre Labastie <pierre.labastie@…>, 3 months ago

Change all xml decl to encoding=utf-8

  • Property mode set to 100644
File size: 13.0 KB
Line 
1<?xml version="1.0" encoding="UTF-8"?>
2<!DOCTYPE sect1 PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
3 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
4 <!ENTITY % general-entities SYSTEM "../../general.ent">
5 %general-entities;
6]>
7
8<sect1 id="locale-issues" xreflabel="Locale Related Issues">
9 <?dbhtml filename="locale-issues.html"?>
10
11
12 <title>Locale Related Issues</title>
13
14 <para>This page contains information about locale related problems and
15 issues. In the following paragraphs you'll find a generic overview of
16 things that can come up when configuring your system for various locales.
17 Many (but not all) existing locale related problems can be classified
18 and fall under one of the headings below. The severity ratings below use
19 the following criteria:</para>
20
21 <itemizedlist>
22 <listitem>
23 <para>Critical: The program doesn't perform its main function.
24 The fix would be very intrusive, it's better to search for a
25 replacement.</para>
26 </listitem>
27 <listitem>
28 <para>High: Part of the functionality that the program provides
29 is not usable. If that functionality is required, it's better to
30 search for a replacement.</para>
31 </listitem>
32 <listitem>
33 <para>Low: The program works in all typical use cases, but lacks
34 some functionality normally provided by its equivalents.</para>
35 </listitem>
36 </itemizedlist>
37
38 <para>If there is a known workaround for a specific package, it will
39 appear on that package's page.</para>
40
41 <sect2 id="locale-not-valid-option"
42 xreflabel="Needed Encoding Not a Valid Option">
43
44 <title>The Needed Encoding is Not a Valid Option in the Program</title>
45
46 <para>Severity: Critical</para>
47
48 <para>Some programs require the user to specify the character encoding
49 for their input or output data and present only a limited choice of
50 encodings. This is the case for the <option>-X</option> option in
51<!-- <xref linkend="a2ps"/> and --><xref linkend="enscript"/>,
52 the <option>-input-charset</option> option in unpatched
53 <xref linkend="cdrtools"/>, and the character sets offered for display
54 in the menu of <xref linkend="Links"/>. If the required encoding is not
55 in the list, the program usually becomes completely unusable. For
56 non-interactive programs, it may be possible to work around this by
57 converting the document to a supported input character set before
58 submitting to the program.</para>
59
60 <para>A solution to this type of problem is to implement the necessary
61 support for the missing encoding as a patch to the original program or to
62 find a replacement.</para>
63
64 </sect2>
65
66 <sect2 id="locale-assumed-encoding"
67 xreflabel="Program Assumes Encoding">
68
69 <title>The Program Assumes the Locale-Based Encoding of External
70 Documents</title>
71
72 <para>Severity: High for non-text documents, low for text
73 documents</para>
74
75 <para>Some programs, <xref linkend="nano"/> or
76 <xref linkend="joe"/> for example, assume that documents are always
77 in the encoding implied by the current locale. While this assumption
78 may be valid for the user-created documents, it is not safe for
79 external ones. When this assumption fails, non-ASCII characters are
80 displayed incorrectly, and the document may become unreadable.</para>
81
82 <para>If the external document is entirely text based, it can be
83 converted to the current locale encoding using the
84 <command>iconv</command> program.</para>
85
86 <para>For documents that are not text-based, this is not possible.
87 In fact, the assumption made in the program may be completely
88 invalid for documents where the Microsoft Windows operating system
89 has set de facto standards. An example of this problem is ID3v1 tags
90 in MP3 files. For these cases, the only solution is to find a
91 replacement program that doesn't have the issue (e.g., one that
92 will allow you to specify the assumed document encoding).</para>
93
94 <para>Among BLFS packages, this problem applies to
95 <xref linkend="nano"/>, <xref linkend="joe"/>, and all media players
96 except <xref linkend="audacious"/>.</para>
97
98 <para>Another problem in this category is when someone cannot read
99 the documents you've sent them because their operating system is
100 set up to handle character encodings differently. This can happen
101 often when the other person is using Microsoft Windows, which only
102 provides one character encoding for a given country. For example,
103 this causes problems with UTF-8 encoded TeX documents created in
104 Linux. On Windows, most applications will assume that these documents
105 have been created using the default Windows 8-bit encoding.
106 </para>
107
108 <para>In extreme cases, Windows encoding compatibility issues may be
109 solved only by running Windows programs under
110 <ulink url="https://www.winehq.org/">Wine</ulink>.</para>
111
112 </sect2>
113
114 <sect2 id="locale-wrong-filename-encoding"
115 xreflabel="Wrong Filename Encoding">
116
117 <title>The Program Uses or Creates Filenames in the Wrong Encoding</title>
118
119 <para>Severity: Critical</para>
120
121 <para>The POSIX standard mandates that the filename encoding is
122 the encoding implied by the current LC_CTYPE locale category. This
123 information is well-hidden on the page which specifies the behavior
124 of <application>Tar</application> and <application>Cpio</application>
125 programs. Some programs get it wrong by default (or simply don't
126 have enough information to get it right). The result is that they
127 create filenames which are not subsequently shown correctly by
128 <command>ls</command>, or they refuse to accept filenames that
129 <command>ls</command> shows properly. For the <xref linkend="glib2"/>
130 library, the problem can be corrected by setting the
131 <envar>G_FILENAME_ENCODING</envar> environment variable to the special
132 "@locale" value. <application>Glib2</application> based programs that
133 don't respect that environment variable are buggy.</para>
134
135 <para>The <xref linkend="zip"/> and <xref linkend="unzip"/> have this
136 problem because they hard-code the expected filename encoding.
137 <application>UnZip</application> contains a hard-coded conversion table
138 between the CP850 (DOS) and ISO-8859-1 (UNIX) encodings and uses this table
139 when extracting archives created under DOS or Microsoft Windows. However,
140 this assumption only works for those in the US and not for anyone using a
141 UTF-8 locale. Non-ASCII characters will be mangled in the extracted
142 filenames.</para>
143
144 <!--<para>On the other hand,
145 <application>Nautilus CD Burner</application> checks names of
146 files added to its window for UTF-8 validity. This is wrong for
147 users of non-UTF-8 locales. Also,
148 <application>Nautilus CD Burner</application> unconditionally
149 calls <command>mkisofs</command> with the
150 <parameter>-input-charset UTF-8</parameter> parameter, which is
151 only correct in UTF-8 locales.</para>-->
152
153 <para>The general rule for avoiding this class of problems is to
154 avoid installing broken programs. If this is impossible, the
155 <ulink url="https://j3e.de/linux/convmv/">convmv</ulink>
156 command-line tool can be used to fix filenames created by these
157 broken programs, or intentionally mangle the existing filenames
158 to meet the broken expectations of such programs.</para>
159
160 <para>In other cases, a similar problem is caused by importing
161 filenames from a system using a different locale with a tool that
162 is not locale-aware (e.g., <!--<xref linkend="nfs-utils"/> or-->
163 <xref linkend="openssh"/>). In order to avoid mangling non-ASCII
164 characters when transferring files to a system with a different
165 locale, any of the following methods can be used:</para>
166
167 <itemizedlist>
168 <listitem>
169 <para>Transfer anyway, fix the damage with
170 <command>convmv</command>.</para>
171 </listitem>
172 <listitem>
173 <para>On the sending side, create a tar archive with the
174 <parameter>--format=posix</parameter> switch passed to
175 <command>tar</command> (this will be the default in a future
176 version of <command>tar</command>).</para>
177 </listitem>
178 <listitem>
179 <para>Mail the files as attachments. Mail clients specify the
180 encoding of attached filenames.</para>
181 </listitem>
182 <listitem>
183 <para>Write the files to a removable disk formatted with a FAT or
184 FAT32 filesystem.</para>
185 </listitem>
186 <listitem>
187 <para>Transfer the files using Samba.</para>
188 </listitem>
189 <listitem>
190 <para>Transfer the files via FTP using RFC2640-aware server
191 (this currently means only wu-ftpd, which has bad security history)
192 and client (e.g., lftp).</para>
193 </listitem>
194 </itemizedlist>
195
196 <para>The last four methods work because the filenames are automatically
197 converted from the sender's locale to UNICODE and stored or sent in this
198 form. They are then transparently converted from UNICODE to the
199 recipient's locale encoding.</para>
200
201 </sect2>
202
203 <sect2 id="locale-wrong-multibyte-characters"
204 xreflabel="Breaks Multibyte Characters">
205
206 <title>The Program Breaks Multibyte Characters or Doesn't Count
207 Character Cells Correctly</title>
208
209 <para>Severity: High or critical</para>
210
211 <para>Many programs were written in an older era where multibyte
212 locales were not common. Such programs assume that C "char" data
213 type, which is one byte, can be used to store single characters.
214 Further, they assume that any sequence of characters is a valid
215 string and that every character occupies a single character cell.
216 Such assumptions completely break in UTF-8 locales. The visible
217 manifestation is that the program truncates strings prematurely
218 (i.e., at 80 bytes instead of 80 characters). Terminal-based
219 programs don't place the cursor correctly on the screen, don't react
220 to the "Backspace" key by erasing one character, and leave junk
221 characters around when updating the screen, usually turning the
222 screen into a complete mess.</para>
223
224 <para>Fixing this kind of problems is a tedious task from a
225 programmer's point of view, like all other cases of retrofitting new
226 concepts into the old flawed design. In this case, one has to redesign
227 all data structures in order to accommodate to the fact that a complete
228 character may span a variable number of "char"s (or switch to wchar_t
229 and convert as needed). Also, for every call to the "strlen" and
230 similar functions, find out whether a number of bytes, a number of
231 characters, or the width of the string was really meant. Sometimes it
232 is faster to write a program with the same functionality from scratch.
233 </para>
234
235 <para>Among BLFS packages, this problem applies to
236 <xref linkend="xine-ui"/> and all the shells.</para>
237
238 </sect2>
239
240 <sect2 id="locale-wrong-manpage-encoding"
241 xreflabel="Incorrect Manual Page Encoding">
242
243 <title>The Package Installs Manual Pages in Incorrect or
244 Non-Displayable Encoding</title>
245
246 <para>Severity: Low</para>
247
248 <para>LFS expects that manual pages are in the language-specific (usually
249 8-bit) encoding, as specified on the <ulink
250 url="&lfs-root;/chapter08/man-db.html">LFS Man DB page</ulink>. However,
251 some packages install translated manual pages in UTF-8 encoding (e.g.,
252 Shadow, already dealt with), or manual pages in languages not in the table.
253 Not all BLFS packages have been audited for conformance with the
254 requirements put in LFS (the large majority have been checked, and fixes
255 placed in the book for packages known to install non-conforming manual
256 pages). If you find a manual page installed by any of BLFS packages that is
257 obviously in the wrong encoding, please remove or convert it as needed, and
258 report this to BLFS team as a bug.</para>
259
260 <para>You can easily check your system for any non-conforming manual pages
261 by copying the following short shell script to some accessible location,
262
263<screen><literal>#!/bin/sh
264# Begin checkman.sh
265# Usage: find /usr/share/man -type f | xargs checkman.sh
266for a in "$@"
267do
268 # echo "Checking $a..."
269 # Pure-ASCII manual page (possibly except comments) is OK
270 grep -v '.\\"' "$a" | iconv -f US-ASCII -t US-ASCII >/dev/null 2>&amp;1 \
271 &amp;&amp; continue
272 # Non-UTF-8 manual page is OK
273 iconv -f UTF-8 -t UTF-8 "$a" >/dev/null 2>&amp;1 || continue
274 # Found a UTF-8 manual page, bad.
275 echo "UTF-8 manual page: $a" >&amp;2
276done
277# End checkman.sh
278</literal></screen>
279
280 and then issuing the following command (modify the command below if the
281 <command>checkman.sh</command> script is not in your <envar>PATH</envar>
282 environment variable):</para>
283
284<screen><userinput>find /usr/share/man -type f | xargs checkman.sh</userinput></screen>
285
286 <para>Note that if you have manual pages installed in any location other
287 than <filename class='directory'>/usr/share/man</filename> (e.g.,
288 <filename class='directory'>/usr/local/share/man</filename>), you must
289 modify the above command to include this additional location.</para>
290
291 </sect2>
292
293</sect1>
Note: See TracBrowser for help on using the repository browser.