Ignore:
Timestamp:
10/28/2006 07:13:18 AM (18 years ago)
Author:
Dan Nichilson <dnicholson@…>
Branches:
10.0, 10.1, 11.0, 11.1, 11.2, 11.3, 12.0, 12.1, 6.2, 6.2.0, 6.2.0-rc1, 6.2.0-rc2, 6.3, 6.3-rc1, 6.3-rc2, 6.3-rc3, 7.10, 7.4, 7.5, 7.6, 7.6-blfs, 7.6-systemd, 7.7, 7.8, 7.9, 8.0, 8.1, 8.2, 8.3, 8.4, 9.0, 9.1, basic, bdubbs/svn, elogind, gnome, kde5-13430, kde5-14269, kde5-14686, kea, ken/TL2024, ken/inkscape-core-mods, ken/tuningfonts, krejzi/svn, lazarus, lxqt, nosym, perl-modules, plabs/newcss, plabs/python-mods, python3.11, qt5new, rahul/power-profiles-daemon, renodr/vulkan-addition, systemd-11177, systemd-13485, trunk, upgradedb, xry111/intltool, xry111/llvm18, xry111/soup3, xry111/test-20220226, xry111/xf86-video-removal
Children:
0952b7d8
Parents:
3aeb033
Message:

Implemented Alexander Patrakov's Locale Related Issues changes

git-svn-id: svn://svn.linuxfromscratch.org/BLFS/trunk/BOOK@6364 af4574ff-66df-0310-9fd7-8a98e5e911e0

File:
1 edited

Legend:

Unmodified
Added
Removed
  • introduction/important/locale-issues.xml

    r3aeb033 r86eaa277  
    1717
    1818  <para>This page contains information about locale related problems and
    19   issues. In this paragraph you'll find a generic overview of things that can
    20   come up when configuring your system for various locales. The previous
    21   sentence and the remainder of this paragraph must still be
    22   revised/completed.</para>
    23 
    24  <sect2>
    25 
    26     <title>Package Specific Locale Issues</title>
    27 
    28     <para>For package-specific issues, find the concerned package from the list
    29     below and follow the link to view the available information. If a package
    30     is not listed here, it does not mean there are no known locale-specific
    31     issues or problems with that package. It only means that this page has not
    32     been updated with the locale-specific information regarding that package.
    33     Please reference the BLFS Wiki page for a particular package for any
    34     additional locale-specific information. </para>
     19  issues. In the following paragraphs you'll find a generic overview of
     20  things that can come up when configuring your system for various locales.
     21  Many (but not all) existing locale-related problems can be classified
     22  and fall under one of the headings below. The severity ratings below use
     23  the following criteria:</para>
     24
     25  <itemizedlist>
     26    <listitem>
     27      <para>Critical: The program doesn't perform its main function.
     28      The fix would be very intrusive, it's better to search for a
     29      replacement.</para>
     30    </listitem>
     31    <listitem>
     32      <para>High: Part of the functionality that the program provides
     33      is not usable. If that functionality is required, it's better to
     34      search for a replacement.</para>
     35    </listitem>
     36    <listitem>
     37      <para>Low: The program works in all typical use cases, but lacks
     38      some functionality normally provided by its equivalents.</para>
     39    </listitem>
     40  </itemizedlist>
     41
     42  <para>If there is a known workaround for a specific package, it will
     43  appear on that package's page.</para>
     44
     45  <sect2 id="locale-not-valid-option"
     46         xreflabel="Needed Encoding Not a Valid Option">
     47
     48    <title>The Needed Encoding is Not a Valid Option in the Program</title>
     49
     50    <para>Severity: Critical</para>
     51
     52    <para>Some programs require the user to specify the character encoding
     53    for their input or output data and present only a limited choice of
     54    encodings. This is the case for the <option>-X</option> option in
     55    <xref linkend="a2ps"/> and <xref linkend="enscript"/>,
     56    the <option>-input-charset</option> option in unpatched
     57    <xref linkend="cdrtools"/>, and the character sets offered for display
     58    in the menu of <xref linkend="links"/>. If the required encoding is not
     59    in the list, the program usually becomes completely unusable. For
     60    non-interactive programs, it may be possible to work around this by
     61    converting the document to a supported input character set before
     62    submitting to the program.</para>
     63
     64    <para>A solution to this type of problem is to implement the necessary
     65    support for the missing encoding as a patch to the original program
     66    (as done for <xref linkend="cdrtools"/> in this book), or to find a
     67    replacement.</para>
     68
     69  </sect2>
     70
     71  <sect2 id="locale-assumed-encoding"
     72         xreflabel="Program Assumes Encoding">
     73
     74    <title>The Program Assumes the Locale-Based Encoding of External
     75    Documents</title>
     76
     77    <para>Severity: High for non-text documents, low for text
     78    documents</para>
     79
     80    <para>Some programs, <xref linkend="nano"/> or
     81    <xref linkend="joe"/> for example, assume that documents are always
     82    in the encoding implied by the current locale. While this assumption
     83    may be valid for the user-created documents, it is not safe for
     84    external ones. When this assumption fails, non-ASCII characters are
     85    displayed incorrectly, and the document may become unreadable.</para>
     86
     87    <para>If the external document is entirely text based, it can be
     88    converted to the current locale encoding using the
     89    <command>iconv</command> program.</para>
     90
     91    <para>For documents that are not text-based, this is not possible.
     92    In fact, the assumption made in the program may be completely
     93    invalid for documents where the Microsoft Windows operating system
     94    has set de facto standards. An example of this problem is ID3v1 tags
     95    in MP3 files (see <ulink url="&blfs-wiki;/ID3v1Coding">this page</ulink>
     96    for more details). For these cases, the only solution is to find a
     97    replacement program that doesn't have the issue (e.g., one that
     98    will allow you to specify the assumed document encoding).</para>
     99
     100    <para>Among BLFS packages, this problem applies to
     101    <xref linkend="nano"/>, <xref linkend="joe"/>, and all media players
     102    except <xref linkend="audacious"/>.</para>
     103
     104    <para>Another problem in this category is when someone cannot read
     105    the documents you've sent them because their operating system is
     106    set up to handle character encodings differently. This can happen
     107    often when the other person is using Microsoft Windows, which only
     108    provides one character encoding for a given country. For example,
     109    this causes problems with UTF-8 encoded TeX documents created in
     110    Linux. On Windows, most applications will assume that these documents
     111    have been created using the default Windows 8-bit encoding. See the
     112    <ulink url="&blfs-wiki;/tetex">teTeX</ulink> Wiki page for more
     113    details.</para>
     114
     115    <para>In extreme cases, Windows encoding compatibility issues may be
     116    solved only by running Windows programs under
     117    <ulink url="http://www.winehq.com/">Wine</ulink>.</para>
     118
     119  </sect2>
     120
     121  <sect2 id="locale-wrong-filename-encoding"
     122         xreflabel="Wrong Filename Encoding">
     123
     124    <title>The Program Uses or Creates Filenames in the Wrong Encoding</title>
     125
     126    <para>Severity: Critical</para>
     127
     128    <para>The POSIX standard mandates that the filename encoding is
     129    the encoding implied by the current LC_CTYPE locale category. This
     130    information is well-hidden on the page which specifies the behavior
     131    of <application>Tar</application> and <application>Cpio</application>
     132    programs. Some programs get it wrong by default (or simply don't
     133    have enough information to get it right). The result is that they
     134    create filenames which are not subsequently shown correctly by
     135    <command>ls</command>, or they refuse to accept filenames that
     136    <command>ls</command> shows properly. For the <xref linkend="glib2"/>
     137    library, the problem can be corrected by setting the
     138    <envar>G_FILENAME_ENCODING</envar> environment variable to the special
     139    "@locale" value. <application>Glib2</application> based programs that
     140    don't respect that environment variable are buggy.</para>
     141
     142    <para>The <xref linkend="zip"/>, <xref linkend="unzip"/>, and
     143    <xref linkend="nautilus-cd-burner"/> have this problem because
     144    they hard-code the expected filename encoding.
     145    <application>UnZip</application> contains a hard-coded conversion
     146    table between the CP850 (DOS) and ISO-8859-1 (UNIX) encodings and
     147    uses this table when extracting archives created under DOS or
     148    Microsoft Windows. However, this assumption only works for those
     149    in the US and not for anyone using a UTF-8 locale. Non-ASCII
     150    characters will be mangled in the extracted filenames.</para>
     151
     152    <para>On the other hand,
     153    <application>Nautilus CD Burner</application> checks names of
     154    files added to its window for UTF-8 validity. This is wrong for
     155    users of non-UTF-8 locales. Also,
     156    <application>Nautilus CD Burner</application> unconditionally
     157    calls <command>mkisofs</command> with the
     158    <parameter>-input-charset UTF-8</parameter> parameter, which is
     159    only correct in UTF-8 locales.</para>
     160
     161    <para>The general rule for avoiding this class of problems is to
     162    avoid installing broken programs. If this is impossible, the
     163    <ulink url="http://j3e.de/linux/convmv/">convmv</ulink>
     164    command-line tool can be used to fix filenames created by these
     165    broken programs, or intentionally mangle the existing filenames
     166    to meet the broken expectations of such programs.</para>
     167
     168    <para>In other cases, a similar problem is caused by importing
     169    filenames from a system using a different locale with a tool that
     170    is not locale-aware (e.g., <xref linkend="nfs-utils"/> or
     171    <xref linkend="openssh"/>). In order to avoid mangling non-ASCII
     172    characters when transferring files to a system with a different
     173    locale, any of the following methods can be used:</para>
    35174
    36175    <itemizedlist>
    37 
    38       <title>List of Packages with Locale Related Issues</title>
    39 
    40       <listitem>
    41         <para><xref linkend="locale-mc"/></para>
    42       </listitem>
    43       <listitem>
    44         <para><xref linkend="locale-unzip"/></para>
    45       </listitem>
    46       <listitem>
    47         <para><xref linkend="locale-nano"/></para>
    48       </listitem>
    49 
     176      <listitem>
     177        <para>Transfer anyway, fix the damage with
     178        <command>convmv</command>.</para>
     179      </listitem>
     180      <listitem>
     181        <para>On the sending side, create a tar archive with the
     182        <parameter>--format=posix</parameter> switch passed to
     183        <command>tar</command> (this will be the default in a future
     184        version of <command>tar</command>).</para>
     185      </listitem>
     186      <listitem>
     187        <para>Mail the files as attachments. Mail clients specify the
     188        encoding of attached filenames.</para>
     189      </listitem>
     190      <listitem>
     191        <para>Write the files to a removable disk formatted with a FAT or
     192        FAT32 filesystem.</para>
     193      </listitem>
     194      <listitem>
     195        <para>Transfer the files using Samba.</para>
     196      </listitem>
     197      <listitem>
     198        <para>Transfer the files via FTP using RFC2640-aware server
     199        (this currently means only wu-ftpd, which has bad security history)
     200        and client (e.g., lftp).</para>
     201      </listitem>
    50202    </itemizedlist>
    51203
    52     <sect3 id="locale-mc" xreflabel="MC-&mc-version;">
    53 
    54       <title><xref linkend="mc"/></title>
    55 
    56       <para>This package makes the assumption that <quote>characters</quote>
    57       and <quote>bytes</quote> are the same thing. This is not true in UTF-8
    58       based locales. Due to this assumption <application>MC</application> will
    59       incorrectly position characters on the screen. After the cursor is moved
    60       a bit the screen becomes totally unreadable, as illustrated on
    61       <ulink url="&files-anduin;/mc-bad.png">this
    62       screenshot</ulink> (taken in a ru_RU.UTF-8 locale). Additionally, input
    63       of non-ASCII characters in the editor is impossible, even after selecting
    64       <quote>Other 8-bit</quote> encoding from the menu.</para>
    65 
    66     </sect3>
    67 
    68     <sect3 id="locale-unzip" xreflabel="UnZip-&unzip-version;">
    69 
    70       <title><xref linkend="unzip"/></title>
    71 
    72       <note>
    73         <para>Use of <application>UnZip</application> in the
    74         <application>JDK</application>, <application>Mozilla</application>,
    75         <application>DocBook</application> or any other BLFS package
    76         installation is not a problem, as BLFS instructions never use
    77         <application>UnZip</application> to extract a file with non-ASCII
    78         characters in the file's name.</para>
    79       </note>
    80 
    81       <para>The <application>UnZip</application> package assumes that filenames
    82       stored in the ZIP archives created on non-Unix systems are encoded in
    83       CP850, and that they should be converted to ISO-8859-1 when writing files
    84       onto the filesystem. Such assumptions are not always valid. In fact,
    85       inside the ZIP archive, filenames are encoded in the DOS codepage that is
    86       in use in the relevant country, and the filenames on disk should be in
    87       the locale encoding. In MS Windows, the OemToChar() C function (from
    88       <filename>User32.DLL</filename>) does the correct conversion (which is
    89       indeed the conversion from CP850 to a superset of ISO-8859-1 if MS
    90       Windows is set up to use the US English language), but there is no
    91       equivalent in Linux.</para>
    92 
    93       <para>When using <command>unzip</command> to unpack a ZIP archive
    94       containing non-ASCII filenames, the filenames are damaged because
    95       <command>unzip</command> uses improper conversion when any of its
    96       encoding assumptions are incorrect. For example, in the ru_RU.KOI8-R
    97       locale, conversion of filenames from CP866 to KOI8-R is required, but
    98       conversion from CP850 to ISO-8859-1 is done, which produces filenames
    99       consisting of undecipherable characters instead of words (the closest
    100       equivalent understandable example for English-only users is rot13). There
    101       are several ways around this limitation:</para>
    102 
    103       <para>1) For unpacking ZIP archives with filenames containing non-ASCII
    104       characters, use <ulink url="http://www.winzip.com/">WinZip</ulink> while
    105       running the <ulink url="http://www.winehq.com/">Wine</ulink> Windows
    106       emulator.</para>
    107 
    108       <para>2) After running <command>unzip</command>, fix the damage made to
    109       the filenames using the <command>convmv</command> tool
    110       (<ulink url="http://j3e.de/linux/convmv/"/>). The following is an example
    111       for the ru_RU.KOI8-R locale:</para>
    112 
    113       <blockquote>
    114         <para>Step 1. Undo the conversion done by
    115         <command>unzip</command>:</para>
    116 
    117 <screen><userinput>convmv -f iso-8859-1 -t cp850 -r --nosmart --notest \
    118     <replaceable>&lt;/path/to/unzipped/files&gt;</replaceable></userinput></screen>
    119 
    120         <para>Step 2. Do the correct conversion instead:</para>
    121 
    122 <screen><userinput>convmv -f cp866 -t koi8-r -r --nosmart --notest \
    123     <replaceable>&lt;/path/to/unzipped/files&gt;</replaceable></userinput></screen>
    124       </blockquote>
    125 
    126       <para>3) Apply this patch to unzip:
    127       <ulink url="https://bugzilla.altlinux.ru/attachment.cgi?id=532"/></para>
    128 
    129       <para>It allows to specify the assumed filename encoding in the ZIP
    130       archive using the <option>-O charset_name</option> option and the
    131       on-disk filename encoding using the <option>-I charset_name</option>
    132       option. Defaults: the on-disk filename encoding is the locale encoding,
    133       the encoding inside the ZIP archive is guessed according to the builtin
    134       table based on the locale encoding. For US English users, this still
    135       means that unzip converts from CP850 to ISO-8859-1 by default.</para>
    136 
    137       <para>Caveat: this method works only with 8-bit locale encodings, not
    138       with UTF-8. Attempting to use a patched <command>unzip</command> in UTF-8
    139       locales may result in a segmentation fault and is probably a security
    140       risk.</para>
    141 
    142     </sect3>
    143 
    144     <sect3 id="locale-nano" xreflabel="Nano-&nano-version;">
    145 
    146       <title><xref linkend="nano"/></title>
    147 
    148       <para>The current stable version of <application>Nano</application>
    149       (&nano-version;) does not support UTF-8 character encodings.  A
    150       development version is available which addresses these issues.  This
    151       version can be downloaded at <ulink
    152       url="http://www.nano-editor.org/dist/v1.3/nano-1.3.11.tar.gz"/>.
    153       Instructions for installing this version are the same as those found on
    154       the <xref linkend="nano"/> page.</para>
    155 
    156     </sect3>
     204    <para>The last four methods work because the filenames are automatically
     205    converted from the sender's locale to UNICODE and stored or sent in this
     206    form. They are then transparently converted from UNICODE to the
     207    recipient's locale encoding.</para>
     208
     209  </sect2>
     210
     211  <sect2 id="locale-wrong-multibyte-characters"
     212         xreflabel="Wrong Multibyte Characters">
     213
     214    <title>The Program Breaks Multibyte Characters or Doesn't Count
     215    Character Cells Correctly</title>
     216
     217    <para>Severity: High or critical</para>
     218
     219    <para>Many programs were written in an older era where multibyte
     220    locales were not common. Such programs assume that C "char" data
     221    type, which is one byte, can be used to store single characters.
     222    Further, they assume that any sequence of characters is a valid
     223    string and that every character occupies a single character cell.
     224    Such assumptions completely break in UTF-8 locales. The visible
     225    manifestation is that the program truncates strings prematurely
     226    (i.e., at 80 bytes instead of 80 characters). Terminal-based
     227    programs don't place the cursor correctly on the screen, don't react
     228    to the "Backspace" key by erasing one character, and leave junk
     229    characters around when updating the screen, usually turning the
     230    screen into a complete mess.</para>
     231
     232    <para>Fixing this kind of problems is a tedious task from a
     233    programmer's point of view, like all other cases of retrofitting new
     234    concepts into the old flawed design. In this case, one has to redesign
     235    all data structures in order to accommodate to the fact that a complete
     236    character may span a variable number of "char"s (or switch to wchar_t
     237    and convert as needed). Also, for every call to the "strlen" and
     238    similar functions, find out whether a number of bytes, a number of
     239    characters, or the width of the string was really meant. Sometimes it
     240    is faster to write a program with the same functionality from scratch.
     241    </para>
     242
     243    <para>Among BLFS packages, this problem applies to <xref linkend="mc"/>,
     244    <xref linkend="nano"/>, <xref linkend="ed"/>, <xref linkend="xine-ui"/>
     245    and all shells.</para>
    157246
    158247  </sect2>
Note: See TracChangeset for help on using the changeset viewer.