Ticket #1993: new-locale-issues.diff

File new-locale-issues.diff, 16.5 KB (added by dnicholson@…, 18 years ago)

Revised locale related issues page using generic classes. Based on NewLocaleRelatedIssues

  • introduction/important/locale-issues.xml

     
    1616  <title>Locale Related Issues</title>
    1717
    1818  <para>This page contains information about locale related problems and
    19   issues. In this paragraph you'll find a generic overview of things that can
    20   come up when configuring your system for various locales. The previous
    21   sentence and the remainder of this paragraph must still be
    22   revised/completed.</para>
     19  issues. In this paragraph you'll find a generic overview of things that
     20  can come up when configuring your system for various locales. Many (but
     21  not all) existing locale-related problems can be classified and fall
     22  under one of the headings below.</para>
    2323
    24  <sect2>
     24  <sect2 id="locale-not-valid-option">
    2525
    26     <title>Package Specific Locale Issues</title>
     26    <title>The Needed Encoding is Not a Valid Option in the Program</title>
    2727
    28     <para>For package-specific issues, find the concerned package from the list
    29     below and follow the link to view the available information. If a package
    30     is not listed here, it does not mean there are no known locale-specific
    31     issues or problems with that package. It only means that this page has not
    32     been updated with the locale-specific information regarding that package.
    33     Please reference the BLFS Wiki page for a particular package for any
    34     additional locale-specific information. </para>
     28    <para>Some programs require the user to specify the character encoding
     29    for their input or output data, and present only a limited choice of
     30    encodings. This is the case for the <option>-X</option> option in
     31    <xref linkend="a2ps"/> and <xref linkend="enscript"/>,
     32    the <option>-input-charset</option> option in unpatched
     33    <xref linkend="cdrtools"/>, and the character sets offered for display
     34    in the menu of <xref linkend="links"/>. If the required encoding is not
     35    in the list, the program usually becomes completely unusable. For
     36    non-interactive programs, it may be possible to work around this by
     37    converting the document to a supported input character set before
     38    submitting to the program.</para>
    3539
    36     <itemizedlist>
     40    <para>A solution to this type of problem is to implement the necessary
     41    support for the missing encoding as a patch to the original program
     42    (as done for <xref linkend="cdrtools"/> in this book), or to find a
     43    replacement.</para>
    3744
    38       <title>List of Packages with Locale Related Issues</title>
     45  </sect2>
    3946
    40       <listitem>
    41         <para><xref linkend="locale-mc"/></para>
    42       </listitem>
    43       <listitem>
    44         <para><xref linkend="locale-unzip"/></para>
    45       </listitem>
    46       <listitem>
    47         <para><xref linkend="locale-nano"/></para>
    48       </listitem>
     47  <sect2 id="locale-assumed-encoding">
    4948
    50     </itemizedlist>
     49    <title>The Program Assumes the Locale-Based Encoding of External
     50    Documents</title>
    5151
    52     <sect3 id="locale-mc" xreflabel="MC-&mc-version;">
     52    <para>Some programs, <xref linkend="nano"/> or
     53    <xref linkend="joe"/> for example, assume that documents are always
     54    in the encoding implied by the current locale. While this assumption
     55    may be valid for the user-created documents, it is not safe for
     56    external ones. When this assumption fails, non-ASCII charactrs are
     57    displayed incorrectly, and the document may become unreadable.</para>
    5358
    54       <title><xref linkend="mc"/></title>
     59    <para>If the external document is entirely text based, it can be
     60    converted to the current locale encoding using the
     61    <command>iconv</command> program.</para>
    5562
    56       <para>This package makes the assumption that <quote>characters</quote>
    57       and <quote>bytes</quote> are the same thing. This is not true in UTF-8
    58       based locales. Due to this assumption <application>MC</application> will
    59       incorrectly position characters on the screen. After the cursor is moved
    60       a bit the screen becomes totally unreadable, as illustrated on
    61       <ulink url="&files-anduin;/mc-bad.png">this
    62       screenshot</ulink> (taken in a ru_RU.UTF-8 locale). Additionally, input
    63       of non-ASCII characters in the editor is impossible, even after selecting
    64       <quote>Other 8-bit</quote> encoding from the menu.</para>
     63    <para>For documents that are not text-based, this is not possible.
     64    In fact, the assumption made in the program may be completely
     65    invalid for documents where the Microsoft Windows operating system
     66    has set de-facto standards. An example of this problem is ID3v1 tags
     67    in MP3 files (see <ulink url="&blfs-wiki;/ID3v1Coding">this page</ulink>
     68    for more details). For these cases, the only solution is to find a
     69    replacement program that doesn't have the issue (e.g., one that
     70    will allow you to specify the assumed document encoding).</para>
    6571
    66     </sect3>
     72    <para>Another problem in this category is when someone cannot read
     73    the documents you've sent them because their operating system is
     74    set up to handle character encodings differently. This can happen
     75    often when the other person is using Microsoft Windows, which only
     76    provides one character encoding for a given country. For example,
     77    this causes problems with UTF-8 encoded TeX documents created in
     78    Linux. On Windows, most applications will assume that these documents
     79    have been created using the default Windows 8-bit encoding. See the
     80    <ulink url="&blfs-wiki;/tetex">teTeX</ulink> Wiki page for more
     81    details.</para>
    6782
    68     <sect3 id="locale-unzip" xreflabel="UnZip-&unzip-version;">
     83    <para>In extreme cases, Windows encoding compatibility issues may be
     84    solved only by running Windows programs under
     85    <ulink url="http://www.winehq.com/">Wine</ulink>.</para>
    6986
    70       <title><xref linkend="unzip"/></title>
     87  </sect2>
    7188
    72       <note>
    73         <para>Use of <application>UnZip</application> in the
    74         <application>JDK</application>, <application>Mozilla</application>,
    75         <application>DocBook</application> or any other BLFS package
    76         installation is not a problem, as BLFS instructions never use
    77         <application>UnZip</application> to extract a file with non-ASCII
    78         characters in the file's name.</para>
    79       </note>
     89  <sect2 id="locale-wrong-filename-encoding">
    8090
    81       <para>The <application>UnZip</application> package assumes that filenames
    82       stored in the ZIP archives created on non-Unix systems are encoded in
    83       CP850, and that they should be converted to ISO-8859-1 when writing files
    84       onto the filesystem. Such assumptions are not always valid. In fact,
    85       inside the ZIP archive, filenames are encoded in the DOS codepage that is
    86       in use in the relevant country, and the filenames on disk should be in
    87       the locale encoding. In MS Windows, the OemToChar() C function (from
    88       <filename>User32.DLL</filename>) does the correct conversion (which is
    89       indeed the conversion from CP850 to a superset of ISO-8859-1 if MS
    90       Windows is set up to use the US English language), but there is no
    91       equivalent in Linux.</para>
     91    <title>The Program Uses or Creates Filenames in
     92    the Wrong Encoding</title>
    9293
    93       <para>When using <command>unzip</command> to unpack a ZIP archive
    94       containing non-ASCII filenames, the filenames are damaged because
    95       <command>unzip</command> uses improper conversion when any of its
    96       encoding assumptions are incorrect. For example, in the ru_RU.KOI8-R
    97       locale, conversion of filenames from CP866 to KOI8-R is required, but
    98       conversion from CP850 to ISO-8859-1 is done, which produces filenames
    99       consisting of undecipherable characters instead of words (the closest
    100       equivalent understandable example for English-only users is rot13). There
    101       are several ways around this limitation:</para>
     94    <para>The POSIX standard mandates that the filename encoding is
     95    the encoding implied by the current LC_CTYPE locale category. This
     96    information is well-hidden on the page which specifies the behaviour
     97    of <application>Tar</application> and <application>Cpio</application>
     98    programs. Some programs get it wrong by default (or simply don't
     99    have enough information to get it right). The result is that they
     100    create filenames which are not subsequently shown correctly by
     101    <command>ls</command>, or they refuse to accept filenames that
     102    <command>ls</command> shows properly. For the <xref linkend="glib2"/>
     103    library, the problem can be corrected by setting the
     104    <envar>G_FILENAME_ENCODING</envar> environment variable to the special
     105    "@locale" value. <application>Glib2</application> based programs that
     106    don't respect that environment variable are buggy.</para>
    102107
    103       <para>1) For unpacking ZIP archives with filenames containing non-ASCII
    104       characters, use <ulink url="http://www.winzip.com/">WinZip</ulink> while
    105       running the <ulink url="http://www.winehq.com/">Wine</ulink> Windows
    106       emulator.</para>
     108    <para>The <xref linkend="zip"/>, <xref linkend="unzip"/>, and
     109    <xref linkend="nautilus-cd-burner"/> have this problem because
     110    they hard-code the expected filename encoding.
     111    <application>UnZip</application> contains a hard-coded conversion
     112    table between the CP850 (DOS) and ISO-8859-1 (UNIX) encodings and
     113    uses this table when extracting archives created under DOS or
     114    Microsoft Windows. However, this assumption only works for those
     115    in the US and not for anyone using a UTF-8 locale. Non-ASCII
     116    characters will be mangled in the extracted filenames.</para>
    107117
    108       <para>2) After running <command>unzip</command>, fix the damage made to
    109       the filenames using the <command>convmv</command> tool
    110       (<ulink url="http://j3e.de/linux/convmv/"/>). The following is an example
    111       for the ru_RU.KOI8-R locale:</para>
     118    <para>On the other hand,
     119    <application>Nautilus CD Burner</application> checks names of
     120    files added to its window for UTF-8 validity. This is wrong for
     121    users of non-UTF-8 locales. Also,
     122    <application>Nautilus CD Burner</application> unconditionally
     123    calls <command>mkisofs</command> with the
     124    <parameter>-input-charset UTF-8</parameter> parameter, which is
     125    only correct in UTF-8 locales.</para>
    112126
    113       <blockquote>
    114         <para>Step 1. Undo the conversion done by
    115         <command>unzip</command>:</para>
     127    <para>The general rule for avoiding this class of problems is to
     128    avoid installing broken programs. If this is imposible, the
     129    <ulink url="http://j3e.de/linux/convmv/">convmv</ulink>
     130    command-line tool can be used to fix filenames created by these
     131    broken programs, or intentionally mangle the existing filenames
     132    to meet the broken expectations of such programs.</para>
    116133
    117 <screen><userinput>convmv -f iso-8859-1 -t cp850 -r --nosmart --notest \
    118     <replaceable>&lt;/path/to/unzipped/files&gt;</replaceable></userinput></screen>
     134    <para>In other cases, a similar problem is caused by importing
     135    filenames from a system using a different locale with a tool that
     136    is not locale-aware (e.g., <xref linkend="nfs-utils"/> or
     137    <xref linkend="openssh"/>). In order to avoid mangling non-ASCII
     138    characters when transferring files to a system with a different
     139    locale, any of the following methods can be used:</para>
    119140
    120         <para>Step 2. Do the correct conversion instead:</para>
     141    <itemizedlist>
    121142
    122 <screen><userinput>convmv -f cp866 -t koi8-r -r --nosmart --notest \
    123     <replaceable>&lt;/path/to/unzipped/files&gt;</replaceable></userinput></screen>
    124       </blockquote>
     143      <listitem>
     144        <para>Transfer anyway, fix the damage with
     145        <command>convmv</command>.</para>
     146      </listitem>
    125147
    126       <para>3) Apply this patch to unzip:
    127       <ulink url="https://bugzilla.altlinux.ru/attachment.cgi?id=532"/></para>
     148      <listitem>
     149        <para>On the sending side, create a tar archive with the
     150        <parameter>--format=posix</parameter> switch passed to
     151        <command>tar</command> (this will be the default in a future
     152        version of <command>tar</command>). This causes the filenames
     153        to be converted from the creator's locale encoding to UTF-8
     154        when creating the archive, stored in the UTF-8 encoding in the
     155        archive, and converted from it to the recepient's locale
     156        encoding when unpacking.</para>
     157      </listitem>
    128158
    129       <para>It allows to specify the assumed filename encoding in the ZIP
    130       archive using the <option>-O charset_name</option> option and the
    131       on-disk filename encoding using the <option>-I charset_name</option>
    132       option. Defaults: the on-disk filename encoding is the locale encoding,
    133       the encoding inside the ZIP archive is guessed according to the builtin
    134       table based on the locale encoding. For US English users, this still
    135       means that unzip converts from CP850 to ISO-8859-1 by default.</para>
     159      <listitem>
     160        <para>Mail the files as attachments. Mail clients specify the
     161        encoding of attached filenames.</para>
     162      </listitem>
    136163
    137       <para>Caveat: this method works only with 8-bit locale encodings, not
    138       with UTF-8. Attempting to use a patched <command>unzip</command> in UTF-8
    139       locales may result in a segmentation fault and is probably a security
    140       risk.</para>
     164      <listitem>
     165        <para>Write the files to a removable disk formatted with FAT or
     166        FAT32 filesystem that stores file names in UNICODE. The kernel
     167        automatically converts them to and from UNICODE on demand.</para>
     168      </listitem>
    141169
    142     </sect3>
     170    </itemizedlist>
    143171
    144     <sect3 id="locale-nano" xreflabel="Nano-&nano-version;">
    145 
    146       <title><xref linkend="nano"/></title>
    147 
    148       <para>The current stable version of <application>Nano</application>
    149       (&nano-version;) does not support UTF-8 character encodings.  A
    150       development version is available which addresses these issues.  This
    151       version can be downloaded at <ulink
    152       url="http://www.nano-editor.org/dist/v1.3/nano-1.3.11.tar.gz"/>.
    153       Instructions for installing this version are the same as those found on
    154       the <xref linkend="nano"/> page.</para>
    155 
    156     </sect3>
    157 
    158172  </sect2>
    159173
    160174</sect1>
  • postlfs/editors/nano.xml

     
    3333    simple text editor which aims to replace <application>Pico</application>,
    3434    the default editor in the <application>Pine</application> package.</para>
    3535
     36    <!-- Commented for now
    3637    <caution>
    3738      <para>The <application>Nano</application> package has some issues when
    3839      used in a UTF-8 based locale.  A development version is available
     
    4041      <xref linkend="locale-nano"/> section of the <xref
    4142      linkend="locale-issues"/>.</para>
    4243    </caution>
     44    -->
    4345
    4446    <bridgehead renderas="sect3">Package Information</bridgehead>
    4547    <itemizedlist spacing="compact">
  • general/sysutils/mc.xml

     
    3535    making many frequent file operations more efficient and preserving the
    3636    full power of the command prompt.</para>
    3737
     38    <!-- Commented for now
    3839    <caution>
    3940      <para>The <application>MC</application> package has some issues when
    4041      used in a UTF-8 based locale. For a full explanation of the issues, see
    4142      the <xref linkend="locale-mc"/> section of the
    4243      <xref linkend="locale-issues"/>.</para>
    4344    </caution>
     45    -->
    4446
    4547    <bridgehead renderas="sect3">Package Information</bridgehead>
    4648    <itemizedlist spacing="compact">
  • general/sysutils/unzip.xml

     
    3636    <application>PKZIP</application> or <application>Info-ZIP</application>
    3737    utilities, primarily in a DOS environment.</para>
    3838
     39    <!-- Commented for now
    3940    <caution>
    4041      <para>The <application>UnZip</application> package has some locale
    4142      related issues. For a full explanation of the issues and some possible
    4243      solutions, see the <xref linkend="locale-unzip"/> section of the
    4344      <xref linkend="locale-issues"/>.</para>
    4445    </caution>
     46    -->
    4547
    4648    <bridgehead renderas="sect3">Package Information</bridgehead>
    4749    <itemizedlist spacing="compact">