Ticket #1993: new-locale-issues.diff

File new-locale-issues.diff, 16.5 KB (added by dnicholson@…, 17 years ago)

Revised locale related issues page using generic classes. Based on NewLocaleRelatedIssues

  • introduction/important/locale-issues.xml

    1616  <title>Locale Related Issues</title>
    1818  <para>This page contains information about locale related problems and
    19   issues. In this paragraph you'll find a generic overview of things that can
    20   come up when configuring your system for various locales. The previous
    21   sentence and the remainder of this paragraph must still be
    22   revised/completed.</para>
     19  issues. In this paragraph you'll find a generic overview of things that
     20  can come up when configuring your system for various locales. Many (but
     21  not all) existing locale-related problems can be classified and fall
     22  under one of the headings below.</para>
    24  <sect2>
     24  <sect2 id="locale-not-valid-option">
    26     <title>Package Specific Locale Issues</title>
     26    <title>The Needed Encoding is Not a Valid Option in the Program</title>
    28     <para>For package-specific issues, find the concerned package from the list
    29     below and follow the link to view the available information. If a package
    30     is not listed here, it does not mean there are no known locale-specific
    31     issues or problems with that package. It only means that this page has not
    32     been updated with the locale-specific information regarding that package.
    33     Please reference the BLFS Wiki page for a particular package for any
    34     additional locale-specific information. </para>
     28    <para>Some programs require the user to specify the character encoding
     29    for their input or output data, and present only a limited choice of
     30    encodings. This is the case for the <option>-X</option> option in
     31    <xref linkend="a2ps"/> and <xref linkend="enscript"/>,
     32    the <option>-input-charset</option> option in unpatched
     33    <xref linkend="cdrtools"/>, and the character sets offered for display
     34    in the menu of <xref linkend="links"/>. If the required encoding is not
     35    in the list, the program usually becomes completely unusable. For
     36    non-interactive programs, it may be possible to work around this by
     37    converting the document to a supported input character set before
     38    submitting to the program.</para>
    36     <itemizedlist>
     40    <para>A solution to this type of problem is to implement the necessary
     41    support for the missing encoding as a patch to the original program
     42    (as done for <xref linkend="cdrtools"/> in this book), or to find a
     43    replacement.</para>
    38       <title>List of Packages with Locale Related Issues</title>
     45  </sect2>
    40       <listitem>
    41         <para><xref linkend="locale-mc"/></para>
    42       </listitem>
    43       <listitem>
    44         <para><xref linkend="locale-unzip"/></para>
    45       </listitem>
    46       <listitem>
    47         <para><xref linkend="locale-nano"/></para>
    48       </listitem>
     47  <sect2 id="locale-assumed-encoding">
    50     </itemizedlist>
     49    <title>The Program Assumes the Locale-Based Encoding of External
     50    Documents</title>
    52     <sect3 id="locale-mc" xreflabel="MC-&mc-version;">
     52    <para>Some programs, <xref linkend="nano"/> or
     53    <xref linkend="joe"/> for example, assume that documents are always
     54    in the encoding implied by the current locale. While this assumption
     55    may be valid for the user-created documents, it is not safe for
     56    external ones. When this assumption fails, non-ASCII charactrs are
     57    displayed incorrectly, and the document may become unreadable.</para>
    54       <title><xref linkend="mc"/></title>
     59    <para>If the external document is entirely text based, it can be
     60    converted to the current locale encoding using the
     61    <command>iconv</command> program.</para>
    56       <para>This package makes the assumption that <quote>characters</quote>
    57       and <quote>bytes</quote> are the same thing. This is not true in UTF-8
    58       based locales. Due to this assumption <application>MC</application> will
    59       incorrectly position characters on the screen. After the cursor is moved
    60       a bit the screen becomes totally unreadable, as illustrated on
    61       <ulink url="&files-anduin;/mc-bad.png">this
    62       screenshot</ulink> (taken in a ru_RU.UTF-8 locale). Additionally, input
    63       of non-ASCII characters in the editor is impossible, even after selecting
    64       <quote>Other 8-bit</quote> encoding from the menu.</para>
     63    <para>For documents that are not text-based, this is not possible.
     64    In fact, the assumption made in the program may be completely
     65    invalid for documents where the Microsoft Windows operating system
     66    has set de-facto standards. An example of this problem is ID3v1 tags
     67    in MP3 files (see <ulink url="&blfs-wiki;/ID3v1Coding">this page</ulink>
     68    for more details). For these cases, the only solution is to find a
     69    replacement program that doesn't have the issue (e.g., one that
     70    will allow you to specify the assumed document encoding).</para>
    66     </sect3>
     72    <para>Another problem in this category is when someone cannot read
     73    the documents you've sent them because their operating system is
     74    set up to handle character encodings differently. This can happen
     75    often when the other person is using Microsoft Windows, which only
     76    provides one character encoding for a given country. For example,
     77    this causes problems with UTF-8 encoded TeX documents created in
     78    Linux. On Windows, most applications will assume that these documents
     79    have been created using the default Windows 8-bit encoding. See the
     80    <ulink url="&blfs-wiki;/tetex">teTeX</ulink> Wiki page for more
     81    details.</para>
    68     <sect3 id="locale-unzip" xreflabel="UnZip-&unzip-version;">
     83    <para>In extreme cases, Windows encoding compatibility issues may be
     84    solved only by running Windows programs under
     85    <ulink url="http://www.winehq.com/">Wine</ulink>.</para>
    70       <title><xref linkend="unzip"/></title>
     87  </sect2>
    72       <note>
    73         <para>Use of <application>UnZip</application> in the
    74         <application>JDK</application>, <application>Mozilla</application>,
    75         <application>DocBook</application> or any other BLFS package
    76         installation is not a problem, as BLFS instructions never use
    77         <application>UnZip</application> to extract a file with non-ASCII
    78         characters in the file's name.</para>
    79       </note>
     89  <sect2 id="locale-wrong-filename-encoding">
    81       <para>The <application>UnZip</application> package assumes that filenames
    82       stored in the ZIP archives created on non-Unix systems are encoded in
    83       CP850, and that they should be converted to ISO-8859-1 when writing files
    84       onto the filesystem. Such assumptions are not always valid. In fact,
    85       inside the ZIP archive, filenames are encoded in the DOS codepage that is
    86       in use in the relevant country, and the filenames on disk should be in
    87       the locale encoding. In MS Windows, the OemToChar() C function (from
    88       <filename>User32.DLL</filename>) does the correct conversion (which is
    89       indeed the conversion from CP850 to a superset of ISO-8859-1 if MS
    90       Windows is set up to use the US English language), but there is no
    91       equivalent in Linux.</para>
     91    <title>The Program Uses or Creates Filenames in
     92    the Wrong Encoding</title>
    93       <para>When using <command>unzip</command> to unpack a ZIP archive
    94       containing non-ASCII filenames, the filenames are damaged because
    95       <command>unzip</command> uses improper conversion when any of its
    96       encoding assumptions are incorrect. For example, in the ru_RU.KOI8-R
    97       locale, conversion of filenames from CP866 to KOI8-R is required, but
    98       conversion from CP850 to ISO-8859-1 is done, which produces filenames
    99       consisting of undecipherable characters instead of words (the closest
    100       equivalent understandable example for English-only users is rot13). There
    101       are several ways around this limitation:</para>
     94    <para>The POSIX standard mandates that the filename encoding is
     95    the encoding implied by the current LC_CTYPE locale category. This
     96    information is well-hidden on the page which specifies the behaviour
     97    of <application>Tar</application> and <application>Cpio</application>
     98    programs. Some programs get it wrong by default (or simply don't
     99    have enough information to get it right). The result is that they
     100    create filenames which are not subsequently shown correctly by
     101    <command>ls</command>, or they refuse to accept filenames that
     102    <command>ls</command> shows properly. For the <xref linkend="glib2"/>
     103    library, the problem can be corrected by setting the
     104    <envar>G_FILENAME_ENCODING</envar> environment variable to the special
     105    "@locale" value. <application>Glib2</application> based programs that
     106    don't respect that environment variable are buggy.</para>
    103       <para>1) For unpacking ZIP archives with filenames containing non-ASCII
    104       characters, use <ulink url="http://www.winzip.com/">WinZip</ulink> while
    105       running the <ulink url="http://www.winehq.com/">Wine</ulink> Windows
    106       emulator.</para>
     108    <para>The <xref linkend="zip"/>, <xref linkend="unzip"/>, and
     109    <xref linkend="nautilus-cd-burner"/> have this problem because
     110    they hard-code the expected filename encoding.
     111    <application>UnZip</application> contains a hard-coded conversion
     112    table between the CP850 (DOS) and ISO-8859-1 (UNIX) encodings and
     113    uses this table when extracting archives created under DOS or
     114    Microsoft Windows. However, this assumption only works for those
     115    in the US and not for anyone using a UTF-8 locale. Non-ASCII
     116    characters will be mangled in the extracted filenames.</para>
    108       <para>2) After running <command>unzip</command>, fix the damage made to
    109       the filenames using the <command>convmv</command> tool
    110       (<ulink url="http://j3e.de/linux/convmv/"/>). The following is an example
    111       for the ru_RU.KOI8-R locale:</para>
     118    <para>On the other hand,
     119    <application>Nautilus CD Burner</application> checks names of
     120    files added to its window for UTF-8 validity. This is wrong for
     121    users of non-UTF-8 locales. Also,
     122    <application>Nautilus CD Burner</application> unconditionally
     123    calls <command>mkisofs</command> with the
     124    <parameter>-input-charset UTF-8</parameter> parameter, which is
     125    only correct in UTF-8 locales.</para>
    113       <blockquote>
    114         <para>Step 1. Undo the conversion done by
    115         <command>unzip</command>:</para>
     127    <para>The general rule for avoiding this class of problems is to
     128    avoid installing broken programs. If this is imposible, the
     129    <ulink url="http://j3e.de/linux/convmv/">convmv</ulink>
     130    command-line tool can be used to fix filenames created by these
     131    broken programs, or intentionally mangle the existing filenames
     132    to meet the broken expectations of such programs.</para>
    117 <screen><userinput>convmv -f iso-8859-1 -t cp850 -r --nosmart --notest \
    118     <replaceable>&lt;/path/to/unzipped/files&gt;</replaceable></userinput></screen>
     134    <para>In other cases, a similar problem is caused by importing
     135    filenames from a system using a different locale with a tool that
     136    is not locale-aware (e.g., <xref linkend="nfs-utils"/> or
     137    <xref linkend="openssh"/>). In order to avoid mangling non-ASCII
     138    characters when transferring files to a system with a different
     139    locale, any of the following methods can be used:</para>
    120         <para>Step 2. Do the correct conversion instead:</para>
     141    <itemizedlist>
    122 <screen><userinput>convmv -f cp866 -t koi8-r -r --nosmart --notest \
    123     <replaceable>&lt;/path/to/unzipped/files&gt;</replaceable></userinput></screen>
    124       </blockquote>
     143      <listitem>
     144        <para>Transfer anyway, fix the damage with
     145        <command>convmv</command>.</para>
     146      </listitem>
    126       <para>3) Apply this patch to unzip:
    127       <ulink url="https://bugzilla.altlinux.ru/attachment.cgi?id=532"/></para>
     148      <listitem>
     149        <para>On the sending side, create a tar archive with the
     150        <parameter>--format=posix</parameter> switch passed to
     151        <command>tar</command> (this will be the default in a future
     152        version of <command>tar</command>). This causes the filenames
     153        to be converted from the creator's locale encoding to UTF-8
     154        when creating the archive, stored in the UTF-8 encoding in the
     155        archive, and converted from it to the recepient's locale
     156        encoding when unpacking.</para>
     157      </listitem>
    129       <para>It allows to specify the assumed filename encoding in the ZIP
    130       archive using the <option>-O charset_name</option> option and the
    131       on-disk filename encoding using the <option>-I charset_name</option>
    132       option. Defaults: the on-disk filename encoding is the locale encoding,
    133       the encoding inside the ZIP archive is guessed according to the builtin
    134       table based on the locale encoding. For US English users, this still
    135       means that unzip converts from CP850 to ISO-8859-1 by default.</para>
     159      <listitem>
     160        <para>Mail the files as attachments. Mail clients specify the
     161        encoding of attached filenames.</para>
     162      </listitem>
    137       <para>Caveat: this method works only with 8-bit locale encodings, not
    138       with UTF-8. Attempting to use a patched <command>unzip</command> in UTF-8
    139       locales may result in a segmentation fault and is probably a security
    140       risk.</para>
     164      <listitem>
     165        <para>Write the files to a removable disk formatted with FAT or
     166        FAT32 filesystem that stores file names in UNICODE. The kernel
     167        automatically converts them to and from UNICODE on demand.</para>
     168      </listitem>
    142     </sect3>
     170    </itemizedlist>
    144     <sect3 id="locale-nano" xreflabel="Nano-&nano-version;">
    146       <title><xref linkend="nano"/></title>
    148       <para>The current stable version of <application>Nano</application>
    149       (&nano-version;) does not support UTF-8 character encodings.  A
    150       development version is available which addresses these issues.  This
    151       version can be downloaded at <ulink
    152       url="http://www.nano-editor.org/dist/v1.3/nano-1.3.11.tar.gz"/>.
    153       Instructions for installing this version are the same as those found on
    154       the <xref linkend="nano"/> page.</para>
    156     </sect3>
    158172  </sect2>
  • postlfs/editors/nano.xml

    3333    simple text editor which aims to replace <application>Pico</application>,
    3434    the default editor in the <application>Pine</application> package.</para>
     36    <!-- Commented for now
    3637    <caution>
    3738      <para>The <application>Nano</application> package has some issues when
    3839      used in a UTF-8 based locale.  A development version is available
    4041      <xref linkend="locale-nano"/> section of the <xref
    4142      linkend="locale-issues"/>.</para>
    4243    </caution>
     44    -->
    4446    <bridgehead renderas="sect3">Package Information</bridgehead>
    4547    <itemizedlist spacing="compact">
  • general/sysutils/mc.xml

    3535    making many frequent file operations more efficient and preserving the
    3636    full power of the command prompt.</para>
     38    <!-- Commented for now
    3839    <caution>
    3940      <para>The <application>MC</application> package has some issues when
    4041      used in a UTF-8 based locale. For a full explanation of the issues, see
    4142      the <xref linkend="locale-mc"/> section of the
    4243      <xref linkend="locale-issues"/>.</para>
    4344    </caution>
     45    -->
    4547    <bridgehead renderas="sect3">Package Information</bridgehead>
    4648    <itemizedlist spacing="compact">
  • general/sysutils/unzip.xml

    3636    <application>PKZIP</application> or <application>Info-ZIP</application>
    3737    utilities, primarily in a DOS environment.</para>
     39    <!-- Commented for now
    3940    <caution>
    4041      <para>The <application>UnZip</application> package has some locale
    4142      related issues. For a full explanation of the issues and some possible
    4243      solutions, see the <xref linkend="locale-unzip"/> section of the
    4344      <xref linkend="locale-issues"/>.</para>
    4445    </caution>
     46    -->
    4648    <bridgehead renderas="sect3">Package Information</bridgehead>
    4749    <itemizedlist spacing="compact">