7.1 Internationalization
Monotone initially dealt with only ASCII characters, in file path
names, certificate names, key names, and packets. Some
conservative extensions are provided to permit internationalized
use. These extensions can be summarized as follows:
- Monotone uses GNU gettext to provide localized progress and error
messages. Translations may or may not exist for your locale, but the
infrastructure is present to add them.
- All command-line arguments are mapped from your local character set to
UTF-8 before processing. This means that monotone can only
handle key names, file names and certificate names which map cleanly
into UTF-8.
- Monotone's control files are stored in UTF-8. This includes: revisions
and manifests, both inside the database and when written to the
_MTN/ directory of the workspace; the _MTN/options and
_MTN/revision files. Converting these files to any other
character set will cause monotone to break; do not do so.
- File path names in the workspace are converted to the locale's
character set (determined via the LANG or CHARSET environment
variables) before monotone interacts with the file system. If you are
accustomed to being able to use file names in your locale's character
set, this should “just work” with monotone.
- Key and cert names, and similar “name-like” entities are subject to
some cleaning and normalization, and conversion into network-safe
subsets of ASCII (typically ACE). Generally, you should be able to use
“sensible” strings in your locale's character set as names, but they
may appear mangled or escaped in certain contexts such as network
transmission.
- Monotone's transmission and storage forms are otherwise
unchanged. Packets and database contents are 7-bit clean ASCII.
The remainder of this section is a precise specification of monotone's
internationalization behavior.
General Terms
- Character set conversion
- The process of mapping a string of bytes representing wide characters
from one encoding to another. Per-file character set conversions are
specified by a Lua hook
get_charset_conv
which takes a filename
and returns a table of two strings: the first represents the
"internal" (database) charset, the second represents the "external"
(file system) charset.
- LDH
- Letters, digits, and hyphen: the set of ASCII bytes
0x2D
,
0x30..0x39
, 0x41..0x5A
, and 0x61..0x7A
.
- stringprep
- RFC 3454, a general framework for mapping, normalizing, prohibiting
and bidirectionality checking for international names prior to use in
public network protocols.
- nameprep
- RFC 3491, a specific profile of stringprep, used for preparing
international domain names (IDNs)
- punycode
- RFC 3492, a "bootstring" encoding of Unicode into ASCII.
- IDNA
- RFC 3490, international domain names for applications, a combination
of the above technologies (nameprep, punycoding, limiting to LDH
characters) to form a specific "ASCII compatible encoding" (ACE) of
Unicode, signified by the presence of an "unlikely" ACE prefix string
"xn–". IDNA is intended to make it possible to use Unicode relatively
"safely" over legacy ASCII-based applications. the general picture of
an IDNA string is this:
{ACE-prefix}{LDH-sanitized(punycode(nameprep(UTF-8-string)))}
It is important to understand that IDNA encoding does not
preserve the input string: it both prohibits a wide variety of
possible strings and normalizes non-equal strings to supposedly
"equivalent" forms.
By default, monotone does not decode IDNA when printing to the
console (IDNA names are ASCII, which is a subset of UTF-8, so this
normal form conversion can still apply, albeit oddly). this behavior
is to protect users against security problems associated with
malicious use of "similar-looking" characters. If the hook
display_decoded_idna
returns true, IDNA names are decoded for
display.
Filenames
- Filenames are subject to normal form conversion.
- Filenames are subject to an additional normal form stage which adjusts
for platform name semantics, for example changing the Windows
0x5C
'\' path separator to 0x2F
'/'. This extra
processing is performed by boost::filesystem.
- FIXME: Monotone does not properly handle case insensitivity on Windows.
- A filename (in normal form) is constrained to be a nonempty sequence
of path components, separated by byte
0x2F
(ASCII / ), and
without a leading or trailing 0x2F
.
- A path component is a nonempty sequence of any UTF-8 character codes
except the path separator byte
0x2F
and any ASCII "control codes"
(0x00..0x1F
and 0x7F
).
- The path components "." and ".." are prohibited.
- Manifests and revisions are constructed from the normal form
(UTF-8). The LC_COLLATE locale category is not used to sort
manifest or revision entries.
File contents
- Files are subject to character set conversion and line ending
conversion.
- File SHA1 values are calculated from the internal form of the
conversions. If the external form of a file differs from the internal
form, running a 3rd party program such as sha1sum will produce
different results than those entries shown in a corresponding manifest.
UI messages
UI messages are displayed via calls to gettext()
.
Host names
Host names are read on the command-line and subject to normal form
conversion. Host names are then split at 0x2E
(ASCII '.'), each
component is subject to IDNA encoding, and the components are
rejoined.
After processing, host names are stored internally as ASCII. The
invariant is that a host name inside monotone contains only sequences
of LDH separated by 0x2E
.
Cert names
Read on the command line and subject to normal form conversion and
IDNA encoding as a single component. The invariant is that a cert name
inside monotone is a single LDH ASCII string.
Cert values
Cert values may be either text or binary, depending on the return
value of the hook cert_is_binary
. If binary, the cert value is
never printed to the screen (the literal string "<binary>" is
displayed, instead), and is never subjected to line ending or
character conversion. If text, the cert value is subject to normal
form conversion, as well as having all UTF-8 codes corresponding to
ASCII control codes (0x0..0x1F
and 0x7F
) prohibited in
the normal form, except 0x0A
(ASCII LF).
Var domains
Read on the command line and subject to normal form conversion and IDNA
encoding as a single component. The invariant is that a var domain
inside monotone is a single LDH ASCII string.
Var names and values
Var names and values are assumed to be text, and subject to normal form
conversion.
Key names
Read on the command line and subject to normal form conversion and
IDNA encoding as an email address (split and joined at '.' and '@'
characters). The invariant is that a key name inside monotone contains
only LDH, 0x2E
(ASCII '.') and 0x40
(ASCII '@')
characters.
Packets
Packets are 7-bit ASCII. The characters permitted in packets are
the union of these character sets:
- The 65 characters of base64 encoding (64 coding + "=" pad).
- The 16 characters of hex encoding.
- LDH, '@' and '.' characters, as required for key and cert names.
- '[' and ']', the packet delimiters.
- ASCII codes 0x0D (CR), 0x0A (LF), 0x09 (HT), and 0x20 (SP).