Monotone initially dealt with only ASCII characters, in file path names, certificate names, key names, and packets. Some conservative extensions are provided to permit internationalized use. These extensions can be summarized as follows:
The remainder of this section is a precise specification of monotone’s internationalization behavior.
The process of mapping a string of bytes representing wide characters
from one encoding to another. Per-file character set conversions are
specified by a Lua hook
get_charset_conv which takes a filename
and returns a table of two strings: the first represents the
"internal" (database) charset, the second represents the "external"
(file system) charset.
Letters, digits, and hyphen: the set of ASCII bytes
RFC 3454, a general framework for mapping, normalizing, prohibiting and bidirectionality checking for international names prior to use in public network protocols.
RFC 3491, a specific profile of stringprep, used for preparing international domain names (IDNs)
RFC 3492, a "bootstring" encoding of Unicode into ASCII.
RFC 3490, international domain names for applications, a combination of the above technologies (nameprep, punycoding, limiting to LDH characters) to form a specific "ASCII compatible encoding" (ACE) of Unicode, signified by the presence of an "unlikely" ACE prefix string "xn–". IDNA is intended to make it possible to use Unicode relatively "safely" over legacy ASCII-based applications. the general picture of an IDNA string is this:
It is important to understand that IDNA encoding does not preserve the input string: it both prohibits a wide variety of possible strings and normalizes non-equal strings to supposedly "equivalent" forms.
By default, monotone does not decode IDNA when printing to the console (IDNA names are ASCII, which is a subset of UTF-8, so this normal form conversion can still apply, albeit oddly). this behavior is to protect users against security problems associated with malicious use of "similar-looking" characters.
0x5C’\’ path separator to
0x2F’/’. This extra processing is performed by boost::filesystem.
0x2F(ASCII / ), and without a leading or trailing
0x2Fand any ASCII "control codes" (
sha1sumwill produce different results than those entries shown in a corresponding manifest.
UI messages are displayed via calls to
Host names are read on the command-line and subject to normal form
conversion. Host names are then split at
0x2E (ASCII ’.’), each
component is subject to IDNA encoding, and the components are
After processing, host names are stored internally as ASCII. The
invariant is that a host name inside monotone contains only sequences
of LDH separated by
Read on the command line and subject to normal form conversion and IDNA encoding as a single component. The invariant is that a cert name inside monotone is a single LDH ASCII string.
Cert values may be either text or binary, depending on the return
value of the hook
cert_is_binary. If binary, the cert value is
never printed to the screen (the literal string "<binary>" is
displayed, instead), and is never subjected to line ending or
character conversion. If text, the cert value is subject to normal
form conversion, as well as having all UTF-8 codes corresponding to
ASCII control codes (
0x7F) prohibited in
the normal form, except
0x0A (ASCII LF).
Read on the command line and subject to normal form conversion and IDNA encoding as a single component. The invariant is that a var domain inside monotone is a single LDH ASCII string.
Var names and values are assumed to be text, and subject to normal form conversion.
Read on the command line and subject to normal form conversion and
IDNA encoding as an email address (split and joined at ’.’ and ’@’
characters). The invariant is that a key name inside monotone contains
0x2E (ASCII ’.’) and
0x40 (ASCII ’@’)
Packets are 7-bit ASCII. The characters permitted in packets are the union of these character sets: