Monotone initially dealt with only ASCII characters, in file path names, certificate names, key names, and packets. Some conservative extensions are provided to permit internationalized use. These extensions can be summarized as follows:
The remainder of this section is a precise specification of monotone's internationalization behavior.
get_charset_conv
which takes a filename
and returns a table of two strings: the first represents the
"internal" (database) charset, the second represents the "external"
(file system) charset.
0x0D
, 0x0A
, or the pair 0x0D 0x0A
) from one
convention to another. Per-file line ending conversion is specified by
a Lua hook get_linesep_conv
which takes a filename and returns
a table of two strings: the first represents the "internal" (database)
line ending convention, the second represents the "external"
(file system) line ending convention. each string should be one of the
three strings "CR", "LF", or "CRLF".
Note that Line ending conversion is always performed on the internal
character set, when both character set and line ending conversion are
enabled; this behavior is meant to encourage the use of the monotone's
“normal form” (UTF-8, '\n') as an internal form for your source
files, when working with multiple external forms. Also note that line
ending conversion only works on character encodings with the specific
code bytes described above, such as ASCII, ISO-8859x, and UTF-8.
get_system_linesep
hook. No hooks exist for adjusting the
system character set, since the system character set must be known
during command-line argument processing, before any Lua hooks are
loaded.
Monotone's normal form is the UTF-8 character set and the 0x0A
(LF) line ending form. This form is used in any files monotone needs
to read, write, and interpret itself, such as: MT/revision,
MT/work, MT/options, .mt-attrs
0x2D
,
0x30..0x39
, 0x41..0x5A
, and 0x61..0x7A
.
{ACE-prefix}{LDH-sanitized(punycode(nameprep(UTF-8-string)))}
It is important to understand that IDNA encoding does not preserve the input string: it both prohibits a wide variety of possible strings and normalizes non-equal strings to supposedly "equivalent" forms.
By default, monotone does not decode IDNA when printing to the
console (IDNA names are ASCII, which is a subset of UTF-8, so this
normal form conversion can still apply, albeit oddly). this behavior
is to protect users against security problems associated with
malicious use of "similar-looking" characters. If the hook
display_decoded_idna
returns true, IDNA names are decoded for
display.
0x5C
'\' path separator to 0x2F
'/'. This extra
processing is performed by boost::filesystem.
0x2F
(ASCII / ), and
without a leading or trailing 0x2F
.
0x2F
and any ASCII "control codes"
(0x00..0x1F
and 0x7F
).
sha1sum
will produce
different results than those entries shown in a corresponding manifest.
UI messages are displayed via calls to gettext()
.
Host names are read on the command-line and subject to normal form
conversion. Host names are then split at 0x2E
(ASCII '.'), each
component is subject to IDNA encoding, and the components are
rejoined.
After processing, host names are stored internally as ASCII. The
invariant is that a host name inside monotone contains only sequences
of LDH separated by 0x2E
.
Read on the command line and subject to normal form conversion and IDNA encoding as a single component. The invariant is that a cert name inside monotone is a single LDH ASCII string.
Cert values may be either text or binary, depending on the return
value of the hook cert_is_binary
. If binary, the cert value is
never printed to the screen (the literal string "<binary>" is
displayed, instead), and is never subjected to line ending or
character conversion. If text, the cert value is subject to normal
form conversion, as well as having all UTF-8 codes corresponding to
ASCII control codes (0x0..0x1F
and 0x7F
) prohibited in
the normal form, except 0x0A
(ASCII LF).
Read on the command line and subject to normal form conversion and IDNA encoding as a single component. The invariant is that a var domain inside monotone is a single LDH ASCII string.
Var names and values are assumed to be text, and subject to normal form conversion.
Read on the command line and subject to normal form conversion and
IDNA encoding as an email address (split and joined at '.' and '@'
characters). The invariant is that a key name inside monotone contains
only LDH, 0x2E
(ASCII '.') and 0x40
(ASCII '@')
characters.
Packets are 7-bit ASCII. The characters permitted in packets are the union of these character sets:
Now uses 0x0A (ASCII LF) as a delimiter, to permit 0x20 in filenames. This may change in the future.