Encoding text

Some Sympathy nodes allows you to choose an encoding, especially ones that write or read from files or communicate over a network. This section is a short introduction about encodings to help you choose.

Character encoding determines the translation between text characters and bytes, for example, stored in a file. Each encoding uses a different translation scheme and can support different languages.

  • Encode: text characters -> bytes

  • Decode: bytes -> text characters

To recreate the original text, choose the same encoding for decode as was used to encode the data.

See https://en.wikipedia.org/wiki/Character_encoding for more information.

Notable Encodings

Here are some encodings that we offer as choices in Sympathy.

Recommended encoding:

  • UTF-8, supports essentially all written languages. Widely used on the web and strongly recommended when you have the freedom to choose. Capable of encoding all valid unicode characters.

Other encodings:

These are not recommended but could be needed when working with existing files and applications. Use only when required!

  • UTF-16, supports essentially all written languages. There are variations depending on byte order (endianness) UTF-16-LE, UTF-16-BE, and UTF-16 which uses byte order mark (BOM) to determine to use LE or BE. UTF-16 is used internally by Microsoft Windows but is now generally superceeded by UTF-8. Capable of encoding all valid unicode characters.

  • US-ASCII, supports American English.

  • ISO 8859-1 (Latin-1), supports Western European Languages, superset of US-ASCII.

  • ISO 8859-15 (Latin-9), supports Western European Languages, similar to ISO 8859-1 but replaces some less common symbols, introducing the euro sign.

  • Windows code page encodings, are sometimes used by older applications and file formats, especially ones for Windows. Can be identified with a code page identifiers. Superseded by unicode and UTF-8, etc. but can still be found in files today.

  • Windows-1252, example of a Windows code page which supports Western European Languages. Superset of ISO 8859-1 in terms of printable characters. Used in the legacy components of Microsoft Windows for English and many European languages.

For other encodings (if you type the name by hand), use the Codec names from https://docs.python.org/3/library/codecs.html#standard-encodings.

Choosing an encoding

If you are responsible for both encoding and decoding, you should probably just use utf8 as character encoding at both ends.

If you receive data files you need to use the same as was used to encode them. If you don’t know what encoding was used, you can try the ones listed above in order to try to identify the correct one.

If you produce data files you need to communicate the character encoding to the consumers of those data files.

There are also applications and libraries that use heuristics to try to automatically identify character encodings, but these are prone to failure in many cases and are not generally recommended.

Mismatched encodings

Decoding using an different character encoding than the one used for encoding may result in garbled text, making some characters appears as unrelated ones.

Example with the Swedish word “Björnbärssnår” (Blackberry thicket):

Encode

Decode

Result

UTF-8

UTF-8

Björnbärssnår

UTF-8

ISO-8859-1

Björnbärssnår

UTF-8

ISO-8859-2

BjÜrnbärssnür

UTF-8

UTF-16-LE

橂뛃湲썢犤獳썮犥

UTF-8

UTF-16-BE

䉪쎶牮拃ꑲ獳滃ꕲ

As seen, unmatched encodings can result in anything from misrepresented special characters to a result that is compeletely off. The result can also be correct for some words in a larger text and incorrect for others. Both encode and decode can also fail completely if there is no possible translation, depending on the combination of characters (encode) or bytes (decode).

See https://en.wikipedia.org/wiki/Mojibake for more information.

Encodings in Python

Encodings in Python is performed using 2 different methods: str.encode and bytes.decode. Names for available encodings can be found in the documentation for the codecs module.

Encode using an unsupported encoding results in an UnicodeEncodeError.

>>> 'Björnbärssnår'.encode('ascii')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 2: ordinal not in range(128)

Decode using an unsupported encoding results in UnicodeDecodeError.

>>> encoded = 'Björnbärssnår'.encode('iso-8859-1')
>>> encoded
b'Bj\xf6rnb\xe4rssn\xe5r'
>>> encoded.decode('iso-8859-1')
'Björnbärssnår'
>>> encoded.decode('utf-8')
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 2: invalid start byte

Often, the right way to deal with these exceptions is simply to choose the intended encoding. When the exact encoding is unknown or if the data is somehow corrupt, Python offers the errors parameter for encode and decode - which can substitute or ignore unsupported symbols.

>>> encoded = 'Björnbärssnår'.encode('iso-8859-1')
>>> encoded
b'Bj\xf6rnb\xe4rssn\xe5r'
>>> encoded.decode('iso-8859-1')
'Björnbärssnår'
>>> encoded.decode('utf-8', errors='replace')
'Bj�rnb�rssn�r'

Here, errors=’replace’ substitutes � in place of unhandled characters instead of raising a UnicodeDecodeError. For more options, see https://docs.python.org/3/howto/unicode.html.