Python 3

home

Understanding Unicode and Character Encodings

Introduction: All Plaintext is Represented as Bytes (Integers)

As text is represented as bytes, it must make use of an encoding that indicates what character each byte value represents.


A large part of the data we work with is stored in plaintext files, which are simply streams of characters. Plaintext file types include .txt, .csv, .json, .html and .xml. However, all files are stored in binary form on our computer systems. Plaintext files are stored as integers, with each integer (or sometimes 2-4 integers) representing a character.


    h   e   l   l   o   ,       w   o   r   l   d   !
   104 101 108 108 111 44   32 119 111 114 108 100  33

Each integer occupies a byte on our system, which is why strings of integers that represent characters are called bytestrings. In order to view text in an application (such as a text editor, browser, by Python, etc.), the integers in a bytestring must be converted to characters. To do this, the application must refer to an encoding that indicates the integer value corresponding to each character in the character set. Every time we look at text, whether in an editor, IDE, browser or Python program, the bytestrings are being converted to text. This is happening invisibly and seamlessly (although you may sometimes see a ? in a chat or web page - this means that the converter didn't know how to convert that integer).





The ASCII Table: the original character set

This 128-character set comprises all English characters and numbers, plus some symbols.


The original character set first used with computers is known as the ascii table. If you look at this table and pick out the integer equivalents for "hello, world!" you'll see that they match those in the earlier hello, world! example.


You can also see a similar translation in action by using the ord() and chr() functions:

ordinal = ord('A')        # 65

char = chr(65)            # 'A'

The problem with the ascii table is that it only contains 128 characters. This works fine for many files written in English, but many other characters and symbols are needed to represent languages around the world.





Unicode: the effort to create a universal character set

There are many character sets; UTF-8 is the most widely accepted Unicode charset.


Unicode is the effort to produce a character set that can represent a much more comprehensive set of characters used in world languages (as well as other types of plaintext expression such as emojis). A unicode character set such as utf-8 is capable of representing over a million data points. Most applications, websites and Python itself have embraced the utf-8 standard. However, this does not mean that utf-8 will be making other character sets obsolete. In fact, there are hundreds of character sets in use today, and this is not likely to change. The IT world will always be awash in many different character sets, and as IT professionals it is our job to be able to interpret and use character sets that we encounter.

Name Year Introduced # of Data Points Notes
ascii 1963 128 one byte per character
latin-1 (a.k.a. ISO-8859-1) 1987 256 a superset of ascii; one byte per character
utf-8 1993 1,112,064 a superset of ascii; variable-length encoding (1-4 bytes (8-bit) per character)
utf-16 2000 1,112,064 in use in a small fraction of systems; variable-width encoding (1-2 16-bit code units per character), not a superset of ascii
utf-32 c. 2003 1,112,064 in use in a small fraction of systems; fixed width encoding (4 bytes per character), not a superset of ascii





Unicode in Python: Encoding and Decoding bytestrings

A string is a sequence of characters; a bytestring is a sequence of integers that represent each character, mapped through an encoding.


As we discussed, all text is stored as integer values, each integer representing a character. When displaying text, the integers must be converted to characters. This means that every time you see characters represented in a file -- in an editor, in a browser or by a Python program -- those characters were decoded from integer values. To decode an integer to its corresponding character, the application must refer to an encoding, which maps integer values to characters. ascii is one such encoding, utf-8 is another. This means that every string representation of text that you see has been decoded from integers using an encoding.


To have Python encode a string to an integer bytestring, we use the str.encode() method:

strvar = 'this is a string'

bytesvar = strvar.encode('utf-8')        # utf-8 is the default, but could be ascii, latin-1, etc.

print(bytesvar)                          # b'this is a string'

# view first character in string and bytestring
print(strvar[0])                         # 't'
print(bytesvar[0])                       # 116

A bytestring is notated with a b''. It prints with string characters because Python applies the utf-8 encoding by default.  


To decode a bytestring to a string, we use the bytes.decode() method, supplying the encoding:

bytesvar = b'hello'                     # this would usually come from another
                                        # source, such as a web page

strvar = bytesvar.decode('utf-8')       # utf-8 is the default

print(strvar)                           # 'hello'

 


Most modern applications like Python default to utf-8. If we wish to open a file with a different encoding, we can specify it to open():

fh = open('../pyku.txt', encoding='ISO-8859-1')

We can also open a file to be read as bytes with the rb argument:

fh = open('../pyku.txt', 'rb')

bytestr = fh.read()         # b"We're all out of gouda.\nThis parrot..."

And, we can also write raw bytes to a file:

fh = open('newfile.txt', 'wb')

text_to_write = b'this is text encoded as bytes'

fh.write(text_to_write)

fh.close()




Handling Decode and Encode Errors

We can trap decode and encode errors and/or remove or replace inappropriate characters.


Strings and bytestrings do not carry encoding information; this means that even they do not "know" what encoding they are using, and thus cannot tell us which encoding to use.


So, we may receive text that we encode or decode incorrectly:

string = 'voilà'
bytestring = string.encode('ascii')

   # UnicodeEncodeError: 'ascii' codec can't encode character  '\xe0'
   # in position 4: ordinal not in range(128)


bytestring = string.encode('latin-1')     # successful - 'à' is part of the latin-1 character set


string = bytestring.decode('ascii')

   # UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in
   # position 0: ordinal not in range(128)

It is of course possible to trap these exceptions and take action if needed.


If we find that a string cannot be decoded to a specified encoding, we can choose to either remove or replace the unknown character:

string = 'voilà'
bytestring = string.encode('latin-1')

try:
    string = bytestring.decode('ascii')     # this will raise a UnicodeDecodeError
except UnicodeDecodeError:
    string = bytestring.decode('ascii', errors='replace')        # 'voil?'
    string = bytestring.decode('ascii', errors='ignore')         # 'voil'

'replace' only generates a literal question mark to replace the unknown character; 'ignore' removes it "Sniffing" the Encoding of a Bytestring A bytestring does not contain meta information, so it does not "know" what encoding was used to create it.


Python provides the module chardet that can inspect the bytestring in order to determine its encoding. We call this kind of evidence-based examination "sniffing":

import chardet

s1 = 'there it is!'
sb1 = s1.encode('ascii')

s2 = 'voilà!'
sb2 = s2.encode('latin-1')

print(chardet.detect(sb1))        # {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
print(chardet.detect(sb2))        # {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

Keep in mind that chardet doesn't know which character sets are involved, becuase the bytestring does not contain this information. chardet.detect() infers the character set based on the characters in the string.





[pr]