Unicode in Python

Andrew Svetlov

@andrew_svetlov
andrew.svetlov@gmail.com
https://asvetlov.github.io/pycon-minsk-2018/

Bio

Work on Ocean S.A. (https://ocean.io/)
aio-libs, aiohttp etc.
Python Core Developer

ASCII

ASCII age

Every symbol takes 7 bit
Ord > 127 for national encodings

Russian codecs

Windows 1251
KOI8-R
IBM 855
IBM 866
Mac Cyrillic
ISO 8859-5

Codecs are not compatible

Unicode

Character → Code Point
1,114,112 code points
136,755 characters
139 modern and historic scripts
Hieroglyphs
Ideographics
Emojis
etc.

Not a binary representation

Codecs

Codecs and Unicode

UTF is Unicode Transformation Format

Unicode (str) → Code Points in memory
UTF (bytes) → for serialization
files / network / database
All other codecs are bytish too

UTF

UTF-8
UTF-16-LE
UTF-16-BE
UTF-32-LE
UTF-32-BE
+BOM

Universal? Really? ☠ WTF?

UTF-8

1-4 bytes per code point
Can encode everything
[\u0000-\u007F] 0bbb_bbbb
[\u0080-\u07FF] 110b_bbbb+10bb_bbbb
[\u0800-\uFFFF] 1110_bbbb+10bb_bbbb*2
[\U00010000-\U00010FFFF] 1111_0bb+10bb_bbbb*3

UTF-32

4 bytes per code point
Little Endian (UTF-32-LE) → B1 B2 B3 B4
Big Endian (UTF-32-BE) → B4 B3 B2 B1

UTF-16

2 or 4 bytes per code point
Gory inheritage
Surrogate pairs:
D800-DBFF + DC00-DFFF

U+1F872 (🡲) to UTF-16 → D83E + DC72


off = ord(ch) - 0x10000
high = (off // 0x400) + 0xD800
low = (off % 0x400) + 0xDC00

BOM

Byte order mark

A magic number at the start of text

EF BB BF → UTF-8
FE FF → UTF-16-BE
FF FE → UTF-16-LE
00 00 FE FF → UTF-32-BE
FF FE 00 00 → UTF-32-LE

Error modes

b.decode('utf-8', errors='strict')

Modes:

strict
ignore
replace
backslashreplace
...

https://docs.python.org/3/library/codecs.html#codec-base-classes

Composites

Trick


>>> s1 = "Löwe"
>>> s2 = "Löwe"
>>> s1 == s2  # ???


>>> "Löwe" == "Löwe"
False
>>> len(s1)
4
>>> len(s2)
5

Digging in


ö → 1 codepoint, 'ö'(\u00f6)


ö → 2 code points, 'o'(\u006f) + '̈'(\u0308)

Normalization

unicodedata.normalize(form, s)

NFC (Composed)
NFKC
NFD (Decomposed)
NFKD


>>> import unicodedata
>>> len(unicodedata.normalize('NFKC', "Löwe"))
4
>>> len(unicodedata.normalize('NFKD', "Löwe"))
5

Categories

Code Point Names


>>> unicodedata.name(' ')
'SPACE'
>>> unicodedata.name('A')
'LATIN CAPITAL LETTER A'
>>> '\N{SPACE}'
' '
>>> '\N{LATIN CAPITAL LETTER A}'
'A'

Funny numbers


>>> sum(unicodedata.category(chr(i)) == 'Zs'
...     for i in range(sys.maxunicode))
17
>>> sum(unicodedata.category(chr(i)).startswith('P')
...     for i in range(sys.maxunicode))
748

s.split(' ,.;&!*')

Spaces

'SPACE'
'NO-BREAK SPACE'
'OGHAM SPACE MARK'
'EN QUAD'
'EM QUAD'
'EN SPACE'
'EM SPACE'
'THREE-PER-EM SPACE'
'FOUR-PER-EM SPACE'

'SIX-PER-EM SPACE'
'FIGURE SPACE'
'PUNCTUATION SPACE'
'THIN SPACE'
'HAIR SPACE'
'NARROW NO-BREAK SPACE'
'MEDIUM MATHEMATICAL SPACE'
'IDEOGRAPHIC SPACE'

Regular expressions

regex library

pip install regex

import regex

\p{property=value}
\P{property=value}
\p{value}
\P{value}

Example


>>> p = regex.compile('\p{Zs}+')
>>> p.split('a   b  c')
['a', 'b', 'c']

>> regex.compile('[\p{Z}\p{P}]+')

URL encoding

Percent encoding

Encode all non-safe symbols

'т' → '%D1%82'

'к' → '%D0%BA'


>>> from yarl import URL
>>> URL('https://wikipedia.org/текст')
URL('https://wikipedia.org/%D1%82%D0%B5%D0%BA%D1%81%D1%82')

INDA


>>> URL('https://雜草工作室.香港')
URL('https://xn--2qq421aovb6v1e3pu.xn--j6w193g')

U-Label → 雜草工作室.香港
A-Label → xn--2qq421aovb6v1e3pu.xn--j6w193g

Questions?