Unicode in Python

Andrew Svetlov

@andrew_svetlov
andrew.svetlov@gmail.com
https://asvetlov.github.io/pycon-minsk-2018/

Bio

  • Work on Ocean S.A. (https://ocean.io/)
  • aio-libs, aiohttp etc.
  • Python Core Developer

ASCII

ASCII age

  • Every symbol takes 7 bit
  • Ord > 127 for national encodings

Russian codecs

  • Windows 1251
  • KOI8-R
  • IBM 855
  • IBM 866
  • Mac Cyrillic
  • ISO 8859-5

Codecs are not compatible

Unicode

  • Character → Code Point
  • 1,114,112 code points
  • 136,755 characters
  • 139 modern and historic scripts
  • Hieroglyphs
  • Ideographics
  • Emojis
  • etc.

Not a binary representation

Codecs

Codecs and Unicode

UTF is Unicode Transformation Format

  • Unicode (str) → Code Points in memory
  • UTF (bytes) → for serialization
  • files / network / database
  • All other codecs are bytish too

UTF

  • UTF-8
  • UTF-16-LE
  • UTF-16-BE
  • UTF-32-LE
  • UTF-32-BE
  • +BOM

Universal? Really? WTF?

UTF-8

  • 1-4 bytes per code point
  • Can encode everything
  • [\u0000-\u007F] 0bbb_bbbb
  • [\u0080-\u07FF] 110b_bbbb+10bb_bbbb
  • [\u0800-\uFFFF] 1110_bbbb+10bb_bbbb*2
  • [\U00010000-\U00010FFFF] 1111_0bb+10bb_bbbb*3

UTF-32

  • 4 bytes per code point
  • Little Endian (UTF-32-LE) → B1 B2 B3 B4
  • Big Endian (UTF-32-BE) → B4 B3 B2 B1

UTF-16

  • 2 or 4 bytes per code point
  • Gory inheritage
  • Surrogate pairs:
  • D800-DBFF + DC00-DFFF

U+1F872 (🡲) to UTF-16 → D83E + DC72


off = ord(ch) - 0x10000
high = (off // 0x400) + 0xD800
low = (off % 0x400) + 0xDC00
	    

BOM

Byte order mark

A magic number at the start of text

  • EF BB BF → UTF-8
  • FE FF → UTF-16-BE
  • FF FE → UTF-16-LE
  • 00 00 FE FF → UTF-32-BE
  • FF FE 00 00 → UTF-32-LE

Error modes

b.decode('utf-8', errors='strict')

Modes:

  • strict
  • ignore
  • replace
  • backslashreplace
  • ...

https://docs.python.org/3/library/codecs.html#codec-base-classes

Composites

Trick


>>> s1 = "Löwe"
>>> s2 = "Löwe"
>>> s1 == s2  # ???
	    

>>> "Löwe" == "Löwe"
False
>>> len(s1)
4
>>> len(s2)
5
	    

Digging in


ö → 1 codepoint, 'ö'(\u00f6)
	    
vs

ö → 2 code points, 'o'(\u006f) + '̈'(\u0308)
	    

Normalization

unicodedata.normalize(form, s)

  • NFC (Composed)
  • NFKC
  • NFD (Decomposed)
  • NFKD

>>> import unicodedata
>>> len(unicodedata.normalize('NFKC', "Löwe"))
4
>>> len(unicodedata.normalize('NFKD', "Löwe"))
5
	    

Categories

Category


>>> unicodedata.category(' ')
'Zs'   # (Z)separator (s)pace
>>> unicodedata.category('1')
'Nd'   # (N)umeric (d)igit
>>> unicoded2ata.category('A')
'Lu'   # (L)etter (u)ppercased
	    

Code Point Names


>>> unicodedata.name(' ')
'SPACE'
>>> unicodedata.name('A')
'LATIN CAPITAL LETTER A'
>>> '\N{SPACE}'
' '
>>> '\N{LATIN CAPITAL LETTER A}'
'A'
	    

Funny numbers


>>> sum(unicodedata.category(chr(i)) == 'Zs'
...     for i in range(sys.maxunicode))
17
>>> sum(unicodedata.category(chr(i)).startswith('P')
...     for i in range(sys.maxunicode))
748
	    
s.split(' ,.;&!*')

Spaces

  • 'SPACE'
  • 'NO-BREAK SPACE'
  • 'OGHAM SPACE MARK'
  • 'EN QUAD'
  • 'EM QUAD'
  • 'EN SPACE'
  • 'EM SPACE'
  • 'THREE-PER-EM SPACE'
  • 'FOUR-PER-EM SPACE'
  • 'SIX-PER-EM SPACE'
  • 'FIGURE SPACE'
  • 'PUNCTUATION SPACE'
  • 'THIN SPACE'
  • 'HAIR SPACE'
  • 'NARROW NO-BREAK SPACE'
  • 'MEDIUM MATHEMATICAL SPACE'
  • 'IDEOGRAPHIC SPACE'

Regular expressions

regex library

pip install regex
import regex
  • \p{property=value}
  • \P{property=value}
  • \p{value}
  • \P{value}

Example


>>> p = regex.compile('\p{Zs}+')
>>> p.split('a   b  c')
['a', 'b', 'c']

>> regex.compile('[\p{Z}\p{P}]+')
	    

URL encoding

Percent encoding

Encode all non-safe symbols

'т' → '%D1%82'

'к' → '%D0%BA'


>>> from yarl import URL
>>> URL('https://wikipedia.org/текст')
URL('https://wikipedia.org/%D1%82%D0%B5%D0%BA%D1%81%D1%82')
	    

INDA


>>> URL('https://雜草工作室.香港')
URL('https://xn--2qq421aovb6v1e3pu.xn--j6w193g')
	    
  • U-Label → 雜草工作室.香港
  • A-Label → xn--2qq421aovb6v1e3pu.xn--j6w193g

Questions?

Andrew Svetlov

@andrew_svetlov
andrew.svetlov@gmail.com
http://asvetlov.github.io/pycon-minsk-2018/