Andrew Svetlov
@andrew_svetlov
andrew.svetlov@gmail.com
https://asvetlov.github.io/pycon-minsk-2018/
Codecs are not compatible
Not a binary representation
UTF is Unicode Transformation Format
Universal? Really? ☠ WTF?
U+1F872 (🡲) to UTF-16 → D83E + DC72
off = ord(ch) - 0x10000
high = (off // 0x400) + 0xD800
low = (off % 0x400) + 0xDC00
A magic number at the start of text
b.decode('utf-8', errors='strict')
Modes:
https://docs.python.org/3/library/codecs.html#codec-base-classes
>>> s1 = "Löwe"
>>> s2 = "Löwe"
>>> s1 == s2 # ???
>>> "Löwe" == "Löwe"
False
>>> len(s1)
4
>>> len(s2)
5
ö → 1 codepoint, 'ö'(\u00f6)
vs
ö → 2 code points, 'o'(\u006f) + '̈'(\u0308)
unicodedata.normalize(form, s)
>>> import unicodedata
>>> len(unicodedata.normalize('NFKC', "Löwe"))
4
>>> len(unicodedata.normalize('NFKD', "Löwe"))
5
>>> unicodedata.category(' ')
'Zs' # (Z)separator (s)pace
>>> unicodedata.category('1')
'Nd' # (N)umeric (d)igit
>>> unicoded2ata.category('A')
'Lu' # (L)etter (u)ppercased
>>> unicodedata.name(' ')
'SPACE'
>>> unicodedata.name('A')
'LATIN CAPITAL LETTER A'
>>> '\N{SPACE}'
' '
>>> '\N{LATIN CAPITAL LETTER A}'
'A'
>>> sum(unicodedata.category(chr(i)) == 'Zs'
... for i in range(sys.maxunicode))
17
>>> sum(unicodedata.category(chr(i)).startswith('P')
... for i in range(sys.maxunicode))
748
s.split(' ,.;&!*')
|
|
pip install regex
import regex
>>> p = regex.compile('\p{Zs}+')
>>> p.split('a b c')
['a', 'b', 'c']
>> regex.compile('[\p{Z}\p{P}]+')
Encode all non-safe symbols
'т' → '%D1%82'
'к' → '%D0%BA'
>>> from yarl import URL
>>> URL('https://wikipedia.org/текст')
URL('https://wikipedia.org/%D1%82%D0%B5%D0%BA%D1%81%D1%82')
>>> URL('https://雜草工作室.香港')
URL('https://xn--2qq421aovb6v1e3pu.xn--j6w193g')
Andrew Svetlov
@andrew_svetlov
andrew.svetlov@gmail.com
http://asvetlov.github.io/pycon-minsk-2018/