1 of 30

Character Encoding

and You�

Dr. Rachael Tatman

@rctatman

2 of 30

1down votefavorite

I'm working on a new Django site, and, after migrating in a pile of data, have started running into a deeply frustrating DjangoUnicodeDecodeError. The bad character in question is a \xe8 (e-grave).”

@rctatman

3 of 30

Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png

I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename.

  • Stack Overflow user wim

@rctatman

4 of 30

UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)

�U�C���A�ӻP���͡A�n��h�f�A�ƨ��i�G�F�j����ť���y�A�_��³��A�פ

ºÚ

@rctatman

5 of 30

UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)

�U�C���A�ӻP���͡A�n��h�f�A�ƨ��i�G�F�j����ť���y�A�_��³��A�פ

ºÚ

"Mojibake"

\mo.dʒi.ba.ke\

@rctatman

6 of 30

  • What are character encodings?
  • When am I likely to run into trouble?
  • How can I get myself out of trouble?

@rctatman

7 of 30

  • What are character encodings?
  • When am I likely to run into trouble?
  • How can I get myself out of trouble?

@rctatman

8 of 30

01100001

@rctatman

9 of 30

01100001

ASCII

a

ISO/IEC 8859-1

a

Windows-1251

a

Big5

a

Shift_JIS

a

UTF-8

a

@rctatman

10 of 30

11000011

10101001

ASCII

é

ISO/IEC 8859-1

é

Windows-1251

Г©

Big5

Shift_JIS

テゥ

UTF-8

é

@rctatman

11 of 30

11001010

10010010

ASCII

�

ISO/IEC 8859-1

�

Windows-1251

К’

Big5

Shift_JIS

ハ�

UTF-8

ʒ

@rctatman

12 of 30

11100010 10000010

10101100

ASCII

�

ISO/IEC 8859-1

�

Windows-1251

€

Big5

��

Shift_JIS

竄ャ

UTF-8

@rctatman

13 of 30

Please use UTF-8.

  • Supports many languages/sets of symbols
  • ASCII is a valid subset of UTF-8
  • >90% of text on the web is in UTF-8
  • UTF-8 is the default Python source encoding (PEP 3120)

@rctatman

14 of 30

Please use UTF-8.

  • Supports many languages/sets of symbols
  • ASCII is a valid subset of UTF-8
  • >90% of text on the web is in UTF-8
  • UTF-8 is the default Python source encoding (PEP 3120)

???

@rctatman

15 of 30

01100001

ASCII

a

UTF-8

a

UTF-16

UTF-32

@rctatman

16 of 30

01100011

01100001

01110100

ASCII

cat

UTF-8

cat

UTF-16

捡�

UTF-32

@rctatman

17 of 30

00000000

00000001

11110110

10011111

ASCII

<control>

<control>

öŸ

UTF-8

<control>

UTF-16

Ā

<in private

use area>

UTF-32

@rctatman

18 of 30

Please use UTF-8.

Once more,

It's great!

@rctatman

19 of 30

  • What are character encodings?
  • When am I likely to run into trouble?
  • How can I get myself out of trouble?

@rctatman

20 of 30

Potential ⚠Danger Zones⚠:

  • Old datasets/datasets produced by legacy systems
  • Non-English text, especially:
    • Languages with a bunch of d̤ɨa͋c̹r͓i̊tic̟ʂ (Italian, Norwegian, Yoruba, etc.)
    • Languages spoken in the former Soviet Union (Russian, Bulgarian, etc.)
    • Mongolian
    • CJK (Chinese, Japanese, Korean)
    • Scripts only recently added to Unicode (e.g. Gondi, Nüshu)
  • Files that have been converted from/to proprietary file formats (e.g. .xls, .doc)

@rctatman

21 of 30

  • What are character encodings?
  • When am I likely to run into trouble?
  • How can I get myself out of trouble?

@rctatman

22 of 30

Follow Best Practices

  1. Use UTF-8 (both for data & code)
  2. Make sure to l👀k at your input/output files
  3. Try to stick to Python 3
    1. Python 3 str type defaults to Unicode (yay!)
    2. Python 2 str type defaults to ASCII (boo!)

Python 2?

No, thank you!

  1. If using anything other than UTF-8 for data:
    1. Convert it into UTF-8 as soon as you read it in
    2. Convert it back at you write it out

@rctatman

23 of 30

Follow along with the code examples!

https://www.kaggle.com/rtatman/

character-encodings-tips-tricks/

@rctatman

24 of 30

Convert to Unicode

Convert Back

@rctatman

25 of 30

https://www.kaggle.com/rtatman/character-encodings-tips-tricks/

@rctatman

26 of 30

https://www.kaggle.com/rtatman/character-encodings-tips-tricks/

@rctatman

27 of 30

Double Check Your Encodings

https://www.kaggle.com/rtatman/character-encodings-tips-tricks/

@rctatman

28 of 30

Ungarble Your Unicode

https://www.kaggle.com/rtatman/character-encodings-tips-tricks/

@rctatman

29 of 30

  • What are character encodings?
  • When am I likely to run into trouble?
  • How can I get myself out of trouble?

@rctatman

30 of 30

Thanks! Questions?

Further reading:

Code: https://www.kaggle.com/rtatman/character-encodings-tips-tricks/

@rctatman