Character Encoding
and You�
Dr. Rachael Tatman
@rctatman
1down votefavorite | “I'm working on a new Django site, and, after migrating in a pile of data, have started running into a deeply frustrating DjangoUnicodeDecodeError. The bad character in question is a \xe8 (e-grave).” |
@rctatman
Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png
I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename.
@rctatman
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
�U�C���A�ӻP���͡A�n��h�f�A�ƨ��i�G�F�j����ť���y�A�_��³��A�פ
ºÚ
@rctatman
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
�U�C���A�ӻP���͡A�n��h�f�A�ƨ��i�G�F�j����ť���y�A�_��³��A�פ
ºÚ
"Mojibake"
\mo.dʒi.ba.ke\
@rctatman
@rctatman
@rctatman
01100001
@rctatman
01100001
ASCII | a |
ISO/IEC 8859-1 | a |
Windows-1251 | a |
Big5 | a |
Shift_JIS | a |
UTF-8 | a |
@rctatman
11000011
10101001
ASCII | é |
ISO/IEC 8859-1 | é |
Windows-1251 | Г© |
Big5 | 矇 |
Shift_JIS | テゥ |
UTF-8 | é |
@rctatman
11001010
10010010
ASCII | � |
ISO/IEC 8859-1 | � |
Windows-1251 | К’ |
Big5 | � |
Shift_JIS | ハ� |
UTF-8 | ʒ |
@rctatman
11100010 10000010
10101100
ASCII | � |
ISO/IEC 8859-1 | � |
Windows-1251 | € |
Big5 | �� |
Shift_JIS | 竄ャ |
UTF-8 | € |
@rctatman
Please use UTF-8.
@rctatman
Please use UTF-8.
???
@rctatman
01100001
ASCII | a |
UTF-8 | a |
UTF-16 | � |
UTF-32 | � |
@rctatman
01100011
01100001
01110100
ASCII | cat |
UTF-8 | cat |
UTF-16 | 捡� |
UTF-32 | � |
@rctatman
00000000
00000001
11110110
10011111
ASCII | <control> <control> öŸ |
UTF-8 | <control> 柀 |
UTF-16 | Ā <in private use area> |
UTF-32 | |
@rctatman
Please use UTF-8.
Once more,
It's great!
@rctatman
@rctatman
Potential ⚠Danger Zones⚠:
@rctatman
@rctatman
Follow Best Practices
Python 2?
No, thank you!
@rctatman
Follow along with the code examples!
https://www.kaggle.com/rtatman/
character-encodings-tips-tricks/
@rctatman
Convert to Unicode
Convert Back
@rctatman
https://www.kaggle.com/rtatman/character-encodings-tips-tricks/
@rctatman
https://www.kaggle.com/rtatman/character-encodings-tips-tricks/
@rctatman
Double Check Your Encodings
https://www.kaggle.com/rtatman/character-encodings-tips-tricks/
@rctatman
Ungarble Your Unicode
https://www.kaggle.com/rtatman/character-encodings-tips-tricks/
@rctatman
@rctatman
Thanks! Questions?
Further reading:
Code: https://www.kaggle.com/rtatman/character-encodings-tips-tricks/
@rctatman