This document is basically about dealing with strings that have non-English characters in them like Ǣ.  If you’re lucky, you may never have issues with this.  But, various people have run into this in various ways so I figured putting together a brief explanation might be handy.

So Jython has two different kinds of “strings”.  One kind is called 8-bit strings and the other is called unicode strings.  

8-Bit Strings

An 8-bit string (at its core) is basically just an array of numbers.  So "ABC" is in some sense very similar to [65, 66, 67].   So when "ABC" gets sent to the server, what gets sent is actually those numbers.  And the largest any individual one of those numbers can be is 255, because that's the 8-bit chunks of data everybody has agreed to work with.

But how to interpret those numbers?  Well by convention people use the ASCII encoding which basically says that 65='A'.  As long as you're using ordinary English characters, that's good enough.  But if you need more...well things get a little more complex.  One wrinkle is that many languages have more than 255 symbols, so you can't have the simple 1 8 bit chunk = 1 letter anymore.  So there are many "encodings" or ways of converting numerical data to language glyphs, utf-8 (part of the unicode standard) is one such encoding.  Most of them are ascii compatible, so "ABC" = [65, 66, 67] in utf-8 and many of these other systems.  They use the fact that not every one of the 255 character in ascii has a meaning.  So "ABßC" might look like [65, 255, 17, 26, 67].  The "255" means a nonstandard character and then the next two numbers indicate which character it is.

The problem is that by looking at just the [65, 255, 17, 26, 67], there is no way to know what encoding is being used.  [65, 255, 17, 26, 67] could well have a different meaning is some other encoding.  And there is no single standard way of representing what encoding is being used.  This is the problem with 8-bit strings...they can store any kind of string data, but unless all the data happens to be ASCII there is no way to be sure what the characters mean.

Unicode Strings

To fix this, python has introduced a new kind of string - the Unicode string.  Unicode strings have the same methods 8-bit strings have and often can be used interchangeably with 8-bit strings.  But, most importantly, they carry their encoding with them.  So with a unicode string, you know how to interpret all those funky non-ascii numbers so they can be displayed correctly.

Don’t fall into the trap of thinking about “unicode strings” as simply strings with non-English characters.  A unicode string could be all English, but still have that encoding info carried around with it.  Conversely, you could have 8-bit string contains text with non-English characters in it.  In fact, that what causes problems most of the time: you have an 8-bit string with some funky characters in it, and python is not sure how to interpret those characters (i.e. what encoding to use) when it’s time to display.

So Why Is There a Problem?

When things go well, all this string encoding magic happens behind the scenes and you never have to worry about it.  The problem is that sometimes this magic works and you have to explicitly think about what kind of string you are using.

Here’s some things you might need to do:

Determine if a string is an 8-bit string or a unicode string

Sometimes, when you print out unicode strings they will print out with a u in front of them.  So the unicode string “Hello world” will look like u’Hello world’.  But not always.  So if you want to check, the easiest thing to do is just print out the type of the string.

print type(mysteryString)

If the result is <type 'unicode'> it’s unicode.  If it’s <type 'str'> it’s a 8-bit string.

Convert a 8-bit string to a unicode string

If the string is ordinary ascii characters, this is easy as pie.  You use the function unicode to do the conversion.

myString = "Hello"  # myString is an 8 bit string

myUniString = unicode(myString) # myUniString is the unicode version

                                # of that same string

That will work great, right up until you have a string with some non-standard character in it.  Then it will fail with an error like this: 'ascii' codec can't decode byte 0xff.  What this error means is that you tried to convert something to unicode, but because the string wasn’t just standard ascii characters Python wasn’t sure how to do the conversion.  Remember what I said before about their being multiple encodings?  So you have to tell python what encoding is being used.  The good news is that (unless you know otherwise or the situation is odd), the string is probably encoded in utf8.  So if you tell python that explicitly it will work:

myUniString = unicode(myString, "utf-8")

So if you've got some non-English data coming from a website or other source, try the line above to convert it to a utf-8 string.

Convert a unicode string to a 8-bit string

Going the other direction is pretty easy too.  You want to do this when you've found some code is incomptable with unicode strings.  One thing that can work is the str function:

myString = str(myUniString)

But this will give you an error ('ascii' codec can't encode character) if you have non-English characters in there.  A better plan is to use the encode function.  Encode converts unicode strings into 8 bit strings and you can be a bit more specific about how it does it:

myString = myUniString.encode('ascii','ignore') # skips any non-ascii characters

myString = myUniString.encode('utf-8') # leaves characters there using utf-8 encoding

Non-ASCII characters in Source Files

A similar problem can arise when there are non-ascii characters located in your source code.  By default, python expects files to be written in ascii.  If that is not true they need to explicitly specify what encoding to use.

The usual source of this trouble is cutting and pasting code from other places...especially PowerPoint slides.  Often, Powerpoint has converted quotes to angled 'smart quotes'...those are not ascii compatible.  When you have this problem, you'll see an error like this:

SyntaxError: Non-ASCII character '\xc3' in file /tmp/p263.py

 on line 2, but no encoding declared; see

 http://www.python.org/peps/pep-0263.html for details

If you know what you just pasted in, one solution is to just figure out what the non-ascii character is and delete it.  But you can also add a line to your file to specify the encoding.  It looks like this:

# -*- coding: latin-1 -*-

That line must be the first or second line of your python file.  I've specified the latin-1 encoding, because it's the encoding most often used by powerpoint slides.  Depending on where you cut and paste from, you might have stuff in utf-8 instead:

# -*- coding: utf-8 -*-

Of course, you can set this even if you don't have problems cutting and pasting.  If you'd like to write comments or even strings in some non-English language, this line will let you do that.

More Info

Everything I know about Python encoding issues, I learned from the doc:

http://docs.python.org/howto/unicode.html

It's not aimed at a beginner level, but it gets into a lot more detail if you need it.