Files
The final built-in object type of Python allows us to access files. The open() function creates a Python file object, which links to an external file. After a file is opened, you can read and write to it like normal.
Files in Python are different from the previous types I've covered. They aren't numbers, sequences, nor mappings; they only export methods for common file processing. Technically, files are a prebuilt C extension that provides a wrapper for the C stdio (standard input/output) filesystem. If you already know how to use C files, you pretty much know how to use Python files.
Files are a way to save data permanently. Everything you’ve learned so far is resident only in memory; as soon as you close down Python or turn off your computer, it goes away. You would have to retype everything over if you wanted to use it again.
The files that Python creates are manipulated by the computer’s file system. Python is able to use the operating specific functions to import, save, and modify files. It may be a little bit of work to make certain features work correctly in cross-platform manner but it means that your program will be able to be used by more people. Of course, if you are writing your program for a specific operating system, then you only need to worry about the OS-specific functions.
File Operations
To keep things consistent, here's the list of Python file operations:
Because Python has a built-in garbage collector, you don't really need to manually close your files; once an object is no longer referenced within memory, the object's memory space is automatically reclaimed. This applies to all objects in Python, including files. However, it's recommended to manually close files in large systems; it won't hurt anything and it’s good to get into the habit in case you ever have to work in a language that doesn’t have garbage collection.
Files and Streams
Coming from a Unix-background, Python treats files as a data stream, i.e. each file is read and stored as a sequential flow of bytes. Each file has an end-of-file marker denoting when the last byte of data has been read from it. This is useful because you can write a program that reads a file in pieces rather than loading the entire file into memory at one time. When the end-of-file marker is reached, your program knows there is nothing further to read and can continue with whatever processing it needs to do.
When a file is read, such as with a readline() method, the end of the file is shown shown at the command line with an empty string; empty lines are just strings with an end-of-line character. Here's an example:
Generic Code Example:
>>> myfile = open('myfile', 'w') #open file for input (creates)
>>> myfile.write('hello text file') #write a line of text
>>> myfile.close()
>>> myfile = open('myfile', 'r') #open for output
>>> myfile.readline() #read the line back
'hello text file'
>>> myfile.readline()
‘ ‘ #empty string: end of file
Creating a File
Creating a file is extremely easy with Python. As shown in the example above, you simply create the variable that will represent the file, open the file, and give it a filename and tell Python that you want to write to it.
If you don’t expressly tell Python that you want to write to a file, it will be opened in read-only mode. This acts as a safety feature to prevent you from accidentally overwriting files. In addition to the standard “w” to indicate writing and “r” for reading, Python supports several other file access modes.
When using standard files, most of the information will be alphanumeric in nature, hence the extra binary-mode file operations. Unless you have a specific need, this will be fine for most of your tasks. In a later section, I will talk about saving files that are comprised of lists, dictionaries, or other data elements.
Reading From a File
If you notice in the above list, the standard read-modes produce an I/O (input/output) error if the file doesn’t exist. If you end up with this error, your program will halt and give you an error message, like below:
>>> file = open("myfile", "r")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'myfile'
>>>
To fix this, you should always open files in such a way as to catch the error before it kills your program. This is called “catching the exception”, because the IOError given is actually an exception given by the Python interpreter. There is a chapter dedicated to exception handling but here is a brief overview.
When you are performing an operation where there is a potential for an exception to occur, you should wrap that operation within a try/except code block. This will try to run the operation; if an exception is thrown, you can catch it and deal with it gracefully. Otherwise, your program crashes and burns.
So, how do you handle potential exception errors? Just give it a try. (Sorry, bad joke.)
>>> f = open("myfile", "w")
>>> f.write("hello there, my text file.\nWill you fail gracefully?")
>>> f.close()
>>> try:
... file = open("myfile", "r")
... file.readlines()
... file.close()
... except IOError:
... print "The file doesn't exist"
...
['hello there, my text file.\n', 'Will you fail gracefully?']
>>>
What’s going on here? Well, the first few lines are simply the same as first example in this chapter, then we set up a try/except block to gracefully open the file.
So, what happens if the exception does occur? This:
>>> try:
... file3 = open("file3", "r")
... file3.readlines()
... file3.close()
... except IOError:
... print "The file doesn't exist. Check the filename."
...
The file doesn't exist. Check the filename.
>>>
The file “file3” hasn’t been created, so of course there is nothing to open. Normally you would get an IOError but since you are expressly looking for this error, you can handle it. When the exception is raised, the program gracefully exits and prints out the information you told it to.
One final note: when using files, it’s important to close them when you’re done using them. Though Python has built-in garbage collection, it doesn’t necessarily apply to files. Open files consume system resources and, depending on the file mode, other programs may not be able to access open files.
Iterating Through Files
We’ve talked about iteration before and we’ll talk about it in later chapters. Iteration is simply performing on operation on data in a sequential fashion, usually through the for loop. With files, iteration can be used to read the information in the file and process it in an orderly manner. It also limits the amount of memory taken up when a file is read, which not only reduces system resource use but can also improve performance.
Say you have a file of tabular information, e.g. a payroll file. You want to read the file and print out each line, with “pretty” formatting so it is easy to read. Here’s an example of how to do that. (We’re assuming that the information has already been put in the file. Also, the normal Python interpreter prompts aren’t visible because you would actually write this as a full-blown program, as we’ll see later.)
try:
file = open(“payroll”, “r”)
except IOError:
print “The file doesn’t exist. Check filename.”
individuals = file.readlines()
print “Account”.ljust(10),
print “Name”.ljust(10),
print “Amount”.rjust(10)
for record in individuals:
columns = record.split()
print columns[0].ljust(10)
print columns[1].ljust(10)
print columns[2].rjust(10)
file.close()
Output:
Account Name Balance
101 Jeffrey 100.50
105 Patrick 325.49
110 Susan 210.50
A shortcut would be rewriting the for block so it doesn’t have to iterate through the variable individuals, it can simply read the file directly, as such:
for record in file:
This will iterate through the file, read each line, and assign it to record. This results in each line being processed immediately, rather than having to wait for the entire file to be read into memory. The readlines() method requires the file to be placed in memory before it can be processed; for large files, this can result in a performance hit.
Seeking
Seeking is the process of moving a pointer within a file to an arbitrary position. This allows you to get data from anywhere within the file without having to start at the beginning every time.
The seek() method can take several arguments. The first argument (offset) is starting position of the pointer. The second, optional argument is the seek direction from where the offset starts. 0 is the default value and indicates an offset relative to the beginning of the file, 1 is relative to the current position within the file, and 2 is relative to the end of the file.
file.seek(15) #position pointer 15 bytes from beginning of file
file.seek(12, 1) #position pointer 12 bytes from current location
file.seek(-50, 2) #position pointer 50 bytes backwards from end of file
file.seek(0, 2) #position pointer at end of file
The tell() method returns the current position of the pointer within the file. This can be useful for troubleshooting (to make sure the pointer is actually in the location you think it is) or as a returned value for a function.
Serialization
Serialization (pickling) allows you to save non-textual information to memory or transmit it over a network. Pickling essentially takes any data object, such as dictionaries, lists, or even class instances (which we’ll cover later), and converts it into a byte set that can be used to “reconstitute” the original data.
>>>import cPickle #import cPickle library
>>>a_list = [“one”, “two”, “buckle”, “my”, “shoe”]
>>>save_file = open(“pickled_list”, “w”)
>>>cPickle.dump(a_list, save_file) #serialize list to file
>>>file.close()
>>>open_file = open(“pickled_list”, “r”)
>>>b_list = cPickle.load(open_file)
In the above example, we actually used the cPickle library rather than the pickle library. The reason is related to information we discussed in Chapter 2. Since Python is interpreted, it runs a bit slower compared to compiled languages, like C. Because of this, Python has a pre-compiled version of pickle that was written in C, hence cPickle. Using cPickle makes your program run faster.
Of course, with processor speeds getting faster all the time, you probably won’t see a significant difference. However, it is there and the use is the same as the normal pickle library, so you might as well use it. (As an aside, anytime you need to increase the speed of your program, you can write the bottleneck in C and bind it into Python. I won’t cover that in this book but you can learn more in the official Python documentation.)
Shelves are similar to pickles except that they pickle objects to an access-by-key database, much like dictionaries. Shelves allow you to simulate a random-access file or a database. It’s not a true database but, often, it works well enough for development and testing purposes.
>>>import shelve #import shelve library
>>>a_list = [“one”, “two”, “buckle”, “my”, “shoe”]
>>>dbase = shelve.open(“filename”)
>>>dbase[“rhyme”] = a_list #save list under key name
>>>b_list = dbase[“rhyme”] #retrieve list