Chapter 10: Reading and Writing from Text Files

Believe it or not, you aren't required to rely on input for all your input from a user. If you've got documents, configuration files, or other text documents sitting on your hard drive, these can all be used as input as easily as reading data from the keyboard. In this chapter, we'll build some sample applications that both accept text files as input and write text files as output.

File objects

The input statement is fairly abstract because there's really only one way to query the keyboard for a string. It isn't necessary to tell input to use the active keyboard, to wait until the new line character comes through before it's over, or any other extraneous information. Before we can read input from a file, it is necessary to identify a specific reference to the file itself. With a file, we have two key pieces of information, the location of the file, and whether we'd like to open the file for reading or for writing. To get this file reference, we use the open command. In this example, we use a file in the same directory as the program source code called it hello.txt, containing a single line of text.


input_file = open("hello.txt", "r")

This open statement uses the file name as the first parameter. Since the file is in the same directory as my source code, we don't need to include the full system path. The second parameter is the single letter "r", which is short for "read". This tells Python that we want to open the file for reading access, meaning that we aren't going to make any changes to hello.txt. We're just going to see what it contains.

There are a number of different ways to get data contained in the file, but we focus on the two most common ones here. You can either read the entire file into memory at once as a list by using readlines, or you can read a file one line at a time by using the in keyword. For smaller examples, you really won't notice any difference. For larger files, you don't want to load the entire thing into memory at once, so in general, we're going to stick with single lines at a time.

First, let's take a look at readlines.


input_file = open("hello.txt", "r")

lines = input_file.readlines()

input_file.close()

print("The entire hello.txt file:")

print(lines)

print("The hello.txt file, line-by-line:")

for x in lines:

    print(x)

print("All done!")



The entire hello.txt file:

['Hello world!\n']

The hello.txt file, line-by-line:

Hello world!



All done!

The readlines method is called on the opened file object to read the entire file into memory. This means that when readlines finishes, the returned variable is a list with a length equal to the number of lines in the text file itself. The hello.txt file created earlier is a single line terminated by a new-line character. You can see the results in lines when it is printed to the screen. The list has a length of one and consists of a single string that terminates with the newline character "\n".

You can see how we can use the list in a loop to access each of the lines in turn. The lines of the file exist in memory now, so individual lines can be accessed by index. Notice that we also called the close method for the input_file file object. When the file is no longer needed, you'll want to make sure that access to the file is stopped. Python will often close files automatically if they're still open when the program terminates, but as your program continues running and you keep opening files again and again, you can run into access problems.

The alternative way to read the lines from a file is one-by-one. A loop can be used to do this using the file object as an iterator. The format for this is as follows:


input_file = open("hello.txt", "r")

print("The hello.txt file, line-by-line using a for-loop:")

for x in input_file:

    print(x)

input_file.close()

print("All done!")



The hello.txt file, line-by-line using a for-loop:

Hello world!



All done!

Each of the lines in hello.txt is accessed by using the in keyword. It requires much less memory when used with larger files, and its syntax is slightly more straightforward. We'll use both in and readlines going forward. You are of course free to use the one that feels most comfortable.

So far, we've only looked at an input file with a single line of text. How do larger source files appear? As a sample, let's use the Project Gutenberg text version of The Time Machine, by H. G. Wells. This file was downloaded from http://www.gutenberg.org/cache/epub/35/pg35.txt and saved in the same folder as the source code that will attempt to read it.


input_file = open("pg35.txt", "r")

lines = input_file.readlines()

print("The input file has {0} lines of text.".format(len(lines)))

print(lines[0])

for x in range(37, 49):

    print("{0}: {1}".format(x, lines[x]), end="")

print("All done!")



The input file has 3604 lines of text.

Project Gutenberg's The Time Machine, by H. G. (Herbert George) Wells



37: The Time Traveller (for so it will be convenient to speak of him)

38: was expounding a recondite matter to us. His grey eyes shone and

39: twinkled, and his usually pale face was flushed and animated. The

40: fire burned brightly, and the soft radiance of the incandescent

41: lights in the lilies of silver caught the bubbles that flashed and

42: passed in our glasses. Our chairs, being his patents, embraced and

43: caressed us rather than submitted to be sat upon, and there was that

44: luxurious after-dinner atmosphere when thought roams gracefully

45: free of the trammels of precision. And he put it to us in this

46: way--marking the points with a lean forefinger--as we sat and lazily

47: admired his earnestness over this new paradox (as we thought it)

48: and his fecundity.

All done!

You might have noticed the change to the print statement in the for-loop above. Since each of the lines is read in by readlines, the real full line of text is accepted. Each of these lines of text actually ends in a newline character, so each of the elements in the list itself has a newline. You can verify this in the interpreter.


>>> lines[0]

"Project Gutenberg's The Time Machine, by H. G. (Herbert George) Wells\n"

The "\n" character at the end of the string is the newline.


>>> lines[37:49]

['The Time Traveller (for so it will be convenient to speak of him)\n', 'was expounding a recondite matter to us. His grey eyes shone and\n', 'twinkled, and his usually pale face was flushed and animated. The\n', 'fire burned brightly, and the soft radiance of the incandescent\n', 'lights in the lilies of silver caught the bubbles that flashed and\n', 'passed in our glasses. Our chairs, being his patents, embraced and\n', 'caressed us rather than submitted to be sat upon, and there was that\n', 'luxurious after-dinner atmosphere when thought roams gracefully\n', 'free of the trammels of precision. And he put it to us in this\n', 'way--marking the points with a lean forefinger--as we sat and lazily\n', 'admired his earnestness over this new paradox (as we thought it)\n', 'and his fecundity.\n']

If you'd like to just acquire the text in each line without the unnecessary whitespace on either end, we can go back to the old string method strip so that the print statement doesn't need to be supplemented with the end parameter. This is usually what I do in my own code, but of course your mileage may vary.


input_file = open("pg35.txt", "r")

lines = input_file.readlines()

print("The input file has {0} lines of text.".format(len(lines)))

print(lines[0])

for x in range(37, 40):

    print("{0}: {1}".format(x, lines[x].strip()))

print("All done!")



The input file has 3604 lines of text.

Project Gutenberg's The Time Machine, by H. G. (Herbert George) Wells



37: The Time Traveller (for so it will be convenient to speak of him)

38: was expounding a recondite matter to us. His grey eyes shone and

39: twinkled, and his usually pale face was flushed and animated. The

All done!

Writing to a text file

When opening a text file, we must decide whether to open the file for reading or for writing. It doesn't realy make sense to do both at the same time, so a distinction between the two is made. For example, when opening The Time Machine, we never intended to rewrite the story. Rather, we were interested in opening the story and reading the data for some purpose.

Let's consider writing to files first. With writing, the open function is used a similar way as reading, but the second parameter is either "w" for write or "a" for append. If you open a file for appending, any new information written to the file is pasted onto the end.

When you open a file for writing, a new empty file is created in the specified location. If the file already exists and you use "w", it will be deleted and overwritten by the new file. That's important! Be careful when specifying your target, because it is fantastically easy to overwrite your input file by accidentally using "w" instead of "r" when opening. I speak from experience on this one.


output_file = open("output.txt", "w")

To start getting data into the file, we can use write. The write function works a lot like print, except that it sends data to a file instead of to the screen. When a string is written to a file, the newline character won't be added unless it is explicitly specified. Let's build a sample program to see how we can start writing user input to a file.


output_file = open("output.txt", "w")

print("Enter a few strings to write to the file, and type quit when finished.")

done = False

while not done:

    st = input("> ")

    if st == "quit":

        done = True

    else:

        output_file.write("{0}\n".format(st))

output_file.close()

print("All done!")



Enter a few strings to write to the file, and type quit when finished.

> This is a test.

> Writing to a file is fun!

> Wowwwwwww.

> quit

All done!



output.txt:

This is a test.

Writing to a file is fun!

Wowwwwwww.

Now if we run the code a second time and enter in some more data, the original output.txt will be deleted and replaced by the new set of user input. Let's change the "w" in the open function to "a" and see how it works instead.


output_file = open("output.txt", "a")



Enter a few strings to write to the file, and type quit when finished.

> Appending data is also fun.

> quit

All done!



output.txt:

This is a test.

Writing to a file is fun!

Wowwwwwww.

Appending data is also fun.

As long as the data is a string, it can be safely written to the file. Unlike print, write won't automatically attempt to convert your data to a string, so you'll have to do it manually. For example, if we try to write a number to the file, we'll hit a TypeError exception.


output_file = open("output.txt", "w")

output_file.write(5)

output_file.close()

print("All done!")



Traceback (most recent call last):

 File "C:\Python33\sandbox.py", line 2, in <module>

            output_file.write(5)

TypeError: must be str, not int

The way around this is to wrap the number in str to explicitly convert the number 5 into the string "5".


output_file.write(str(5))

The same thing goes for other data types. Just make sure that the actual conversion is done manually, and write will be able to send the data to the file.

Combining reads and writes

Now let's take an input file and do some interesting processing on it before writing some new data to an output file. We'll have two file objects, one for input and one for output. For this program, we'll use The Time Machine again, and count the number of words. To do that, we'll strip out any character that isn't a letter and assume that any continuous set of letters makes up a word. For example, let's take an arbitrary string from input and split it up into each of the words one by one.


input_string = input("Enter a string: ")

new_string = ""

for x in input_string:

    if x.isalpha() or x == " ":

        new_string = "{0}{1}".format(new_string, x.lower())

while new_string.find(" ") >= 0:

    new_string = new_string.replace(" ", " ")

new_string = new_string.strip()

words = new_string.split(" ")

words.sort()

print("The words in your string are:")

print(words)



Enter a string: This is a test.

The words in your string are:

['a', 'is', 'test', 'this']

The isalpha function returns True if all of the characters in the non-empty string are alphabetic. In the program above, we look at each letter in input_string one at a time, test whether they are either in the alphabet or are a space, and if they are what we want, concatenate them to new_string. When we're done, new_string should only consist of letters and spaces. Any pattern of two or more spaces is eliminated by replacing two spaces with one space. Finally, split is called to get a list of the words in the string. If we didn't replace all pairs of spaces, split would give us some empty words as results. For example, if you remove the replace call, you can see output like this:


Enter a string: This           is a test.

The words in your string are:

['', '', '', '', '', '', '', '', '', '', 'a', 'is', 'test', 'this']

Since there can be extra spaces at the start or end of a string that can still throw off the results, we also make a call to strip to clean up the ends. Finally, we sort the word list, and print it to the screen.

Let's use the same approach with the input file. Instead of printing the entire list to the screen, let's print a subset of the words, along with the length of the word list. The previous code will need a slight modification because the new code will be using multiple lines.


input_file = open("pg35.txt", "r")

words = []

for line in input_file:

    new_string = ""

    for x in line:

        if x.isalpha() or x == " ":

            new_string = "{0}{1}".format(new_string, x.lower())

        else:

            new_string = "{0} ".format(new_string)

    while new_string.find(" ") >= 0:

        new_string = new_string.replace(" ", " ")

    new_string = new_string.strip()

    if len(new_string) > 0:

        for word in new_string.split(" "):

            words.append(word)

words.sort()

print("There are {0} words in the file.".format(len(words)))

print("Some of the words in your file are:")

print(words[:10])

print(words[10000:10010])



There are 35261 words in the file.

Some of the words in your file are:

['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']

['front', 'front', 'frugivorous', 'fruit', 'fruit', 'fruit', 'fruit', 'fruit', 'fruit', 'fruits']

This looks great, and we've got a full list of all the words in the file. However, it would be nice to have each word show up only once. In particular, it would be handy to have a count of the actual number of times each word occurs. Let's change the list to a dictionary and track each word a little more closely.


input_file = open("pg35.txt", "r")

words = {}

for line in input_file:

    new_string = ""

    for x in line:

        if x.isalpha() or x == " ":

            new_string = "{0}{1}".format(new_string, x.lower())

    while new_string.find(" ") >= 0:

        new_string = new_string.replace(" ", " ")

    new_string = new_string.strip()

    if len(new_string) > 0:

        for word in new_string.split(" "):

            if word in words:

                words[word] += 1

            else:

                words[word] = 1

words_keys = sorted(words)

print("There are {0} words in the file.".format(len(words)))

print("Some of the words in your file are:")

for x in range(10):

    print(words_keys[x], words[words_keys[x]])



There are 5173 words in the file.

Some of the words in your file are:

a 861

abandon 1

abandoned 1

abide 1

able 3

abnormally 1

abominable 2

abominations 1

about 78

above 23

Much better! We can see how frequent the words are, and with the sorted keys from the dictionary, it's possible to view a sorted list of the elements of the dictionary.

Let's go back to writing files so that this information can be saved in an output file. Change the last part of the program to this:


words_keys = sorted(words)

output_file = open("output.txt", "w")

output_file.write("There are {0} words in the file.\n".format(len(words)))

output_file.write("Some of the words in your file are:\n")

for key in words_keys:

    output_file.write("{0}: {1}\n".format(key, words[key]))

output_file.close()

If you run this program and take a look at the new contents of the output.txt file we just wrote, you'll see the following data:


There are 5173 words in the file.

Some of the words in your file are:

a: 861

abandon: 1

abandoned: 1

abide: 1

able: 3

abnormally: 1

abominable: 2

abominations: 1

about: 78

above: 23

[..]

In a single program, we've managed to read in an entire story line by line, do some word analysis on the data, and send the results to an output file based on the frequency of the words. The amount of data we're looking at is too great to print to the screen at once, so saving it as a file works as a great way to keep track of things. When you have too much data to work with at once, saving as a file can give you a manageable way of handling the information.

Breaking Stuff

Reading and writing to text files is fairly straightforward using Python. There's an important caveat there though: reading and writing to files that are made up of text data is fairly straightforward. There is another type of data stored in files that is usually referred to as binary data. Binary data doesn't look like sentences or natural language. It's not structured for people to read directly. Instead, it's used by computers to store the data it needs in an efficient way.

If you try to open a binary file, like an archived zip file, you'll get some strange results on the screen. Consider this simple program:


input_file = open("test.zip", "r")

for line in input_file:

    print(line)

input_file.close()

This code works perfectly fine as long as the input file is made up a collection of individual lines. Each line will be printed to the screen, one-by-one, and the file will be closed. However, as you might have noticed in the code block, we're opening a zip file here. What do we get in this case? Well, in my terminal, I get the following:


.]1Dt.pyUX

          7]?R7]?R?+J?,NUp?HN-(????P??P??WT???P??H.]1D??H?

    ??t.pyU7]?R7]?RPK>_

Not particularly helpful. This file actually consists of a single python file that I created as an example called t.py. You can see hints that the file name is stored in the binary data. However, the contents of the file are compressed and unreadable, so what you get on the screen is a bunch of spaghetti.

Note that Python didn't crash when we tried to open a binary file in this way. All that we said was to open the file and to treat it as a text file with line breaks. There were no exceptions and no bright red warnings. The output was the only clue that we'd done something unexpected.

Summary

Reading and writing information using text files is a common operation in Python that is implemented in an elegant way. Reading a file looks just like accessing elements in a list, and writing strings to a file is extremely similar to printing to the screen.