File Reading and Writing

Saving and loading data.

Files

Text Files

f = open('secrets.txt')
secret_data = f.read()
f.close()

secret_data is a string

Note

In Python 3, files are opened by default in text mode, and the default encoding is UTF-8. This means that in the usual case, you get a proper Unicode string to work with, as UTF-8 is the most common encoding for text. Also, it is ASCII compatible, so ASCII Files with “just work”. IF “Unicode” and “ASCII” mean nothing to you – don’t worry about it, just know that things will usually work for text, even non-English text. And if you get odd characters or an EncodingError, then your file is not UTF-8, and it’s time to Google “Python Unicode”. (more info here: Unicode in Python)

Binary Files

f = open('secrets.bin', 'rb')
secret_data = f.read()
f.close()

secret_data is a byte string (with arbitrary bytes in it – well, not arbitrary – whatever is in the file!)

(See the struct module to unpack binary data )

File Opening Modes

f = open('secrets.txt', [mode])
'r', 'w', 'a'
'rb', 'wb', 'ab'
'r+', 'w+', 'a+'
'r+b', 'w+b', 'a+b'

These follow the Unix conventions, and aren’t all that well documented in the Python docs. But these BSD docs make it pretty clear:

http://www.manpagez.com/man/3/fopen/

Gotcha – ‘w’ modes always clear the file if it already exists!

Text File Notes

Text is default:

  • Newlines are translated: \r\n -> \n

  • – reading and writing!

  • Use *nix-style in your code: \n

Gotcha:

  • no difference between text and binary on *nix

  • but this is not true on Windows, and will cause an error.

File Reading

Reading part of a file:

header_size = 4096
f = open('secrets.txt')
secret_header = f.read(header_size)
secret_rest = f.read()
f.close()

Common Idioms

for line in open('secrets.txt'):
    print(line)

(The file object is an iterable that iterates through the lines in a text file.)

f = open('secrets.txt')
while True:
    line = f.readline()
    if not line:
        break
    do_something_with_line()

We will learn more about the keyword with later (it creates a “context manager”), but for now, just understand the syntax and the advantage over simply opening the file:

with open('workfile', 'r') as f:
    read_data = f.read()
f.closed
True

You use with to open the file, and assign it a name (f in this case). The file remains open while in the with block. At the end of the with block, the file is unconditionally closed, even if an Exception is raised. You code will (mostly) work without it, but it’s a good habit to get into to always use with to open a file.

File Writing

outfile = open('output.txt', 'w')
for i in range(10):
    outfile.write("this is line: %i\n"%i)
outfile.close()

with open('output.txt', 'w') as f:
    for i in range(10):
       f.write("this is line: %i\n"%i)

File Methods

Commonly Used Methods:

f.read() f.readline()  f.readlines()

f.write(str) f.writelines(seq)

f.seek(offset)   f.tell() # for binary files, mostly

f.close()

StringIO

A StringIO method is a “file like” object that stores the content in memory. That is, it has all the methods of a file, and behaves the same way, but never writes anything to disk.

In [6]: import io

In [7]: f = io.StringIO()

In [8]: f.write("some stuff")
Out[8]: 10

In [9]: f.seek(0)
Out[9]: 0

In [10]: f.read()
Out[10]: 'some stuff'

In [11]: f.getvalue()
Out[11]: 'some stuff'

In [12]: f.close()

(This can be handy for testing file handling code…)

Paths and Directories

Paths

Paths are generally handled with simple strings.

Relative paths:

'secret.txt'
'./secret.txt'

Absolute paths:

'/home/chris/secret.txt'

Either works with open() , etc.

Relative paths are relative to the current working directory, which is only relevant to command-line programs.

os module

os.getcwd()
os.chdir(path)

os.path module

os.path.split()
os.path.splitext()
os.path.basename()
os.path.dirname()
os.path.join()
os.path.abspath()
os.path.relpath()

(all platform independent)

Directories

os.listdir()
os.mkdir()
os.walk()

(Note the shutil module provides higher level operations.)

pathlib

pathlib is a package for handling paths in an OO way:

http://pathlib.readthedocs.org/en/pep428/

All the stuff in os.path and more:

In [14]: import pathlib

In [15]: pth = pathlib.Path('./')

In [16]: pth.is_dir()
Out[16]: True

In [17]: pth.absolute()
Out[17]: PosixPath('/Users/Chris/PythonStuff/UWPCE/Fall2018-PY210A/examples/Session02')

In [18]: for f in pth.iterdir():
    ...:     print(f)
    ...:
    ...:

And it has a really nifty way to join paths, by overloading the “division” operator:

In [49]: p = pathlib.Path.home()  # create a path to the user home dir.

In [50]: p
Out[50]: PosixPath('/Users/Chris')

In [51]: p / "a_dir" / "one_more" / "a_filename"
Out[51]: PosixPath('/Users/Chris/a_dir/one_more/a_filename')

Kinda slick, eh?

For the full docs:

https://docs.python.org/3/library/pathlib.html

The Path Protocol

As of Python 3.6, there is now a protocol for making arbitrary objects act like paths:

Read about it in PEP 519:

https://www.python.org/dev/peps/pep-0519/

This was added because most built-in file handling modules, as well as any number of third party packages that needed a path, worked only with string paths.

Even after pathlib was added to the standard library, you couldn’t pass a Path object in where a path was needed –even the most common ones like open().

So you could use the nifty path manipulation stuff, but still needed to call str on it:

p = pathlib.Path.home() / a_filename.txt

f = open(str(p), 'r')

Rather than add explicit support for Path objects, a new protocol was defined, and most of the standard library was updated to support the new protocol.

This way, third party path libraries could be used with the standard library as well.

What this means to you

Unless you are writing a path manipulation library, or a library that deals with paths other than with the stdlib packages (like open()), all you need to know is that you can use Path objects most places you need a path.

I expect we will see expanded use of pathlib as python 3.6 and 3.7 becomes widely used.

Some added notes:

Using files and “with”

Sorry for the confusion, but I’ll be more clear now.

When working with files, unless you have a good reason not to, use with:

with open(the_filename, 'w') as outfile:
    outfile.write(something)
    do_some_more...
# now done with out file -- it will be closed, regardless of errors, etc.
do_other_stuff

with invokes a context manager – which can be confusing, but for now, just follow this pattern – it really is more robust.

And you can even do two at once:

with open(source, 'rb') as infile, open(dest, 'wb') as outfile:
    outfile.write(infile.read())

Binary files

Python can open files in one of two modes:

  • Text

  • Binary

This is just what you’d think – if the file contains text, you want text mode. If the file contains arbitrary binary data, you want binary mode.

All data in all files is binary – that’s how computers work. So in Python3, “text” actually means Unicode – which is a particular system for matching characters to binary data.

But this too is complicated – there are multiple ways that binary data can be mapped to Unicode text, known as “encodings”. In Python, text files are by default opened with the “utf-8” encoding. These days, that mostly “just works”.

But if you read a binary file as text, then Python will try to interpret the bytes as utf-8 encoded text – and this will likely fail:

In [13]: open("a_photo.jpg").read()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-13-5c699bc20e80> in <module>()
----> 1 open("PassportPhoto.JPG").read()

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

In Python2, it’s less likely that you’ll get an error like this – it doesn’t try to decode the file as it’s read – even for text files – so it’s a bit tricky and more error prone.

NOTE: If you want to actually DO anything with a binary file, other than passing it around, then you’ll need to know a lot about how the details of what the bytes in the file mean – and most likely, you’ll use a library for that – like an image processing library for the jpeg example above.