Dealing with files

This chapter will discuss the open() built-in function and introduce some of the built-in modules for file processing.

open and close

The open() built-in function is one of the ways to read and write files. The first argument to this function is the filename to be processed. The filename is a relative/absolute path to the location of the file. Rest are keyword arguments that you can configure. The output is a TextIOWrapper object (i.e. a filehandle), which you can use as an iterator. Here's an example:

# default mode is rt i.e. read text
>>> fh = open('ip.txt')
>>> fh
<_io.TextIOWrapper name='ip.txt' mode='r' encoding='UTF-8'>
>>> next(fh)
'hi there\n'
>>> next(fh)
'today is sunny\n'
>>> next(fh)
'have a nice day\n'
>>> next(fh)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

# check if the filehandle is active or closed
>>> fh.closed
False
# close the filehandle
>>> fh.close()
>>> fh.closed
True

The mode argument specifies what kind of processing you want. Only text mode will be covered in this chapter, which is the default. You can combine options, for example, rb means read in binary mode. Here's the relevant details from the documentation:

  • 'r' open for reading (default)
  • 'w' open for writing, truncating the file first
  • 'x' open for exclusive creation, failing if the file already exists
  • 'a' open for writing, appending to the end of the file if it exists
  • 'b' binary mode
  • 't' text mode (default)
  • '+' open for updating (reading and writing)

The encoding argument is meaningful only in the text mode. You can check the default encoding for your environment using the locale module as shown below. See docs.python: standard encodings and docs.python HOWTOs: Unicode for more details.

>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'

Here's how Python handles line separation by default, see documentation for more details.

On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.

On output, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep.

If the given filename doesn't exist, you'll get a FileNotFoundError exception.

>>> open('xyz.txt', mode='r', encoding='ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'xyz.txt'

Context manager

Quoting from docs.python: Reading and Writing Files:

It is good practice to use the with keyword when dealing with file objects. The advantage is that the file is properly closed after its suite finishes, even if an exception is raised at some point. Using with is also much shorter than writing equivalent try-finally blocks.

# read_file.py
with open('ip.txt', mode='r', encoding='ascii') as f:
    for ip_line in f:
        op_line = ip_line.rstrip('\n').capitalize() + '.'
        print(op_line)

Recall that as keyword was seen before in Different ways of importing and try-except sections. Here's the output of the above program:

$ python3.9 read_file.py
Hi there.
Today is sunny.
Have a nice day.

info See The Magic of Python Context Managers for more examples and details.

read, readline and readlines

The read() method gives you entire remaining contents of the file as a single string. The readline() method gives next line of text and readlines() gives all the remaining lines as a list of strings.

>>> open('ip.txt').read()
'hi there\ntoday is sunny\nhave a nice day\n'

>>> fh = open('ip.txt')
# readline() is similar to next()
# but returns empty string instead of StopIteration exception
>>> fh.readline()
'hi there\n'
>>> fh.readlines()
['today is sunny\n', 'have a nice day\n']
>>> fh.readline()
''

write

# write_file.py
with open('op.txt', mode='w', encoding='ascii') as f:
    f.write('this is a sample line of text\n')
    f.write('yet another line\n')

You can call the write() method on a filehandle to add contents to that file (provided the mode you have set supports writing). Unlike print(), the write() method doesn't automatically add newline characters.

$ python3.9 write_file.py

$ cat op.txt
this is a sample line of text
yet another line

$ file op.txt
op.txt: ASCII text

warning If the file already exists, the w mode will overwrite the contents (i.e. existing content will be lost).

info You can also use the print() function for writing by passing the filehandle to the file argument. The fileinput module supports in-place editing and other features (see In-place editing with fileinput section for examples).

File processing modules

This section gives introductory examples for some of the built-in modules that are handy for file processing. Quoting from docs.python: os:

This module provides a portable way of using operating system dependent functionality.

>>> import os

# current working directory
>>> os.getcwd()
'/home/learnbyexample/Python/programs/'

# value of an environment variable
>>> os.getenv('SHELL')
'/bin/bash'

# file size
>>> os.stat('ip.txt').st_size
40

# check if given path is a file
>>> os.path.isfile('ip.txt')
True

Quoting from docs.python: glob:

The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order. No tilde expansion is done, but *, ?, and character ranges expressed with [] will be correctly matched.

>>> import glob

# list of files (including directories) containing '_file' in their name
>>> glob.glob('*_file*')
['read_file.py', 'write_file.py']

Quoting from docs.python: shutil:

The shutil module offers a number of high-level operations on files and collections of files. In particular, functions are provided which support file copying and removal.

>>> import shutil

>>> shutil.copy('ip.txt', 'ip_file.txt')
'ip_file.txt'
>>> glob.glob('*_file*')
['read_file.py', 'ip_file.txt', 'write_file.py']

Quoting from docs.python: pathlib:

This module offers classes representing filesystem paths with semantics appropriate for different operating systems. Path classes are divided between pure paths, which provide purely computational operations without I/O, and concrete paths, which inherit from pure paths but also provide I/O operations.

>>> from pathlib import Path

# use 'rglob' instead of 'glob' if you want to match names recursively
>>> list(Path('programs').glob('*file.py'))
[PosixPath('programs/read_file.py'), PosixPath('programs/write_file.py')]

See pathlib module: taming the file system and stackoverflow: How can I iterate over files in a given directory? for more details and examples.

There are specialized modules for structured data processing as well, for example:

Exercises

  • Write a program that reads a known filename f1.txt which contains a single column of numbers in Python syntax. Your task is to display the sum of these numbers, which is 10485.14 for the given example.

    $ cat f1.txt 
    8
    53
    3.14
    84
    73e2
    100
    2937
    
  • Read the documentation for glob.glob() and write a program to list all files ending with .txt in the current directory as well as sub-directories, recursively.