Plain text input

In this section, you'll see how to match each word from a plain text file against a known set of words. Any input word that is not found in this set will be displayed as part of the output. You'll see how to build the reference set of words from a dictionary file and what kind of data scrubbing is needed for this task.

Naive split

Here's a simple implementation that attempts to catch typos if input words are not present in the given dictionary file.

>>> def spell_check(text):
...     return [w for w in text.split() if w not in words]
... 
>>> word_file = 'word_files/words.txt'
>>> with open(word_file) as f:
...     words = {line.rstrip() for line in f}
... 
>>> spell_check('hi there')
[]
>>> spell_check('this has a tpyo')
['tpyo']
>>> spell_check('How are you?')
['How', 'you?']

The set data type uses a hash based membership lookup, which takes constant amount of time irrespective of the number of elements (see Hashtables for details). So, it is the ideal data type to store dictionary words for this project.

The input lines from the dictionary file will have line ending characters, so the rstrip() string method is used to remove them. You can use the strip() method if there can be spurious whitespace characters at the start of the line as well.

The spell_check() function accepts a string input and returns a list of words not found in the dictionary. In this naive implementation, the input text is split on whitespaces and the resulting words are compared. As seen from the sample tests, punctuation characters and the case of the input string can result in false mismatches.

/usr/share/dict/words is used as words.txt for this project. See wikipedia: words for a bit of information about the words file in different Linux distributions. See linuxwords if you want to view or download a smaller dictionary file for this project.

You can also use app.aspell.net to create dictionary files based on a specific country, diacritic handling, etc.

Data scrubbing

Here's an improved version that removes punctuation and ignores case for word comparisons:

# plain_text.py
from string import punctuation

def spell_check(text):
    op = []
    for w in text.split():
        w = w.strip(punctuation)
        if w and w.lower() not in words:
            op.append(w)
    return op

word_file = 'word_files/words.txt'
with open(word_file) as f:
    words = {line.rstrip().lower() for line in f}

The lower() string method is applied for the lines of dictionary file as well as the input words. This reduces false mismatches at the cost of losing typos that are related to the case of the text.

The other major change is removing punctuation characters at the start and end of input words. The built-in string.punctuation value is passed to the strip() method and the modified input words are then compared against the dictionary words.

Here are some sample test cases with this improved version:

>>> from plain_text import *
>>> spell_check('hi there')
[]
>>> spell_check('this has a tpyo')
['tpyo']
>>> spell_check('How are you?')
[]
>>> spell_check('# Headery titles')
['Headery']
>>> spell_check("I'm fine. That's nothing!")
[]

Unicode input

While this project assumes ASCII input for the most part, here's how you can adapt a few things for working with Unicode data. The pypi: regex module comes in handy with character sets like \p{P} for punctuation characters.

>>> from plain_text import *
>>> text = '“Should I get this gadget?”'
>>> spell_check(text)
['“Should', 'gadget?”']
# punctuation has only ASCII characters, hence the issue
>>> [w.strip(punctuation) for w in text.split()]
['“Should', 'I', 'get', 'this', 'gadget?”']

# regex module comes in handy for Unicode punctuations
>>> import regex
>>> [regex.sub(r'^\p{P}+|\p{P}+$', '', w) for w in text.split()]
['Should', 'I', 'get', 'this', 'gadget']

However, unlike string.punctuation, the \p{P} set doesn't consider symbols like >, +, etc as punctuation characters. You'll have to use \p{S} as well to include such symbols.

>>> from string import punctuation
>>> text = '"+>cat=-'
>>> text.strip(punctuation)
'cat'

>>> import regex
>>> regex.sub(r'^\p{P}+|\p{P}+$', '', text)
'+>cat='
>>> regex.sub(r'^[\p{P}\p{S}]+|[\p{P}\p{S}]+$', '', text)
'cat'

If you do not want to use the regex module, you can build all the Unicode punctuation/symbol characters using the unicodedata module. See this stackoverflow thread for details.

Practice Python Projects

Plain text input

Naive split

Data scrubbing

Unicode input