Plain text input
In this section, you'll see how to match each word of plain text input against a known set of words. Any input word that is not found in this set will be displayed as part of the output. You'll see how to build the reference set of words from a dictionary file and what kind of data scrubbing is needed for this task.
Naive split
Here's a simple implementation that attempts to catch typos if input words are not present in the given dictionary file.
>>> def spell_check(text):
... return [w for w in text.split() if w not in words]
...
>>> word_file = 'word_files/words.txt'
>>> with open(word_file) as f:
... words = {line.rstrip() for line in f}
...
>>> spell_check('hi there')
[]
>>> spell_check('this has a tpyo')
['tpyo']
>>> spell_check('How are you?')
['How', 'you?']
set
data type uses hash based membership lookup, which takes constant amount of time irrespective of the number of elements (see Hashtables for details). So, it is the ideal data type to store dictionary words for this project.
The input lines from the dictionary file will have line ending characters, so the rstrip()
string method is used to remove them. You can use strip()
method if there can be spurious whitespace characters at the start of the line as well.
The spell_check()
function accepts a string input and returns a list of words not found in the dictionary. In this naive implementation, the input text is split on whitespaces and the resulting words are compared. As seen from the sample tests, punctuation characters and the case of input string can result in false mismatches.
/usr/share/dict/words
is used aswords.txt
for this project. See wikipedia: words for a bit of information about thewords
file in different Linux distributions. See linuxwords if you want to view or download a smaller dictionary file for this project.
You can use app.aspell.net to create dictionary files based on specific country, diacritic handling, etc.
Data scrubbing
Here's an improved version that removes punctuation and ignores case for word comparisons:
# plain_text.py
from string import punctuation
def spell_check(text):
op = []
for w in text.split():
w = w.strip(punctuation)
if w and w.lower() not in words:
op.append(w)
return op
word_file = 'word_files/words.txt'
with open(word_file) as f:
words = {line.rstrip().lower() for line in f}
The lower()
string method is applied for the lines of dictionary file as well as the input words. This reduces false mismatches at the cost of losing typos that are related to the case of the text.
The other major change is removing punctuation characters at the start and end of input words. Built-in string.punctuation
is passed to the strip()
method and the modified input words are then compared against the dictionary words.
Here's some sample test cases with this improved version:
>>> from plain_text import *
>>> spell_check('hi there')
[]
>>> spell_check('this has a tpyo')
['tpyo']
>>> spell_check('How are you?')
[]
>>> spell_check('# Headery titles')
['Headery']
>>> spell_check("I'm fine. That's nothing!")
[]
Unicode input
While this project assumes ASCII input for the most part, here's how you can adapt a few things for working with Unicode data. The pypi: regex module comes in handy with character sets like \p{P}
for punctuation characters.
>>> from plain_text import *
>>> text = '“Should I get this gadget?”'
>>> spell_check(text)
['“Should', 'gadget?”']
# punctuation has only ASCII characters, hence the issue
>>> [w.strip(punctuation) for w in text.split()]
['“Should', 'I', 'get', 'this', 'gadget?”']
# regex module comes in handy for Unicode punctuations
>>> import regex
>>> [regex.sub(r'^\p{P}+|\p{P}+$', '', w) for w in text.split()]
['Should', 'I', 'get', 'this', 'gadget']
However, unlike string.punctuation
, the \p{P}
set doesn't consider symbols like >
, +
, etc as punctuation characters. You'll have to use \p{S}
as well to include such symbols.
>>> from string import punctuation
>>> text = '"+>foo=-'
>>> text.strip(punctuation)
'foo'
>>> import regex
>>> regex.sub(r'^\p{P}+|\p{P}+$', '', text)
'+>foo='
>>> regex.sub(r'^[\p{P}\p{S}]+|[\p{P}\p{S}]+$', '', text)
'foo'
If you do not want to use the
regex
module, you can build all the Unicode punctuation/symbol characters using theunicodedata
module. See this stackoverflow thread for details.