Markdown input
In this section you'll see how to check typos for Markdown input files. A complete Markdown parser is out of scope for this project, but you'll see how a few lines of code can help to avoid code snippets and hyperlinks from being checked for typos. You'll also see how to manage multiple input files.
From wikipedia: Markdown:
Markdown is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber and Aaron Swartz created Markdown in 2004 as a markup language that is appealing to human readers in its source code form. Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.
Single Markdown file
There are different implementations of Markdown. I use GitHub Flavored Markdown, see this Spec for details.
Contents of md_files/sample.md
is shown below. Code blocks (which can span multiple lines) are specified by surrounding them with lines starting with three or more backticks. A specific programming language can be given for syntax highlighting purposes. Lines starting with #
character(s) are headers. Inline code can be formatted by surrounding the code with backticks. Quotes start with the >
character. Hyperlinks are created using [link text](hyperlink)
format and so on.
# re introduction
In this chapter, you'll get an introduction of `re` module
that is part of Python's standard library.
## re.search
Use `re.search` function to tesr if the the given regexp pattern
matches the input string. Syntax is shown below:
>`re.search(pattern, string, flags=0)`
```python
>>> sentence = 'This is a sample string'
>>> bool(re.search(r'is.*am', sentence))
True
>>> bool(re.search(r'str$', sentence))
False
```
[My book](https://github.com/learnbyexample/py_regular_expressions)
on Python regexp has more details.
Writing a parser to handle complete Markdown Spec is out of scope for this project. The main aim here is to find spelling issues for normal text. That means avoiding code blocks, inline code, hyperlinks, etc. Here's one such implementation:
# markdown.py
import re
from string import punctuation
def spell_check(words, text):
for w in text.split():
w = w.strip(punctuation + '—')
if w and w.lower() not in words:
yield w
def process_md(words, md_file):
links = re.compile(r'\[([^]]+)\]\([^)]+\)')
inline_code = re.compile(r'`[^`]+`')
hist = {}
code_block = False
with open(md_file) as f:
for line in f:
if line.startswith('```'):
code_block = not code_block
elif not code_block:
line = links.sub(r'\1', line)
line = inline_code.sub('', line)
for w in spell_check(words, line):
hist[w] = hist.get(w, 0) + 1
return hist
if __name__ == '__main__':
word_file = 'word_files/words.txt'
with open(word_file) as f:
words = {line.rstrip().lower() for line in f}
hist = process_md(words, 'md_files/sample.md')
for k in sorted(hist, key=lambda k: (k.lower(), -hist[k])):
print(f'{k}: {hist[k]}')
Here's explanation for the additional code compared to the plain text implementation seen earlier:
- Em dash
—
is also scrubbed as a punctuation character. - The
words
set is passed to thespell_check()
function as an argument instead of using global variables. process_md()
function takes care of removing code blocks, hyperlinks, etc.- The
code_block
flag is used here to skip code blocks.- See softwareengineering: FSM examples if you are not familiar with state machines.
- As mentioned earlier, the hyperlink formatting is
[link text](hyperlink)
. Thelinks
regexp\[([^]]+)\]\([^)]+\)
handles this case. The portion between[
and]
characters is captured and rest of the text gets deleted.- You can use sites like regex101 and debuggex to understand this regexp better. See my Python re(gex)? ebook if you want to learn more about regular expressions.
- The
inline_code
regexp`[^`]+`
deletes inline code from input text. - After these processing steps, the remaining text is passed to the
spell_check()
function. - Typos (especially false mismatches) might be repeated multiple times in the given input file. So, a histogram is created here to save the potential typos as keys and their number of occurrences as values.
- Since a dictionary data type is being used to handle the potential list of typos, the
spell_check()
function has been changed toyield
the words one by one instead of returning a list of words.- See stackoverflow: What does the yield keyword do? if you want to know more about the
yield
keyword.
- See stackoverflow: What does the yield keyword do? if you want to know more about the
- The
- Finally, the potential typos are displayed in alphabetical order.
$ python3.9 markdown.py
re.search: 1
regexp: 2
tesr: 1
Even with this narrowed version of Markdown parsing, there are cases that aren't handled properly:
- When content of the code block to be displayed can have lines starting with triple backticks, the code block markers will use more number of backticks. That's how the contents of
md_files/sample.md
was displayed above. This scenario will not be properly parsed with the above implementation.- As a workaround, you can save the length of backticks of the starting marker and look for ending marker with the same number of backticks.
- Similarly, inline code can have backtick characters and hyperlinks can have
()
characters. Again, this isn't handled with the above implementation.- You can use regexp to handle a few levels of nesting. Or, you can even implement a recursive regexp with the third party
regex
module. See Recursive matching section from my regexp ebook for details on both these workarounds.
- You can use regexp to handle a few levels of nesting. Or, you can even implement a recursive regexp with the third party
Multiple files
A project could have multiple markdown files, and they might not necessarily be all grouped together in a single directory. Another improvement that can be added is maintaining extra word files that cover false mismatches like programming terms, or even valid words that are not present in the reference dictionary file.
Here's one such implementation:
# typos.py
import glob
import re
from string import punctuation
def reference_words(word_files):
words = set()
for word_file in word_files:
with open(word_file) as f:
words.update(line.rsplit(':', 1)[0].rstrip().lower() for line in f)
return words
def spell_check(words, text):
for w in text.split():
w = w.strip(punctuation + '—')
if w and w.lower() not in words:
yield w
def process_md(words, md_file):
links = re.compile(r'\[([^]]+)\]\([^)]+\)')
inline_code = re.compile(r'`[^`]+`')
hist = {}
code_block = False
with open(md_file) as f:
for line in f:
if line.startswith('```'):
code_block = not code_block
elif not code_block:
line = links.sub(r'\1', line)
line = inline_code.sub('', line)
for w in spell_check(words, line):
hist[w] = hist.get(w, 0) + 1
return hist
if __name__ == '__main__':
word_files = glob.glob('word_files/**/*.txt', recursive=True)
words = reference_words(word_files)
with open('typos.log', 'w') as opf:
for md in glob.glob('md_files/**/*.md', recursive=True):
hist = process_md(words, md)
if hist:
opf.write(f'{md}\n')
for k in sorted(hist, key=lambda k: (k.lower(), -hist[k])):
opf.write(f'{k}: {hist[k]}\n')
opf.write(f'{"-" * 50}\n\n')
- The
glob
module is helpful to get all the filenames that match the given wildcard expression.*.txt
will match all files ending with.txt
extension. If you want to match filenames from sub-directories at any depth as well, prefix the expression with**/
and set therecursive
parameter toTrue
.- See docs.python: glob and wikipedia: glob for more details.
- The
reference_words()
function accepts a sequence of files from which thewords
set will be built.- You might also notice that
rsplit()
processing has been added. This makes it easier to build extra reference files by copy pasting the false mismatches from the output of this program. Or, if you are not lazy like me, you could copy paste only the relevant string instead of whole lines and avoid this extra pre-processing step.
- You might also notice that
- The Markdown input files are also determined recursively using the
glob
module. - The output is now formatted with a filename prefix to make it easier to find and fix the typos.
Here's a sample output with the word_files
directory containing only the words.txt
file:
$ python3.9 typos.py
$ cat typos.log
md_files/sample.md
re.search: 1
regexp: 2
tesr: 1
--------------------------------------------------
md_files/re/lookarounds.md
groupins: 1
lookahead: 2
lookarounds: 3
Lookarounds: 1
lookbehind: 2
--------------------------------------------------
Some of the terms in the above output are false mismatches. Save such lines in a separate file as shown below:
$ cat word_files/programming_terms.txt
re.search: 1
regexp: 2
lookahead: 2
lookarounds: 3
lookbehind: 2
Running the program again will give only the valid typos:
$ python3.9 typos.py
$ cat typos.log
md_files/sample.md
tesr: 1
--------------------------------------------------
md_files/re/lookarounds.md
groupins: 1
--------------------------------------------------
Managing word files
You can have any number of extra files to serve as word references. For example, if you are processing a text file of a novel, you might want to create a file for missing dictionary words, another for characters, yet another for fictional words, etc. That way, you can reuse specific files for future projects and this also makes it easier to manually review these files later for mistakes.
You can also speed up creating these extra files by filtering words with a minimum count, three for example. You would still have to manually review this, but it will help reduce the copy paste effort. With multiple input files, this minimum count will make more sense by maintaining a histogram of mismatches from all the input files and filtering at the end instead of per file basis.