Markdown input

In this section you'll see how to check typos for Markdown input files. A complete Markdown parser is out of scope for this project, but you'll see how a few lines of code can help to avoid code snippets and hyperlinks from being checked for typos. You'll also see how to manage multiple input files.

From wikipedia: Markdown:

Markdown is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber and Aaron Swartz created Markdown in 2004 as a markup language that is appealing to human readers in its source code form. Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.

Single Markdown file

There are different implementations of Markdown. I use GitHub Flavored Markdown, see this Spec for details.

Contents of md_files/sample.md is shown below. Code blocks (which can span multiple lines) are specified by surrounding them with lines starting with three or more backticks. A specific programming language can be given for syntax highlighting purposes. Lines starting with # character(s) are headers. Inline code can be formatted by surrounding the code with backticks. Quotes start with the > character. Hyperlinks are created using [link text](hyperlink) format and so on.

# re introduction

In this chapter, you'll get an introduction of `re` module  
that is part of Python's standard library.

## re.search

Use `re.search` function to tesr if the the given regexp pattern  
matches the input string. Syntax is shown below:

>`re.search(pattern, string, flags=0)`

```python
>>> sentence = 'This is a sample string'
>>> bool(re.search(r'is.*am', sentence))
True
>>> bool(re.search(r'str$', sentence))
False
```

[My book](https://github.com/learnbyexample/py_regular_expressions)  
on Python regexp has more details.

Writing a parser to handle complete Markdown Spec is out of scope for this project. The main aim here is to find spelling issues for normal text. That means avoiding code blocks, inline code, hyperlinks, etc. Here's one such implementation:

# markdown.py
import re
from string import punctuation

def spell_check(words, text):
    for w in text.split():
        w = w.strip(punctuation + '—')
        if w and w.lower() not in words:
            yield w

def process_md(words, md_file):
    links = re.compile(r'\[([^]]+)\]\([^)]+\)')
    inline_code = re.compile(r'`[^`]+`')
    hist = {}
    code_block = False
    with open(md_file) as f:
        for line in f:
            if line.startswith('```'):
                code_block = not code_block
            elif not code_block:
                line = links.sub(r'\1', line)
                line = inline_code.sub('', line)
                for w in spell_check(words, line):
                    hist[w] = hist.get(w, 0) + 1
    return hist

if __name__ == '__main__':
    word_file = 'word_files/words.txt'
    with open(word_file) as f:
        words = {line.rstrip().lower() for line in f}

    hist = process_md(words, 'md_files/sample.md')    
    for k in sorted(hist, key=lambda k: (k.lower(), -hist[k])):
        print(f'{k}: {hist[k]}')

Here's explanation for the additional code compared to the plain text implementation seen earlier:

  • Em dash is also scrubbed as a punctuation character.
  • The words set is passed to the spell_check() function as an argument instead of using global variables.
  • process_md() function takes care of removing code blocks, hyperlinks, etc.
    • The code_block flag is used here to skip code blocks.
    • As mentioned earlier, the hyperlink formatting is [link text](hyperlink). The links regexp \[([^]]+)\]\([^)]+\) handles this case. The portion between [ and ] characters is captured and rest of the text gets deleted.
      • You can use sites like regex101 and debuggex to understand this regexp better. See my Python re(gex)? ebook if you want to learn more about regular expressions.
    • The inline_code regexp `[^`]+` deletes inline code from input text.
    • After these processing steps, the remaining text is passed to the spell_check() function.
    • Typos (especially false mismatches) might be repeated multiple times in the given input file. So, a histogram is created here to save the potential typos as keys and their number of occurrences as values.
    • Since a dictionary data type is being used to handle the potential list of typos, the spell_check() function has been changed to yield the words one by one instead of returning a list of words.
  • Finally, the potential typos are displayed in alphabetical order.
$ python3.9 markdown.py
re.search: 1
regexp: 2
tesr: 1

Even with this narrowed version of Markdown parsing, there are cases that aren't handled properly:

  • When content of the code block to be displayed can have lines starting with triple backticks, the code block markers will use more number of backticks. That's how the contents of md_files/sample.md was displayed above. This scenario will not be properly parsed with the above implementation.
    • As a workaround, you can save the length of backticks of the starting marker and look for ending marker with the same number of backticks.
  • Similarly, inline code can have backtick characters and hyperlinks can have () characters. Again, this isn't handled with the above implementation.
    • You can use regexp to handle a few levels of nesting. Or, you can even implement a recursive regexp with the third party regex module. See Recursive matching section from my regexp ebook for details on both these workarounds.

Multiple files

A project could have multiple markdown files, and they might not necessarily be all grouped together in a single directory. Another improvement that can be added is maintaining extra word files that cover false mismatches like programming terms, or even valid words that are not present in the reference dictionary file.

Here's one such implementation:

# typos.py
import glob
import re
from string import punctuation

def reference_words(word_files):
    words = set()
    for word_file in word_files:
        with open(word_file) as f:
            words.update(line.rsplit(':', 1)[0].rstrip().lower() for line in f)
    return words

def spell_check(words, text):
    for w in text.split():
        w = w.strip(punctuation + '—')
        if w and w.lower() not in words:
            yield w

def process_md(words, md_file):
    links = re.compile(r'\[([^]]+)\]\([^)]+\)')
    inline_code = re.compile(r'`[^`]+`')
    hist = {}
    code_block = False
    with open(md_file) as f:
        for line in f:
            if line.startswith('```'):
                code_block = not code_block
            elif not code_block:
                line = links.sub(r'\1', line)
                line = inline_code.sub('', line)
                for w in spell_check(words, line):
                    hist[w] = hist.get(w, 0) + 1
    return hist

if __name__ == '__main__':
    word_files = glob.glob('word_files/**/*.txt', recursive=True)
    words = reference_words(word_files)

    with open('typos.log', 'w') as opf:
        for md in glob.glob('md_files/**/*.md', recursive=True):
            hist = process_md(words, md)    
            if hist:
                opf.write(f'{md}\n')
                for k in sorted(hist, key=lambda k: (k.lower(), -hist[k])):
                    opf.write(f'{k}: {hist[k]}\n')
                opf.write(f'{"-" * 50}\n\n')
  • The glob module is helpful to get all the filenames that match the given wildcard expression. *.txt will match all files ending with .txt extension. If you want to match filenames from sub-directories at any depth as well, prefix the expression with **/ and set the recursive parameter to True.
  • The reference_words() function accepts a sequence of files from which the words set will be built.
    • You might also notice that rsplit() processing has been added. This makes it easier to build extra reference files by copy pasting the false mismatches from the output of this program. Or, if you are not lazy like me, you could copy paste only the relevant string instead of whole lines and avoid this extra pre-processing step.
  • The Markdown input files are also determined recursively using the glob module.
  • The output is now formatted with a filename prefix to make it easier to find and fix the typos.

Here's a sample output with the word_files directory containing only the words.txt file:

$ python3.9 typos.py

$ cat typos.log
md_files/sample.md
re.search: 1
regexp: 2
tesr: 1
--------------------------------------------------

md_files/re/lookarounds.md
groupins: 1
lookahead: 2
lookarounds: 3
Lookarounds: 1
lookbehind: 2
--------------------------------------------------

Some of the terms in the above output are false mismatches. Save such lines in a separate file as shown below:

$ cat word_files/programming_terms.txt
re.search: 1
regexp: 2
lookahead: 2
lookarounds: 3
lookbehind: 2

Running the program again will give only the valid typos:

$ python3.9 typos.py

$ cat typos.log
md_files/sample.md
tesr: 1
--------------------------------------------------

md_files/re/lookarounds.md
groupins: 1
--------------------------------------------------

Managing word files

You can have any number of extra files to serve as word references. For example, if you are processing a text file of a novel, you might want to create a file for missing dictionary words, another for characters, yet another for fictional words, etc. That way, you can reuse specific files for future projects and this also makes it easier to manually review these files later for mistakes.

You can also speed up creating these extra files by filtering words with a minimum count, three for example. You would still have to manually review this, but it will help reduce the copy paste effort. With multiple input files, this minimum count will make more sense by maintaining a histogram of mismatches from all the input files and filtering at the end instead of per file basis.