Data similarity

Now that you have all the author names, the next task is to take care of typos. You'll see how to use the rapidfuzz module for calculating the similarity between two strings. This helps to remove majority of the typos — for example Courtney Schaefer and Courtney Shafer. But, this would also introduce new errors if similar looking names are actually different authors and not typos — for example R.J. Barker and R.J. Parker.

From pypi: rapidfuzz:

RapidFuzz is a fast string matching library for Python and C++, which is using the string similarity calculations from FuzzyWuzzy.

From pypi: fuzzywuzzy:

It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

# virtual environment
$ pip install rapidfuzz

# normal environment
# use py instead of python3.9 for Windows
$ python3.9 -m pip install --user rapidfuzz

Examples

Here's some examples of using fuzz.ratio() to calculate the similarity between two strings. Output of 100.0 means exact match.

>>> from rapidfuzz import fuzz

>>> fuzz.ratio('Courtney Schaefer', 'Courtney Schafer')
96.96969696969697
>>> fuzz.ratio('Courtney Schaefer', 'Courtney Shafer')
93.75

If you decide 90 as the cut-off limit, here's some cases that will be missed.

>>> fuzz.ratio('Ursella LeGuin', 'Ursula K. LeGuin')
80.0
>>> fuzz.ratio('robin hobb', 'Robin Hobb')
80.0
>>> fuzz.ratio('R. F. Kuang', 'RF Kuang')
84.21052631578948

Ignoring string case and removing . before comparing the author names helps in some cases.

>>> fuzz.ratio('robin hobb'.lower(), 'Robin Hobb'.lower())
100.0
>>> fuzz.ratio('R. F. Kuang'.replace('.', ''), 'RF Kuang'.replace('.', ''))
94.11764705882354

Here's an example where two different authors have only a single character difference. This would result in a false positive, which can be improved if book names are also compared.

>>> fuzz.ratio('R.J. Barker', 'R.J. Parker')
90.9090909090909

Top authors

The below program processes the author lists created earlier.

# top_authors.py
from rapidfuzz import fuzz

ip_files = ('authors_2019.txt', 'authors_2021.txt')
op_files = ('top_authors_2019.csv', 'top_authors_2021.csv')

for ip_file, op_file in zip(ip_files, op_files):
    authors = {}
    with open(ip_file) as ipf, open(op_file, 'w') as opf:
        for line in ipf:
            name = line.rstrip('\n')
            authors[name] = authors.get(name, 0) + 1

        fuzzed = {}
        for k1 in sorted(authors, key=lambda k: -authors[k]):
            s1 = k1.lower().replace('.', '')
            for k2 in fuzzed:
                s2 = k2.lower().replace('.', '')
                if round(fuzz.ratio(s1, s2)) >= 90:
                    fuzzed[k2] += authors[k1]
                    break
            else:
                fuzzed[k1] = authors[k1]

        opf.write(f'Author,votes\n')
        for name in sorted(fuzzed, key=lambda k: -fuzzed[k]):
            votes = fuzzed[name]
            if votes >= 5:
                opf.write(f'{name},{votes}\n')

First, a naive histogram is created with author name as key and total number of exact matches as the value.

Then, rapidfuzz is used to merge similar author names. The sorted() function is used to allow the most popular spelling to win.

Finally, the fuzzed dictionary is sorted again by highest votes and written to output files. The result is written in csv format with a header and a cut-off limit of minimum 5 votes.

Here's a table of top-10 authors:

2021Votes2019Votes
Ursula K. Le Guin139N.K. Jemisin58
Robin Hobb127Ursula K. Le Guin57
N.K. Jemisin127Lois McMaster Bujold52
Martha Wells113Robin Hobb47
Lois McMaster Bujold112J.K. Rowling47
Naomi Novik110Naomi Novik45
Susanna Clarke81Becky Chambers36
Becky Chambers76Katherine Addison33
Katherine Addison74Martha Wells30
Madeline Miller72Jacqueline Carey29

If you wish to compare with the actual results, visit the threads linked below (see comment section for author name based counts). The top-10 list shown above happens to match the actual results for both the polls, but with slightly different order and vote counts.