Data similarity

Now that you have all the author names, the next task is to take care of typos. You'll see how to use the rapidfuzz module for calculating the similarity between two strings. This helps to remove majority of the typos — for example Courtney Schaefer and Courtney Shafer. But, this would also introduce new errors if similar looking names are actually different authors and not typos — for example R.J. Barker and R.J. Parker.

From pypi: rapidfuzz:

RapidFuzz is a fast string matching library for Python and C++, which is using the string similarity calculations from FuzzyWuzzy.

From pypi: fuzzywuzzy:

It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

# virtual environment
$ pip install rapidfuzz

# normal environment
# use py instead of python3.13 for Windows
$ python3.13 -m pip install --user rapidfuzz

Examples

Here are some examples of using the fuzz.ratio() method to calculate the similarity between two strings. Output of 100.0 means exact match.

>>> from rapidfuzz import fuzz

>>> fuzz.ratio('Courtney Schaefer', 'Courtney Schafer')
96.96969696969697
>>> fuzz.ratio('Courtney Schaefer', 'Courtney Shafer')
93.75

If you decide 90 as the cut-off limit, here are some cases that will be missed.

>>> fuzz.ratio('Ursella LeGuin', 'Ursula K. LeGuin')
80.0
>>> fuzz.ratio('robin hobb', 'Robin Hobb')
80.0
>>> fuzz.ratio('R. F. Kuang', 'RF Kuang')
84.21052631578947

Ignoring string case and removing the . characters before comparison helps in some cases.

>>> fuzz.ratio('robin hobb'.lower(), 'Robin Hobb'.lower())
100.0
>>> fuzz.ratio('R. F. Kuang'.replace('.', ''), 'RF Kuang'.replace('.', ''))
94.11764705882352

Here's an example where two different authors have only a single character difference. This would result in a false positive, which can be improved if the book names are also compared.

>>> fuzz.ratio('R.J. Barker', 'R.J. Parker')
90.9090909090909

Top authors

The below program processes the author lists created earlier.

# top_authors.py
from rapidfuzz import fuzz

ip_files = ('authors_2019.txt', 'authors_2021.txt')
op_files = ('top_authors_2019.csv', 'top_authors_2021.csv')

for ip_file, op_file in zip(ip_files, op_files):
    authors = {}
    with open(ip_file) as ipf, open(op_file, 'w') as opf:
        for line in ipf:
            name = line.rstrip('\n')
            authors[name] = authors.get(name, 0) + 1

        fuzzed = {}
        for k1 in sorted(authors, key=lambda k: -authors[k]):
            s1 = k1.lower().replace('.', '')
            for k2 in fuzzed:
                s2 = k2.lower().replace('.', '')
                if round(fuzz.ratio(s1, s2)) >= 90:
                    fuzzed[k2] += authors[k1]
                    break
            else:
                fuzzed[k1] = authors[k1]

        opf.write(f'Author,votes\n')
        for name in sorted(fuzzed, key=lambda k: -fuzzed[k]):
            votes = fuzzed[name]
            if votes >= 5:
                opf.write(f'{name},{votes}\n')

First, a naive histogram is created with the author name as the key and the total number of exact matches as the value.

Then, rapidfuzz is used to merge similar author names. The sorted() function is used to allow the most popular spelling to win.

Finally, the fuzzed dictionary is sorted again by highest votes and written to output files. The result is written in csv format with a header and a cut-off limit of minimum 5 votes.

Here's a table of the top-10 authors:

2021	Votes	2019	Votes
Ursula K. Le Guin	139	N.K. Jemisin	58
Robin Hobb	127	Ursula K. Le Guin	57
N.K. Jemisin	127	Lois McMaster Bujold	52
Martha Wells	113	Robin Hobb	47
Lois McMaster Bujold	112	J.K. Rowling	47
Naomi Novik	110	Naomi Novik	45
Susanna Clarke	81	Becky Chambers	36
Becky Chambers	76	Katherine Addison	33
Katherine Addison	74	Martha Wells	30
Madeline Miller	72	Jacqueline Carey	29

Note that the results you get might be different than what is shown here due to modification of the Reddit comments under analysis. Or, users might have deleted their comments and so on.

If you wish to compare with the actual results, visit the threads linked below (see the comment section for author name based counts). The top-10 list shown above happens to match the actual results for both the polls, but with a slightly different order and vote counts.

Practice Python Projects

Data similarity

Examples

Top authors