Data similarity
Now that you have all the author names, the next task is to take care of typos. You'll see how to use the rapidfuzz
module for calculating the similarity between two strings. This helps to remove majority of the typos — for example Courtney Schaefer and Courtney Shafer. But, this would also introduce new errors if similar looking names are actually different authors and not typos — for example R.J. Barker and R.J. Parker.
From pypi: rapidfuzz:
RapidFuzz is a fast string matching library for Python and C++, which is using the string similarity calculations from FuzzyWuzzy.
From pypi: fuzzywuzzy:
It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.
# virtual environment
$ pip install rapidfuzz
# normal environment
# use py instead of python3.9 for Windows
$ python3.9 -m pip install --user rapidfuzz
Examples
Here's some examples of using fuzz.ratio()
to calculate the similarity between two strings. Output of 100.0
means exact match.
>>> from rapidfuzz import fuzz
>>> fuzz.ratio('Courtney Schaefer', 'Courtney Schafer')
96.96969696969697
>>> fuzz.ratio('Courtney Schaefer', 'Courtney Shafer')
93.75
If you decide 90
as the cut-off limit, here's some cases that will be missed.
>>> fuzz.ratio('Ursella LeGuin', 'Ursula K. LeGuin')
80.0
>>> fuzz.ratio('robin hobb', 'Robin Hobb')
80.0
>>> fuzz.ratio('R. F. Kuang', 'RF Kuang')
84.21052631578948
Ignoring string case and removing .
before comparing the author names helps in some cases.
>>> fuzz.ratio('robin hobb'.lower(), 'Robin Hobb'.lower())
100.0
>>> fuzz.ratio('R. F. Kuang'.replace('.', ''), 'RF Kuang'.replace('.', ''))
94.11764705882354
Here's an example where two different authors have only a single character difference. This would result in a false positive, which can be improved if book names are also compared.
>>> fuzz.ratio('R.J. Barker', 'R.J. Parker')
90.9090909090909
Top authors
The below program processes the author lists created earlier.
# top_authors.py
from rapidfuzz import fuzz
ip_files = ('authors_2019.txt', 'authors_2021.txt')
op_files = ('top_authors_2019.csv', 'top_authors_2021.csv')
for ip_file, op_file in zip(ip_files, op_files):
authors = {}
with open(ip_file) as ipf, open(op_file, 'w') as opf:
for line in ipf:
name = line.rstrip('\n')
authors[name] = authors.get(name, 0) + 1
fuzzed = {}
for k1 in sorted(authors, key=lambda k: -authors[k]):
s1 = k1.lower().replace('.', '')
for k2 in fuzzed:
s2 = k2.lower().replace('.', '')
if round(fuzz.ratio(s1, s2)) >= 90:
fuzzed[k2] += authors[k1]
break
else:
fuzzed[k1] = authors[k1]
opf.write(f'Author,votes\n')
for name in sorted(fuzzed, key=lambda k: -fuzzed[k]):
votes = fuzzed[name]
if votes >= 5:
opf.write(f'{name},{votes}\n')
First, a naive histogram is created with author name as key and total number of exact matches as the value.
Then, rapidfuzz
is used to merge similar author names. The sorted()
function is used to allow the most popular spelling to win.
Finally, the fuzzed dictionary is sorted again by highest votes and written to output files. The result is written in csv
format with a header and a cut-off limit of minimum 5
votes.
Here's a table of top-10 authors:
2021 | Votes | 2019 | Votes |
---|---|---|---|
Ursula K. Le Guin | 139 | N.K. Jemisin | 58 |
Robin Hobb | 127 | Ursula K. Le Guin | 57 |
N.K. Jemisin | 127 | Lois McMaster Bujold | 52 |
Martha Wells | 113 | Robin Hobb | 47 |
Lois McMaster Bujold | 112 | J.K. Rowling | 47 |
Naomi Novik | 110 | Naomi Novik | 45 |
Susanna Clarke | 81 | Becky Chambers | 36 |
Becky Chambers | 76 | Katherine Addison | 33 |
Katherine Addison | 74 | Martha Wells | 30 |
Madeline Miller | 72 | Jacqueline Carey | 29 |
If you wish to compare with the actual results, visit the threads linked below (see comment section for author name based counts). The top-10 list shown above happens to match the actual results for both the polls, but with slightly different order and vote counts.