In this project, you'll learn how to use application programming interface (API) to fetch data. From this raw data, you'll extract data of interest and then apply heuristic rules to correct possible mistakes (at the cost of introducing new bugs). Finally, you'll see options to display the results.
- Get top level comments from Reddit threads
- Use regular expressions to explore data inconsistencies and extract author names
- Correct typos by comparing similarity between names
- Display results as a word cloud
The following modules and concepts will be utilized in this project:
I read a lot of fantasy novels and /r/Fantasy/ is one of my favorite social forums. They conduct a few polls every year for best novels, novellas, standalones, etc. These polls help me pick new books to read.
The poll results are manually tallied, since there can be typos, bad entries, etc. I wanted to see if this process can be automated and gave me an excuse to get familiar with using APIs and some of the third-party Python modules.
I learned a lot, especially about the challenges in data analysis. I hope you'll learn a lot too.