After learning the basics of spaCy, we decided that a Simpsons quote search engine would be a perfectly cromulent project. Luckily, Kaggle has a Simpsons dataset with dialogue from the first 27 seasons. We were not particularly concerned with the lack of dialogue from seasons 28-31.
The next task was to compute vectors for each line of dialogue and append these vectors to a new column of the dialogue dataframe. First, we created a list of the lemmas (the words that stand at the head of a dictionary defintions) of all of the words in the line of dialogue that weren’t stop words (common words such as “the”, “a”, “an”, “in” that aren’t particularly useful for our purposes). We then joined all the lemmas together into a single string, with spaces between the words, and used spaCy to calculate the vectors for these strings.
When a user enters a string to search for, the same process is followed to compute a vector. Then, scikit-learn’s KNeighbors finds the 5 vectors that are closest to the input vector, and returns the dialogue lines that are associated with those vectors. We got the search functionality working relatively quickly, but the most difficult part of this project was deployment.
The original incarnation of this project was a cross-functional build week during Lambda School, so we needed a way to serve up search results that would allow our full-stack web teammates to easily access them. Deploying it as an API via Flask seemed like the simplest solution, but we were in for an annoying surprise when we tried to deploy the API to the free-tier of Heroku: we needed spaCy with the large english language model, and that’s over a gigabyte. Doh!
At that point we needed to investigate paid options for deployment. Luckily, we had plenty of free AWS credits. So we spun up an EC2, installed the necessary packages, and deployed our API. When it came time to write this article, the original front end was no longer active, but we were able to deploy a stripped down single page version (without login or the ability to save quotes) with the serve package in Node.js.
Test out the search engine here: Simpsons Dialogue Search
And check out the GitHub here: Simpson Says