Data Mining Term Project (Fall 2019)

Movie Search and Classifier

Phase 1: Movie Search

Demo Link
GitHub Link

Dataset
The dataset contains metadata for 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries etc. https://www.kaggle.com/rounakbanik/the-movies-dataset

Pre-processing
In the pre-processing step, for each document in the dataset, we create Calculations
First, the query entered by the user is tokenized, and for each query token, the list of all the documents the token is present in is returned. Then, the document vector is created using the following formulae.
TF of a token in a document is calculated by the formula:
tf = self.postings[query_term][movie_id] / self.lengths[movie_id]
IDF of a token is calculated by the formula:
idf = log10(len(self.lengths) / self.document_frequency[query_term])
Documents are ranked by their score of cosine similarity based on the query tokens.
dot(query_vector, document_vectors[movie_id]) / (norm(query_vector) * norm(document_vectors[movie_id]))
Finally, the documents are sorted by the decreasing order of the score, and top five results are returned.

Contributions Challenges Faced
The biggest challenge faced in the project was deploying the Flask app on PythonAnywhere. Because the pre-processing takes a very large amount of time, the deployment would always time-out. This was mitigated by using the Pickle library in python.
Pickle is a serialization and deserialization library, which can serialize or deserialize python variables to store them into files, or read them from files. The pre-processing step was carried out in the local machine and the resulting pickle data files were created, so that whenever the project was run, the pre-processing step could be skipped after the initial run, and the pre-processing data would simply read from the file. This decreased the app deployment time from several minutes to a few seconds, and also solved the time-out problem in PythonAnywhere.

References
Links to other sections of the project