Use data_collection.sh
to download and extract into following directory structure.
data
|-- imdb/*.tsv
|-- ml-25/*.csv
\-- misc/*.csv
- IMDb non-commericial datasets
- MovieLens
- plots.csv generated using cinemagoer (by another team member, but similar to
get_plot.py
) - user_events.csv (not uploaded) (generated by another team member)
Use data_preparation.py
to process tsv/csv into csv/json for MongoDB import. See collections.txt
for schema. Takes around 30 minutes.
The previous step will create files: collections/$collection.{csv,json}
. Run:
sh import_collection_from_file.sh collections/people.json --drop
sh import_collection_from_file.sh collections/user_events.csv --drop
sh import_collection_from_file.sh collections/user_ratings.csv --drop
sh import_collection_from_file.sh collections/shows.csv --drop
sh import_collection_from_file.sh collections/shows.json --mode=merge
sh import_collection_from_file.sh collections/shows.actors.json --mode=merge
sh import_collection_from_file.sh collections/shows.crew.json --mode=merge
Takes around 30 minutes
See natural_language_queries.txt
for list of queries. Run mongosh < queries.js > queries_out.txt
.
Written using FastAPI and some javascript. See img/
for screenshots.