This project uses Python and regular expressions to create a web scraper that searches for movie titles, dates, descriptions, metascores, and images in Metacritic. It gets the Metacritic url and constructs a list of movies from a particular year and page and then writes it to a csv file. It then reads the file and performs an analysis on the data.
The project is built using Python and Regual Expressions in Jupyter Notebook.
Imports used to run this program:
- re
- urlib3
- certifi
- json
- pymongo
- time
- pandas
- matplotlib (pyplot and FormatStrFormatter)
- Seaborn
To install in terminal:
- Open terminal
- path\to\project\file: pip3 install {package to install}
This project uses two files, one for the scraper and another for the analysis.
Connect to MongoDB
with open("/fileLocation/credentialsFileName.json") as f:
data = json.load(f)
mongo_connection_string = data ['mongodb']
Retrieve the data in your MongoDB collection
client = pymongo.MongoClient(mongo_connection_string, tlsCAFile=certifi.where())
db1_database = client['databaseName']
metacritic_data = db1_database['collectionName']
Get the Metacritic url
url = "https://www.metacritic.com/browse/movies/score/metascore/year/filtered?year_selected=(year)&sort=desc&view=detailed&page=(page)"
Retrieve credentials from json credentials file stored on local computer and fetch the MongoDB collection
# Retrieve credentials
with open("/fileLocation/credentialsFileName.json") as f:
data = json.load(f)
mongo_connection_string = data ['mongodb']
# Fetch the database named "DB1"
client = pymongo.MongoClient(mongo_connection_string, tlsCAFile=certifi.where())
db1_database = client['databaseName']
metacritic_data = db1_database['collectionName']
metacritic = pd.DataFrame(metacritic_data.find())
Add year and month columns to dataframe
metacritic['year'] = metacritic.release_date.dt.year
metacritic['month'] = metacritic.release_date.dt.month
Distributed under the MIT license. See LICENS.txt
for more information.