Assigned: May 14th
Due: May 24th at 12 p.m. (noon) (10 days to complete)
Please remember the Academy Honesty policy on problem sets. If you are ever confused about what is okay and what isn't, ask me.
To allow students to challenge themselves and scrape sites they're interested in, or would be useful for other projects.
If I've approved your final project in class, then it will already meet these requirements. Here are the requirements I talked about during the second week of class:
Content requirements:
- The data scraped for this website needs to be of journalistic value. This is meaning can be flexible, but would not allow for projects such as scraping the .mp3 files off a music site.
Technical requirements:
- Scraping the site you've selected needs to be challenging. This can mean:
- A complex form you need to fill out to get the information
- Tricky html code
- Multiple pages of information
- Scraping two websites simultaneously and saving their results together
- Final projects cannot:
- Be a single page with a table
- The data cannot be already downloadable as a CSV or XLS file
Unlike previous weeks, I'd like for you to email me your final project scripts by the deadline. In your email, please include the following:
-
What’s the URL(s) you’re getting that information from?
ex: http://www.criminaljustice.ny.gov/SomsSUBDirectory/search_index.jsp?offenderSubmit=true&LastName=&County=31&Zip=&Submit=Search -
What’s the data you’re trying to scrape? Be specific and include exactly what data points you’re looking for.
ex: I’m trying to scrape the Offender ID, Risk and Address for every sex offender in New York City. This requires scraping the URL I listed, plus all of the URLs linked to for each sex offender to gather the data. -
Attach your final project Python script to the email.
That's it! Make sure it lands in my inbox by noon on Sunday, May 24th.