forked from jshen/harvardnow
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathScraping and API Calls.txt
11 lines (11 loc) · 3.32 KB
/
Scraping and API Calls.txt
1
2
3
4
5
6
7
8
9
10
11
1. Title
2. Scraping is a type of data extraction that takes information form the html of a webpage. While it can be cumbersome to directly parse the html, the library Beautiful Soup organizes the html in a tree that is easy to navigate.
3. This is sample code from Beautiful Soup’s documentation. The first block is the sample html and the second block shows how to invoke the library.
4. Here are some examples of how to extract information from the Beautiful Soup’s tree. As you can see from these examples the html elements can be extracted using the name of the element and keywords like string to access the body of the element. Some other useful ways to navigate the soup tree can are in the last few examples where elements are identified by id.
5. In the code for Harvard Now, we used scraping to obtain information about the laundry status. The image on the left shows what the laundry webpage looks like and the image on the right is the html of this page. In order to obtain the information about the status of the washers the name of the washer and its status had to be extracted. From these two images we can see that the name of the washer and its status are in text fields that we can extract.
6. This code shows how we can use Beautiful Soup to extract those pieces of information. After initially parsing the url we open the page with Beautiful Soup. Because all the list of washers had a header with the id washer, we can use soup.find to get this element and then navigate to the first machine by using next_sibling. Now that we have identified the correct list element, we can extract the two text elements we want by identifying their relationship from the <li> that we have identified.
7. Once we obtain this list of machines, parsing this information into a readable string is simple.
8. Using an API can be an extremely time saving practice. While scraping is quite versatile, scraping is pretty annoying to code and can completely break if a small change is made to the website you are scraping. A better solution is using an API call because the developer is intending for you to use to obtain information in this way. For Harvard Now we used Transloc’s API to get the times for shuttles.
9. One difference between scraping and API calls is the setup for an API call. Most API’s require a key and some information about the format that you want to receive information. In this example you can see that we specify the data format as a json and that we have to specify an agency, which in this case just specifies the shuttles for Harvard. The get function makes the request to the API.
10. This code shows the request to the API. In this case we are asking for the field data that is associated with the stops part of the API. The image on the right shows the form that the json is presented in. The largest set of braces is the dictionary stops and the element data is a list that contains each stop. Once we focus on one stop we can extract the name, id, and list of routes that are associated with that particular stop. The other code shows the call to obtain routes.
11. Once we obtain this information the remaining code is fairly simple. When a stop is specified by its ID, we can easily obtain the relevant information to see all the arriving shuttles. These two functions show how we can parse a readable string of all the arriving shuttles at a stop.