Skip to content

Data Extraction

Swapna Y edited this page May 7, 2021 · 2 revisions

COVID-19

Interested Data:

  • For each country/State we interested in:

    • Population
    • Total cases
    • Total deaths
    • Total new cases

Sample code:

import pandas as pd
pd.set_option('display.max_columns', 7)
covid_data = pd.read_csv("https://data.cdc.gov/api/views/9mfq-cb36/rows.csv?accessType=DOWNLOAD")
print(covid_data[["submission_date", "state", "tot_cases", "new_case", "tot_death",]][100:110])
covid_data.to_csv("../docs/cdc_covid.csv")



US Raw data sample:
    submission_date state  tot_cases  new_case  tot_death
 100      05/01/2020    CO      15668       486        820
 101      05/02/2020    CO      16120       452        832
 102      05/03/2020    CO      16534       414        842
 103      05/04/2020    CO      16878       344        851
 104      05/05/2020    CO      17317       439        903
 105      05/06/2020    CO      17738       421        921
 106      05/07/2020    CO      18318       580        944
 107      05/08/2020    CO      18793       475        960
 108      05/09/2020    CO      19316       523        967
 109      05/10/2020    CO      19632       316        971
 
 covid_world_data = pd.read_csv("https://covid19.who.int/WHO-COVID-19-global-data.csv")
 print(covid_world_data.columns)
 print(covid_world_data[["Date_reported", " Country", " New_cases"," Cumulative_cases"," New_deaths",  " Cumulative_deaths"]][100:110])
 covid_world_data.to_csv("../docs/who_covid_world.csv")
 
 
World data:
  Date_reported      Country   New_cases   Cumulative_cases   New_deaths      Cumulative_deaths
 100    2020-04-12  Afghanistan          34                555            3   18
 101    2020-04-13  Afghanistan         110                665            3   21 
 102    2020-04-14  Afghanistan          49                714            2   23
 103    2020-04-15  Afghanistan          70                784            2   25  
 104    2020-04-16  Afghanistan          10                794            4   29 
 105    2020-04-17  Afghanistan          51                845            1   30
 106    2020-04-18  Afghanistan          63                908            0   30
 107    2020-04-19  Afghanistan          88                996            3   33
 108    2020-04-20  Afghanistan           0                996            0   33
 109    2020-04-21  Afghanistan          96               1092            3   36                   

Stocks

S&P Stock symbols were used from datahub.io.Yahoo finance API can be used to extract the data using pandas data reader.This API takes stock ticker as input.

from pandas_datareader import data
 try:
    end_date = datetime.datetime.now()
    start_date = '2008-01-01'
    
        panel_data = data.DataReader('AAP', 'yahoo', start_date, end_date)
        new_path = os.path.join(self.config.data_dir, "AAP.csv")
        panel_data.to_csv(new_path)
    except Exception as ex:
       logging.error("Error extracting data %s", ex)
       raise ex     
   

Stock price data sample
             open   high    low  close     volume
date                                             
2016-01-04  25.65  26.34  25.50  26.34  270597548
2016-01-05  26.44  26.46  25.60  25.68  223163968
2016-01-06  25.14  25.59  24.97  25.18  273829552
2016-01-07  24.67  25.03  24.11  24.11  324377712
2016-01-08  24.64  24.78  24.19  24.24  283192064

Unemployment Rate (USA)

  • Interested Series:
    • Unemployment by industry
    • Unemployment by occupation
    • employment by education level
    • Employment by race
    • Total unemployment

Sample:

import requests
import json
import prettytable
headers = {'Content-type': 'application/json'}
data = json.dumps({"seriesid": ['LNU03032229'],"startyear":"2011", "endyear":"2014"})
p = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
json_data = json.loads(p.text)
for series in json_data['Results']['series']:
   x=prettytable.PrettyTable(["series id","year","period","value","footnotes"])
   seriesId = series['seriesID']
   for item in series['data']:
       year = item['year']
       period = item['period']
       value = item['value']
       footnotes=""
       for footnote in item['footnotes']:
           if footnote:
               footnotes = footnotes + footnote['text'] + ','
       if 'M01' <= period <= 'M12':
           x.add_row([seriesId,year,period,value,footnotes[0:-1]])
output = open('../datasets/'+seriesId + '.txt','w')
output.write (x.get_string())
output.close()

Output:

+-------------+------+--------+-------+-----------+
|  series id  | year | period | value | footnotes |
+-------------+------+--------+-------+-----------+
| LNU03032229 | 2014 |  M12   |  6341 |           |
| LNU03032229 | 2014 |  M11   |  6413 |           |
| LNU03032229 | 2014 |  M10   |  6580 |           |
| LNU03032229 | 2014 |  M09   |  6766 |           |
| LNU03032229 | 2014 |  M08   |  7200 |           |
| LNU03032229 | 2014 |  M07   |  7366 |           |
| LNU03032229 | 2014 |  M06   |  7164 |           |
| LNU03032229 | 2014 |  M05   |  7243 |           |
| LNU03032229 | 2014 |  M04   |  7105 |           |
| LNU03032229 | 2014 |  M03   |  8110 |           |
| LNU03032229 | 2014 |  M02   |  8479 |           |
| LNU03032229 | 2014 |  M01   |  8437 |           |
| LNU03032229 | 2013 |  M12   |  7662 |           |
| LNU03032229 | 2013 |  M11   |  7882 |           |
| LNU03032229 | 2013 |  M10   |  8011 |           |
| LNU03032229 | 2013 |  M09   |  8216 |           |
| LNU03032229 | 2013 |  M08   |  8391 |           |
| LNU03032229 | 2013 |  M07   |  8683 |           |
| LNU03032229 | 2013 |  M06   |  8800 |           |
| LNU03032229 | 2013 |  M05   |  8655 |           |
| LNU03032229 | 2013 |  M04   |  8608 |           |
| LNU03032229 | 2013 |  M03   |  9148 |           |
| LNU03032229 | 2013 |  M02   |  9826 |           |
| LNU03032229 | 2013 |  M01   | 10435 |           |
| LNU03032229 | 2012 |  M12   |  9104 |           |
| LNU03032229 | 2012 |  M11   |  8600 |           |
| LNU03032229 | 2012 |  M10   |  8892 |           |
| LNU03032229 | 2012 |  M09   |  8935 |           |
| LNU03032229 | 2012 |  M08   |  9552 |           |
| LNU03032229 | 2012 |  M07   |  9692 |           |
| LNU03032229 | 2012 |  M06   |  9626 |           |
| LNU03032229 | 2012 |  M05   |  9419 |           |
| LNU03032229 | 2012 |  M04   |  9197 |           |
| LNU03032229 | 2012 |  M03   | 10106 |           |
| LNU03032229 | 2012 |  M02   | 10517 |           |
| LNU03032229 | 2012 |  M01   | 10736 |           |
| LNU03032229 | 2011 |  M12   |  9956 |           |
| LNU03032229 | 2011 |  M11   |  9740 |           |
| LNU03032229 | 2011 |  M10   | 10126 |           |
| LNU03032229 | 2011 |  M09   | 10375 |           |
| LNU03032229 | 2011 |  M08   | 10524 |           |
| LNU03032229 | 2011 |  M07   | 10515 |           |
| LNU03032229 | 2011 |  M06   | 10733 |           |
| LNU03032229 | 2011 |  M05   | 10628 |           |
| LNU03032229 | 2011 |  M04   | 10560 |           |
| LNU03032229 | 2011 |  M03   | 11288 |           |
| LNU03032229 | 2011 |  M02   | 11641 |           |
| LNU03032229 | 2011 |  M01   | 11778 |           |
+-------------+------+--------+-------+-----------+

Housing

  • Interested Series:
    • Top tier for a region 65th to 95th percentile range for a given region
    • Bottom-tier (typical value for homes that fall within the 5th to 35th percentile range for a given region).
    • Typical value for all single-family homes in a given region
    • For-Sale Inventory
    • Days to Pending
    • Median Sale Price

Sample

import pandas as pd
pd.set_option('display.max_columns', 7)
housing_data = pd.read_csv("http://files.zillowstatic.com/research/public_v2/zhvi/City_zhvi_uc_sfrcondo_tier_0.67_1.0_sm_sa_mon.csv")
print(housing_data.columns)
print(housing_data[["RegionName", "StateName", "CountyName", '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31', '2020-09-30' ]][10:20])
housing_data.to_csv("../datasets/housing_usa.csv")


 RegionName StateName            CountyName  ...  2020-07-31  \         2020-08-31
10         Austin        TX         Travis County  ...    668559.0      675088.0
11       San Jose        CA    Santa Clara County  ...   1408057.0      1434708.0
12   Jacksonville        FL          Duval County  ...    298747.0      300912.0
13   Indianapolis        IN         Marion County  ...    254294.0      257201.0
14  San Francisco        CA  San Francisco County  ...   2071877.0      2076277.0
15      Charlotte        NC    Mecklenburg County  ...    456816.0      461385.0
16     Fort Worth        TX        Tarrant County  ...    317164.0      319543.0
17         Tucson        AZ           Pima County  ...    347079.0      350456.0
18       Columbus        OH       Franklin County  ...    271790.0      274608.0
19     Louisville        KY      Jefferson County  ...    307231.0      309979.0