Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Novice data idea: Dog registrations in New York City #67

Closed
cwickham opened this issue May 20, 2019 · 1 comment · Fixed by #126
Closed

Novice data idea: Dog registrations in New York City #67

cwickham opened this issue May 20, 2019 · 1 comment · Fixed by #126
Labels
dataset Background information on a proposed dataset discussion discussion before a proposal

Comments

@cwickham
Copy link
Contributor

A slightly lighthearted data option for the novice courses.

Source: https://data.cityofnewyork.us/Health/NYC-Dog-Licensing-Dataset/nu7n-tubp

All dog owners residing in NYC are required by law to license their dogs. The data is sourced from the DOHMH Dog Licensing System (https://a816-healthpsi.nyc.gov/DogLicense), where owners can apply for and renew dog licenses. Each record represents a unique dog license that was active during the year, but not necessarily a unique record per dog, since a license that is renewed during the year results in a separate record of an active license period. Each record stands as a unique license period for the dog over the course of the yearlong time frame.

Some inspiration for questions: https://www.nytimes.com/interactive/2018/02/08/realestate/dogs-of-new-york.html?module=inline

Some example exercises from my quick exploration follow the Pros and Cons.

Pros

  • Accessible content area - most people have at least a basic familiarity with dog breeds and know NYC is a big city.

  • Fertile for data manipulation questions.

  • Pretty tidy.

  • Room for joining with census tract data.

Cons

  • Is this being updated? Docs say 2016, but it actually now includes 2017 registrations.

  • Probably not similar data for other geographic areas.

  • Requires some working with datetimes, although we could do this and have a cleaner version available.

Example exercises

library(tidyverse)
  1. Read in "NYC_Dog_Licensing_Dataset.csv" and take a look with glimpse(). You'll need to make sure values that are "NULL" in the CSV file are interpreted as missing values.

    dogs <- read_csv("NYC_Dog_Licensing_Dataset.csv", na = "NULL")
    ## Parsed with column specification:
    ## cols(
    ##   RowNumber = col_double(),
    ##   AnimalName = col_character(),
    ##   AnimalGender = col_character(),
    ##   AnimalBirthMonth = col_character(),
    ##   BreedName = col_character(),
    ##   Borough = col_character(),
    ##   ZipCode = col_double(),
    ##   CommunityDistrict = col_double(),
    ##   CensusTract2010 = col_double(),
    ##   NTA = col_character(),
    ##   CityCouncilDistrict = col_double(),
    ##   CongressionalDistrict = col_double(),
    ##   StateSenatorialDistrict = col_double(),
    ##   LicenseIssuedDate = col_character(),
    ##   LicenseExpiredDate = col_character()
    ## )
    
    glimpse(dogs)
    ## Observations: 121,949
    ## Variables: 15
    ## $ RowNumber               <dbl> 533, 548, 622, 633, 655, 872, 874, 875, …
    ## $ AnimalName              <chr> "BONITA", "ROCKY", "BULLY", "COCO", "SKI…
    ## $ AnimalGender            <chr> "F", "M", "M", "M", "F", "M", "M", "M", …
    ## $ AnimalBirthMonth        <chr> "05/01/2013 12:00:00 AM", "05/01/2014 12…
    ## $ BreedName               <chr> "Unknown", "Labrador Retriever Crossbree…
    ## $ Borough                 <chr> "Queens", "Queens", "Queens", "Queens", …
    ## $ ZipCode                 <dbl> 11435, 11691, 11419, 11692, 11691, 11692…
    ## $ CommunityDistrict       <dbl> 412, 414, 410, 414, 414, 414, 414, 414, …
    ## $ CensusTract2010         <dbl> 208, 100801, 98, 964, 100802, 964, 94201…
    ## $ NTA                     <chr> "QN61", "QN15", "QN55", "QN12", "QN15", …
    ## $ CityCouncilDistrict     <dbl> 28, 31, 28, 31, 31, 31, 32, 31, 28, 28, …
    ## $ CongressionalDistrict   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
    ## $ StateSenatorialDistrict <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
    ## $ LicenseIssuedDate       <chr> "10/24/2014", "10/25/2014", "10/28/2014"…
    ## $ LicenseExpiredDate      <chr> "11/15/2017", "10/25/2019", "09/24/2016"…
    
  2. What are the most popular dog names?

    dogs %>% 
      group_by(AnimalName) %>% 
      count(sort = TRUE) 
    ## # A tibble: 16,800 x 2
    ## # Groups:   AnimalName [16,800]
    ##    AnimalName            n
    ##    <chr>             <int>
    ##  1 UNKNOWN            2489
    ##  2 NAME NOT PROVIDED  1764
    ##  3 BELLA              1360
    ##  4 MAX                1287
    ##  5 CHARLIE             984
    ##  6 COCO                943
    ##  7 ROCKY               880
    ##  8 LOLA                876
    ##  9 LUCY                767
    ## 10 BUDDY               747
    ## # … with 16,790 more rows
    
  3. Make sure the values UNKNOWN and NAME NOT PROVIDED in the AnimalName column are interpreted as missing values, then find the most popular male dog names.

    dogs %>% 
      group_by(
        name = AnimalName %>% na_if("UNKNOWN") %>% na_if("NAME NOT PROVIDED")
      ) %>% 
      filter(AnimalGender == "M") %>% 
      count(sort = TRUE)
    ## # A tibble: 10,330 x 2
    ## # Groups:   name [10,330]
    ##    name        n
    ##    <chr>   <int>
    ##  1 <NA>     2506
    ##  2 MAX      1268
    ##  3 CHARLIE   868
    ##  4 ROCKY     867
    ##  5 BUDDY     741
    ##  6 LUCKY     625
    ##  7 TEDDY     559
    ##  8 TOBY      523
    ##  9 JACK      488
    ## 10 MILO      443
    ## # … with 10,320 more rows
    
  4. What are some of the longest dog names that have been registered?

    dogs %>% 
      mutate(name_length = stringr::str_length(AnimalName)) %>% 
      top_n(n = 10)  %>% 
      arrange(desc(name_length)) %>% 
      select(AnimalName, BreedName)
    ## Selecting by name_length
    
    ## # A tibble: 11 x 2
    ##    AnimalName                                          BreedName           
    ##    <chr>                                               <chr>               
    ##  1 "BLU \tM\t10/01/2015\t1\tMaltese\tBronx\t10473\t20… ""                  
    ##  2 CARLYAPPLEWHITECRAWFORDCOLEMAN                      Havanese            
    ##  3 JEFFERSONBARNARDRAMSEYDONNELLY                      Jack Russell Terrier
    ##  4 PIPLONGFELLOWBUTTERFIELDFROUDE                      Jack Russell Terrier
    ##  5 SAMSONMAXWELLWALTERZANE(SAMMY)                      Yorkshire Terrier   
    ##  6 BUDDYVONYANKEEDESHORTHAIR                           Pointer, German Sho…
    ##  7 EMILIE.BUNNELL@GMAIL.COM                            Jack Russell Terrier
    ##  8 SHAWN-MICHAEL-VINCIENT                              Cocker Spaniel      
    ##  9 FLYNN-(BILLYGSWANNBE)                               Greyhound           
    ## 10 DANGERFIELDS-MR.BOBBY                               Bull Dog, French    
    ## 11 EUNICETHOMPSON-STROUD                               Pug
    

    The first one looks like the entire record has been truncated in the name field - copy and paste entry error?

  5. What breeds are most common?

    dogs %>% 
      group_by(BreedName) %>% 
      count(sort = TRUE)
    ## # A tibble: 300 x 2
    ## # Groups:   BreedName [300]
    ##    BreedName                                n
    ##    <chr>                                <int>
    ##  1 Unknown                              16763
    ##  2 Yorkshire Terrier                     7773
    ##  3 Shih Tzu                              7141
    ##  4 Chihuahua                             5771
    ##  5 Maltese                               4292
    ##  6 Labrador Retriever                    4196
    ##  7 American Pit Bull Mix / Pit Bull Mix  3401
    ##  8 American Pit Bull Terrier/Pit Bull    3341
    ##  9 Labrador Retriever Crossbreed         2774
    ## 10 Pomeranian                            2195
    ## # … with 290 more rows
    
  6. How does the number of registrations change over time?

    dogs %>% 
      mutate(
        issued_month = lubridate::mdy(LicenseIssuedDate) %>% 
          lubridate::floor_date(unit = "month")) %>% 
      group_by(issued_month) %>% 
      count() %>% 
      ggplot(aes(issued_month, n)) +
        geom_line()
    ## Warning: Removed 1 rows containing missing values (geom_path).
    

    unnamed-chunk-8-1

    Peaks in summer? Generally increasing trend - are there more dogs or just more registrations?

  7. When are dogs born?

    Complicated by the fact that since animal birth month is entered as a full datetime in M/D/Y format - month and year are meaningful, day and time are not:

    dogs %>% 
      pull(AnimalBirthMonth) %>% 
      head()
    ## [1] "05/01/2013 12:00:00 AM" "05/01/2014 12:00:00 AM"
    ## [3] "07/01/2010 12:00:00 AM" "02/01/2005 12:00:00 AM"
    ## [5] "09/01/2012 12:00:00 AM" "11/01/2013 12:00:00 AM"
    
    dogs <- dogs %>% 
      mutate(
        birth_date = lubridate::mdy_hms(AnimalBirthMonth),
        birth_month = lubridate::month(birth_date, label = TRUE),
        birth_year = lubridate::year(birth_date)
        )

    When are dogs born?

    dogs %>% 
      ggplot(aes(birth_month)) +
      geom_bar()

    unnamed-chunk-11-1
    Is Jan the default value?

Other ideas:

  • What does it mean that there might be more than one record per unique dog? Why can't we identify individual dogs?

  • Combine with census tract data: plot number of registrations against demographics, or registrations of a particular breed against demographics.

  • Are there breeds that are increasing or decreasing in popularity (hard with only 2 years of registrations)?

  • Compare lengths of names between dogs with "Unknown" breed to those with a stated breed. E.g. do mutts get shorter names?

@cwickham cwickham added discussion discussion before a proposal Novice R labels May 20, 2019
@gvwilson
Copy link
Contributor

I like this a lot - and now we can motivate the introduction of join by looking at the relative prevalence of names for children and dogs :-)

@DamienIrving DamienIrving added the dataset Background information on a proposed dataset label May 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Background information on a proposed dataset discussion discussion before a proposal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants