Novice data idea: Dog registrations in New York City #67

cwickham · 2019-05-20T22:46:20Z

A slightly lighthearted data option for the novice courses.

Source: https://data.cityofnewyork.us/Health/NYC-Dog-Licensing-Dataset/nu7n-tubp

All dog owners residing in NYC are required by law to license their dogs. The data is sourced from the DOHMH Dog Licensing System (https://a816-healthpsi.nyc.gov/DogLicense), where owners can apply for and renew dog licenses. Each record represents a unique dog license that was active during the year, but not necessarily a unique record per dog, since a license that is renewed during the year results in a separate record of an active license period. Each record stands as a unique license period for the dog over the course of the yearlong time frame.

Some inspiration for questions: https://www.nytimes.com/interactive/2018/02/08/realestate/dogs-of-new-york.html?module=inline

Some example exercises from my quick exploration follow the Pros and Cons.

Pros

Accessible content area - most people have at least a basic familiarity with dog breeds and know NYC is a big city.
Fertile for data manipulation questions.
Pretty tidy.
Room for joining with census tract data.

Cons

Is this being updated? Docs say 2016, but it actually now includes 2017 registrations.
Probably not similar data for other geographic areas.
Requires some working with datetimes, although we could do this and have a cleaner version available.

Example exercises

library(tidyverse)

Read in "NYC_Dog_Licensing_Dataset.csv" and take a look with glimpse(). You'll need to make sure values that are "NULL" in the CSV file are interpreted as missing values.

dogs <- read_csv("NYC_Dog_Licensing_Dataset.csv", na = "NULL")

## Parsed with column specification:
## cols(
##   RowNumber = col_double(),
##   AnimalName = col_character(),
##   AnimalGender = col_character(),
##   AnimalBirthMonth = col_character(),
##   BreedName = col_character(),
##   Borough = col_character(),
##   ZipCode = col_double(),
##   CommunityDistrict = col_double(),
##   CensusTract2010 = col_double(),
##   NTA = col_character(),
##   CityCouncilDistrict = col_double(),
##   CongressionalDistrict = col_double(),
##   StateSenatorialDistrict = col_double(),
##   LicenseIssuedDate = col_character(),
##   LicenseExpiredDate = col_character()
## )

glimpse(dogs)

## Observations: 121,949
## Variables: 15
## $ RowNumber               <dbl> 533, 548, 622, 633, 655, 872, 874, 875, …
## $ AnimalName              <chr> "BONITA", "ROCKY", "BULLY", "COCO", "SKI…
## $ AnimalGender            <chr> "F", "M", "M", "M", "F", "M", "M", "M", …
## $ AnimalBirthMonth        <chr> "05/01/2013 12:00:00 AM", "05/01/2014 12…
## $ BreedName               <chr> "Unknown", "Labrador Retriever Crossbree…
## $ Borough                 <chr> "Queens", "Queens", "Queens", "Queens", …
## $ ZipCode                 <dbl> 11435, 11691, 11419, 11692, 11691, 11692…
## $ CommunityDistrict       <dbl> 412, 414, 410, 414, 414, 414, 414, 414, …
## $ CensusTract2010         <dbl> 208, 100801, 98, 964, 100802, 964, 94201…
## $ NTA                     <chr> "QN61", "QN15", "QN55", "QN12", "QN15", …
## $ CityCouncilDistrict     <dbl> 28, 31, 28, 31, 31, 31, 32, 31, 28, 28, …
## $ CongressionalDistrict   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ StateSenatorialDistrict <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ LicenseIssuedDate       <chr> "10/24/2014", "10/25/2014", "10/28/2014"…
## $ LicenseExpiredDate      <chr> "11/15/2017", "10/25/2019", "09/24/2016"…

What are the most popular dog names?

dogs %>% 
  group_by(AnimalName) %>% 
  count(sort = TRUE)

## # A tibble: 16,800 x 2
## # Groups:   AnimalName [16,800]
##    AnimalName            n
##    <chr>             <int>
##  1 UNKNOWN            2489
##  2 NAME NOT PROVIDED  1764
##  3 BELLA              1360
##  4 MAX                1287
##  5 CHARLIE             984
##  6 COCO                943
##  7 ROCKY               880
##  8 LOLA                876
##  9 LUCY                767
## 10 BUDDY               747
## # … with 16,790 more rows

Make sure the values UNKNOWN and NAME NOT PROVIDED in the AnimalName column are interpreted as missing values, then find the most popular male dog names.

dogs %>% 
  group_by(
    name = AnimalName %>% na_if("UNKNOWN") %>% na_if("NAME NOT PROVIDED")
  ) %>% 
  filter(AnimalGender == "M") %>% 
  count(sort = TRUE)

## # A tibble: 10,330 x 2
## # Groups:   name [10,330]
##    name        n
##    <chr>   <int>
##  1 <NA>     2506
##  2 MAX      1268
##  3 CHARLIE   868
##  4 ROCKY     867
##  5 BUDDY     741
##  6 LUCKY     625
##  7 TEDDY     559
##  8 TOBY      523
##  9 JACK      488
## 10 MILO      443
## # … with 10,320 more rows

What are some of the longest dog names that have been registered?

dogs %>% 
  mutate(name_length = stringr::str_length(AnimalName)) %>% 
  top_n(n = 10)  %>% 
  arrange(desc(name_length)) %>% 
  select(AnimalName, BreedName)

## Selecting by name_length

## # A tibble: 11 x 2
##    AnimalName                                          BreedName           
##    <chr>                                               <chr>               
##  1 "BLU \tM\t10/01/2015\t1\tMaltese\tBronx\t10473\t20… ""                  
##  2 CARLYAPPLEWHITECRAWFORDCOLEMAN                      Havanese            
##  3 JEFFERSONBARNARDRAMSEYDONNELLY                      Jack Russell Terrier
##  4 PIPLONGFELLOWBUTTERFIELDFROUDE                      Jack Russell Terrier
##  5 SAMSONMAXWELLWALTERZANE(SAMMY)                      Yorkshire Terrier   
##  6 BUDDYVONYANKEEDESHORTHAIR                           Pointer, German Sho…
##  7 EMILIE.BUNNELL@GMAIL.COM                            Jack Russell Terrier
##  8 SHAWN-MICHAEL-VINCIENT                              Cocker Spaniel      
##  9 FLYNN-(BILLYGSWANNBE)                               Greyhound           
## 10 DANGERFIELDS-MR.BOBBY                               Bull Dog, French    
## 11 EUNICETHOMPSON-STROUD                               Pug

The first one looks like the entire record has been truncated in the name field - copy and paste entry error?

What breeds are most common?

dogs %>% 
  group_by(BreedName) %>% 
  count(sort = TRUE)

## # A tibble: 300 x 2
## # Groups:   BreedName [300]
##    BreedName                                n
##    <chr>                                <int>
##  1 Unknown                              16763
##  2 Yorkshire Terrier                     7773
##  3 Shih Tzu                              7141
##  4 Chihuahua                             5771
##  5 Maltese                               4292
##  6 Labrador Retriever                    4196
##  7 American Pit Bull Mix / Pit Bull Mix  3401
##  8 American Pit Bull Terrier/Pit Bull    3341
##  9 Labrador Retriever Crossbreed         2774
## 10 Pomeranian                            2195
## # … with 290 more rows

How does the number of registrations change over time?

dogs %>% 
  mutate(
    issued_month = lubridate::mdy(LicenseIssuedDate) %>% 
      lubridate::floor_date(unit = "month")) %>% 
  group_by(issued_month) %>% 
  count() %>% 
  ggplot(aes(issued_month, n)) +
    geom_line()

## Warning: Removed 1 rows containing missing values (geom_path).

Peaks in summer? Generally increasing trend - are there more dogs or just more registrations?

When are dogs born?

Complicated by the fact that since animal birth month is entered as a full datetime in M/D/Y format - month and year are meaningful, day and time are not:

dogs %>% 
  pull(AnimalBirthMonth) %>% 
  head()

## [1] "05/01/2013 12:00:00 AM" "05/01/2014 12:00:00 AM"
## [3] "07/01/2010 12:00:00 AM" "02/01/2005 12:00:00 AM"
## [5] "09/01/2012 12:00:00 AM" "11/01/2013 12:00:00 AM"

dogs <- dogs %>% 
  mutate(
    birth_date = lubridate::mdy_hms(AnimalBirthMonth),
    birth_month = lubridate::month(birth_date, label = TRUE),
    birth_year = lubridate::year(birth_date)
    )

When are dogs born?

dogs %>% 
  ggplot(aes(birth_month)) +
  geom_bar()

Is Jan the default value?

Other ideas:

What does it mean that there might be more than one record per unique dog? Why can't we identify individual dogs?
Combine with census tract data: plot number of registrations against demographics, or registrations of a particular breed against demographics.
Are there breeds that are increasing or decreasing in popularity (hard with only 2 years of registrations)?
Compare lengths of names between dogs with "Unknown" breed to those with a stated breed. E.g. do mutts get shorter names?

The text was updated successfully, but these errors were encountered:

gvwilson · 2019-05-21T14:24:37Z

I like this a lot - and now we can motivate the introduction of join by looking at the relative prevalence of names for children and dogs :-)

cwickham added discussion discussion before a proposal Novice R labels May 20, 2019

DamienIrving added the dataset Background information on a proposed dataset label May 28, 2019

lwjohnst86 mentioned this issue Jun 18, 2019

Added (cleaned and messy) Dog data and downloaded CO2 datasets #126

Merged

lwjohnst86 closed this as completed in #126 Jun 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Novice data idea: Dog registrations in New York City #67

Novice data idea: Dog registrations in New York City #67

cwickham commented May 20, 2019

gvwilson commented May 21, 2019

Novice data idea: Dog registrations in New York City #67

Novice data idea: Dog registrations in New York City #67

Comments

cwickham commented May 20, 2019

Pros

Cons

Example exercises

gvwilson commented May 21, 2019