owrs_analysis.Rmd

---
title: "California Water Rate Survey Results 2017"
output:
  html_document:
    keep_md: yes
  pdf_document: default
---

```{r, echo=FALSE, warning=FALSE, message=FALSE}
library(dplyr)
library(ggplot2)
library(scales)
# library(raster)
library(RateParser)
library(yaml)
library(purrr)
library(fuzzywuzzyR)
library(lubridate)
library(captioner)
library(png)
library(grid)
library(gridExtra)
library(cowplot)
library(ggthemes)
library(stringr)
library(Cairo)

cadc_red <- "#AA3939"
cadc_blue <- "#32b8df"
cadc_yellow <- "#D4BE6A"

cadc_lightblue <- "#D5DCE8"
```

# Introduction

The [California Data Collaborative ("CaDC")](http://californiadatacollaborative.org/) is a coalition of water utilities that have pioneered a new data infrastructure non-profit 501 (c) (3) to support water managers in meeting their reliability objectives and serve the public good.

One important contribution of the CaDC was to establish a standard format and to provide the infrastructure for storage and maintainance of an open database for water rates, facilitating the work of analysts, economists and software developers interested in analyzing and understanding the differences in water rate structures and prices across many different agencies and locations. The water rate structures were organized in [Open-Water-Rate-Specification (OWRS)](https://github.com/California-Data-Collaborative/Open-Water-Rate-Specification) files, a format based on [YAML](http://yaml.org/), which is designed to be easy to store, transmit, and parse in any programming language while also being easy for humans to read.

This report presents a summary of the types of analyses and insights that can be obtained from analyzing the OWRS, especially when this information is combined with the water consumption data from water agencies and Census Data.

# Data

This report provides the combined analysis of data from 4 different sources:

* Water Rates from the [Open-Water-Rates-Specification](https://github.com/California-Data-Collaborative/Open-Water-Rate-Specification).
* Water Consumption data reported by the water agencies [ADD MORE DETAIL] 
* Demographic Data from the American Community Survey [??]
* Qualitative Data from a Survey realized by the California Data Collaborative with water agencies in 2017. 

```{r, include=FALSE}
source("R/utils.R")
source("R/plots.R")
source("R/fuzzy_matching.R")

#starting function to create Figure captions
fig_nums <- captioner(prefix = "Figure")


#Declare the customer classes to be tested for each utility
customer_classes <- c("RESIDENTIAL_SINGLE"
                      #"RESIDENTIAL_MULTI",
                      #"COMMERCIAL"
)
#End


df_adjustable_sample <- tbl_df(data.frame (usage_month = 4, days_in_period = 30.4, usage_year=2017,
                                           hhsize = 3, meter_size = '3/4"', usage_zone = 1, 
                                           landscape_area = 2000, irr_area = 2000,
                                           et_amount = 4.0, wrap_customer = "No", 
                                           carw_customer = "No", season = "Summer", tax_exemption = "granted", 
                                           lot_size_group = 3, temperature_zone = "Medium", pressure_zone = 1, 
                                           water_font = "city_delivered", city_limits = "inside_city", 
                                           water_type = "potable", rate_class = "C1", dwelling_units = 10, 
                                           elevation_zone = 2, greater_than = "False", usage_indoor_budget_ccf = .3, 
                                           meter_type = "Turbine", block = 1, tariff_area = 1, turbine_meter = "No",
                                           senior = "no", cust_class = customer_classes[1]))

#change for The scatter plots (If start = end the process will be a lot faster if only concerned with histograms and bar/pie charts)
start <- 0
end <- 50
interval <- 5
df_usage <- as.data.frame(list("usage_ccf"=seq(start,end,interval), "cust_class"="RESIDENTIAL_SINGLE"))
df_sample <-  left_join(df_usage, df_adjustable_sample, by="cust_class")

singleTargetValue <- 15

#Retrieve the directories and files in the directores from the Open-Water-Specification-File directory
owrs_path <- "../Open-Water-Rate-Specification/full_utility_rates";

#TODO insert the filename gathering functioon here
df_OWRS <- tbl_df(as.data.frame(list("filepath"=getFileNames(owrs_path)), stringsAsFactors=FALSE)) %>% 
  
  mutate(state = map(filepath, strsplit, split="/") %>% map(c(1,1))) %>%
  
  mutate(owrs_directory = map(filepath, strsplit, split="/") %>% map(c(1,2))) %>%
  
  mutate(filename = map(filepath, strsplit, split="/") %>% map(c(1,3))) %>%
  
  mutate(utility_id = map(owrs_directory, strsplit, split=" ") %>%  
                      map(1) %>%map(tail, n=1) %>%
                      map(gsub, pattern="\\D", replacement="") %>%
                      map(as.numeric)) %>%
  
  mutate(effective_date = as.Date(sapply(filename, extract_date)) ) %>% 
  mutate(utility_name = sapply(as.character(owrs_directory), extract_utility_name) )

df_OWRS <- df_OWRS %>% group_by(utility_name) %>% 
  arrange(desc(effective_date)) %>% 
  filter(row_number()==1)

# Join in supplier report
#load supplier reports, geoinformation and pwsid_record
supplier_reports <- read.csv('data/supplier_report2.csv', stringsAsFactors=FALSE) %>%
  mutate(rgpcd = report_production_calculated*report_percent_residential/report_population/report_days_in_month)
#supplier_geo <- read.csv('data/suppliers.csv', stringsAsFactors=FALSE)
supplier_pwsid <- read.csv('data/utilities_for_OWRS.csv', stringsAsFactors=FALSE)


# append to df_OWRS the best fuzzy match for utility_name to get pwsid
# cutoff chosen arbitraily, other values can be tested
df_OWRS$utility_name_supplier_report <- as.character(sapply(df_OWRS$utility_name, GetCloseMatches,
                                              sequence_strings = supplier_pwsid$Agency_Name, n=1L, cutoff = 0.85))

owrs_to_supplier_report_manual_map <- list(
  "Big Bear Lake  City Of" = "City of Big Bear Lake, Dept of Water & Power",
  "Calaveras Public Utilities District" = "",
  "Crescent City" = "Crescent City City of",
  "Discovery Bay  Town Of" = "Discovery Bay Community Services District",
  "Golden State Water Company - Hawthorne" = "Hawthorne City of",
  "Golden State Water Company - Lakewood" = "",
  "Marina Coast Water District - Central Marina" = "Marina Coast Water District",
  "Marina Coast Water District - Ord Community" = "Marina Coast Water District",
  "Rio Dell  City Of" = "",
  "San Bernardino County Service Area 64 Spring Valley Lake" = "San Bernardino County Service Area 64",
  "Sierra Estates Mutual Water Company" = ""
)

manually_mapped_names <- as.character(owrs_to_supplier_report_manual_map[df_OWRS$utility_name])
df_OWRS$utility_name_supplier_report <- ifelse(manually_mapped_names == "NULL", 
                                               df_OWRS$utility_name_supplier_report,
                                               manually_mapped_names)

merged_OWRS <- merge(df_OWRS, supplier_pwsid, by.x = "utility_name_supplier_report", by.y = "Agency_Name", 
                     all.x=TRUE, all.y=FALSE)

# Merge in household sizes from Census
district_hhsize <- read.csv("Demographics_by_Water_District/Census_Data/Water_District_Household_Sizes_2017.csv")
supplier_reports <- dplyr::left_join(supplier_reports, district_hhsize, by=c('report_pwsid'='pwsid') )

# calculate bills

df_bill <- calculate_bills_for_all_utilities(merged_OWRS, supplier_reports, df_sample, owrs_path, customer_classes, singleTargetValue)

#Format the Bill Information so that only valid data entries are presented, the decimal points are rounded, and the data is arranged by utility
df_final_bill <- tbl_df(df_bill) %>% filter(!is.na(bill)) %>%
  mutate(bill = round(as.numeric(bill), 2),
         commodity_charge = round(as.numeric(commodity_charge), 2),
         service_charge = round(service_charge, 2),
         utility_name = as.character(utility_name),
         bill_frequency = as.character(bill_frequency)) %>% 
  arrange(utility_name)
#End

# Customized benchmark
df_final_bill_customized <- df_final_bill %>% filter(is_single == TRUE)
df_final_bill <- df_final_bill %>% filter(is_single == FALSE)
# Global 15 CCF benchmark
df_final_bill_15 <- df_final_bill %>% filter(usage_ccf == singleTargetValue)

df_final_bill_single <- df_final_bill_customized
```

Load suppliers report info and join with the Utilities list from the OWRS files
```{r, echo=FALSE}
# merge with suplier report
merged_OWRS <- merge(merged_OWRS, supplier_reports, by.x = "PWSID", by.y = "report_pwsid", all.x=TRUE, all.y=FALSE)

#The standard value for GPCD is assumed 55
target_gpcd <- 55
ET_adj_factor <- 0.8
unit_conversion <- 0.62

merged_OWRS$production_target <- 55 * merged_OWRS$report_population * merged_OWRS$report_days_in_month +
  merged_OWRS$report_irr_area_sf * merged_OWRS$report_eto * ET_adj_factor * unit_conversion

merged_OWRS$residential_use <- merged_OWRS$report_production_calculated * merged_OWRS$report_percent_residential

merged_OWRS$pct_above_target <- (merged_OWRS$residential_use / merged_OWRS$production_target) - 1

merged_OWRS$report_monthyear <- as.Date(paste('01', as.character(merged_OWRS$report_month),
                                              as.character(merged_OWRS$report_year)), "%d %m %Y")
```


```{r, echo=FALSE}
# Initialize income data

# Create an ordering for the factor levels
income_cats = c("$15,000 to $19,999", "$20,000 to $24,999",   
"$25,000 to $29,999", "$30,000 to $34,999", "$35,000 to $39,999", "$40,000 to $44,999", 
"$45,000 to $49,999", "$50,000 to $59,999", "$60,000 to $74,999", "$75,000 to $99,999",
"$100,000 to $124,999", "$125,000 to $149,999", "$150,000 to $199,999", "$200,000 or more")

# Assume a numeric representation for each factor level
income_placeholders = c(17500, 22500, 27500, 32500, 37500, 42500, 
47500, 55000, 67500, 87500, 112500, 137500, 17500, 200000)

# Relate the factor level with its numeric assumtion in a dataframe
df_income_levels = data.frame(list("median_category"=income_cats, "income_placeholder"=income_placeholders)) %>%
  mutate(median_category = factor(median_category, levels = income_cats, ordered = TRUE))

district_income <- read.csv("Demographics_by_Water_District/Census_Data/Water_District_Income_2017.csv") %>%
  select(report_pwsid, report_agency_name, Medium, Medium_val) %>%
  rename(median_category = Medium, median_percentile = Medium_val) %>%
  distinct() %>%
  mutate(median_category = factor(median_category, levels = income_cats, ordered = TRUE))
```


```{r, echo=FALSE, warning=FALSE}
# Initialize qualitative data

#read the survey data in
hd <- read.csv('data/2017 CA-NV Rate Survey.csv', nrows=2, header=FALSE, stringsAsFactors = FALSE)
quali_survey <- read.csv('data/2017 CA-NV Rate Survey.csv', skip=2, header=FALSE, stringsAsFactors = FALSE)
names(quali_survey) <- paste(hd[1,], hd[2,], sep = "#")
names(quali_survey)[36] <- "costs_pct_fixed"
names(quali_survey)[37] <- "rev_pct_fixed"
names(quali_survey)[10] <- "agency_name"
names(quali_survey)[5] <- "IP_address"
names(quali_survey)[14] <- "city-town"
quali_survey <- quali_survey[, !duplicated(colnames(quali_survey))]

quali_survey$costs_pct_fixed <- as.numeric(quali_survey$costs_pct_fixed)
quali_survey$rev_pct_fixed <- as.numeric(quali_survey$rev_pct_fixed)
quali_survey$fixedRev_per_fixedCosts <- quali_survey$rev_pct_fixed / quali_survey$costs_pct_fixed
```


```{r past_years_data, echo=FALSE}
# Initialize historical data from previous surveys

df_past_years <- read.csv("data/raftelis_rate_surveys.csv", stringsAsFactors = FALSE) %>%
  rename(bill_frequency = Billing.Frequency,
         utility_name = Water.Service.Provider,
         service_charge = Fixed.Charge,
         commodity_charge = Commodity.Charge,
         bill = Total.Charge,
         bill_type = Rate.Format) %>%
  mutate(bill_type = ifelse(bill_type=="Inclining", "Tiered", bill_type),
         bill_type = ifelse(bill_type=="Declining", "Other", bill_type))

df_past_years <- fuzzy_district_left_join(df_past_years, df_final_bill_15)

# raftelis_to_owrs_manual_map <- list(
#   "Valley Springs Public Utility District" = "",
#   "City of Pittsburg" = "Pittsburg City Of",
#   "Lukins Brothers Water Company, Inc." = "",
#   "Bakman Water Company" = "",
#   "City of Eureka" = "Eureka City Of",
#   "City of Calexico" = "",
#   "City of Bishop" = "",
#   "Arvin Community Services District" = "",
#   "Greenfield County Water District" = "",
#   "Sundale Mutual Water Company" = "",
#   "Clearlake Oaks County Water District" = "",
#   "Callayomi County Water District" = "",
#   "Hilmar County Water District" = "",
#   "North Tahoe Public Utility District" = "",
#   "Colton City of " = "",
#   "City of Colton" = "",
#   "Hesperia Water District" = "",
#   "West Valley Water District" = "",
#   "Yuima Municipal Water District" = "",
#   "Montecito Water District" = "",
#   "Santa Maria City of " = "",
#   "City of Santa Maria" = "",
#   "Sutter Community Services District" = "",
#   "Casitas Municipal Water District" = "",
#   "Westwood Community Services District" = "",
#   "California Water Service Company" = "",
#   "Lompico County Water District" = "",
#   "Oceano Community Services District" = "",
#   "Nipomo Community Services District" = "",
#   "Ramona Municipal Water District" = "",
#   "Rainbow Municipal Water District" = "",
#   "Otay Water District" = "")

# df_past_years <- assign_fuzzy_match_names(df_past_years, 
#                                           source_column_name = "utility_name_raftelis",
#                                           new_name_column = "utility_name_owrs",
#                                           names_to_match_with = df_final_bill_15$utility_name,
#                                           manual_map = raftelis_to_owrs_manual_map,
#                                           cutoff = 0.85)
# 
# cross_year_fixed <- df_past_years %>% 
#   select(Survey.Year, Service.Area, utility_name, utility_name_owrs, service_charge) %>% 
#   tidyr::spread( key=Survey.Year, value=c(service_charge), sep=".") %>%
#   arrange(utility_name) %>%
#   filter(Survey.Year.2013 != "$-") %>%
#   filter(!is.na(Survey.Year.2013)&!is.na(Survey.Year.2015)) %>%
#   select(-Survey.Year.NA) %>%
#   left_join(df_final_bill_15 %>% select(utility_name, service_charge), 
#             by=c("utility_name_owrs"="utility_name"))


cross_year_fixed <- df_past_years %>% 
  select(Survey.Year, Service.Area, utility_name, utility_name_owrs, service_charge) %>% 
  arrange(utility_name) %>%
  filter(Survey.Year == 2015) %>%
  left_join(df_final_bill_15 %>% select(utility_name, service_charge), 
            by=c("utility_name_owrs"="utility_name")) %>%
  filter(!is.na(service_charge.x)&!is.na(service_charge.y)) %>%
  rename(service_charge_2015 = service_charge.x,
         service_charge_2017 = service_charge.y) %>%
  mutate(service_charge_2015 = as.numeric(service_charge_2015),
         service_charge_2017 = as.numeric(service_charge_2017)) %>%
  mutate(service_charge_2015 = ifelse(is.na(service_charge_2015), 0, service_charge_2015),
         service_charge_2017 =ifelse(is.na(service_charge_2017), 0, service_charge_2017)) %>%
  mutate(change = service_charge_2017 - service_charge_2015) %>%
  mutate(percent_change = 100*change/service_charge_2015) %>%
  mutate(percent_change = replace(percent_change, is.nan(percent_change), 0) )


cross_year_commodity <- df_past_years %>% 
  select(Survey.Year, Service.Area, utility_name, utility_name_owrs, commodity_charge) %>% 
  arrange(utility_name) %>%
  filter(Survey.Year == 2015) %>%
  left_join(df_final_bill_15 %>% select(utility_name, commodity_charge), 
            by=c("utility_name_owrs"="utility_name")) %>%
  filter(!is.na(commodity_charge.x)&!is.na(commodity_charge.y)) %>%
  rename(commodity_charge_2015 = commodity_charge.x,
         commodity_charge_2017 = commodity_charge.y) %>%
  mutate(commodity_charge_2015 = as.numeric(commodity_charge_2015),
         commodity_charge_2017 = as.numeric(commodity_charge_2017)) %>%
  mutate(commodity_charge_2015 = ifelse(is.na(commodity_charge_2015), 0, commodity_charge_2015),
         commodity_charge_2017 =ifelse(is.na(commodity_charge_2017), 0, commodity_charge_2017)) %>%
  mutate(change = commodity_charge_2017 - commodity_charge_2015) %>%
  mutate(percent_change = 100*change/commodity_charge_2015) %>%
  mutate(percent_change = replace(percent_change, is.nan(percent_change), 0) )


cross_year_total <- df_past_years %>% 
  select(Survey.Year, Service.Area, utility_name, utility_name_owrs, bill) %>% 
  arrange(utility_name) %>%
  filter(Survey.Year == 2015) %>%
  left_join(df_final_bill_15 %>% select(utility_name, bill), 
            by=c("utility_name_owrs"="utility_name")) %>%
  filter(!is.na(bill.x)&!is.na(bill.y)) %>%
  rename(bill_2015 = bill.x,
         bill_2017 = bill.y) %>%
  mutate(bill_2015 = as.numeric(bill_2015),
         bill_2017 = as.numeric(bill_2017)) %>%
  mutate(bill_2015 = ifelse(is.na(bill_2015), 0, bill_2015),
         bill_2017 =ifelse(is.na(bill_2017), 0, bill_2017)) %>%
  mutate(change = bill_2017 - bill_2015) %>%
  mutate(percent_change = 100*change/bill_2015) %>%
  mutate(percent_change = replace(percent_change, is.nan(percent_change), 0) )


cross_year_rate_type <- df_past_years %>% 
  select(Survey.Year, Service.Area, utility_name, utility_name_owrs, bill_type) %>%
  arrange(utility_name) %>%
  filter(Survey.Year == 2015) %>%
  left_join(df_final_bill_15 %>% select(utility_name, bill_type), 
            by=c("utility_name_owrs"="utility_name")) %>%
  filter(!is.na(bill_type.x)&!is.na(bill_type.y)) %>%
  rename(bill_type_2015 = bill_type.x,
         bill_type_2017 = bill_type.y)

```


```{r service_charge_15_vs_17, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
tmp <- cross_year_fixed %>% summarise(`2015`=mean(service_charge_2015),
                                      `2017`=mean(service_charge_2017)) %>%
  tidyr::gather(key="Survey Year", value="Mean Service Charge") %>% 
  mutate(`Survey Year` = as.character(`Survey Year`))

p <- ggplot(tmp, aes(`Survey Year`, `Mean Service Charge`)) + 
  geom_col(fill=cadc_blue) +
  theme(axis.text.x = element_text(size = 14), axis.text.y = element_text(size = 14), 
        axis.title = element_text(size = 20), title = element_text(size = 25),
          legend.position = "none") +
# + scale_fill_manual(values=c(cadc_blue, cadc_red))  
  geom_text(aes(y=`Mean Service Charge`+ 1, 
                label=printCurrency(round(`Mean Service Charge`, 2)),
                size=14)
            ) 


img <- "img/eps/service_charge_15_vs_17.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/service_charge_15_vs_17.png"
ggsave(img, p)
knitr::include_graphics(img)
```

```{r commodity_charge_15_vs_17, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
tmp <- cross_year_commodity %>% summarise(`2015`=mean(commodity_charge_2015),
                                          `2017`=mean(commodity_charge_2017)) %>%
  tidyr::gather(key="Survey Year", value="Mean Commodity Charge") %>% 
  mutate(`Survey Year` = as.character(`Survey Year`))

p <- ggplot(tmp, aes(`Survey Year`, `Mean Commodity Charge`)) + 
  geom_col(fill=cadc_blue) +
  theme(axis.text.x = element_text(size = 14), axis.text.y = element_text(size = 14), 
        axis.title = element_text(size = 20), title = element_text(size = 25),
          legend.position = "none") +
# + scale_fill_manual(values=c(cadc_blue, cadc_red))  
  geom_text(aes(y=`Mean Commodity Charge`+ 2, 
                label=printCurrency(round(`Mean Commodity Charge`, 2)),
                size=14)
            )

img <- "img/eps/commodity_charge_15_vs_17.eps"
ggsave(img, p, device = cairo_ps)
img <- "img/commodity_charge_15_vs_17.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r total_bill_15_vs_17, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
tmp <- cross_year_total %>% summarise(`2015`=mean(bill_2015),
                                      `2017`=mean(bill_2017)) %>%
  tidyr::gather(key="Survey Year", value="Mean Bill") %>% 
  mutate(`Survey Year` = as.character(`Survey Year`))

p <- ggplot(tmp, aes(`Survey Year`, `Mean Bill`)) + 
  geom_col(fill=cadc_blue) +
  theme(axis.text.x = element_text(size = 14), axis.text.y = element_text(size = 14), 
        axis.title = element_text(size = 20), title = element_text(size = 25),
          legend.position = "none") +
# + scale_fill_manual(values=c(cadc_blue, cadc_red))  
  geom_text(aes(y=`Mean Bill`+ 3, 
                label=printCurrency(round(`Mean Bill`, 2)),
                size=14)
            )

img <- "img/eps/total_bill_15_vs_17.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/total_bill_15_vs_17.png"
ggsave(img, p)
knitr::include_graphics(img)
```


# Analysis Using Customized Usage Benchmarks from Supplier Report


```{r, echo=FALSE, warning=FALSE}
df_final_bill_single <- df_final_bill_customized


df_final_bill_single <- assign_fuzzy_match_names(df_final_bill_single, 
                                          source_column_name = "utility_name",
                                          new_name_column = "utility_name_supplier_report",
                                           names_to_match_with = supplier_pwsid$Agency_Name,
                                           manual_map = owrs_to_supplier_report_manual_map)

# df_final_bill_single <- merge(df_final_bill_single, supplier_pwsid, by.x = "utility_name_supplier_report", by.y = "Agency_Name", all.x=TRUE, all.y=FALSE)


#merge on PWSID, month and Year
eff_vs_rate <- merge(df_final_bill_single, merged_OWRS, 
                     by.x = c("utility_name_supplier_report", "usage_month", "usage_year"),
                     by.y = c("utility_name_supplier_report", "report_month", "report_year"),
                     all.x=TRUE, all.y=FALSE)


df_rates_income <- eff_vs_rate %>% left_join(district_income, by=c('PWSID'='report_pwsid')) %>%
  filter(is.na(median_category)==FALSE) %>%
  left_join(df_income_levels, by="median_category" ) %>%
  mutate(bill_over_income = bill / income_placeholder,
         bill_over_monthly_income = bill / (income_placeholder/12) )
```


## Summary Statistics

This section discusses general characteristics of the rates for utilities analyzed in this survey.

```{r bill_frequency_pie, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px", fig.cap = bill_freq_cap }
p <- plot_bill_frequency_piechart(df_final_bill_single, 2017)

img <- "img/eps/bill_frequency_pie.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/bill_frequency_pie.png"
ggsave(img, p)
knitr::include_graphics(img)

bill_freq_cap <- fig_nums(name='bill_frequency_piechart', caption= 'Bill Frequency Pie Chart. About three quarters of the water agencies use a monthly billing system.')

```

```{r bill_frequency_pie_2015, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px", fig.cap = bill_freq_cap }
tmp <- df_past_years %>% filter(Survey.Year == 2015) %>%
  mutate(bill_frequency = ifelse(bill_frequency=="Bi-monthly", "Bimonthly", bill_frequency),
         bill_frequency = ifelse(bill_frequency=="Tri-monthly", "Trimonthly", bill_frequency))

p <- plot_bill_frequency_piechart(tmp, 2015)

img <- "img/eps/bill_frequency_pie_2015.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/bill_frequency_pie_2015.png"
ggsave(img, p)
knitr::include_graphics(img)

bill_freq_cap <- fig_nums(name='bill_frequency_piechart', caption= 'Bill Frequency Pie Chart. About three quarters of the water agencies use a monthly billing system.')

```

```{r bill_frequency_pie_2013, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px", fig.cap = bill_freq_cap }
tmp <- df_past_years %>% filter(Survey.Year == 2013) %>%
  mutate(bill_frequency = ifelse(bill_frequency=="Bi-monthly", "Bimonthly", bill_frequency),
         bill_frequency = ifelse(bill_frequency=="Tri-monthly", "Trimonthly", bill_frequency))

p <- plot_bill_frequency_piechart(tmp, 2013)

img <- "img/eps/bill_frequency_pie_2013.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/bill_frequency_pie_2013.png"
ggsave(img, p)
knitr::include_graphics(img)

bill_freq_cap <- fig_nums(name='bill_frequency_piechart', caption= 'Bill Frequency Pie Chart. About three quarters of the water agencies use a monthly billing system.')

```


```{r mean_bill_by_parts_pie, echo=FALSE, message=FALSE, out.width="600px", fig.cap = mean_bill_pie_cap}
p <- plot_mean_bill_pie(df_final_bill_single)

img <- "img/eps/mean_bill_by_parts_pie.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/mean_bill_by_parts_pie.png"
ggsave(img, p)
knitr::include_graphics(img)

mean_bill_pie_cap <- fig_nums(name='mean_bill_pie', caption= 'Average bill by parts for all agencies, considering a consumption of 10 CCF in a month. The average total bill is $60.68. With an average service charge (fixed) of $24.63 (40.6%) and an average commodity charge (variable) of $35.61 (58.7%).')
```


```{r rate_structure_type_pie, echo=FALSE, message=FALSE, out.width="600px"}
p <- plot_rate_type_pie(df_final_bill_single, 2017)

img <- "img/eps/rate_structure_type_pie.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/rate_structure_type_pie.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r rate_structure_type_pie_2015, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px", fig.cap = bill_freq_cap }
tmp <- df_past_years %>% filter(Survey.Year == 2015)

p <- plot_rate_type_pie(tmp, 2015)

img <- "img/eps/rate_structure_type_pie_2015.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/rate_structure_type_pie_2015.png"
ggsave(img, p)
knitr::include_graphics(img)

```

```{r rate_structure_type_pie_2013, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px", fig.cap = bill_freq_cap }
tmp <- df_past_years %>% filter(Survey.Year == 2013) %>%
  mutate(bill_type = ifelse(bill_type=="Inclining", "Tiered", bill_type),
         bill_type = ifelse(bill_type=="Declining", "Other", bill_type))

p <- plot_rate_type_pie(tmp, 2013)

img <- "img/eps/rate_structure_type_pie_2013.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/rate_structure_type_pie_2013.png"
ggsave(img, p)
knitr::include_graphics(img)

```


```{r usage_histogram, echo=FALSE, message=FALSE, out.width="600px"}
p <- plot_usage_histogram(df_final_bill_single)

img <- "img/eps/usage_histogram.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/usage_histogram.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r service_charge_ratio_histogram, echo=FALSE, message=FALSE, out.width="600px"}
# meanpercentFixed <- round(mean(as.numeric(df_final_bill$percentFixed[df_final_bill$usage_ccf == singleTargetValue])), 3)


p <- plot_ratio_histogram(df_final_bill_single)

img <- "img/eps/service_charge_ratio_histogram.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/service_charge_ratio_histogram.png"
ggsave(img, p)
knitr::include_graphics(img)
```

```{r total_bill_histogram, echo=FALSE, message=FALSE, out.width="600px"}
p <- plot_bill_histogram(df_final_bill_single)

img <- "img/eps/total_bill_histogram.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/total_bill_histogram.png"
ggsave(img, p)
knitr::include_graphics(img)
```

## Variation in Bills at Different Use Levels

```{r bill_quantiles_vs_usage, echo=FALSE, message=FALSE, out.width="600px"}
p <- plot_bill_quantiles_vs_usage(df_final_bill)

img <- "img/eps/bill_quantiles_vs_usage.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/bill_quantiles_vs_usage.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r commodity_charge_vs_usage_boxplot, echo=FALSE, message=FALSE, out.width="600px"}
p <- boxplot_bills_vs_usage(df_final_bill, start, end, interval)

img <- "img/eps/commodity_charge_vs_usage_boxplot.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/commodity_charge_vs_usage_boxplot.png"
ggsave(img, p)
knitr::include_graphics(img)
```


## Interaction between Rates and Efficiency

```{r efficiency_goal_time_series_boxplot, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
p <- plot_efficiency_ts(merged_OWRS)

img <- "img/eps/efficiency_goal_time_series_boxplot.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/efficiency_goal_time_series_boxplot.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r gpcd_time_series_boxplot, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
p <- plot_gpcd_ts(merged_OWRS)

img <- "img/eps/gpcd_time_series_boxplot.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/gpcd_time_series_boxplot.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r boxplot_bill_by_region, echo=FALSE, message=FALSE, warning=FALSE, out.width="600px"}
p <- boxplot_bill_by_region(eff_vs_rate)

img <- "img/eps/boxplot_bill_by_region.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/boxplot_bill_by_region.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r average_bill_part_by_region, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}

p <- barchart_average_charge_by_region(eff_vs_rate)

img <- "img/eps/average_bill_part_by_region.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/average_bill_part_by_region.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r efficiency_goal_vs_total_bill_scatter_trend, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
p <- plot_eff_vs_bill(eff_vs_rate)

img <- "img/eps/efficiency_goal_vs_total_bill_scatter_trend.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/efficiency_goal_vs_total_bill_scatter_trend.png"
ggsave(img, p)
knitr::include_graphics(img)
```

## Joining Data from the Qualitative Survey

```{r efficiency_goal_vs_percent_fixed_scatter_trend, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
agg_eff_vs_rate <- eff_vs_rate[c("PWSID", "usage_month", "usage_year",
                                 "bill", "pct_above_target")] %>% na.omit() %>%
                        group_by(PWSID) %>% summarise_all(funs(mean))

p <- plot_eff_vs_pctFixed(eff_vs_rate)# %>% filter(usage_ccf == singleTargetValue))

img <- "img/eps/efficiency_goal_vs_percent_fixed_scatter_trend.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/efficiency_goal_vs_percent_fixed_scatter_trend.png"
ggsave(img, p)
knitr::include_graphics(img)
```

```{r fixed_cost_percentage_histogram, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
# chart_count <- nrow(quali_survey %>% filter(!is.na(costs_pct_fixed)))
p <- plot_fixed_costs_percentage_histogram(quali_survey)

img <- "img/eps/fixed_cost_percentage_histogram.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/fixed_cost_percentage_histogram.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r fixed_revenue_percentage_histogram, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
# chart_count <- nrow(quali_survey %>% filter(!is.na(rev_pct_fixed)))

p <- plot_fixed_revenue_percentage_histogram(quali_survey)


img <- "img/eps/fixed_revenue_percentage_histogram.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/fixed_revenue_percentage_histogram.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r fixed_costs_vs_fixed_rev_scatter, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
# chart_count <- nrow(quali_survey %>% filter(!is.na(costs_pct_fixed)&!is.na(rev_pct_fixed)))

p <- plot_fixed_costs_vs_fixed_rev_scatter(quali_survey)

img <- "img/eps/fixed_costs_vs_fixed_rev_scatter.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/fixed_costs_vs_fixed_rev_scatter.png"
ggsave(img, p)
knitr::include_graphics(img)
```

## Affordability

```{r income_bracket_barchart, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px" }
p <- income_bracket_barchart(df_rates_income)

img <- "img/eps/income_bracket_barchart.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/income_bracket_barchart.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r, affordability_histogram, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
p <- plot_affordability_histogram(df_rates_income)

img <- "img/eps/affordability_histogram.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/affordability_histogram.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r bill_vs_income_scatter, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
p <- plot_bill_vs_income_scatter(df_rates_income)

img <- "img/eps/bill_vs_income_scatter.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/bill_vs_income_scatter.png"
ggsave(img, p)
knitr::include_graphics(img)
```


# Analysis Using 15 CCF Usage Benchmark for comparison with previous years


```{r, echo=FALSE, warning=FALSE}
df_final_bill_single <- df_final_bill_15


# df_past_years$utility_name_raftelis <- sapply(df_past_years$utility_name, preprocess_raftelis_name) 
# df_past_years <- assign_fuzzy_match_names(df_past_years, 
#                                           source_column_name = "utility_name_raftelis",
#                                           new_name_column = "utility_name_owrs",
#                                           names_to_match_with = df_final_bill_single$utility_name,
#                                           manual_map = NULL,
#                                           cutoff = 0.85)


df_final_bill_single <- assign_fuzzy_match_names(df_final_bill_single, 
                                          source_column_name = "utility_name",
                                          new_name_column = "utility_name_supplier_report",
                                           names_to_match_with = supplier_pwsid$Agency_Name,
                                           manual_map = owrs_to_supplier_report_manual_map)

# df_final_bill_single <- merge(df_final_bill_single, supplier_pwsid, by.x = "utility_name_supplier_report", by.y = "Agency_Name", all.x=TRUE, all.y=FALSE)


#merge on PWSID, month and Year
eff_vs_rate <- merge(df_final_bill_single, merged_OWRS, 
                     by.x = c("utility_name_supplier_report", "usage_month", "usage_year"),
                     by.y = c("utility_name_supplier_report", "report_month", "report_year"),
                     all.x=TRUE, all.y=FALSE)


df_rates_income <- eff_vs_rate %>% left_join(district_income, by=c('PWSID'='report_pwsid')) %>%
  filter(is.na(median_category)==FALSE) %>%
  left_join(df_income_levels, by="median_category" ) %>%
  mutate(bill_over_income = bill / income_placeholder,
         bill_over_monthly_income = bill / (income_placeholder/12) )
```


## Summary Statistics

This section discusses general characteristics of the rates for utilities analyzed in this survey.


```{r mean_bill_by_parts_pie_15, echo=FALSE, message=FALSE, out.width="600px", fig.cap = mean_bill_pie_cap}
p <- plot_mean_bill_pie(df_final_bill_single)

img <- "img/eps/15ccf/mean_bill_by_parts_pie_15.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/15ccf/mean_bill_by_parts_pie_15.png"
ggsave(img, p)
knitr::include_graphics(img)

mean_bill_pie_cap <- fig_nums(name='mean_bill_pie', caption= 'Average bill by parts for all agencies, considering a consumption of 10 CCF in a month. The average total bill is $60.68. With an average service charge (fixed) of $24.63 (40.6%) and an average commodity charge (variable) of $35.61 (58.7%).')
```


```{r service_charge_ratio_histogram_15, echo=FALSE, message=FALSE, out.width="600px"}
# meanpercentFixed <- round(mean(as.numeric(df_final_bill$percentFixed[df_final_bill$usage_ccf == singleTargetValue])), 3)


p <- plot_ratio_histogram(df_final_bill_single)

img <- "img/eps/15ccf/service_charge_ratio_histogram_15.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/15ccf/service_charge_ratio_histogram_15.png"
ggsave(img, p)
knitr::include_graphics(img)
```

```{r total_bill_histogram_15, echo=FALSE, message=FALSE, out.width="600px", out.height="800px"}
p1 <- plot_bill_histogram(df_final_bill_customized, axis=FALSE, title_text = "Total Bill - Localized")
p2 <- plot_bill_histogram(df_final_bill_15, title_text = "Total Bill - 15 CCF")

p <- plot_grid(p1, p2, align = "h", nrow = 2)

img <- "img/eps/comparisons/total_bill_histograms.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/comparisons/total_bill_histograms.png"
ggsave(img, p)
knitr::include_graphics(img)

# img <- "img/15ccf/total_bill_histogram_15.png"
# ggsave(img, p)
# knitr::include_graphics(img)
```


## Interaction between Rates and Efficiency

```{r efficiency_goal_time_series_boxplot_15, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
p <- plot_efficiency_ts(merged_OWRS)

img <- "img/eps/15ccf/efficiency_goal_time_series_boxplot_15.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/15ccf/efficiency_goal_time_series_boxplot_15.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r gpcd_time_series_boxplot_15, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
p <- plot_gpcd_ts(merged_OWRS)

img <- "img/eps/15ccf/gpcd_time_series_boxplot_15.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/15ccf/gpcd_time_series_boxplot_15.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r boxplot_bill_by_region_15, echo=FALSE, message=FALSE, warning=FALSE, out.width="600px"}
p <- boxplot_bill_by_region(eff_vs_rate)

img <- "img/eps/15ccf/boxplot_bill_by_region_15.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/15ccf/boxplot_bill_by_region_15.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r average_bill_part_by_region_15, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}

p <- barchart_average_charge_by_region(eff_vs_rate)

img <- "img/eps/15ccf/average_bill_part_by_region_15.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/15ccf/average_bill_part_by_region_15.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r efficiency_goal_vs_total_bill_scatter_trend_15, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
p <- plot_eff_vs_bill(eff_vs_rate)

img <- "img/eps/15ccf/efficiency_goal_vs_total_bill_scatter_trend_15.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/15ccf/efficiency_goal_vs_total_bill_scatter_trend_15.png"
ggsave(img, p)
knitr::include_graphics(img)
```

## Joining Data from the Qualitative Survey

```{r efficiency_goal_vs_percent_fixed_scatter_trend_15, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
agg_eff_vs_rate <- eff_vs_rate[c("PWSID", "usage_month", "usage_year",
                                 "bill", "pct_above_target")] %>% na.omit() %>%
                        group_by(PWSID) %>% summarise_all(funs(mean))

p <- plot_eff_vs_pctFixed(eff_vs_rate)# %>% filter(usage_ccf == singleTargetValue))

img <- "img/eps/15ccf/efficiency_goal_vs_percent_fixed_scatter_trend_15.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/15ccf/efficiency_goal_vs_percent_fixed_scatter_trend_15.png"
ggsave(img, p)
knitr::include_graphics(img)
```


## Affordability

```{r, affordability_histogram_15, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
p <- plot_affordability_histogram(df_rates_income)

img <- "img/eps/15ccf/affordability_histogram_15.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/15ccf/affordability_histogram_15.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r bill_vs_income_scatter_15, echo=FALSE, warning=FALSE, message=FALSE, out.width="600px"}
p <- plot_bill_vs_income_scatter(df_rates_income)

img <- "img/eps/15ccf/bill_vs_income_scatter_15.eps"
ggsave(img, p, device=cairo_ps)
img <- "img/15ccf/bill_vs_income_scatter_15.png"
ggsave(img, p)
knitr::include_graphics(img)
```


```{r create_summary_table, echo=FALSE}
df_rates_income <- df_final_bill_customized %>% left_join(district_income, by=c('pwsid'='report_pwsid')) %>%
  left_join(df_income_levels, by="median_category" ) %>%
  mutate(bill_over_income = bill / income_placeholder,
         bill_over_monthly_income = bill / (income_placeholder/12) )

df_rates_income$tier_prices_simplified <- ifelse(is.na(df_rates_income$tier_prices), 
                                              df_rates_income$flat_rate,
                                              df_rates_income$tier_prices)

cols <- c('utility_name', 'pwsid', 'effective_date', 'bill_frequency', 'bill_unit','usage_ccf', 'hhsize', 
          'et_amount', 'bill_type', 'service_charge', 'commodity_charge', 'bill', 'tier_starts', 
          'tier_prices_simplified', 'median_category', 'income_placeholder')
df_summary <- df_rates_income[cols] %>%
  rename(median_income_category = median_category,
         approximate_median_income = income_placeholder,
         tier_prices = tier_prices_simplified,
         avg_household_size = hhsize)


write.csv(df_summary, file = 'summary_table.csv', row.names = FALSE)

```