homeworks/hw1/HW1_Gupta_Sathaye_Talluri.Rmd

---
title: "Homework 1"
subtitle: STAT 334
author: "Krishnanshu Gupta, Ishaan Sathaye, Sreshta Talluri"
date: "`r Sys.Date()`"
output: 
  html_document:
    theme: readable
    highlight: haddock
    toc: true
    toc_float: true
    code_folding: hide  
---

```{r chunk setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      eval=TRUE,
                      message=FALSE,
                      warning=FALSE)
```

# Problem 1

## Part (b)

```{r}
# Import Data Into R from Excel file
# (1)install.packages("readxl") if have never done so
# (2)Save data file to same location as RMarkdown file
# (3)Run the following three commands
library(readxl)
setwd( dirname( rstudioapi::getActiveDocumentContext( )$path ) )
mitchellData=read_excel("mitchell.xlsx")
```

## Part (c)

```{r}
# head() returns the top rows of the mitchellData dataframe
head(mitchellData)
# str() displays the structure of the mitchellData dataframe. 
str(mitchellData)
```

## Part (d)

```{r, fig.width=6, fig.height=4}
plot(Temp~Month, data=mitchellData, pch=18, col='darkred', xlab='month', ylab='temp')
```

# Problem 2

## Part (a)
We expect the relationship between scoring average and driving distance to be negative. Longer drives will usually put the golfers closer to the green, which should result in needing fewer strokes to reach the green and completing the hole. Thus, as driving distance increases, the scoring average will decrease resulting in a negative relationship.

## Part (b)
We expect the relationship between scoring average and putting average to be positive. A lower putting average would mean that the golfers require less putts to complete holes, which would lead to lower and better scores. Thus, as putting average increases, the scoring average will also increase resulting in a positive relationship.

## Part (c)
```{r}
golfers = read_excel("golfers.xls")
plot(avgscore ~ driving, data = golfers, pch = 1, col='blue', 
     xlab='Driving Distance', ylab='Scoring Average', 
     main='Scoring Average vs. Driving Distance')
with(golfers, cor(driving, avgscore))

# Marking an example of outlier with different color
points(260, 70.44, col = "red", pch = 19)
```
The scatterplot of scoring average vs driving distance displays a weak, negative linear relationship. There are a few data points that fall outside the general trendline, such as the point at distance 260 yards, which is marked with a red color.

## Part (d)
```{r}
plot(avgscore ~ avgputts, data = golfers, pch = 1, col='blue', 
     xlab='Putting Average', ylab='Scoring Average', 
     main='Scoring Average vs. Putting Average')
with(golfers, cor(avgputts, avgscore))

# Marking an example of outlier with different color
points(1.753, 68.78, col = "red", pch = 19)
```
The scatterplot of scoring average vs putting average displays a moderate, positive linear relationship. There are a few data points that fall outside the general trendline, such as the point at putting average 1.753, which is marked with a red color.

## Part (e)
The relationships in the scatterplots match our expectations in parts A and B. From the scatterplots, the scatterplot for scoring average vs putting average displays a more moderate linear relationship, while the scoring average vs driving distance displays a weaker linear relationship. The correlation coefficients match the above observations as the scatterplot for scoring average vs putting average has a higher value of approximately 0.44, while the other is approximately -0.27.


# Problem 3

## Part (a)
```{r}
y=c(631.8, 338.6, 627.9, 352.6, 699.8, 470.7, 
    557.8, 547.6, 569.83, 321.1, 344.7, 427.67)

#create a vector of candidate values of yhat (400 thru 600)
#we will see which of these candidates minimizes the SAE and SSE
yhat=seq(400,600,by=0.1)

#create a function that computes the SAD for a given yhat
SAE.fun <- function(yhat){
  sum(abs(y-yhat))
}

#compute the SAD for all the candidate yhat's
SAE=sapply(yhat,SAE.fun)

#plot the SAD for each candidate yhat's
#cex=0.5 makes the size of the points smaller 
plot(yhat,SAE,cex=0.5)

#determine the yhat(s) that minimizes the SAD
yhat[SAE==min(SAE)]
```

If we take a look at which yhat's give us the minimum SAE, we can see that
the value for the median and mean both are in the range of the minimum SAE's.
Looking at the graph, the middle of the minimum range appears to be where
the median is (more so than the mean). This makes sense as the median is known 
to minimize the sum of absolute differences between itself and each point in 
a set of data. This is because the median splits the data into two halves 
where half the points are less than or equal to the median and half are 
greater than or equal to it. 

## Part (b)
```{r}
#create a function that computes the SSE for a given yhat
SSE.fun <- function(yhat){
  sum((y-yhat)^2)
}

#compute the SSE for all the candidate yhat's
SSE=sapply(yhat,SSE.fun)

#plot the SSE for each candidate yhat's
#cex=0.5 makes the size of the points smaller 
plot(yhat,SSE,cex=0.5)

#determine the yhat(s) that minimizes the SSE
yhat[SSE==min(SSE)]
```

If we look at which yhat has the minimum SSE, it appears to be the mean
value. Looking at the graph confirms the previous statement as the minimum
appears to be located at yhat = 490.8, which is the mean. This makes sense
as if we try to find the minimum of the SSE mathematically, you take the
derivative, set it equal to 0, and simplify. The end result gives you the
equation for the mean. The mean balances the dataset such that the sum of 
the distances (errors) above the mean is equal to the sum of the distances 
below the mean. This balance is not merely about distances but squared 
distances, which penalizes larger deviations more severely.

## Part (c)
In this exercise, we learned about loss functions (the SAE and the SSE).
SAE involves minimizing the sum of absolute differences between predicted
and actual observed values. The best predictor which minimizes the SAE is the
median. The median is less sensitive to outliers, making it a good estimator
if there are extreme values. SSE involves minimizing the sum of squared
differences between predicted and observed values. The optimal predictor that
minimizes the SSE is the mean of the data. The mean is good for symmetrically
distributed data, but more sensitive to outliers.

This exercise demonstrated the implications of using different measures of
center (mean vs median). The mean minimizes the SSE because it balances the 
squared deviations, which can be heavily influenced by outliers. In contrast, 
the median minimizes the SAE by effectively cutting the dataset in half, 
ensuring that the number of points above and below the median is balanced.


# Problem 4

## Part (a)

The self-report method has some potential limitations and can have caused the measurements of the hangover severity to become biased. One limitation would be participants' subjective interpretations of the hangover symptoms. From the symptoms list, feelings such as had difficulty concentrating or felt very weak would be open to interpretation. What one person might consider very weak might just be a mild feeling to another. This could affect the conclusions drastically since the factor loadings from Table 4 would be different. Another limitation would be that since participants were told to recall symptoms over the past year, more recent or even severe hangovers can become over-represented. When it comes to talking about the impact between men and women, the self-reporting could become under-represented for women, given the social stigma around reporting symptoms especially for women.

In terms of how this method affected the conclusion, it would be possible that there were some under-reports for women. When controlling for sex differences in alcohol involvement, women had higher scores on the scale than for men, which would go against the bias for women in terms of being under-represented. This means that these potential biases did not favor one gender over another in a way that would change the conclusions drastically.

## Part (b)

The type of sample that was used was a convenience sample. It consisted of 1230 currently drinking college students from the University of Missouri-Columbia. Of those 62% were women, 91% were Caucasian, and the average age was 18.8 years old with most being between 18-22 years old. On the basis of this sample there are many limitations when generalizing the results of this paper. One would be age range which would be very specific in terms of generalizing hangover symptoms. Then the college student status, lack of racial diversity, as well as drawing from just one university would not be adequate to generalize to larger populations. In fact, the paper does address this limitation by saying, "it is important to point out that participants were called from a single class on a single campus, and this caution must be exercised when comparing these findings to those of other larger, multi-campus studies" (Slutske et al. 1449).

## Part (c)

The two variables that were measured for each person to provide this result would be their score on the HSS (Hangover Symptoms Scale) and the frequency of drinking in the past year. The logical explanatory variable would be the frequency of drinking since the hangover symptoms would be the outcome variable. The reason for this is that the drinking frequency happens before the symptoms and that it is the frequency that causes the more frequent hangover symptoms. Intuitively this makes more sense where drinking more results in more hangovers. Also in the paper, it is mentioned that HSS is a valid measure of hangover, then it would be related to drinking heaviness.

## Part (d)

r = 0.44 is the Pearson correlation coefficient which measures the strength of the linear association between two quantitative variables. In this case the 2 variables are the scores on the HSS and the frequency of drinking in the past year. So a value of 0.44 means that there is a moderate positive linear relationship between the HSS scores and drinking frequency. So as the drinking frequency increases, HSS scores also tend to increase.

## Part (e)

As it is used in the quote, the word significant means that the correlation was calculated from the sample data and that the result is unlikely to have occurred by chance alone. Also, a statistical test was conducted to see if this correlation was statistically significantly different from zero. Also from the paper it means that the p-value from that test was below the significance level (p \< 0.001). So in the end it means that the correlation is likely a real, non-zero association between the scores on the HSS and frequency of drinking and not just a sampling error.