Analysis of MechaCar dataset to establish relationships between features and miles per gallon(mpg) on a variety of cars. Various types of statistical analysis were performed using RStudio to establish potential relationships and the accuracy of each test respectively.
##Linear Regression to Predict MPG The following linear regression model and summary thereof were created using the MechaCar_mpg dataset. The output are as follows:
Linear Regression:
Linear Regression Summary:
From the output of linear regression and summary modeling we can concluded the following:
- The variables/coefficents that provided a statistically non-random amount of variance to the mpg values in the data set include the vehicle length, and ground clearance. This means that these variables do have a significant effect on miles per gallon(mpg) in the data set analyzed. The P-values for these variables are 2.60e-12 (vehicle length) and 5.21e-08 (ground_clearance) respectively, both being found statistically significant as they are well under the threshold p-value of 0.05. The other variables considered, vehicle weight, spoiler angle, and if the car is all wheel drive(AWD) do not have p-values of under 0.05 and are therefore not statitically signigicant within the given data to non-randomly affect the cars mpg.
- the slope of the linear model would not be considered to be zero. a slope of zero would indicate that there is no correlation between variables and mpg, the models p-value is 5.35e-11, which is well below the threshold of 0.05. This allows the conclusion to be drawn that there is sufficient evidence to reject the null hypothesis that the slope would be zero, and accept the hypothesis that the slope would be non-zero.
- The model effectly predicts mpg of MechaCar prototpyes due to the linear models r-squared value of 0.7149. This indicates 71.49% of predictions will be determined by the model. Since the intercept is additionally shown to be significant (5.08e-08) there may be additional variables that can help explain variability of predicted mpg, and transformation or inclusion of additional data sets may be implimented to additionally improve the models predictions.
##Summary Statistics on Suspension Coils
- Design specifications for the MechaCar suspension coils dictate that variance of suspension coils cannot exceet 100 pounds per square inch (PSI). Taking summary statistics with R, we can conclude from the entire dataset the variance for PSI is 62.29, well below the 100 PSI variance, as shown below:
- Three different manufactoring lots are representated within the data. To ensure precise and consistent results, the same summary statistics were ran while grouping by manufactoring lot. From the resulting statistics we can conclude that only lot 1 & lot 2 are within PSI specifications, while lot 3 is above the 100 PSI variance limit. There is very low variance for lot 1 & 2 which has affected the total lots variance, making it within passing specifications. This is why in-depth analysis is critical so variance between individual lots can be scrutinized and problems identified. Lot 3 would need significant improvement to be considered passing the specifications as it is well above limits. Resulting tables from grouping summary statistics are as follows:
##T-Tests on Suspension Coils T-tests were performed to test the hypothesis that PSI across all namufactoring lots are statistically different from the population mean (1,500 PSI). The null hypothesis being there will be no statistical difference in PSI from manufacturing lots from the population mean (1,500 PSI).
A t-test was performed on the data where all lots are represented together. The resulting t-test shows a p-value of 2.2e-16, well below the significance level of p-value 0.05. This means for this data set we do have sufficient evidence to reject the null hypothesis and therefore conclude that the means are statistically different. Results of t-test for all data within the suspension coils data set is as follows:
When looking at the summary statistics for each manufacturing lot individual it was found the total data set did not accurately represent the individual lots, therefore t-tests were performed on each lot individually, with the same hypothesis and null hypothesis in mind.
- Lot 1's t-test shows a p-value of 1, well above the significance level of p-value 0.05. This means for this data set we do not have sufficient evidence to reject the null hypothesis and therefore conclude that the means are statistically similar. The t-test for lot 1 is as depicted below:
- Lot 1's t-test shows a p-value of .6072, above the significance level of p-value 0.05. This means for this data set we do not have sufficient evidence to reject the null hypothesis and therefore conclude that the means are statistically similar. The t-test for lot 1 is as depicted below:
- Lot 1's t-test shows a p-value of .04168, below the significance level of p-value 0.05. This means for this data set we do have sufficient evidence to reject the null hypothesis and therefore conclude that the means are statistically different. The t-test for lot 1 is as depicted below:
##Study Design: MechaCar vs Competition
In order to quantify how MechaCar performs against competition there are a variety of metrics that could be compared to other car manufacturers. The metrics that would likely strike interest of consumers would be cost, fuel efficiency, safety ratings, if the car is available in all wheel drive(AWD). Additional data sets containing information on car metrics would be required, demo2.csv in resources would be a good starting data set, though others like R's built in Motor Tren Car Road Tests(mtcars) could be used as well.
To test each metric, MechaCars mean for the variable would be compared, with the hypothesis that that metric is better than the competition, and the null hypothesis being there is no statistical difference between metric results between MechaCars and competition.
To compare cost, the type of vehicle could be grouped, averaged and compared through summary statistics to see if MechaCars costs for vehicles is statistically different from that of the competition. Since the data type is continuous and categorical an ANOVA test could be used to determine statistical significance.
To compare fuel efficiency miles per gallon (mpg) wether city, highway, or overall could compared with grouping and summary statistics. Joining data sets and grouping when running summary statistics would like create the most easily interpreted table so all information would be readily available in one result, like what was done with the summary statistics for suspension coils. Box plots could then be done to visually represent outliers.
Many other metrics would be compared in similar ways with adjustments based on data types and the number of variables being compared. Use of this statistical testing and data types informational sheet could prove useful in deciding the best path to determining statistical testing type and interpretation of test results. (Visual representation supplied by Vanderbilt's Data Analytics Boot Camp, available for view in resources file):
##Contact Me