Testing carpet samples for chemical compounds to determine their age using SAS. I use logistic regression in SAS Studio with a dataset from "Age Estimation of Old Carpets Based on Cystine and Cysteic Acid Content."
To begin the project, you'll need to download the following dataset: Age Estimation of Old Carpets Based on Cystine and Cysteic Acid Content.
Source: J. Csapo, Z. Csapo-Kiss, T.G. Martin, S. Folestad, O. Orwar, A. Tivesten, and S. Nemethy (1995). "Age Estimation of Old Carpets Based on Cystine and Cysteic Acid Content," Analytica Chimica Acta, Vol. 300, pp. 313-320.
You will need to download SAS in order to run the code. More details on how to install SAS on a Windows machine are here.
Our covariates are the four organic compounds--Cysteic Acid, Cystine, Methionine, and Tyrosine. The first step I did was creating a QQ-plot in order to see if our residuals follow a normal pattern.
proc reg DATA=dg.carpet plots(only)=QQPLot;
model age=cys_acid cys met tyr;
ods select QQPlot;
run;
Make sure to import the data file correctly before creating your Q-Q plot. The plot should look like this:
Although it is lightly tailed on both ends, the data seems to be normally distributed, which is what we want. To further rectify that there is a linear relationship, we can plot the residuals, which are the differences between our observed and predicted values. Ideally, we want our plot of the residuals to look totally random, even if there are symmetrically distributed clouds of points.
data subset;
set dis2.carpet;
if age=. then delete;
option obs=1000;
proc corr data=subset plots=matrix;
var age cys_acid cys met tyr;
option obs=1000;
proc reg data=subset;
model age=cys_acid cys met tyr;
output out=dis2.carpet;
please note there is a typo in line one, the first statement should read 'libname' to associate the chemicals' library with a libref. Sorry!
Your model should look like the image below:
In the case that there is a distinct pattern, outliers, or shape, we can further improve themodel. We can see in Figure 2, I’ve modelled the residual plots for each of our four covariates respectively. There doesn’t seem to be a distinct pattern so we can check off these assumptions: the variance must have a mean of and the variance of the error terms must be constant.
Doing a data summary, we can take note that cysteic acid has the smallest p-value and thus a minimal effect on the age of our wood samples. In any case for any of the four covariates, you would fail to reject a null hypothesis for alpha equals 0.01. All of the compounds have F-values less than 1%.
proc contents data = carpet;
proc reg data = carpet;
model age = cys;
proc reg data = carpet;
model age = met;
Your output should look like this procedure for the regression of our model. Make sure to accompany the PROC REG statement with a MODEL statement to specify the regression models.
Our adjusted coefficient of determination is approximately 0.9946—implying that 99.46% of our Cysteic Acid, Cystine, Methionine, and Tyrosine’s variation can be explained by our linear model. Though it isn’t quite 1, the regression predictions almost perfectly fit the data, so we’re on the right track. I tried playing around and doing a logarithmic transformation on age but didn’t really see a difference (i.e, expecting a tighter QQ-plot for the data but instead getting Figure 3). For this reason, I would suggest sticking to the first model since we would have a coefficient of determination closest to one and better results overall.
data work.transform;
set WORK.IMPORT;
log_age=log(age);
log_cys_acid=log(cys_acid);
log_cys=l0g(cys);
run;
The Q-Q plot for the log transformed age category:
We can assess the quality of the fit with the 'Fit Diagnostic' function.