-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathregression.Rmd
318 lines (245 loc) · 10.7 KB
/
regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
---
title: "Tutorial 4: Modeling the Relation between Two Variables (Drug Concentration vs Viability)"
output: html_document
---
```{r echo=FALSE}
knitr::opts_chunk$set(cache=FALSE, warning=FALSE)
```
IC50 and AUC statistics are designed to summarize drug response curves into
a single number. This summarization step facilitates downstream
analyses. Apart from summarizing drug responses, IC50 and AUC have also
intuitive interpretations. For an overview about these statistics, have a
look at the Tutorial #2 (Using Correlation Measures to Assess
Replicability of Drug Response Studies).
A limitation of this type of summarized statistics, however, is that they usually
require to make assumptions about the data. As we will see in this vignette, some of
these assumption might not always hold. When going through this vignette,
try to think about the following question: Can the inconsistencies
between the different studies be attributed to the modelling assumptions?
## Exploring the drug response data
Let's start by exploring the IC50 and the AUC statistics that were
published in the original manuscripts. Let's load the data into the current working session and
define a function that allows us to visualize the relation between
drug response and drug concentration.
```{r plotResponse}
rawFile <- "rawPharmacoData.csv"
summarizedFile <- "summarizedPharmacoData.csv"
if( !file.exists( rawFile ) ){
source("downloadData.R")
}
pharmacoData <- read.csv(rawFile)
summarizedData <- read.csv(summarizedFile)
library(ggplot2)
library(dplyr)
library(cowplot)
plotResponse <- function(drugA, cellLineA, addPublishedIC50=TRUE ){
pharSub <- filter( pharmacoData, drug==drugA, cellLine==cellLineA )
sumSub <- filter( summarizedData, drug==drugA, cellLine==cellLineA )
p <- ggplot( pharSub, aes( log10(concentration), viability, col=study)) +
geom_point(size=2.1) + geom_line(lwd=1.1) + ylim(0, 150)
if( addPublishedIC50 ){
p <- p + geom_vline( sumSub, xintercept=log10( sumSub[,"ic50_CCLE"] ), col="#d95f02", linetype="longdash") +
geom_vline( xintercept=log10( sumSub[,"ic50_GDSC"]), col="#1b9e77", linetype="longdash") +
geom_hline( yintercept=50, col="#00000050", linetype="longdash")
}
p <- p + scale_colour_manual( values = c("CCLE" = "#d95f02", "GDSC" = "#1b9e77" ))
xlims <- xlim( range(log10(c(pharSub$concentration, sumSub$ic50_CCLE, sumSub$ic50_GDSC ) ) ) )
p + xlims
}
```
The plot define above will visualize the viability scores as a function
of the drug concentrations in each study. The vertical dotted lines
display the IC50 value published from each study. Let's start by
exploring how the response curve for the drug 17-AAG behaves in the
cell-line H4. Notice that this drug was reported to have consistent viability
responses between the two studies.
```{r}
plotResponse( drugA="17-AAG", cellLineA="H4", TRUE )
```
What observations can you draw from this curve? Are the response
data holding the assumptions to estimate an IC50 value?
Let's now select another drug-cell line combination.
```{r}
plotResponse( drugA="Nilotinib", cellLineA="22RV1" )
```
Are the reported IC50 values reflecting the actual behaviour
of the response curves? How can IC50 values be estimated if
there are no viabilities below 50% for the second example?
How did the two different studies deal with these cases?
## Logistic regression
A common way to model viability response curves is to fit logistic
regression models. If you have interest in knowing more about either
logistic regression models or modelling approaches in general, [this book](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf)
gives an excellent introduction to these topics.
The idea of a model is that it should describe how
the viability decreases upon increasing the drug concentration.
Let's write a function that fits a logistic regression model on
the data. The *fitLogisticModel* defined below receives as input
a drug, a cell line and a study, and fits a regression
model viability ~ concentration on these data.
```{r}
fitLogisticModel <- function(drugA, cellLineA, studyA){
pharSub <- filter( pharmacoData, drug==drugA, cellLine==cellLineA, study==studyA)
inRange <- pharSub$viability > 0 & pharSub$viability < 100
pharSub$viability <- round(pharSub$viability)
pharSub$concentration <- log10( pharSub$concentration )
maxVal <- pmax( pharSub$viability, 100 )
fit <- glm( cbind( viability, maxVal-viability ) ~ concentration,
pharSub, family=binomial )
fit
}
```
Let's now use this function to fit models on the data. We will use the
two drug-cell line combinations mentioned in the first section of this vignette.
```{r}
lrCCLE1 <- fitLogisticModel( "17-AAG", "H4", "CCLE" )
lrGDSC1 <- fitLogisticModel( "17-AAG", "H4", "GDSC" )
lrCCLE2 <- fitLogisticModel( "Nilotinib", "22RV1", "CCLE" )
lrGDSC2 <- fitLogisticModel( "Nilotinib", "22RV1", "GDSC" )
lrCCLE1
lrCCLE2
```
Let's evaluate the logistic regression models by plotting the model
and the raw data together. The function *predictValues* receives as
input a fit and outputs response values predicted from such model.
The *plotFit* function defined below enables the visualization of
the model predictions together with the raw data.
```{r}
predictValues <- function( fit, numPred=1000){
min <- min( fit$data$concentration )
max <- max( fit$data$concentration )
valuesToPredict <- seq(min, max, length.out=numPred)
predicted <- predict( fit,
data.frame(concentration=valuesToPredict),
type="response" )
data.frame( concentration=valuesToPredict,
viability=predicted*100 )
}
plotFit <- function(p, fitCCLE, fitGDSC ){
p <- p + geom_line( aes( concentration, viability ),
data=predictValues( fitCCLE ), lwd=1.2,
linetype="dashed", col="#d95f02" )+
geom_line( aes( concentration, viability ),
data=predictValues( fitGDSC ), lwd=1.2,
linetype="dashed", col="#1b9e77")
p
}
```
Now let's use these functions to evaluate the regression fits
from the two drug-cell line combinations mentioned before. Ideally, we would like the regression model to be as
close as possible to the individual data points.
```{r}
plotFit( plotResponse( "17-AAG", "H4", FALSE ),
fitCCLE=lrCCLE1, fitGDSC=lrGDSC1 )
plotFit( plotResponse( "Nilotinib", "22RV1", FALSE ),
fitCCLE=lrCCLE2, fitGDSC=lrGDSC2 ) +
xlim(-2, 1.3)
```
### IC50 and AUC calculation from logistic regression models
The following two subsections provide code implementations to compute the
IC50 and AUC statistics for the drug-cell line combinations mentioned above.
Notice that these implementations were not based in code from previous
publications.
Using the logistic models fitted before, let's estimate IC50
values by predicting the drug concentration value that the
logistic regression model predicts to result in a viability score of 50%.
```{r}
library(magrittr)
getIC50Value <- function( fit ){
if( !fit$converged ){
return( NA )
}
predictValues( fit, numPred=10000 ) %>%
{ .$concentration[which.min( abs( .$viability - 50) )] }
}
10^getIC50Value( lrCCLE1 )
10^getIC50Value( lrGDSC1 )
filter( summarizedData, drug=="17-AAG", cellLine=="H4")[,c("ic50_CCLE", "ic50_GDSC")]
10^getIC50Value( lrCCLE2 )
10^getIC50Value( lrGDSC2 )
filter( summarizedData, drug=="Nilotinib", cellLine=="22RV1")[,c("ic50_CCLE", "ic50_GDSC")]
```
Let's now calculate AUC values based on the
logistic regression model.
```{r}
getAUCValue <- function( fit ){
numbOfPredictions <- 10000
if( !fit$converged ){
return( NA )
}
x <- 1 - ( predictValues( fit, numPred=numbOfPredictions )$viability / 100 ) ## difference between 1 and the predicted viability probability
x <- sum( x ) ## summing all the predicted values
x / numbOfPredictions ## normalize such that the total area sums to 1
}
getAUCValue( lrCCLE1 )
getAUCValue( lrGDSC1 )
filter( summarizedData, drug=="17-AAG", cellLine=="H4")
getAUCValue( lrCCLE2 )
getAUCValue( lrGDSC2 )
filter( summarizedData, drug=="Nilotinib", cellLine=="22RV1")
```
## Estimating regressions, IC50 values and AUC values for all combinations of drugs x cell-lines
The following code, fits a logistic regression model for each of the drug-cellline combinations
and estimates both IC50 and AUC values for both the CCLE and the GDSC data.
```{r, cache=TRUE}
mySummarizedData <- suppressWarnings( lapply( seq_len( nrow( summarizedData )), function(x){
drug <- as.character( summarizedData$drug[x] )
cellLine <- as.character( summarizedData$cellLine[x] )
fitCCLE <- try( fitLogisticModel( drug, cellLine, "CCLE" ), silent=TRUE)
fitGDSC <- try( fitLogisticModel( drug, cellLine, "GDSC" ), silent=TRUE)
if( inherits(fitCCLE, "try-error") ){
ic50CCLE <- NA
aucCCLE <- NA
}else{
ic50CCLE <- 10^getIC50Value( fitCCLE )
aucCCLE <- getAUCValue( fitCCLE )
}
if( inherits(fitGDSC, "try-error") ){
ic50GDSC <- NA
aucGDSC <- NA
}else{
ic50GDSC <- 10^getIC50Value( fitGDSC )
aucGDSC <- getAUCValue( fitGDSC )
}
data.frame( drug=drug,
cellLine=cellLine,
ic50_CCLE=ic50CCLE,
auc_CCLE=aucCCLE,
ic50_GDSC=ic50GDSC,
auc_GDSC=aucGDSC )
} ) )
mySummarizedData <- do.call( rbind, mySummarizedData )
```
Lets compare the scores estimated using code from this vignette between the different studies.
```{r}
allSummarizedData <- merge( x=summarizedData, y=mySummarizedData, by=c("drug", "cellLine"))
ggplot(
filter( allSummarizedData, drug=="17-AAG"), aes( -log10(ic50_GDSC.y), -log10( ic50_CCLE.y) ) ) +
geom_point()
ggplot(
filter( allSummarizedData, drug=="17-AAG"), aes( auc_GDSC.y, auc_CCLE.y ) ) +
geom_point()
```
## Modeling drug response using linear models
The function defined below, instead of fitting a logistic regression like the
function *fitLogisticModel*, fits a linear regression.
```{r}
fitLinearModel <- function(drugA, cellLineA, studyA){
pharSub <- filter( pharmacoData, drug==drugA, cellLine==cellLineA, study==studyA)
pharSub$concentration <- log10( pharSub$concentration )
fit <- lm( viability~ concentration, pharSub )
fit
}
```
Below you will find an example on how to use the *fitLinearModel* function and how to extract
the slope of the linear regression.
```{r}
linearModelCCLE1 <- fitLinearModel( "17-AAG", "H4", "CCLE" )
slope1 <- coefficients( linearModelCCLE1 )["concentration"]
linearModelGDSC1 <- fitLinearModel( "17-AAG", "H4", "GDSC" )
slope2 <- coefficients( linearModelGDSC1 )["concentration"]
linearModelCCLE2 <- fitLinearModel( "Nilotinib", "22RV1", "CCLE" )
coefficients( linearModelCCLE2 )["concentration"]
linearModelGDSC2 <- fitLinearModel( "Nilotinib", "22RV1", "GDSC" )
coefficients( linearModelGDSC2 )["concentration"]
```