-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathexploration.Rmd
97 lines (75 loc) · 3.05 KB
/
exploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
title: "Tutorial 1: Exploratory analysis of pharmacogenomic data"
output: html_document
---
```{r echo=FALSE}
knitr::opts_chunk$set(cache=FALSE)
```
## Introduction
Probably the most important step of analyzing datasets is to actually understand
the data. This process is crucial to know what kind of questions we can
answer with it. This tutorial has code that will help guiding
you through this process. Make sure you understand well the experimental design of the
two studies and try to link each variable to this experimental design. Also, make sure you
understand what each *R* command is doing. Feel free to hack the code!
If you have any question about the code, ask one of the mentors.
Also remember that [google](www.google.com) is one of the most important tool for
data science.
Let's start by loading the data into the current working session.
```{r readRaw}
rawFile <- "rawPharmacoData.csv"
pharmacoData <- read.csv(rawFile)
```
What kind of variables are in the data? Are these variables numerical and/or categorical? What does each column represent?
```{r quest2}
head( pharmacoData )
str( pharmacoData )
```
How many drugs are contained in these data?
```{r quest3}
length( levels( pharmacoData$drug ) )
```
How many drug concentrations were used in each study?
```{r quest4}
tapply( pharmacoData$doseID, pharmacoData$study, function(x){
length( unique( x ) )
})
```
One of the first things data scientists do when
digging into new data is to explore their distributions.
Histograms visualize the data distributions and can also point us towards statistical
models to use. The code below
transforms the data into a logarithmic scale and plots a histogram separately for each study. Based on these plots, which study would you say
has the most consistent experimental protocol?
```{r quest6, warning=FALSE}
library(ggplot2)
library(cowplot)
ggplot( pharmacoData, aes( log2(concentration) ) ) +
geom_histogram(fill = "white", colour="black") +
facet_wrap(~study)
```
Viability scores are the percentage
of cells that survive upon exposure to a certain drug.
Below, we will explore the range of the data and calculate how
many data points are below 0 and above 100.
```{r}
range( pharmacoData$viability )
sum( pharmacoData$viability < 0 )
sum( pharmacoData$viability > 100 )
```
We can also compare the distribution of viability scores between
the two studies using density plots. Based on the distribution of
the viability scores, would you say there are obvious differences
between the two studies?
```{r}
ggplot( pharmacoData, aes( viability, group=study, colour=study) ) +
geom_density(fill="white", lwd=2, alpha=0.1) + xlim(0, 170)
```
The code below plots the viability scores as box-plots for each drug, stratified by the two studies. Can you tell something about the toxic properties of the different
drugs? Are these properties consistent across studies?
```{r}
ggplot( pharmacoData, aes( y=viability, x=drug, fill=study) ) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=.5)) +
ylim(0, 200)
```