forked from resbaz/r-novice-gapminder
-
Notifications
You must be signed in to change notification settings - Fork 2
/
12-plyr.html
299 lines (294 loc) · 17.7 KB
/
12-plyr.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: R for reproducible scientific analysis</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-responsive.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="stylesheet" type="text/css" href="css/swc-workshop-and-lesson.css" />
<link rel="stylesheet" type="text/css" href="css/lesson.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container container-full-width card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<div class="row-fluid">
<div class="span10 offset1">
<h1 class="title">R for reproducible scientific analysis</h1>
<h2 class="subtitle">Split-apply-combine</h2>
<div id="learning-objectives" class="objectives">
<h2>Learning Objectives</h2>
<ul>
<li>To be able to use the split-apply-combine strategy for data analysis</li>
</ul>
</div>
<p>Previously we looked at how you can use functions to simplify your code. We defined the <code>calcGDP</code> function, which takes the gapminder dataset, and multiplies the population and GDP per capita column. We also defined additional arguments so we could filter by <code>year</code> and <code>country</code>:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Takes a dataset and multiplies the population column</span>
<span class="co"># with the GDP per capita column.</span>
calcGDP <-<span class="st"> </span>function(dat, <span class="dt">year=</span><span class="ot">NULL</span>, <span class="dt">country=</span><span class="ot">NULL</span>) {
if(!<span class="kw">is.null</span>(year)) {
dat <-<span class="st"> </span>dat[dat$year %in%<span class="st"> </span>year, ]
}
if (!<span class="kw">is.null</span>(country)) {
dat <-<span class="st"> </span>dat[dat$country %in%<span class="st"> </span>country,]
}
gdp <-<span class="st"> </span>dat$pop *<span class="st"> </span>dat$gdpPercap
new <-<span class="st"> </span><span class="kw">cbind</span>(dat, <span class="dt">gdp=</span>gdp)
<span class="kw">return</span>(new)
}</code></pre>
<p>A common task you'll encounter when working with data, is that you'll want to run calculations on different groups within the data. In the above, we were simply calculating the GDP by multiplying two columns together. But what if we wanted to calculated the mean GDP per continent?</p>
<p>We could run <code>calcGPD</code> and then take the mean of each continent:</p>
<pre class="sourceCode r"><code class="sourceCode r">withGDP <-<span class="st"> </span><span class="kw">calcGDP</span>(gapminder)
<span class="kw">mean</span>(withGDP[withGDP$continent ==<span class="st"> "Africa"</span>, <span class="st">"gdp"</span>]) </code></pre>
<pre class="output"><code>[1] 20904782844</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">mean</span>(withGDP[withGDP$continent ==<span class="st"> "Americas"</span>, <span class="st">"gdp"</span>]) </code></pre>
<pre class="output"><code>[1] 379262350210</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">mean</span>(withGDP[withGDP$continent ==<span class="st"> "Asia"</span>, <span class="st">"gdp"</span>]) </code></pre>
<pre class="output"><code>[1] 227233738153</code></pre>
<p>But this isn't very <em>nice</em>. Yes, by using a function, you have reduced a substantial amount of repetition. That <strong>is</strong> nice. But there is still repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.</p>
<p>We could write a new function that is flexible like <code>calcGDP</code>, but this also takes a substantial amount of effort and testing to get right.</p>
<p>The abstract problem we're encountering here is know as "split-apply-combine":</p>
<div class="figure">
<img src="img/splitapply.png" alt="Split apply combine" /><p class="caption">Split apply combine</p>
</div>
<p>We want to <em>split</em> our data into groups, in this case continents, <em>apply</em> some calculations on that group, then optionally <em>combine</em> the results together afterwards.</p>
<h4 id="the-plyr-package">The <code>plyr</code> package</h4>
<p>For those of you who have used R before, you might be familiar with the <code>apply</code> family of functions. While R's built in functions do work, we're going to introduce you to another method for solving the "split-apply-combine" probelm. The <a href="http://had.co.nz/plyr/">plyr</a> package provides a set of functions that we find more user friendly for solving this problem.</p>
<p>We installed this package in an earlier challenge. Let's load it now:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(plyr)</code></pre>
<p>Plyr has functions for operating on <code>lists</code>, <code>data.frames</code> and <code>arrays</code> (matrices, or n-dimensional vectors). Each function performs:</p>
<ol style="list-style-type: decimal">
<li>A <strong>split</strong>ting operation</li>
<li><strong>Apply</strong> a function on each split in turn.</li>
<li>Re<strong>combine</strong> output data as a single data object.</li>
</ol>
<p>The functions are named based on the data structure they expect as input, and the data structure you want returned as output: [a]rray, [l]ist, or [d]ata.frame. The first letter corresponds to the input data structure, the second letter to the output data structure, and then the rest of the function is named "ply".</p>
<p>This gives us 9 core functions **ply. There are an additional three functions which will only perform the split and apply steps, and not any combine step. They're named by their input data type and represent null output by a <code>_</code> (see table)</p>
<p>Note here that plyr's use of "array" is different to R's, an array in ply can include a vector or matrix.</p>
<div class="figure">
<img src="img/full_apply_suite.png" alt="Full apply suite" /><p class="caption">Full apply suite</p>
</div>
<p>Each of the xxply functions (<code>daply</code>, <code>ddply</code>, <code>llply</code>, <code>laply</code>, ...) has the same structure and has 4 key features and structure:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">xxply</span>(.data, .variables, .fun)</code></pre>
<ul>
<li>The first letter of the function name gives the input type and the second gives the output type.</li>
<li>.data - gives the data object to be processed</li>
<li>.variables - identifies the splitting variables</li>
<li>.fun - gives the function to be called on each piece</li>
</ul>
<p>Now we can quickly calculate the mean GDP per continent:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ddply</span>(
<span class="dt">.data =</span> <span class="kw">calcGDP</span>(gapminder),
<span class="dt">.variables =</span> <span class="st">"continent"</span>,
<span class="dt">.fun =</span> function(x) <span class="kw">mean</span>(x$gdp)
)</code></pre>
<pre class="output"><code> continent V1
1 Africa 20904782844
2 Americas 379262350210
3 Asia 227233738153
4 Europe 269442085301
5 Oceania 188187105354</code></pre>
<p>Let's walk through what just happened:</p>
<ul>
<li>The <code>ddply</code> function feeds in a <code>data.frame</code> (function starts with <strong>d</strong>) and returns another <code>data.frame</code> (2nd letter is a <strong>d</strong>) i</li>
<li>the first argument we gave was the data.frame we wanted to operate on: in this case the gapminder data. We called <code>calcGDP</code> on it first so that it would have the additional <code>gdp</code> column added to it.</li>
<li>The second argument indicated our split criteria: in this case the "continent" column. Note that we just gave the name of the column, not the actual column itself like we've done previously with subsetting. Plyr takes care of these implementation details for you.</li>
<li>The third argument is the function we want to apply to each grouping of the data. We had to define our own short function here: each subset of the data gets stored in <code>x</code>, the first argument of our function. This is an anonymous function: we haven't defined it elsewhere, and it has no name. It only exists in the scope of our call to <code>ddply</code>.</li>
</ul>
<p>What if we want a different type of output data structure?:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dlply</span>(
<span class="dt">.data =</span> <span class="kw">calcGDP</span>(gapminder),
<span class="dt">.variables =</span> <span class="st">"continent"</span>,
<span class="dt">.fun =</span> function(x) <span class="kw">mean</span>(x$gdp)
)</code></pre>
<pre class="output"><code>$Africa
[1] 20904782844
$Americas
[1] 379262350210
$Asia
[1] 227233738153
$Europe
[1] 269442085301
$Oceania
[1] 188187105354</code></pre>
<p>We called the same function again, but changed the second letter to an <code>l</code>, so the output was returned as a list.</p>
<p>We can specify multiple columns to group by:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ddply</span>(
<span class="dt">.data =</span> <span class="kw">calcGDP</span>(gapminder),
<span class="dt">.variables =</span> <span class="kw">c</span>(<span class="st">"continent"</span>, <span class="st">"year"</span>),
<span class="dt">.fun =</span> function(x) <span class="kw">mean</span>(x$gdp)
)</code></pre>
<pre class="output"><code> continent year V1
1 Africa 1952 5992294608
2 Africa 1957 7359188796
3 Africa 1962 8784876958
4 Africa 1967 11443994101
5 Africa 1972 15072241974
6 Africa 1977 18694898732
7 Africa 1982 22040401045
8 Africa 1987 24107264108
9 Africa 1992 26256977719
10 Africa 1997 30023173824
11 Africa 2002 35303511424
12 Africa 2007 45778570846
13 Americas 1952 117738997171
14 Americas 1957 140817061264
15 Americas 1962 169153069442
16 Americas 1967 217867530844
17 Americas 1972 268159178814
18 Americas 1977 324085389022
19 Americas 1982 363314008350
20 Americas 1987 439447790357
21 Americas 1992 489899820623
22 Americas 1997 582693307146
23 Americas 2002 661248623419
24 Americas 2007 776723426068
25 Asia 1952 34095762661
26 Asia 1957 47267432088
27 Asia 1962 60136869012
28 Asia 1967 84648519224
29 Asia 1972 124385747313
30 Asia 1977 159802590186
31 Asia 1982 194429049919
32 Asia 1987 241784763369
33 Asia 1992 307100497486
34 Asia 1997 387597655323
35 Asia 2002 458042336179
36 Asia 2007 627513635079
37 Europe 1952 84971341466
38 Europe 1957 109989505140
39 Europe 1962 138984693095
40 Europe 1967 173366641137
41 Europe 1972 218691462733
42 Europe 1977 255367522034
43 Europe 1982 279484077072
44 Europe 1987 316507473546
45 Europe 1992 342703247405
46 Europe 1997 383606933833
47 Europe 2002 436448815097
48 Europe 2007 493183311052
49 Oceania 1952 54157223944
50 Oceania 1957 66826828013
51 Oceania 1962 82336453245
52 Oceania 1967 105958863585
53 Oceania 1972 134112109227
54 Oceania 1977 154707711162
55 Oceania 1982 176177151380
56 Oceania 1987 209451563998
57 Oceania 1992 236319179826
58 Oceania 1997 289304255183
59 Oceania 2002 345236880176
60 Oceania 2007 403657044512</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">daply</span>(
<span class="dt">.data =</span> <span class="kw">calcGDP</span>(gapminder),
<span class="dt">.variables =</span> <span class="kw">c</span>(<span class="st">"continent"</span>, <span class="st">"year"</span>),
<span class="dt">.fun =</span> function(x) <span class="kw">mean</span>(x$gdp)
)</code></pre>
<pre class="ouput"><code> year
continent 1952 1957 1962 1967 1972
Africa 5992294608 7359188796 8784876958 11443994101 15072241974
Americas 117738997171 140817061264 169153069442 217867530844 268159178814
Asia 34095762661 47267432088 60136869012 84648519224 124385747313
Europe 84971341466 109989505140 138984693095 173366641137 218691462733
Oceania 54157223944 66826828013 82336453245 105958863585 134112109227
year
continent 1977 1982 1987 1992 1997
Africa 18694898732 22040401045 24107264108 26256977719 30023173824
Americas 324085389022 363314008350 439447790357 489899820623 582693307146
Asia 159802590186 194429049919 241784763369 307100497486 387597655323
Europe 255367522034 279484077072 316507473546 342703247405 383606933833
Oceania 154707711162 176177151380 209451563998 236319179826 289304255183
year
continent 2002 2007
Africa 35303511424 45778570846
Americas 661248623419 776723426068
Asia 458042336179 627513635079
Europe 436448815097 493183311052
Oceania 345236880176 403657044512</code></pre>
<p>You can use these functions in place of <code>for</code> loops (and its usually faster to do so): just write the body of the for loop in the anonymous function:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">d_ply</span>(
<span class="dt">.data=</span>gapminder,
<span class="dt">.variables =</span> <span class="st">"continent"</span>,
<span class="dt">.fun =</span> function(x) {
meanGDPperCap <-<span class="st"> </span><span class="kw">mean</span>(x$gdpPercap)
<span class="kw">print</span>(<span class="kw">paste</span>(
<span class="st">"The mean GDP per capita for"</span>, <span class="kw">unique</span>(x$continent),
<span class="st">"is"</span>, <span class="kw">format</span>(meanGDPperCap, <span class="dt">big.mark=</span><span class="st">","</span>)
))
}
)</code></pre>
<div id="tip-printing-numbers" class="callout">
<h4>Tip: printing numbers</h4>
<p>The <code>format</code> function can be used to make numeric values "pretty" for printing out in messages.</p>
</div>
<div id="challenge-1" class="challenge">
<h4>Challenge 1</h4>
<p>Calculate the average life expectancy per continent. Which has the longest? Which had the shortest?</p>
</div>
<div id="challenge-2" class="challenge">
<h4>Challenge 2</h4>
<p>Calculate the average life expectancy per continent and year. Which had the longest and shortest in 2007? Which had the greatest change in between 1952 and 2007?</p>
</div>
<div id="advanced-challenge" class="challenge">
<h4>Advanced Challenge</h4>
<p>Calculate the difference in mean life expectancy between the years 1952 and 2007 from the output of challenge 2 using one of the <code>plyr</code> functions.</p>
</div>
<div id="alternate-challenge-if-class-seems-lost" class="challenge">
<h4>Alternate Challenge if class seems lost</h4>
<p>Without running them, which of the following will calculate the average life expectancy per continent:</p>
<ol style="list-style-type: decimal">
<li><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ddply</span>(
<span class="dt">.data =</span> gapminder,
<span class="dt">.variables =</span> gapminder$continent,
<span class="dt">.fun =</span> function(dataGroup) {
<span class="kw">mean</span>(dataGroup$lifeExp)
}
)</code></pre></li>
<li><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ddply</span>(
<span class="dt">.data =</span> gapminder,
<span class="dt">.variables =</span> <span class="st">"continent"</span>,
<span class="dt">.fun =</span> <span class="kw">mean</span>(dataGroup$lifeExp)
)</code></pre></li>
<li><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ddply</span>(
<span class="dt">.data =</span> gapminder,
<span class="dt">.variables =</span> <span class="st">"continent"</span>,
<span class="dt">.fun =</span> function(dataGroup) {
<span class="kw">mean</span>(dataGroup$lifeExp)
}
)</code></pre></li>
<li><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">adply</span>(
<span class="dt">.data =</span> gapminder,
<span class="dt">.variables =</span> <span class="st">"continent"</span>,
<span class="dt">.fun =</span> function(dataGroup) {
<span class="kw">mean</span>(dataGroup$lifeExp)
}
)</code></pre></li>
</ol>
</div>
</div>
</div>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="http://software-carpentry.org/v5/js/bootstrap/bootstrap.min.js"></script>
</body>
</html>