-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathindex.html
115 lines (98 loc) · 12.5 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="description" content="">
<meta name="author" content="">
<title>Annotated ETL Code Examples with Make</title>
<link href="http://maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css" rel="stylesheet">
<link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet">
<link href="static/custom.css" rel="stylesheet">
</head>
<body>
<div class="container-fluid">
<div class="row">
<div class="col-sm-10">
<h1 class="heavy">Annotated ETL Code Examples with Make</h1>
<h3 class="heavy">Intro</h2>
<p>When we make data at DataMade, we use GNU make to achieve a reproducible data transformation workflow</p>
<p>For more background on make, see our <a href="https://github.com/datamade/data-making-guidelines/blob/master/make.md" target="blank">overview of make & makefiles</a></p>
<p>For ETL best practices, see our <a href="https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md" target="blank">DataMade ETL styleguide</a></p>
<p><i class="fa fa-hand-o-up"></i> <i>hover (or click if you're on a touchscreen) on highlighted text for annotations</i></p>
<h3 class="heavy">Example Make Rules</h2>
<h4><i>General Structure of a Rule</i></h4>
<div class="codeblock pseudocode">
<span class="more" data-toggle="popover" data-placement="top" data-content="the target is what you want to generate. make expects all targets to be files, with the exception of phony target. a file target can be an output filename, an output file pattern, or a variable.">target</span>: <span class="more" data-toggle="popover" data-placement="top" data-content="dependencies are everything that needs to exist in order to make the target. make expects all dependencies to be files. dependencies can be filenames, filename patterns, or variables. dependencies are optional.">dependencies</span>
<br>
<span class="more" data-toggle="popover" data-placement="top" data-content="recipes are commands for generating the target file. any command you can run on the terminal is fair game for recipes - bash commands, invoking a script, etc.">recipe</span>
</div>
<h4>1. Setup phony targets</h4>
<div class="codeblock">
<span class="more" data-toggle="popover" data-placement="top" data-content="by default, make assumes that targets are files. however, sometimes it is useful to run commands that do not represent physical files - for example, making all targets or cleaning your directory. this specifies that the following targets are not file targets">.PHONY</span>: <span class="more" data-toggle="popover" data-placement="top" data-content="we use this phony target to make all finished files">all</span> <span class="more" data-toggle="popover" data-placement="top" data-content="we use this phony target to restore the directory to its state before running any make commands">clean</span>
<br><br>
<span class="more" data-toggle="popover" data-placement="top" data-content="this rule allows you to generate all of your finished files with one command, 'make all'">all</span>: $(<span class="more" data-toggle="popover" data-placement="top" data-content="this is a variable that should be defined to include all finished files">GENERATED_FILES</span>)
<br><br>
<span class="more" data-toggle="popover" data-placement="top" data-content="this rule allows you to clean all generated files with one command, 'make clean'">clean</span>:
<br>
rm -Rf <span class="more" data-toggle="popover" data-placement="top" data-content="we put all finished files in the finished/ directory, so that they can be cleaned all at once">finished/*<span class="more" data-toggle="popover" data-placement="top" data-content="***">
</div>
<h4>2. Downloading a zip directory</h4>
<div class="codeblock">
<span class="more" data-toggle="popover" data-placement="top" data-content="this is the target. there's no dependency here, b/c it's being downloaded from the web">parcels.zip</span>:
<br>
<span class="more" data-toggle="popover" data-placement="top" data-content="wget retrieves a file from a URL">wget</span>
<span class="more" data-toggle="popover" data-placement="top" data-content="this option causes the downloaded file to have a local timestamp, rather than using the timestamp of the file on the server (default wget behavior). this is necessary b/c make looks at the last-modified metadata to determine what needs to be built">--no-use-server-timestamps</span> <span class="more" data-toggle="popover" data-placement="top" data-content="break up code into multiple lines when it gets long, for readability">\</span>
<br>
http://maps.indiana.edu/download/Reference/Land_Parcels_County_IDHS.zip <span class="more" data-toggle="popover" data-placement="top" data-content="a wget option that allows you to rename what you're downloading">-O</span> <span class="more" data-toggle="popover" data-placement="top" data-content="this is an automatic variable referring to the filename of the target, in this case parcels.zip">$@</span>
</div>
<h4>3. Unzipping a zip directory</h4>
<div class="codeblock">
<span class="more" data-toggle="popover" data-placement="top" data-content="this states that chicomm.shp is an intermediate target, meaning that it will be deleted after all targets listing it as a dependency have been built">.INTERMEDIATE</span>:
chicomm.shp<br>
<span class="more" data-toggle="popover" data-placement="top" data-content="the target">chicomm.shp</span>:
<span class="more" data-toggle="popover" data-placement="top" data-content="the zipped dependency">chicomm.zip</span>
<br>
unzip <span class="more" data-toggle="popover" data-placement="bottom" data-content="this option overwrites existing files">-o</span> <span class="more" data-toggle="popover" data-placement="bottom" data-content="this is an automatic variable referring to the name of the first dependency - in this case, chicomm.zip">$<</span>
</div>
<h4>4. Converting excel to csv</h4>
<div class="codeblock">
<span class="more" data-toggle="popover" data-placement="top" data-content="this states that parcel_survey.csv is an intermediate target, meaning that it will be deleted after all targets listing it as a dependency have been built">.INTERMEDIATE</span>:
parcel_survey.csv<br>
<span class="more" data-toggle="popover" data-placement="top" data-content="the csv target">parcel_survey.csv</span>:
<span class="more" data-toggle="popover" data-placement="top" data-content="the excel dependency">parcel_survey.xlsx</span>
<br>
<span class="more" data-toggle="popover" data-placement="bottom" data-content="a csvkit command for converting to csv">in2csv</span> <span class="more" data-toggle="popover" data-placement="bottom" data-content="an automatic variable referring to the filename of the first dependency, in this case parcel_survey.xlsx">$<</span> <span class="more" data-toggle="popover" data-placement="bottom" data-content="this redirects output to a file instead of stdout (the console)">></span> <span class="more" data-toggle="popover" data-placement="bottom" data-content="an automatic variable referring to the filename of the target, in this case parcel_survey.csv">$@</span>
</div>
<h4>5. Grabbing select columns from an excel doc, & creating a csv with a new header</h4>
<div class="codeblock">
<span class="more" data-toggle="popover" data-placement="top" data-content="the csv target">school_id_lookup.csv</span>:
<span class="more" data-toggle="popover" data-placement="top" data-content="the excel dependency">School_data_8-3-14.xlsx</span>
<br>
<span class="more" data-toggle="popover" data-placement="bottom" data-content="a csvkit command for converting to csv">in2csv</span> <span class="more" data-toggle="popover" data-placement="bottom" data-content="an automatic variable referring to the filename of the first dependency, in this case School_data_8-3-14.xlsx">$<</span> <span class="more" data-toggle="popover" data-placement="bottom" data-content="chaining processes: stdout of what's on the left = stdin of what's on the right">|</span>\
<br>
<span class="more" data-toggle="popover" data-placement="bottom" data-content="a csvkit command for taking a slice of a csv">csvcut</span> <span class="more" data-toggle="popover" data-placement="bottom" data-content="the -c option speficies which columns to grab - this grabs the first and second column in the csv">-c "1,2"</span> |\
<br>
(<span class="more" data-toggle="popover" data-placement="bottom" data-content="create a csv header row with two columns">echo "school_id,school_name"</span>; <span class="more" data-toggle="popover" data-placement="bottom" data-content="prints output from the second line onwards - everything in a csv except for the header row">tail +2</span>) <span class="more" data-toggle="popover" data-placement="bottom" data-content="this redirects output to a file instead of stdout (the console)">></span> finished/<span class="more" data-toggle="popover" data-placement="bottom" data-content="notdir is a function - given a filepath, it returns only the file name, without the directory path">$(notdir $@)</span>
</div>
<h4>6. Join csvs, using an implicit rule</h4>
<div class="codeblock">
<span class="more" data-toggle="popover" data-placement="top" data-content="the same stem as the dependency">%</span>hourly.joined.csv: <span class="more" data-toggle="popover" data-placement="top" data-content="% will match any non-empty substring (the match is called the stem), making this an implicit rule">%</span>hourly.csv stations.csv
<br>
<span class="more" data-toggle="popover" data-placement="bottom" data-content="a csvkit command for joining csvs">csvjoin</span> <span class="more" data-toggle="popover" data-placement="bottom" data-content="the -c option speficies which columns to join - this joins on the 3rd column in the first csv = the 4th column in the second csv">-c "3,4"</span> <span class="more" data-toggle="popover" data-placement="bottom" data-content="the first csv to be joined">$<</span> <span class="more" data-toggle="popover" data-placement="bottom" data-content="the second csv to be joined">stations.csv</span>
<span class="more" data-toggle="popover" data-placement="bottom" data-content="this redirects output to a file instead of stdout (the console)">></span> finished/<span class="more" data-toggle="popover" data-placement="bottom" data-content="notdir is a function - given a filepath, it returns only the file name, without the directory path">$(notdir $@)</span>
</div>
<br><br><br>
</div>
</div>
<div>
</body>
<script src="https://code.jquery.com/jquery-1.11.2.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js"></script>
<script>
$(function () {
$('[data-toggle="popover"]').popover({
trigger: 'hover'
});
})
</script>
</html>