-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.xml
648 lines (440 loc) · 28.7 KB
/
index.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Mediative</title>
<link>https://mediative.github.io/index.xml</link>
<description>Recent content on Mediative</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<lastBuildDate>Sun, 10 Jul 2016 20:47:14 -0400</lastBuildDate>
<atom:link href="https://mediative.github.io/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>Simple Spark ml pipeline</title>
<link>https://mediative.github.io/post/2016/07/simple-spark-ml-pipeline/</link>
<pubDate>Sun, 10 Jul 2016 20:47:14 -0400</pubDate>
<guid>https://mediative.github.io/post/2016/07/simple-spark-ml-pipeline/</guid>
<description>
<p>Mediative recently hosted a <a href="http://www.meetup.com/Montreal-Apache-Spark-Meetup/events/231285569/">Apache Spark Montreal Meetup</a>&rsquo;s project night where some of us decided to create a simple ML pipeline.
To spare the installation of Spark, we used the <a href="https://databricks.com/try-databricks">Databricks community edition</a>.
Since the goal was to see if we could make it work,
we wanted to use data that we knew was correlated.
But to make the project a little more fun,
we decided to explore something else than the usual <a href="https://en.wikipedia.org/wiki/Data_set#Classic_data_sets">data sets</a>
so we went for the Dow Jones and Nasdaq.
In the wealth of all <code>R</code> packages, one can find the <a href="https://cran.r-project.org/web/packages/quantmod/quantmod.pdf">quantmod package</a> to access financial data.</p>
<h2 id="getting-data">Getting data</h2>
<p>The notebook was created for <code>Scala</code> language, so we used the <code>%r</code> magic to install and use the <code>R</code> package to access the data. And while we were at it, we merge both data sets right away:</p>
<pre><code class="language-r">%r
install.packages(&quot;quantmod&quot;)
library(&quot;quantmod&quot;)
## NASDAQ
nsd&lt;-as.data.frame(getSymbols(Symbols = &quot;^NDX&quot;,
src = &quot;yahoo&quot;, from = &quot;2015-01-01&quot;,to = &quot;2016-01-01&quot;, env = NULL))
## Dow Jones
dji&lt;-as.data.frame(getSymbols(Symbols = &quot;^DJI&quot;,
src = &quot;yahoo&quot;, from = &quot;2015-01-01&quot;,to = &quot;2016-01-01&quot;, env = NULL))
## Adding the date as a column (in the above they are index and are lost when a table is created)
nsd$date&lt;-rownames(nsd)
dji$date&lt;-rownames(dji)
## Merge the tables together on date
mrgIndx&lt;-merge(nsd, dji)
dfIndx &lt;- createDataFrame(sqlContext, mrgIndx)
registerTempTable(dfIndx, &quot;testIndx&quot;)
</code></pre>
<p>The last line register the dataframe as a (temporary) table to make it available outside of the <code>R</code> scope.</p>
<p>The following (default) scala cell will create a dataframe back from this table.</p>
<pre><code class="language-scala">val df = sqlContext.sql(&quot;SELECT * FROM testIndx&quot;)
</code></pre>
<h2 id="looking-at-the-data">Looking at the data</h2>
<p>Of all the fields, we will only consider the <code>date</code> and the adjusted Nasdaq,<code>NDX_Adjusted</code>, and Dow Jones, <code>DJI_Adjusted</code> values. Why adjusted? No reason, so why not! Let see if they are correlated and we can have hope to predict one with the other:</p>
<pre><code class="language-scala">import org.apache.spark.sql.functions.lit
display(df.withColumn(&quot;NDXTimes5&quot;, $&quot;NDX_Adjusted&quot;.cast(DoubleType).multiply(lit(5))))
</code></pre>
<figure >
<img src="https://mediative.github.io/images/simple-ml-pipeline/djiNndx5.png" />
</figure>
<p>No fancy statistical tools are needed to see that these two curves are correlated. The Nasdaq value has been scaled up by five (using the imported <code>lit</code> function) to make the comparison more obvious, but this scaling will not be used in the training. Hopefully, even a basic model can take care of that.</p>
<h2 id="preparing-the-data-and-the-model">Preparing the data and the model</h2>
<p>Let&rsquo;s keep only the fields that we will need</p>
<pre><code class="language-scala">val data = df.withColumn(&quot;NDX&quot;, $&quot;NDX_Adjusted&quot;)
.withColumn(&quot;DJI&quot;, $&quot;DJI_Adjusted&quot;)
.select(&quot;NDX&quot;, &quot;DJI&quot;)
</code></pre>
<p>And let&rsquo;s keep a random test subsample for testing purpose, the rest will be use for training the model.</p>
<pre><code class="language-scala">val Array(training, test) = data.randomSplit(Array(0.75, 0.25), seed = 12345)
</code></pre>
<p>We use the <code>VectorAssembler</code> to create the feature vector used by the model. We want to predict the Dow Jones with the Nasdaq (<code>NDX</code>), so the latter will be our feature, which we will wisely call <code>features</code>.</p>
<pre><code class="language-scala">import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array(&quot;NDX&quot;))
.setOutputCol(&quot;features&quot;)
</code></pre>
<p>We will also need a model to learn with, for such a simple task, let&rsquo;s use a simple linear regression where we define the Dow Jones (<code>DJI</code>) as the target we want to learn on (that is called <code>label</code> in <code>ml</code>).</p>
<pre><code class="language-scala">import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression()
.setLabelCol(&quot;DJI&quot;)
.setFeaturesCol(&quot;features&quot;)
</code></pre>
<h2 id="set-up-the-pipeline">Set up the pipeline</h2>
<p>Now that we have all the elements, we can easily assemble them with the <code>pipeline</code> functionality.</p>
<pre><code class="language-scala">import org.apache.spark.ml.Pipeline
val steps: Array[org.apache.spark.ml.PipelineStage] = Array(assembler, lr)
val pipeline = new Pipeline().setStages(steps)
</code></pre>
<h2 id="fitting-the-model">Fitting the model</h2>
<p>Preparing data and training is done with a single call of the pipeline</p>
<pre><code class="language-scala">val myModel = pipeline.fit(training)
</code></pre>
<p>We can now see how well the model works by comparing its prediction with the actual Dow Jones values</p>
<pre><code class="language-scala">display(myModel.transform(test).select(&quot;prediction&quot;, &quot;DJI&quot;))
</code></pre>
<figure >
<img src="https://mediative.github.io/images/simple-ml-pipeline/predNactual.png" />
</figure>
<p>The model obviously managed to learn correlations between the Dow jones and the Nasdaq. Nothing to impress your broker, but that is a basis on which building better prediction.</p>
</description>
</item>
<item>
<title>Sparrow version 0.2.0</title>
<link>https://mediative.github.io/post/2016/02/sparrow-version-0.2.0/</link>
<pubDate>Mon, 29 Feb 2016 21:44:36 -0500</pubDate>
<guid>https://mediative.github.io/post/2016/02/sparrow-version-0.2.0/</guid>
<description>
<p>Sparrow version <a href="https://github.com/mediative/sparrow/releases/tag/0.2.0">0.2.0</a>
is now available with updated dependency on Spark 1.6.0.</p>
<p>It&rsquo;s both available as a <a href="http://spark-packages.org/package/mediative/sparrow">Spark
package</a> and from the <a href="https://github.com/mediative/sparrow/blob/6589d0f3302520d284461e0aced147d9e14ddb7d/README.md#getting-started">YPG
Data Bintray
repository</a>.</p>
<h2 id="release-notes">Release notes</h2>
<ul>
<li>Bump Spark version to 1.6.0</li>
<li>Test against Scala 2.10.6 on Travis</li>
<li>Bump the Macro Paradise plugin to 2.1.0</li>
</ul>
</description>
</item>
<item>
<title>News from Spark Summit East</title>
<link>https://mediative.github.io/post/2016/02/news-from-spark-summit-east/</link>
<pubDate>Sun, 28 Feb 2016 21:38:37 -0500</pubDate>
<guid>https://mediative.github.io/post/2016/02/news-from-spark-summit-east/</guid>
<description>
<p>Mediative is building a data pipeline on top of Spark
so I went to <a href="https://spark-summit.org/east-2016/">Spark Summit East</a>
to see what other people are doing and what&rsquo;s coming.
There were many conference tracks including Enterprise, Developer and Data Science.
I mostly attended Data Science talks and below are the highlights.
Some of this information also came from <a href="http://www.meetup.com/Spark-NYC/events/228233164/">NYC Spark Meetup</a>,
held on the first evening of the Conference.</p>
<h1 id="spark-2-0">Spark 2.0</h1>
<p>Some of the main news about Spark 2.0 are</p>
<ul>
<li>should be available late April - Early May</li>
<li>(almost) No API changes for 2.0</li>
<li>Will unifying datasets and dataframes
<ul>
<li><code>DataFrame = Dataset[Row]</code></li>
<li>Libraries will accept both interchangeably</li>
</ul></li>
</ul>
<h2 id="tungsten">Tungsten</h2>
<p>the under-the-hood improver of memory and CPU efficiency for Spark applications.
Project Tungsten was introduced in Spark 1.4.
See <a href="https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html">this blog</a> for more information. Here is what to expect from new releases</p>
<ul>
<li><strong>Phase I</strong> Spark 2.0
<ul>
<li>~5x faster</li>
<li>Improve IO by better pruning data to process</li>
<li>Native memory management (use less java object and their costly initialization)</li>
</ul></li>
<li><strong>Phase II</strong> Spark 2.x
<ul>
<li>~10x faster</li>
<li>Spark will work as a compiler: reading the provided code and create it&rsquo;s own optimize version.</li>
</ul></li>
</ul>
<h2 id="spark-streaming">Spark Streaming</h2>
<p>Processing data in real time will be more integrated with batch applications
with</p>
<ul>
<li>Structured stream
<ul>
<li>will extend dataframe/dataset</li>
<li>more analysis from stream data</li>
</ul></li>
<li>Supports interactive &amp; batch queries (e.g. aggregate data in a stream then serving to JDBC)</li>
</ul>
<p>(more info on Spark 2.0 <a href="https://spark-summit.org/east-2016/events/keynote-day-2/">here</a>)</p>
<hr />
<h1 id="pipelines">Pipelines</h1>
<p>The summit comprised lots of of pipeline talks, two examples shown below are particularly
interesting for their similarities with our projects at Mediative.</p>
<h2 id="netflix-distributed-time-travel-for-feature-generation">Netflix Distributed Time Travel for Feature Generation</h2>
<p>The goal is build a time machine snapshots online services
and uses the snapshot data offline to reconstruct the inputs
that a model would have seen online to generate features.</p>
<p>First, an appropriate sample of contexts is selected
(samples based on properties such as viewing patterns, devices, time spent on the service, region, etc)
and persisted into S3 (parquet) as represented by the <code>Context Set</code> below.
Interestingly they also store the confidence level for each snapshot service,
the percentage of successful data fetched.
The batch API fetches the associated S3 location of the snapshot data from Cassandra and loads the snapshot data in Spark</p>
<figure >
<img src="https://mediative.github.io/images/spark-summit-east-2016/netflixSnapshotAPI.png" />
</figure>
<p>here is an example call to their API returning a RDD</p>
<pre><code class="language-scala">val snapshot = new SnapshotDataManager(sqlContext))
.withTimestams(1445470140000L)
.withContextId(OUTATIME)
.getViewingHistory
</code></pre>
<p>(more info <a href="https://spark-summit.org/east-2016/events/distributed-time-travel-for-feature-generation/">here</a>)</p>
<hr />
<h2 id="real-time-data-pipelines-with-kafka">Real Time Data Pipelines with Kafka</h2>
<p>If you have <code>n</code> connectors, it is very likely that you&rsquo;ll end up writing n*n connections.
Here is a scary examples
<figure >
<img src="https://mediative.github.io/images/spark-summit-east-2016/conplexPipeline.png" />
</figure>
</p>
<p><strong>Kafka connect&rsquo;s two modes</strong></p>
<ul>
<li>Source connectors : some system to Kafka</li>
<li>Synk connectors : From Kafka to some system
<figure >
<img src="https://mediative.github.io/images/spark-summit-east-2016/kafka2modes.png" />
</figure>
</li>
</ul>
<p>Kafka&rsquo;s buffer allows to stream to (non-stream) destination like HDFS</p>
<figure >
<img src="https://mediative.github.io/images/spark-summit-east-2016/kafkaDataIntegration.png" />
</figure>
<p>It is even possible to copy an entire database (suggested partition: by table)</p>
<p>more information <a href="https://spark-summit.org/east-2016/events/building-realtime-data-pipelines-with-kafka-connect-and-spark-streaming/">here</a></p>
<h1 id="machine-learning">Machine Learning</h1>
<p>There were many example with MLlib and SparkR and packages like <strong>Sparkling water</strong> (H2O), an Open Sources with tools like customized DataFrames and Notebook.
The incubating <strong>SystemML</strong> (IBM) translates high-level (R or python)
aims to optimized code adapting to underlying input formats and physical data representations.</p>
<hr />
<h2 id="tensorspark">TensorSpark</h2>
<p>A distributed TensorFlow on Spark (Arimo, Inc.) motivated by TensorFlow (at the time)
being only released for single-machine.
Even with a TensorFlow released, TensorSpark might be more appropriate to join with some spark infrastructure.</p>
<p>The figure below represents how an instance of tensorFlow runs on each machines where
the driver is the parameter server: receiving gradients from workers and broadcast the updated model.</p>
<figure >
<img src="https://mediative.github.io/images/spark-summit-east-2016/tensorSparkArchitecture.png" />
</figure>
<p>more information <a href="https://spark-summit.org/east-2016/events/distributed-tensor-flow-on-spark-scaling-googles-deep-learning-library/">here</a></p>
<hr />
<h2 id="online-bidding">Online bidding</h2>
<p>Of particular interest to Mediative, a talk about real time bidding over display ads with machine learning.</p>
<figure >
<img src="https://mediative.github.io/images/spark-summit-east-2016/AdbidPipeline.png" />
</figure>
<p>Their pipeline could train multiple models in parallel and choose the most effective one.
A very nice outcome was the most effective model varies from campaign to campaign as shown below.</p>
<p>
<figure >
<img src="https://mediative.github.io/images/spark-summit-east-2016/AdModelCompare.png" />
</figure>
more information <a href="https://spark-summit.org/east-2016/events/spark-dataxu-multi-model-machine-learning-for-real-time-bidding-over-display-ads/">here</a></p>
<hr />
<h1 id="visualization">Visualization</h1>
<p>Visualisation still mostly rely on (non-scalable) libraries although significant progress
was shown with integration of ggplot2 with SparkR where 47% of API implemented
(as shown <a href="https://spark-summit.org/east-2016/events/generalized-linear-models-in-spark-mllib-and-sparkr/">here</a>).
There is also the incubatin Zoomdate which shows nice promises.
Meanwhile, better to filter your data and use a non-distributed library.</p>
<hr />
<h1 id="others">Others</h1>
<p>A quick mention of interesting subjects</p>
<ul>
<li><p><a href="https://spark-summit.org/east-2016/events/magellan-spark-as-a-geospatial-analytics-engine/">Magellan-Spark Geospatial analytics</a></p>
<ul>
<li>Cartesian join : joining polygone and points</li>
<li>supported formats includes GeoJSON, ESRI, OSM-XML</li>
</ul></li>
<li><p><a href="https://spark-summit.org/east-2016/events/beyond-collect-and-parallelize-for-tests/">Beyond Collect and Parallelize for Tests</a></p>
<ul>
<li>Addressing problems of testing at scale</li>
<li>Comparing RDD, DataFrames, DataSets</li>
<li>Getting test (big) data</li>
</ul></li>
</ul>
<hr />
<h1 id="spark-community-edition-beta">Spark community edition (beta)</h1>
<p>Finally, Databricks announced a free edition of their very nice service,
this includes access to 6GB clusters.</p>
<ul>
<li><p>beta edition available in the coming weeks</p>
<ul>
<li><a href="http://go.databricks.com/databricks-community-edition-beta-waitlist">waiting list</a></li>
</ul></li>
<li><p>Includes learning utilities</p></li>
<li><p>See <a href="https://www.youtube.com/watch?v=35Y-rqSMCCA">demo</a></p></li>
</ul>
</description>
</item>
<item>
<title>Running Zeppelin on CDH</title>
<link>https://mediative.github.io/post/2016/02/running-zeppelin-on-cdh/</link>
<pubDate>Fri, 26 Feb 2016 14:46:46 -0500</pubDate>
<guid>https://mediative.github.io/post/2016/02/running-zeppelin-on-cdh/</guid>
<description>
<h2 id="download-and-build-zeppelin">Download and Build Zeppelin</h2>
<p>Go to the <a href="http://zeppelin.incubator.apache.org/download.html">download page</a>
and get the latest source package.</p>
<p>Untar the source package and create a git repo to make bower happy:</p>
<pre><code>$ tar zxvf zeppelin-0.5.6-incubating.tgz
$ cd zeppelin-0.5.6-incubating
$ git init
</code></pre>
<p>Before building from source first determine the Hadoop version by running the
following command on the edge node:</p>
<pre><code>$ hadoop version
Hadoop 2.6.0-cdh5.4.8
...
This command was run using /opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/lib/hadoop/hadoop-common-2.6.0-cdh5.4.8.jar
</code></pre>
<p>Build Zeppelin with <a href="http://zeppelin.incubator.apache.org/docs/0.5.6-incubating/install/yarn_install.html">YARN support</a>
enabled using the Maven profile corresponding to the Hadoop version found above:</p>
<pre><code>$ mvn clean package -Pbuild-distr -Pyarn -Pspark-1.5 -Dspark.version=1.5.2 \
-Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.4.8 -DskipTests -Pvendor-repo
</code></pre>
<p>Note we are assuming that you are using a custom Spark version as described in
<a href="https://mediative.github.io/post/2016/02/installing-a-custom-spark-version-on-cdh/">our previous post</a>.</p>
<h2 id="installing-zeppelin-on-the-edge-node">Installing Zeppelin on the Edge Node</h2>
<p>Copy the distribution to the edge node:</p>
<pre><code>$ scp zeppelin-distribution/target/zeppelin-x.y.z-incubating.tar.gz edge-node:
</code></pre>
<p>SSH to the edge node, unzip the tarball and <code>cd</code> to the Zeppelin installation directory:</p>
<pre><code>$ tar zxvf /path/to/zeppelin-x.y.z-incubating.tar.gz
$ cd zeppelin-x.y.z-incubating/
</code></pre>
<p>Configure Zeppelin by creating and editing <code>conf/zeppelin-env.sh</code>:</p>
<pre><code>$ cp conf/zeppelin-env.sh{.template,}
</code></pre>
<p>It should contain the following variables:</p>
<pre><code class="language-sh">export SPARK_HOME=&quot;$HOME/spark-x.y.z-bin-cdhx.y.z&quot; # Assuming you are using a custom Spark version
export MASTER=yarn-client
export ZEPPELIN_JAVA_OPTS=&quot;-Dspark.yarn.jar=$HOME/spark-x.y.z-bin-cdhx.y.z/lib/spark-assembly-x.y.z-hadoopx.y.z-cdhx.y.z.jar&quot;
export DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-x.y.z-1.cdhx.y.z.p0.11/lib/hadoop
export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME}
if [ -n &quot;$HADOOP_HOME&quot; ]; then
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
</code></pre>
<h2 id="manage-the-zeppelin-server">Manage the Zeppelin Server</h2>
<p>To start the server run:</p>
<pre><code> $ bin/zeppelin-daemon.sh start
</code></pre>
<p>To stop it:</p>
<pre><code> $ bin/zeppelin-daemon.sh stop
</code></pre>
</description>
</item>
<item>
<title>Mesos Stack version 0.4.0</title>
<link>https://mediative.github.io/post/2016/02/mesos-stack-version-0.4.0/</link>
<pubDate>Wed, 24 Feb 2016 14:25:03 -0500</pubDate>
<guid>https://mediative.github.io/post/2016/02/mesos-stack-version-0.4.0/</guid>
<description>
<p>Version <a href="https://github.com/mediative/mesos-stack/releases/tag/0.4.0">0.4.0</a> has
been released of our Mesos stack. It updates Marathon-LB to use an upstream
released version and adds a new GlusterFS role to distribute files across the
Mesos cluster. Also enjoy the new and improved
<a href="https://mediative.github.io/mesos-stack/">documentation</a> which is generated from
the Ansible role files.</p>
<h2 id="release-notes">Release notes</h2>
<p>Improvements:</p>
<ul>
<li>mesos-master, mesos-agent: Use fully qualified host names.</li>
<li>Generate Ansible role documentation from YAML files so they are always up to
date.</li>
<li>marathon-lb: Upgrade to version 1.1.1.</li>
<li>common: Disable IPv6 on all cluster nodes.</li>
<li>New glusterfs role which adds persistent storage across nodes.</li>
</ul>
</description>
</item>
<item>
<title>Installing a Custom Spark Version on CDH</title>
<link>https://mediative.github.io/post/2016/02/installing-a-custom-spark-version-on-cdh/</link>
<pubDate>Sat, 13 Feb 2016 19:54:46 -0500</pubDate>
<guid>https://mediative.github.io/post/2016/02/installing-a-custom-spark-version-on-cdh/</guid>
<description><p>Since Spark can be run as a YARN application it is possible to run a Spark
version other than the one provided by the Cloudera platform (CDH). This
document lists the instructions for how to compile a specific Spark version
against the Hadoop version supported by CDH. The instructions are based on the
post <a href="https://www.linkedin.com/pulse/running-spark-151-cdh-deenar-toraskar-cfa">Running Spark 1.5.1 on
CDH</a>.</p>
<ol>
<li><p>Determine the version of CDH and Hadoop by running the following command on
the edge node:</p>
<pre><code>$ hadoop version
Hadoop 2.6.0-cdh5.4.8
...
</code></pre></li>
<li><p><a href="http://spark.apache.org/downloads.html">Download Spark</a> and extract the
sources.</p></li>
<li><p><a href="http://spark.apache.org/docs/latest/building-spark.html">Build Spark</a> by
opening the distribution directory in the shell and running the following
command using the CDH and Hadoop version from step 1:</p>
<pre><code>$ ./make-distribution.sh --tgz --name cdh5.4.8 -Pyarn \
-Phadoop-2.6 -Phadoop-provided -Dhadoop.version=2.6.0-cdh5.4.8 \
-Phive -Phive-thriftserver
</code></pre>
<p>Note that <code>-Phadoop-provided</code> enables the profile to build the assembly
without including Hadoop-ecosystem dependencies provided by Cloudera. To
compile with Spark 2.11 support first run:</p>
<pre><code>$ ./dev/change-scala-version.sh 2.11
</code></pre>
<p>and pass <code>-Dscala-2.11</code> to <code>make-distribution.sh</code>.</p></li>
<li><p>Copy the resulting <code>tgz</code> file to the edge node:</p>
<pre><code>$ scp spark-x.x.x-bin-cdh5.4.8.tgz user@edge-node:
</code></pre></li>
<li><p>Connect to the edge node</p></li>
<li><p>Extract the <code>tgz</code> file</p></li>
<li><p><code>cd</code> into the custom Spark distribution and configure the custom Spark
distribution:</p>
<pre><code> $ cp -R /etc/spark/conf/* conf/
# Change SPARK_HOME to point to folder with custom Spark distrobution
$ sed -i &quot;s#\(.*SPARK_HOME\)=.*#\1=$(pwd)#&quot; conf/spark-env.sh
# Tell YARN which Spark JAR to use
$ echo &quot;spark.yarn.jar=$(pwd)/$(ls lib/spark-assembly-*.jar)&quot; &gt;&gt; conf/spark-defaults.conf
$ cp /etc/hive/conf/hive-site.xml conf/
</code></pre></li>
<li><p>Test the custom Spark distribution:</p>
<pre><code> $ ./bin/run-example SparkPi 10 --master yarn-client
$ ./bin/spark-shell --master yarn-client
</code></pre></li>
</ol>
</description>
</item>
<item>
<title>Projects</title>
<link>https://mediative.github.io/projects/</link>
<pubDate>Sat, 13 Feb 2016 19:51:05 -0500</pubDate>
<guid>https://mediative.github.io/projects/</guid>
<description><p>A curated list of OSS projects maintained by YPG Data</p>
<ul>
<li><a href="https://mediative.github.io/mesos-stack">Mesos Stack</a>:
Scripts to configure a Mesos cluster using Mesos and Mesosphere components.</li>
<li><a href="https://mediative.github.io/eigenflow">Eigenflow</a>:
ETL orchestration platform with recoverability and process monitoring features.</li>
<li><a href="https://mediative.github.io/sparrow">Sparrow</a>:
Scala library for converting Spark rows to case classes.</li>
<li><a href="https://github.com/mediative/sbt-mediative">sbt-mediative</a>:
A collection of opinionated plugins to minimize boilerplate when setting up new SBT projects.</li>
<li><a href="https://github.com/mediative/TTFI">TTFI</a>:
Scala port of the ideas from the paper on <a href="http://okmij.org/ftp/tagless-final/course/lecture.pdf">Typed Tagless-Final Interpreters</a>.</li>
</ul>
</description>
</item>
</channel>
</rss>