index.xml

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Mediative</title>
    <link>https://mediative.github.io/index.xml</link>
    <description>Recent content on Mediative</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Sun, 10 Jul 2016 20:47:14 -0400</lastBuildDate>
    <atom:link href="https://mediative.github.io/index.xml" rel="self" type="application/rss+xml" />
    
    <item>
      <title>Simple Spark ml pipeline</title>
      <link>https://mediative.github.io/post/2016/07/simple-spark-ml-pipeline/</link>
      <pubDate>Sun, 10 Jul 2016 20:47:14 -0400</pubDate>
      
      <guid>https://mediative.github.io/post/2016/07/simple-spark-ml-pipeline/</guid>
      <description>

&lt;p&gt;Mediative recently hosted a &lt;a href=&#34;http://www.meetup.com/Montreal-Apache-Spark-Meetup/events/231285569/&#34;&gt;Apache Spark Montreal Meetup&lt;/a&gt;&amp;rsquo;s project night where some of us decided to create a simple ML pipeline.
To spare the installation of Spark, we used the &lt;a href=&#34;https://databricks.com/try-databricks&#34;&gt;Databricks community edition&lt;/a&gt;.
Since the goal was to see if we could make it work,
we wanted to use data that we knew was correlated.
But to make the project a little more fun,
we decided to explore something else than the usual &lt;a href=&#34;https://en.wikipedia.org/wiki/Data_set#Classic_data_sets&#34;&gt;data sets&lt;/a&gt;
so we went for the Dow Jones and Nasdaq.
In the wealth of all &lt;code&gt;R&lt;/code&gt; packages, one can find the &lt;a href=&#34;https://cran.r-project.org/web/packages/quantmod/quantmod.pdf&#34;&gt;quantmod package&lt;/a&gt; to access financial data.&lt;/p&gt;

&lt;h2 id=&#34;getting-data&#34;&gt;Getting data&lt;/h2&gt;

&lt;p&gt;The notebook was created for &lt;code&gt;Scala&lt;/code&gt; language, so we used the &lt;code&gt;%r&lt;/code&gt; magic to install and use the &lt;code&gt;R&lt;/code&gt; package to access the data.  And while we were at it, we merge both data sets right away:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-r&#34;&gt;%r
install.packages(&amp;quot;quantmod&amp;quot;)
library(&amp;quot;quantmod&amp;quot;)

## NASDAQ
nsd&amp;lt;-as.data.frame(getSymbols(Symbols = &amp;quot;^NDX&amp;quot;,
    src = &amp;quot;yahoo&amp;quot;, from = &amp;quot;2015-01-01&amp;quot;,to = &amp;quot;2016-01-01&amp;quot;, env = NULL))
## Dow Jones
dji&amp;lt;-as.data.frame(getSymbols(Symbols = &amp;quot;^DJI&amp;quot;,
    src = &amp;quot;yahoo&amp;quot;, from = &amp;quot;2015-01-01&amp;quot;,to = &amp;quot;2016-01-01&amp;quot;, env = NULL))

## Adding the date as a column (in the above they are index and are lost when a table is created)
nsd$date&amp;lt;-rownames(nsd)
dji$date&amp;lt;-rownames(dji)

## Merge the tables together on date
mrgIndx&amp;lt;-merge(nsd, dji)

dfIndx &amp;lt;- createDataFrame(sqlContext, mrgIndx)
registerTempTable(dfIndx, &amp;quot;testIndx&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The last line register the dataframe as a (temporary) table to make it available outside of the &lt;code&gt;R&lt;/code&gt; scope.&lt;/p&gt;

&lt;p&gt;The following (default) scala cell will create a dataframe back from this table.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-scala&#34;&gt;val df = sqlContext.sql(&amp;quot;SELECT * FROM testIndx&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;looking-at-the-data&#34;&gt;Looking at the data&lt;/h2&gt;

&lt;p&gt;Of all the fields, we will only consider the &lt;code&gt;date&lt;/code&gt; and the adjusted Nasdaq,&lt;code&gt;NDX_Adjusted&lt;/code&gt;, and Dow Jones, &lt;code&gt;DJI_Adjusted&lt;/code&gt; values.  Why adjusted? No reason, so why not!  Let see if they are correlated and we can have hope to predict one with the other:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-scala&#34;&gt;import org.apache.spark.sql.functions.lit
display(df.withColumn(&amp;quot;NDXTimes5&amp;quot;, $&amp;quot;NDX_Adjusted&amp;quot;.cast(DoubleType).multiply(lit(5))))
&lt;/code&gt;&lt;/pre&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;https://mediative.github.io/images/simple-ml-pipeline/djiNndx5.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;No fancy statistical tools are needed to see that these two curves are correlated. The Nasdaq value has been scaled up by five (using the imported &lt;code&gt;lit&lt;/code&gt; function) to make the comparison more obvious, but this scaling will not be used in the training.  Hopefully, even a basic model can take care of that.&lt;/p&gt;

&lt;h2 id=&#34;preparing-the-data-and-the-model&#34;&gt;Preparing the data and the model&lt;/h2&gt;

&lt;p&gt;Let&amp;rsquo;s keep only the fields that we will need&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-scala&#34;&gt;val data = df.withColumn(&amp;quot;NDX&amp;quot;, $&amp;quot;NDX_Adjusted&amp;quot;)
.withColumn(&amp;quot;DJI&amp;quot;, $&amp;quot;DJI_Adjusted&amp;quot;)
.select(&amp;quot;NDX&amp;quot;, &amp;quot;DJI&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And let&amp;rsquo;s keep a random test subsample for testing purpose, the rest will be use for training the model.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-scala&#34;&gt;val Array(training, test) = data.randomSplit(Array(0.75, 0.25), seed = 12345)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We use the &lt;code&gt;VectorAssembler&lt;/code&gt; to create the feature vector used by the model.  We want to predict the Dow Jones with the Nasdaq (&lt;code&gt;NDX&lt;/code&gt;), so the latter will be our feature, which we will wisely call &lt;code&gt;features&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-scala&#34;&gt;import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
  .setInputCols(Array(&amp;quot;NDX&amp;quot;))
  .setOutputCol(&amp;quot;features&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We will also need a model to learn with, for such a simple task, let&amp;rsquo;s use a simple linear regression where we define the Dow Jones (&lt;code&gt;DJI&lt;/code&gt;) as the target we want to learn on (that is called &lt;code&gt;label&lt;/code&gt; in &lt;code&gt;ml&lt;/code&gt;).&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-scala&#34;&gt;import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression()
  .setLabelCol(&amp;quot;DJI&amp;quot;)
  .setFeaturesCol(&amp;quot;features&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;set-up-the-pipeline&#34;&gt;Set up the pipeline&lt;/h2&gt;

&lt;p&gt;Now that we have all the elements, we can easily assemble them with the &lt;code&gt;pipeline&lt;/code&gt; functionality.&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-scala&#34;&gt;import org.apache.spark.ml.Pipeline
val steps: Array[org.apache.spark.ml.PipelineStage] = Array(assembler, lr)
val pipeline = new Pipeline().setStages(steps)
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;fitting-the-model&#34;&gt;Fitting the model&lt;/h2&gt;

&lt;p&gt;Preparing data and training is done with a single call of the pipeline&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-scala&#34;&gt;val myModel = pipeline.fit(training)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can now see how well the model works by comparing its prediction with the actual Dow Jones values&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-scala&#34;&gt;display(myModel.transform(test).select(&amp;quot;prediction&amp;quot;, &amp;quot;DJI&amp;quot;))
&lt;/code&gt;&lt;/pre&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;https://mediative.github.io/images/simple-ml-pipeline/predNactual.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;The model obviously managed to learn correlations between the Dow jones and the Nasdaq.  Nothing to impress your broker, but that is a basis on which building better prediction.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Sparrow version 0.2.0</title>
      <link>https://mediative.github.io/post/2016/02/sparrow-version-0.2.0/</link>
      <pubDate>Mon, 29 Feb 2016 21:44:36 -0500</pubDate>
      
      <guid>https://mediative.github.io/post/2016/02/sparrow-version-0.2.0/</guid>
      <description>

&lt;p&gt;Sparrow version &lt;a href=&#34;https://github.com/mediative/sparrow/releases/tag/0.2.0&#34;&gt;0.2.0&lt;/a&gt;
is now available with updated dependency on Spark 1.6.0.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s both available as a &lt;a href=&#34;http://spark-packages.org/package/mediative/sparrow&#34;&gt;Spark
package&lt;/a&gt; and from the &lt;a href=&#34;https://github.com/mediative/sparrow/blob/6589d0f3302520d284461e0aced147d9e14ddb7d/README.md#getting-started&#34;&gt;YPG
Data Bintray
repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&#34;release-notes&#34;&gt;Release notes&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bump Spark version to 1.6.0&lt;/li&gt;
&lt;li&gt;Test against Scala 2.10.6 on Travis&lt;/li&gt;
&lt;li&gt;Bump the Macro Paradise plugin to 2.1.0&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>News from Spark Summit East</title>
      <link>https://mediative.github.io/post/2016/02/news-from-spark-summit-east/</link>
      <pubDate>Sun, 28 Feb 2016 21:38:37 -0500</pubDate>
      
      <guid>https://mediative.github.io/post/2016/02/news-from-spark-summit-east/</guid>
      <description>

&lt;p&gt;Mediative is building a data pipeline on top of Spark
so I went to &lt;a href=&#34;https://spark-summit.org/east-2016/&#34;&gt;Spark Summit East&lt;/a&gt;
to see what other people are doing and what&amp;rsquo;s coming.
There were many conference tracks including Enterprise, Developer and Data Science.
I mostly attended Data Science talks and below are the highlights.
Some of this information also came from &lt;a href=&#34;http://www.meetup.com/Spark-NYC/events/228233164/&#34;&gt;NYC Spark Meetup&lt;/a&gt;,
held on the first evening of the Conference.&lt;/p&gt;

&lt;h1 id=&#34;spark-2-0&#34;&gt;Spark 2.0&lt;/h1&gt;

&lt;p&gt;Some of the main news about Spark 2.0 are&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;should be available late April - Early May&lt;/li&gt;
&lt;li&gt;(almost) No API changes for 2.0&lt;/li&gt;
&lt;li&gt;Will unifying datasets and dataframes

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DataFrame = Dataset[Row]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Libraries will accept both interchangeably&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&#34;tungsten&#34;&gt;Tungsten&lt;/h2&gt;

&lt;p&gt;the under-the-hood improver of memory and CPU efficiency for Spark applications.
Project Tungsten was introduced in Spark 1.4.
See &lt;a href=&#34;https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html&#34;&gt;this blog&lt;/a&gt; for more information.  Here is what to expect from new releases&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Phase I&lt;/strong&gt; Spark 2.0

&lt;ul&gt;
&lt;li&gt;~5x faster&lt;/li&gt;
&lt;li&gt;Improve IO by better pruning data to process&lt;/li&gt;
&lt;li&gt;Native memory management (use less java object and their costly initialization)&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase II&lt;/strong&gt; Spark 2.x

&lt;ul&gt;
&lt;li&gt;~10x faster&lt;/li&gt;
&lt;li&gt;Spark will work as a compiler: reading the provided code and create it&amp;rsquo;s own optimize version.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&#34;spark-streaming&#34;&gt;Spark Streaming&lt;/h2&gt;

&lt;p&gt;Processing data in real time will be more integrated with batch applications
with&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured stream

&lt;ul&gt;
&lt;li&gt;will extend dataframe/dataset&lt;/li&gt;
&lt;li&gt;more analysis from stream data&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Supports interactive &amp;amp; batch queries (e.g. aggregate data in a stream then serving to JDBC)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(more info on Spark 2.0 &lt;a href=&#34;https://spark-summit.org/east-2016/events/keynote-day-2/&#34;&gt;here&lt;/a&gt;)&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&#34;pipelines&#34;&gt;Pipelines&lt;/h1&gt;

&lt;p&gt;The summit comprised lots of of pipeline talks, two examples shown below are particularly
interesting for their similarities with our projects at Mediative.&lt;/p&gt;

&lt;h2 id=&#34;netflix-distributed-time-travel-for-feature-generation&#34;&gt;Netflix Distributed Time Travel for Feature Generation&lt;/h2&gt;

&lt;p&gt;The goal is build a time machine snapshots online services
and uses the snapshot data offline to reconstruct the inputs
that a model would have seen online to generate features.&lt;/p&gt;

&lt;p&gt;First, an appropriate sample of contexts is selected
(samples based on properties such as viewing patterns, devices, time spent on the service, region, etc)
and persisted into S3 (parquet) as represented by the &lt;code&gt;Context Set&lt;/code&gt; below.
Interestingly they also store the confidence level for each snapshot service,
the percentage of successful data fetched.
The batch API fetches the associated S3 location of the snapshot data from Cassandra and loads the snapshot data in Spark&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;https://mediative.github.io/images/spark-summit-east-2016/netflixSnapshotAPI.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;here is an example call to their API returning a RDD&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-scala&#34;&gt;val snapshot = new SnapshotDataManager(sqlContext))
  .withTimestams(1445470140000L)
  .withContextId(OUTATIME)
  .getViewingHistory
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(more info &lt;a href=&#34;https://spark-summit.org/east-2016/events/distributed-time-travel-for-feature-generation/&#34;&gt;here&lt;/a&gt;)&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&#34;real-time-data-pipelines-with-kafka&#34;&gt;Real Time Data Pipelines with Kafka&lt;/h2&gt;

&lt;p&gt;If you have &lt;code&gt;n&lt;/code&gt; connectors, it is very likely that you&amp;rsquo;ll end up writing n*n connections.
Here is a scary examples

&lt;figure &gt;
    
        &lt;img src=&#34;https://mediative.github.io/images/spark-summit-east-2016/conplexPipeline.png&#34; /&gt;
    
    
&lt;/figure&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kafka connect&amp;rsquo;s two modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source connectors : some system to Kafka&lt;/li&gt;
&lt;li&gt;Synk connectors : From Kafka to some system

&lt;figure &gt;
    
        &lt;img src=&#34;https://mediative.github.io/images/spark-summit-east-2016/kafka2modes.png&#34; /&gt;
    
    
&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kafka&amp;rsquo;s buffer allows to stream to (non-stream) destination like HDFS&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;https://mediative.github.io/images/spark-summit-east-2016/kafkaDataIntegration.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;It is even possible to copy an entire database (suggested partition: by table)&lt;/p&gt;

&lt;p&gt;more information &lt;a href=&#34;https://spark-summit.org/east-2016/events/building-realtime-data-pipelines-with-kafka-connect-and-spark-streaming/&#34;&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h1 id=&#34;machine-learning&#34;&gt;Machine Learning&lt;/h1&gt;

&lt;p&gt;There were many example with MLlib and SparkR and packages like &lt;strong&gt;Sparkling water&lt;/strong&gt; (H2O), an Open Sources with tools like customized DataFrames and Notebook.
The incubating &lt;strong&gt;SystemML&lt;/strong&gt; (IBM) translates high-level (R or python)
aims to optimized code adapting to underlying input formats and physical data representations.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&#34;tensorspark&#34;&gt;TensorSpark&lt;/h2&gt;

&lt;p&gt;A distributed TensorFlow on Spark (Arimo, Inc.) motivated by TensorFlow (at the time)
being only released for single-machine.
Even with a TensorFlow released, TensorSpark might be more appropriate to join with some spark infrastructure.&lt;/p&gt;

&lt;p&gt;The figure below represents how an instance of tensorFlow runs on each machines where
the driver is the parameter server: receiving gradients from workers and broadcast the updated model.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;https://mediative.github.io/images/spark-summit-east-2016/tensorSparkArchitecture.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;more information &lt;a href=&#34;https://spark-summit.org/east-2016/events/distributed-tensor-flow-on-spark-scaling-googles-deep-learning-library/&#34;&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&#34;online-bidding&#34;&gt;Online bidding&lt;/h2&gt;

&lt;p&gt;Of particular interest to Mediative, a talk about real time bidding over display ads with machine learning.&lt;/p&gt;


&lt;figure &gt;
    
        &lt;img src=&#34;https://mediative.github.io/images/spark-summit-east-2016/AdbidPipeline.png&#34; /&gt;
    
    
&lt;/figure&gt;


&lt;p&gt;Their pipeline could train multiple models in parallel and choose the most effective one.
A very nice outcome was the most effective model varies from campaign to campaign as shown below.&lt;/p&gt;

&lt;p&gt;
&lt;figure &gt;
    
        &lt;img src=&#34;https://mediative.github.io/images/spark-summit-east-2016/AdModelCompare.png&#34; /&gt;
    
    
&lt;/figure&gt;

more information &lt;a href=&#34;https://spark-summit.org/east-2016/events/spark-dataxu-multi-model-machine-learning-for-real-time-bidding-over-display-ads/&#34;&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&#34;visualization&#34;&gt;Visualization&lt;/h1&gt;

&lt;p&gt;Visualisation still mostly rely on (non-scalable) libraries although significant progress
was shown with integration of ggplot2 with SparkR where 47% of API implemented
(as shown &lt;a href=&#34;https://spark-summit.org/east-2016/events/generalized-linear-models-in-spark-mllib-and-sparkr/&#34;&gt;here&lt;/a&gt;).
There is also the incubatin Zoomdate which shows nice promises.
Meanwhile, better to filter your data and use a non-distributed library.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&#34;others&#34;&gt;Others&lt;/h1&gt;

&lt;p&gt;A quick mention of interesting subjects&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href=&#34;https://spark-summit.org/east-2016/events/magellan-spark-as-a-geospatial-analytics-engine/&#34;&gt;Magellan-Spark Geospatial analytics&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cartesian join : joining polygone and points&lt;/li&gt;
&lt;li&gt;supported formats includes GeoJSON, ESRI, OSM-XML&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;a href=&#34;https://spark-summit.org/east-2016/events/beyond-collect-and-parallelize-for-tests/&#34;&gt;Beyond Collect and Parallelize for Tests&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Addressing problems of testing at scale&lt;/li&gt;
&lt;li&gt;Comparing RDD, DataFrames, DataSets&lt;/li&gt;
&lt;li&gt;Getting test (big) data&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h1 id=&#34;spark-community-edition-beta&#34;&gt;Spark community edition (beta)&lt;/h1&gt;

&lt;p&gt;Finally, Databricks announced a free edition of their very nice service,
this includes access to 6GB clusters.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;beta edition available in the coming weeks&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;http://go.databricks.com/databricks-community-edition-beta-waitlist&#34;&gt;waiting list&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Includes learning utilities&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;See &lt;a href=&#34;https://www.youtube.com/watch?v=35Y-rqSMCCA&#34;&gt;demo&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Running Zeppelin on CDH</title>
      <link>https://mediative.github.io/post/2016/02/running-zeppelin-on-cdh/</link>
      <pubDate>Fri, 26 Feb 2016 14:46:46 -0500</pubDate>
      
      <guid>https://mediative.github.io/post/2016/02/running-zeppelin-on-cdh/</guid>
      <description>

&lt;h2 id=&#34;download-and-build-zeppelin&#34;&gt;Download and Build Zeppelin&lt;/h2&gt;

&lt;p&gt;Go to the &lt;a href=&#34;http://zeppelin.incubator.apache.org/download.html&#34;&gt;download page&lt;/a&gt;
and get the latest source package.&lt;/p&gt;

&lt;p&gt;Untar the source package and create a git repo to make bower happy:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ tar zxvf zeppelin-0.5.6-incubating.tgz
$ cd zeppelin-0.5.6-incubating
$ git init
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Before building from source first determine the Hadoop version by running the
following command on the edge node:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ hadoop version
Hadoop 2.6.0-cdh5.4.8
...
This command was run using /opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/lib/hadoop/hadoop-common-2.6.0-cdh5.4.8.jar
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Build Zeppelin with &lt;a href=&#34;http://zeppelin.incubator.apache.org/docs/0.5.6-incubating/install/yarn_install.html&#34;&gt;YARN support&lt;/a&gt;
enabled using the Maven profile corresponding to the Hadoop version found above:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ mvn clean package -Pbuild-distr -Pyarn -Pspark-1.5 -Dspark.version=1.5.2 \
    -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.4.8 -DskipTests -Pvendor-repo
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note we are assuming that you are using a custom Spark version as described in
&lt;a href=&#34;https://mediative.github.io/post/2016/02/installing-a-custom-spark-version-on-cdh/&#34;&gt;our previous post&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&#34;installing-zeppelin-on-the-edge-node&#34;&gt;Installing Zeppelin on the Edge Node&lt;/h2&gt;

&lt;p&gt;Copy the distribution to the edge node:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ scp zeppelin-distribution/target/zeppelin-x.y.z-incubating.tar.gz edge-node:
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;SSH to the edge node, unzip the tarball and &lt;code&gt;cd&lt;/code&gt; to the Zeppelin installation directory:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ tar zxvf /path/to/zeppelin-x.y.z-incubating.tar.gz
$ cd zeppelin-x.y.z-incubating/
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Configure Zeppelin by creating and editing &lt;code&gt;conf/zeppelin-env.sh&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cp conf/zeppelin-env.sh{.template,}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It should contain the following variables:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&#34;language-sh&#34;&gt;export SPARK_HOME=&amp;quot;$HOME/spark-x.y.z-bin-cdhx.y.z&amp;quot; # Assuming you are using a custom Spark version
export MASTER=yarn-client
export ZEPPELIN_JAVA_OPTS=&amp;quot;-Dspark.yarn.jar=$HOME/spark-x.y.z-bin-cdhx.y.z/lib/spark-assembly-x.y.z-hadoopx.y.z-cdhx.y.z.jar&amp;quot;

export DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-x.y.z-1.cdhx.y.z.p0.11/lib/hadoop
export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME}

if [ -n &amp;quot;$HADOOP_HOME&amp;quot; ]; then
  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&#34;manage-the-zeppelin-server&#34;&gt;Manage the Zeppelin Server&lt;/h2&gt;

&lt;p&gt;To start the server run:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; $ bin/zeppelin-daemon.sh start
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To stop it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; $ bin/zeppelin-daemon.sh stop
&lt;/code&gt;&lt;/pre&gt;
</description>
    </item>
    
    <item>
      <title>Mesos Stack version 0.4.0</title>
      <link>https://mediative.github.io/post/2016/02/mesos-stack-version-0.4.0/</link>
      <pubDate>Wed, 24 Feb 2016 14:25:03 -0500</pubDate>
      
      <guid>https://mediative.github.io/post/2016/02/mesos-stack-version-0.4.0/</guid>
      <description>

&lt;p&gt;Version &lt;a href=&#34;https://github.com/mediative/mesos-stack/releases/tag/0.4.0&#34;&gt;0.4.0&lt;/a&gt; has
been released of our Mesos stack. It updates Marathon-LB to use an upstream
released version and adds a new GlusterFS role to distribute files across the
Mesos cluster. Also enjoy the new and improved
&lt;a href=&#34;https://mediative.github.io/mesos-stack/&#34;&gt;documentation&lt;/a&gt; which is generated from
the Ansible role files.&lt;/p&gt;

&lt;h2 id=&#34;release-notes&#34;&gt;Release notes&lt;/h2&gt;

&lt;p&gt;Improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mesos-master, mesos-agent: Use fully qualified host names.&lt;/li&gt;
&lt;li&gt;Generate Ansible role documentation from YAML files so they are always up to
date.&lt;/li&gt;
&lt;li&gt;marathon-lb: Upgrade to version 1.1.1.&lt;/li&gt;
&lt;li&gt;common: Disable IPv6 on all cluster nodes.&lt;/li&gt;
&lt;li&gt;New glusterfs role which adds persistent storage across nodes.&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Installing a Custom Spark Version on CDH</title>
      <link>https://mediative.github.io/post/2016/02/installing-a-custom-spark-version-on-cdh/</link>
      <pubDate>Sat, 13 Feb 2016 19:54:46 -0500</pubDate>
      
      <guid>https://mediative.github.io/post/2016/02/installing-a-custom-spark-version-on-cdh/</guid>
      <description>&lt;p&gt;Since Spark can be run as a YARN application it is possible to run a Spark
version other than the one provided by the Cloudera platform (CDH). This
document lists the instructions for how to compile a specific Spark version
against the Hadoop version supported by CDH. The instructions are based on the
post &lt;a href=&#34;https://www.linkedin.com/pulse/running-spark-151-cdh-deenar-toraskar-cfa&#34;&gt;Running Spark 1.5.1 on
CDH&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Determine the version of CDH and Hadoop by running the following command on
the edge node:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ hadoop version
Hadoop 2.6.0-cdh5.4.8
...
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;a href=&#34;http://spark.apache.org/downloads.html&#34;&gt;Download Spark&lt;/a&gt; and extract the
sources.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;a href=&#34;http://spark.apache.org/docs/latest/building-spark.html&#34;&gt;Build Spark&lt;/a&gt; by
opening the distribution directory in the shell and running the following
command using the CDH and Hadoop version from step 1:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ./make-distribution.sh --tgz --name cdh5.4.8 -Pyarn \
     -Phadoop-2.6 -Phadoop-provided -Dhadoop.version=2.6.0-cdh5.4.8 \
     -Phive -Phive-thriftserver
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that &lt;code&gt;-Phadoop-provided&lt;/code&gt; enables the profile to build the assembly
without including Hadoop-ecosystem dependencies provided by Cloudera. To
compile with Spark 2.11 support first run:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ./dev/change-scala-version.sh 2.11
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;and pass &lt;code&gt;-Dscala-2.11&lt;/code&gt; to &lt;code&gt;make-distribution.sh&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Copy the resulting &lt;code&gt;tgz&lt;/code&gt; file to the edge node:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ scp spark-x.x.x-bin-cdh5.4.8.tgz user@edge-node:
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Connect to the edge node&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Extract the &lt;code&gt;tgz&lt;/code&gt; file&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;code&gt;cd&lt;/code&gt; into the custom Spark distribution and configure the custom Spark
distribution:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; $ cp -R /etc/spark/conf/* conf/
 # Change SPARK_HOME to point to folder with custom Spark distrobution
 $ sed -i &amp;quot;s#\(.*SPARK_HOME\)=.*#\1=$(pwd)#&amp;quot; conf/spark-env.sh
 # Tell YARN which Spark JAR to use
 $ echo &amp;quot;spark.yarn.jar=$(pwd)/$(ls lib/spark-assembly-*.jar)&amp;quot; &amp;gt;&amp;gt; conf/spark-defaults.conf
 $ cp /etc/hive/conf/hive-site.xml conf/
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Test the custom Spark distribution:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; $ ./bin/run-example SparkPi 10 --master yarn-client
 $ ./bin/spark-shell --master yarn-client
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Projects</title>
      <link>https://mediative.github.io/projects/</link>
      <pubDate>Sat, 13 Feb 2016 19:51:05 -0500</pubDate>
      
      <guid>https://mediative.github.io/projects/</guid>
      <description>&lt;p&gt;A curated list of OSS projects maintained by YPG Data&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://mediative.github.io/mesos-stack&#34;&gt;Mesos Stack&lt;/a&gt;:
Scripts to configure a Mesos cluster using Mesos and Mesosphere components.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://mediative.github.io/eigenflow&#34;&gt;Eigenflow&lt;/a&gt;:
ETL orchestration platform with recoverability and process monitoring features.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://mediative.github.io/sparrow&#34;&gt;Sparrow&lt;/a&gt;:
Scala library for converting Spark rows to case classes.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/mediative/sbt-mediative&#34;&gt;sbt-mediative&lt;/a&gt;:
A collection of opinionated plugins to minimize boilerplate when setting up new SBT projects.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://github.com/mediative/TTFI&#34;&gt;TTFI&lt;/a&gt;:
Scala port of the ideas from the paper on &lt;a href=&#34;http://okmij.org/ftp/tagless-final/course/lecture.pdf&#34;&gt;Typed Tagless-Final Interpreters&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
  </channel>
</rss>