layout | title | custom_title | description | type | navigation | ||||
---|---|---|---|---|---|---|---|---|---|
global |
Home |
Apache Spark™ - Unified Analytics Engine for Big Data |
Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. |
page |
|
Apache Spark™ is a unified analytics engine for large-scale data processing.
<p class="lead">
Run workloads 100x faster.
</p>
<p>
Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
</p>
<p class="lead">
Write applications quickly in Java, Scala, Python, R, and SQL.
</p>
<p>
Spark offers over 80 high-level operators that make it easy to build parallel apps.
And you can use it <em>interactively</em>
from the Scala, Python, R, and SQL shells.
</p>
df = spark.read.json("logs.json")
df.where("age > 21")
.select("name.first").show()
Spark's Python DataFrame API
Read JSON files with automatic schema inference
Read JSON files with automatic schema inference
<p class="lead">
Combine SQL, streaming, and complex analytics.
</p>
<p>
Spark powers a stack of libraries including
<a href="{{site.baseurl}}/sql/">SQL and DataFrames</a>, <a href="{{site.baseurl}}/mllib/">MLlib</a> for machine learning,
<a href="{{site.baseurl}}/graphx/">GraphX</a>, and <a href="{{site.baseurl}}/streaming/">Spark Streaming</a>.
You can combine these libraries seamlessly in the same application.
</p>
<p class="lead">
Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.
</p>
<p>
You can run Spark using its <a href="{{site.baseurl}}/docs/latest/spark-standalone.html">standalone cluster mode</a>,
on <a href="https://github.com/amplab/spark-ec2">EC2</a>,
on <a href="https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">Hadoop YARN</a>,
on <a href="https://mesos.apache.org">Mesos</a>, or
on <a href="https://kubernetes.io/">Kubernetes</a>.
Access data in <a href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">HDFS</a>,
<a href="https://www.alluxio.org/">Alluxio</a>,
<a href="https://cassandra.apache.org">Apache Cassandra</a>,
<a href="https://hbase.apache.org">Apache HBase</a>,
<a href="https://hive.apache.org">Apache Hive</a>,
and hundreds of other data sources.
</p>
<p>
Spark is used at a wide range of organizations to process large datasets.
You can find many example use cases on the
<a href="{{site.baseurl}}/powered-by.html">Powered By</a> page.
</p>
<p>
There are many ways to reach the community:
</p>
<ul class="list-narrow">
<li>Use the <a href="{{site.baseurl}}/community.html#mailing-lists">mailing lists</a> to ask questions.</li>
<li>In-person events include numerous <a href="{{site.baseurl}}/community.html#events">meetup groups and conferences</a>.</li>
<li>We use <a href="https://issues.apache.org/jira/browse/SPARK">JIRA</a> for issue tracking.</li>
</ul>
<p>
Apache Spark is built by a wide set of developers from over 300 companies.
Since 2009, more than 1200 developers have contributed to Spark!
</p>
<p>
The project's
<a href="{{site.baseurl}}/committers.html">committers</a>
come from more than 25 organizations.
</p>
<p>
If you'd like to participate in Spark, or contribute to the libraries on top of it, learn
<a href="{{site.baseurl}}/contributing.html">how to contribute</a>.
</p>
<p>Learning Apache Spark is easy whether you come from a Java, Scala, Python, R, or SQL background:</p>
<ul class="list-narrow">
<li><a href="{{site.baseurl}}/downloads.html">Download</a> the latest release: you can run Spark locally on your laptop.</li>
<li>Read the <a href="{{site.baseurl}}/docs/latest/quick-start.html">quick start guide</a>.</li>
<li>Learn how to <a href="{{site.baseurl}}/docs/latest/#launching-on-a-cluster">deploy</a> Spark on a cluster.</li>
</ul>