Skip to content

Commit

Permalink
[SQL] Minor edits to sql programming guide.
Browse files Browse the repository at this point in the history
Author: Henry Cook <hcook@eecs.berkeley.edu>

Closes #2316 from hcook/sql-docs and squashes the following commits:

373f94b [Henry Cook] Minor edits to sql programming guide.
  • Loading branch information
hcook authored and marmbrus committed Sep 8, 2014
1 parent 386bc24 commit 26bc765
Showing 1 changed file with 47 additions and 45 deletions.
92 changes: 47 additions & 45 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ title: Spark SQL Programming Guide

Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using
Spark. At the core of this component is a new type of RDD,
[SchemaRDD](api/scala/index.html#org.apache.spark.sql.SchemaRDD). SchemaRDDs are composed
[Row](api/scala/index.html#org.apache.spark.sql.catalyst.expressions.Row) objects along with
[SchemaRDD](api/scala/index.html#org.apache.spark.sql.SchemaRDD). SchemaRDDs are composed of
[Row](api/scala/index.html#org.apache.spark.sql.catalyst.expressions.Row) objects, along with
a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table
in a traditional relational database. A SchemaRDD can be created from an existing RDD, [Parquet](http://parquet.io)
in a traditional relational database. A SchemaRDD can be created from an existing RDD, a [Parquet](http://parquet.io)
file, a JSON dataset, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).

All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`.
Expand All @@ -26,21 +26,21 @@ All of the examples on this page use sample data included in the Spark distribut
<div data-lang="java" markdown="1">
Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using
Spark. At the core of this component is a new type of RDD,
[JavaSchemaRDD](api/scala/index.html#org.apache.spark.sql.api.java.JavaSchemaRDD). JavaSchemaRDDs are composed
[Row](api/scala/index.html#org.apache.spark.sql.api.java.Row) objects along with
[JavaSchemaRDD](api/scala/index.html#org.apache.spark.sql.api.java.JavaSchemaRDD). JavaSchemaRDDs are composed of
[Row](api/scala/index.html#org.apache.spark.sql.api.java.Row) objects, along with
a schema that describes the data types of each column in the row. A JavaSchemaRDD is similar to a table
in a traditional relational database. A JavaSchemaRDD can be created from an existing RDD, [Parquet](http://parquet.io)
in a traditional relational database. A JavaSchemaRDD can be created from an existing RDD, a [Parquet](http://parquet.io)
file, a JSON dataset, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).
</div>

<div data-lang="python" markdown="1">

Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using
Spark. At the core of this component is a new type of RDD,
[SchemaRDD](api/python/pyspark.sql.SchemaRDD-class.html). SchemaRDDs are composed
[Row](api/python/pyspark.sql.Row-class.html) objects along with
[SchemaRDD](api/python/pyspark.sql.SchemaRDD-class.html). SchemaRDDs are composed of
[Row](api/python/pyspark.sql.Row-class.html) objects, along with
a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table
in a traditional relational database. A SchemaRDD can be created from an existing RDD, [Parquet](http://parquet.io)
in a traditional relational database. A SchemaRDD can be created from an existing RDD, a [Parquet](http://parquet.io)
file, a JSON dataset, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).

All of the examples on this page use sample data included in the Spark distribution and can be run in the `pyspark` shell.
Expand Down Expand Up @@ -68,11 +68,11 @@ val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
{% endhighlight %}

In addition to the basic SQLContext, you can also create a HiveContext, which provides a strict
super set of the functionality provided by the basic SQLContext. Additional features include
In addition to the basic SQLContext, you can also create a HiveContext, which provides a
superset of the functionality provided by the basic SQLContext. Additional features include
the ability to write queries using the more complete HiveQL parser, access to HiveUDFs, and the
ability to read data from Hive tables. To use a HiveContext, you do not need to have an
existing hive setup, and all of the data sources available to a SQLContext are still available.
existing Hive setup, and all of the data sources available to a SQLContext are still available.
HiveContext is only packaged separately to avoid including all of Hive's dependencies in the default
Spark build. If these dependencies are not a problem for your application then using HiveContext
is recommended for the 1.2 release of Spark. Future releases will focus on bringing SQLContext up to
Expand All @@ -95,7 +95,7 @@ In addition to the basic SQLContext, you can also create a HiveContext, which pr
super set of the functionality provided by the basic SQLContext. Additional features include
the ability to write queries using the more complete HiveQL parser, access to HiveUDFs, and the
ability to read data from Hive tables. To use a HiveContext, you do not need to have an
existing hive setup, and all of the data sources available to a SQLContext are still available.
existing Hive setup, and all of the data sources available to a SQLContext are still available.
HiveContext is only packaged separately to avoid including all of Hive's dependencies in the default
Spark build. If these dependencies are not a problem for your application then using HiveContext
is recommended for the 1.2 release of Spark. Future releases will focus on bringing SQLContext up to
Expand All @@ -118,7 +118,7 @@ In addition to the basic SQLContext, you can also create a HiveContext, which pr
super set of the functionality provided by the basic SQLContext. Additional features include
the ability to write queries using the more complete HiveQL parser, access to HiveUDFs, and the
ability to read data from Hive tables. To use a HiveContext, you do not need to have an
existing hive setup, and all of the data sources available to a SQLContext are still available.
existing Hive setup, and all of the data sources available to a SQLContext are still available.
HiveContext is only packaged separately to avoid including all of Hive's dependencies in the default
Spark build. If these dependencies are not a problem for your application then using HiveContext
is recommended for the 1.2 release of Spark. Future releases will focus on bringing SQLContext up to
Expand Down Expand Up @@ -146,11 +146,11 @@ describes the various methods for loading data into a SchemaRDD.

Spark SQL supports two different methods for converting existing RDDs into SchemaRDDs. The first
method uses reflection to infer the schema of an RDD that contains specific types of objects. This
reflection based approach leads to more concise code and works well went the schema is known ahead
of time, while you are writing your Spark application.
reflection based approach leads to more concise code and works well when you already know the schema
while writing your Spark application.

The second method for creating SchemaRDDs is through a programmatic interface that allows you to
construct a schema and then apply it to and existing RDD. While this method is more verbose, it allows
construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows
you to construct SchemaRDDs when the columns and their types are not known until runtime.

### Inferring the Schema Using Reflection
Expand Down Expand Up @@ -266,10 +266,10 @@ List<String> teenagerNames = teenagers.map(new Function<Row, String>() {

<div data-lang="python" markdown="1">

Spark SQL can convert an RDD of Row objects to a SchemaRDD, inferring the datatypes . Rows are constructed by passing a list of
key/value pairs as kwargs to the Row class. The keys of this list define the columns names of the table,
Spark SQL can convert an RDD of Row objects to a SchemaRDD, inferring the datatypes. Rows are constructed by passing a list of
key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
and the types are inferred by looking at the first row. Since we currently only look at the first
row, it is important that there is no missing data in the first row of the RDD. In future version we
row, it is important that there is no missing data in the first row of the RDD. In future versions we
plan to more completely infer the schema by looking at more data, similar to the inference that is
performed on JSON files.

Expand Down Expand Up @@ -306,14 +306,14 @@ for teenName in teenNames.collect():

<div data-lang="scala" markdown="1">

In cases that case classes cannot be defined ahead of time (for example,
the structure of records is encoded in a string or a text dataset will be parsed
When case classes cannot be defined ahead of time (for example,
the structure of records is encoded in a string, or a text dataset will be parsed
and fields will be projected differently for different users),
a `SchemaRDD` can be created programmatically with three steps.

1. Create an RDD of `Row`s from the original RDD;
2. Create the schema represented by a `StructType` matching the structure of
`Row`s in the RDD created in the step 1.
`Row`s in the RDD created in Step 1.
3. Apply the schema to the RDD of `Row`s via `applySchema` method provided
by `SQLContext`.

Expand Down Expand Up @@ -358,14 +358,14 @@ results.map(t => "Name: " + t(0)).collect().foreach(println)

<div data-lang="java" markdown="1">

In cases that JavaBean classes cannot be defined ahead of time (for example,
the structure of records is encoded in a string or a text dataset will be parsed and
When JavaBean classes cannot be defined ahead of time (for example,
the structure of records is encoded in a string, or a text dataset will be parsed and
fields will be projected differently for different users),
a `SchemaRDD` can be created programmatically with three steps.

1. Create an RDD of `Row`s from the original RDD;
2. Create the schema represented by a `StructType` matching the structure of
`Row`s in the RDD created in the step 1.
`Row`s in the RDD created in Step 1.
3. Apply the schema to the RDD of `Row`s via `applySchema` method provided
by `JavaSQLContext`.

Expand Down Expand Up @@ -427,10 +427,10 @@ List<String> names = results.map(new Function<Row, String>() {

<div data-lang="python" markdown="1">

For some cases (for example, the structure of records is encoded in a string or
a text dataset will be parsed and fields will be projected differently for
different users), it is desired to create `SchemaRDD` with a programmatically way.
It can be done with three steps.
When a dictionary of kwargs cannot be defined ahead of time (for example,
the structure of records is encoded in a string, or a text dataset will be parsed and
fields will be projected differently for different users),
a `SchemaRDD` can be created programmatically with three steps.

1. Create an RDD of tuples or lists from the original RDD;
2. Create the schema represented by a `StructType` matching the structure of
Expand Down Expand Up @@ -566,7 +566,7 @@ for teenName in teenNames.collect():

### Configuration

Configuration of parquet can be done using the `setConf` method on SQLContext or by running
Configuration of Parquet can be done using the `setConf` method on SQLContext or by running
`SET key=value` commands using SQL.

<table class="table">
Expand All @@ -575,23 +575,23 @@ Configuration of parquet can be done using the `setConf` method on SQLContext or
<td><code>spark.sql.parquet.binaryAsString</code></td>
<td>false</td>
<td>
Some other parquet producing systems, in particular Impala and older versions of Spark SQL, do
not differentiate between binary data and strings when writing out the parquet schema. This
Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do
not differentiate between binary data and strings when writing out the Parquet schema. This
flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.
</td>
</tr>
<tr>
<td><code>spark.sql.parquet.cacheMetadata</code></td>
<td>false</td>
<td>
Turns on caching of parquet schema metadata. Can speed up querying
Turns on caching of Parquet schema metadata. Can speed up querying of static data.
</td>
</tr>
<tr>
<td><code>spark.sql.parquet.compression.codec</code></td>
<td>snappy</td>
<td>
Sets the compression codec use when writing parquet files. Acceptable values include:
Sets the compression codec use when writing Parquet files. Acceptable values include:
uncompressed, snappy, gzip, lzo.
</td>
</tr>
Expand Down Expand Up @@ -805,9 +805,8 @@ Spark SQL can cache tables using an in-memory columnar format by calling `cacheT
Then Spark SQL will scan only required columns and will automatically tune compression to minimize
memory usage and GC pressure. You can call `uncacheTable("tableName")` to remove the table from memory.

Note that if you just call `cache` rather than `cacheTable`, tables will _not_ be cached in
in-memory columnar format. So we strongly recommend using `cacheTable` whenever you want to
cache tables.
Note that if you call `cache` rather than `cacheTable`, tables will _not_ be cached using
the in-memory columnar format, and therefore `cacheTable` is strongly recommended for this use case.

Configuration of in-memory caching can be done using the `setConf` method on SQLContext or by running
`SET key=value` commands using SQL.
Expand All @@ -833,7 +832,7 @@ Configuration of in-memory caching can be done using the `setConf` method on SQL

</table>

## Other Configuration
## Other Configuration Options

The following options can also be used to tune the performance of query execution. It is possible
that these options will be deprecated in future release as more optimizations are performed automatically.
Expand All @@ -842,7 +841,7 @@ that these options will be deprecated in future release as more optimizations ar
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.sql.autoBroadcastJoinThreshold</code></td>
<td>false</td>
<td>10000</td>
<td>
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when
performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently
Expand Down Expand Up @@ -876,7 +875,7 @@ code.
## Running the Thrift JDBC server

The Thrift JDBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
in Hive 0.12. You can test the JDBC server with the beeline script comes with either Spark or Hive 0.12.
in Hive 0.12. You can test the JDBC server with the beeline script that comes with either Spark or Hive 0.12.

To start the JDBC server, run the following in the Spark directory:

Expand All @@ -899,12 +898,12 @@ your machine and a blank password. For secure mode, please follow the instructio

Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`.

You may also use the beeline script comes with Hive.
You may also use the beeline script that comes with Hive.

## Running the Spark SQL CLI

The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode and execute
queries input from command line. Note: the Spark SQL CLI cannot talk to the Thrift JDBC server.
queries input from the command line. Note that the Spark SQL CLI cannot talk to the Thrift JDBC server.

To start the Spark SQL CLI, run the following in the Spark directory:

Expand All @@ -916,7 +915,10 @@ options.

# Compatibility with Other Systems

## Migration Guide for Shark Users
## Migration Guide for Shark User

### Scheduling
s
To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session,
users can set the `spark.sql.thriftserver.scheduler.pool` variable:

Expand All @@ -925,7 +927,7 @@ users can set the `spark.sql.thriftserver.scheduler.pool` variable:
### Reducer number

In Shark, default reducer number is 1 and is controlled by the property `mapred.reduce.tasks`. Spark
SQL deprecates this property by a new property `spark.sql.shuffle.partitions`, whose default value
SQL deprecates this property in favor of `spark.sql.shuffle.partitions`, whose default value
is 200. Users may customize this property via `SET`:

SET spark.sql.shuffle.partitions=10;
Expand Down

0 comments on commit 26bc765

Please sign in to comment.