-For some cases (for example, the structure of records is encoded in a string or
-a text dataset will be parsed and fields will be projected differently for
-different users), it is desired to create `SchemaRDD` with a programmatically way.
-It can be done with three steps.
+When a dictionary of kwargs cannot be defined ahead of time (for example,
+the structure of records is encoded in a string, or a text dataset will be parsed and
+fields will be projected differently for different users),
+a `SchemaRDD` can be created programmatically with three steps.
1. Create an RDD of tuples or lists from the original RDD;
2. Create the schema represented by a `StructType` matching the structure of
@@ -566,7 +566,7 @@ for teenName in teenNames.collect():
### Configuration
-Configuration of parquet can be done using the `setConf` method on SQLContext or by running
+Configuration of Parquet can be done using the `setConf` method on SQLContext or by running
`SET key=value` commands using SQL.
@@ -575,8 +575,8 @@ Configuration of parquet can be done using the `setConf` method on SQLContext or
spark.sql.parquet.binaryAsString |
false |
- Some other parquet producing systems, in particular Impala and older versions of Spark SQL, do
- not differentiate between binary data and strings when writing out the parquet schema. This
+ Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do
+ not differentiate between binary data and strings when writing out the Parquet schema. This
flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.
|
@@ -584,14 +584,14 @@ Configuration of parquet can be done using the `setConf` method on SQLContext or
spark.sql.parquet.cacheMetadata |
false |
- Turns on caching of parquet schema metadata. Can speed up querying
+ Turns on caching of Parquet schema metadata. Can speed up querying of static data.
|
spark.sql.parquet.compression.codec |
snappy |
- Sets the compression codec use when writing parquet files. Acceptable values include:
+ Sets the compression codec use when writing Parquet files. Acceptable values include:
uncompressed, snappy, gzip, lzo.
|
@@ -805,9 +805,8 @@ Spark SQL can cache tables using an in-memory columnar format by calling `cacheT
Then Spark SQL will scan only required columns and will automatically tune compression to minimize
memory usage and GC pressure. You can call `uncacheTable("tableName")` to remove the table from memory.
-Note that if you just call `cache` rather than `cacheTable`, tables will _not_ be cached in
-in-memory columnar format. So we strongly recommend using `cacheTable` whenever you want to
-cache tables.
+Note that if you call `cache` rather than `cacheTable`, tables will _not_ be cached using
+the in-memory columnar format, and therefore `cacheTable` is strongly recommended for this use case.
Configuration of in-memory caching can be done using the `setConf` method on SQLContext or by running
`SET key=value` commands using SQL.
@@ -833,7 +832,7 @@ Configuration of in-memory caching can be done using the `setConf` method on SQL
-## Other Configuration
+## Other Configuration Options
The following options can also be used to tune the performance of query execution. It is possible
that these options will be deprecated in future release as more optimizations are performed automatically.
@@ -842,7 +841,7 @@ that these options will be deprecated in future release as more optimizations ar
Property Name | Default | Meaning |
spark.sql.autoBroadcastJoinThreshold |
- false |
+ 10000 |
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when
performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently
@@ -876,7 +875,7 @@ code.
## Running the Thrift JDBC server
The Thrift JDBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
-in Hive 0.12. You can test the JDBC server with the beeline script comes with either Spark or Hive 0.12.
+in Hive 0.12. You can test the JDBC server with the beeline script that comes with either Spark or Hive 0.12.
To start the JDBC server, run the following in the Spark directory:
@@ -899,12 +898,12 @@ your machine and a blank password. For secure mode, please follow the instructio
Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`.
-You may also use the beeline script comes with Hive.
+You may also use the beeline script that comes with Hive.
## Running the Spark SQL CLI
The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode and execute
-queries input from command line. Note: the Spark SQL CLI cannot talk to the Thrift JDBC server.
+queries input from the command line. Note that the Spark SQL CLI cannot talk to the Thrift JDBC server.
To start the Spark SQL CLI, run the following in the Spark directory:
@@ -916,7 +915,10 @@ options.
# Compatibility with Other Systems
-## Migration Guide for Shark Users
+## Migration Guide for Shark User
+
+### Scheduling
+s
To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session,
users can set the `spark.sql.thriftserver.scheduler.pool` variable:
@@ -925,7 +927,7 @@ users can set the `spark.sql.thriftserver.scheduler.pool` variable:
### Reducer number
In Shark, default reducer number is 1 and is controlled by the property `mapred.reduce.tasks`. Spark
-SQL deprecates this property by a new property `spark.sql.shuffle.partitions`, whose default value
+SQL deprecates this property in favor of `spark.sql.shuffle.partitions`, whose default value
is 200. Users may customize this property via `SET`:
SET spark.sql.shuffle.partitions=10;
|