Important this example does not work on Basic clusters. A defect has been raised with the Bluemix on Cloud IBM engineering team.

Overview

This example shows how to use WebHCat to submit Hive, MapReduce and Pig jobs to a BigInsights cluster and wait for the job response. An example use cases for programmatically running Map/Reduce jobs on the BigInsights cluster is a client application that needs to process data in BigInsights and send the results of processing to a third party application. Here the client application could be an ETL tool job, a Microservice, or some other custom application.

This example uses WebHCat to run and monitor the three types of job: Hive, Map-Reduce and Pig.

WebHCat, previously known as Templeton is the REST API for HCatalog, a table and storage management layer for Hadoop

BigInsights for Apache Hadoop clusters are secured using Apache Knox. Apache Knox is a REST API Gateway for interacting with Apache Hadoop clusters. The Apache Knox project provides a Java client library and this can be used for interacting with the Oozie REST API. Using a library is usally more productive because the library provides a higher level of abstraction than when working directly with the REST API.

Developer experience

Developers will gain the most from these examples if they are:

Comfortable using Windows, OS X or *nix command prompts
Familiar with compiling and running java applications
Able to read code written in a high level language such as Groovy
Familiar with the Gradle build tool
Familiar with Map/Reduce, Hive and Pig concepts
Familiar with the WebHdfsGroovy Example

Example Requirements

You have met the pre-requisites and have followed the setup instructions in the top level README

For Sqoop example, make sure to specify following in your connection.properties file:

dashdb_sqoop_url:jdbc:db2://changeme:changeme/BLUDB:sslConnection=true;
dashdb_sqoop_username:changeme
dashdb_sqoop_password:changeme

Run the example

This example consists of:

Hive.groovy - Groovy script to submit the Hive job via WebHCat
MapReduce.groovy - Groovy script to submit the Map/Reduce job via WebHCat
Pig.groovy - Groovy script to submit the Pig job via WebHCat
Sqoop.groovy - Groovy script to submit the Sqoop job via WebHCat
build.gradle - Gradle script to compile and package the Map/Reduce code and execute each groovy script

To run the examples, open a command prompt window:

change into the directory containing this example and run gradle to execute the example
- ./gradlew Example (OS X / *nix)
- gradlew.bat Example (Windows)

You can run individual example:

change into the directory containing this example and run gradle to execute the example
- ./gradlew Hive (OS X / *nix)
- gradlew.bat Hive (Windows)
some output from running the command on my machine is shown below

$ cd examples/WebHCatGroovy
$ ./gradlew Hive
:compileJava UP-TO-DATE
:compileGroovy UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:Hive
[Hive.groovy] Delete /user/biadmin/test: 200
[Hive.groovy] Mkdir /user/biadmin/test: 200
[Hive.groovy] Copy Hive query file to HDFS
[Hive.groovy] Submitted job: job_1469408964315_0126
[Hive.groovy] Polling up to 240s for job completion...
.........................................................
[Hive.groovy] Job status: true
[exit, stderr, stdout]
[Hive.groovy] Content of stdout:
set TABLE_NAME
TABLE_NAME=biadmin_temp_1470349178741


create table if not exists ${hiveconf:TABLE_NAME} ( id int, name string )

insert into table ${hiveconf:TABLE_NAME} values (1, "Hello World!")

select * from ${hiveconf:TABLE_NAME}
1	Hello World!

drop table ${hiveconf:TABLE_NAME}


BUILD SUCCESSFUL

Total time: 1 mins 13.49 secs

The output above shows:

the steps performed by gradle (command names are prefixed by ':' such as :compileJava)
the steps performed by the Hive.groovy script prefixed with [Hive.groovy]
the Hive Job output from content of stdout

To run Map-Reduce example, use ./gradlew MapReduce

$ cd examples/WebHCatGroovy
$ ./gradlew MapReduce
:compileJava UP-TO-DATE
:compileGroovy UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:jar UP-TO-DATE
:MapReduce
[MapReduce.groovy] Delete /user/biadmin/test: 200
[MapReduce.groovy] Mkdir /user/biadmin/test: 200
[MapReduce.groovy] Copy input file to HDFS
[MapReduce.groovy] Copy jar file to HDFS
[MapReduce.groovy] Submitted job: job_1469408964315_0128
[MapReduce.groovy] Polling up to 60s for job completion...
................................................
[MapReduce.groovy] Job status: true
[_SUCCESS, part-r-00000]

BUILD SUCCESSFUL

Total time: 1 mins 2.985 secs

The output above shows:

the steps performed by gradle (command names are prefixed by ':' such as :compileJava)
the steps performed by the MapReduce.groovy script prefixed with [MapReduce.groovy]
the MapReduce Job output from content of stdout

To run Pig example, use ./gradlew Pig

$ cd examples/WebHCatGroovy
$ ./gradlew Pig
:compileJava UP-TO-DATE
:compileGroovy UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:Pig
[Pig.groovy] Delete /user/biadmin/test: 200
[Pig.groovy] Mkdir /user/biadmin/test: 200
[Pig.groovy] Copy Pig query file to HDFS
[Pig.groovy] Copy data file to HDFS
[Pig.groovy] Submitted job: job_1469408964315_0124
[Pig.groovy] Polling up to 60s for job completion...
...................................................
[Pig.groovy] Job status: true
[exit, stderr, stdout]
[Pig.groovy] Content of stdout:
(ctdean)
(pauls)
(carmas)
(dra)


BUILD SUCCESSFUL

Total time: 1 mins 9.279 secs

The output above shows:

the steps performed by gradle (command names are prefixed by ':' such as :compileJava)
the steps performed by the Pig.groovy script prefixed with [Pig.groovy]
the Pig Job output from content of stdout

To Run Sqoop example, use ./gradlew Sqoop

$ cd examples/WebHCatGroovy
$ ./gradlew Sqoop
:compileJava UP-TO-DATE
:compileGroovy
Note: /Users/pierreregazzoni/BigInsights-on-Apache-Hadoop/examples/WebHCatGroovy/src/main/java/org/apache/hadoop/examples/WordCount.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
:processResources UP-TO-DATE
:classes
:Sqoop
[Sqoop.groovy] Delete /user/biadmin/sqoop: 200
[Sqoop.groovy] Mkdir /user/biadmin/sqoop: 200
[Sqoop.groovy] Submitted job: job_1471997327942_0112
[Sqoop.groovy] Polling up to 60s for job completion...
..............................................
[Sqoop.groovy] Job status: true
[exit, stderr, stdout]

[Sqoop.groovy] Content of stderr:
log4j:WARN custom level class [Relative to Yarn Log Dir Prefix] not found.
16/08/25 13:18:08 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6_IBM_27

...

	File Output Format Counters 
		Bytes Written=87
16/08/25 13:18:34 INFO mapreduce.ImportJobBase: Transferred 87 bytes in 22.8155 seconds (3.8132 bytes/sec)
16/08/25 13:18:34 INFO mapreduce.ImportJobBase: Retrieved 3 records.

[_SUCCESS, part-m-00000]

[Sqoop.groovy] Content of part-m-00000
b,N/A (less than 5 years old)
1,Yes, speaks another language
2,No, speaks only English


BUILD SUCCESSFUL

Total time: 59.665 secs

The output above shows:

the steps performed by gradle (command names are prefixed by ':' such as :compileJava)
the steps performed by the Pig.groovy script prefixed with [Sqoop.groovy]
the Sqoop Job output from content of stderr
the content of the table that was 'sqooped' from dashdb to HDFS

NOTE: The Sqoop example use a local patch for JIRA KNOX-743

Decomposition Instructions

The examples uses a gradle build file build.gradle when you run ./gradlew or gradle.bat. The build.gradle does the following:

compile the Map/Reduce code. The apply plugin: 'java' statement controls this more info.
create a Jar file with the compiled Map/Reduce code for MapReduce example. The jar { ... } statement controls this more info.
compile and execute Groovy scripts.

Each example script performs the following steps:

Create a HTTP session on the BigInsights Knox REST API
Upload required query file, jar file and/or data file over WebHDFS
Submit the job using Knox REST API for WebHCat
Every second, check the status of the job using Knox REST API for WebHCat
When the job successfully finishes, check and download the output from the job

All code is well commented and it is suggested that you browse the build.gradle and *.groovy scripts to understand in more detail how they work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Overview

Developer experience

Example Requirements

Run the example

Decomposition Instructions

Files

README.md

Latest commit

History

README.md

File metadata and controls

Overview

Developer experience

Example Requirements

Run the example

Decomposition Instructions