Skip to content

Latest commit

 

History

History
executable file
·
245 lines (180 loc) · 9.36 KB

File metadata and controls

executable file
·
245 lines (180 loc) · 9.36 KB

Important this example does not work on Basic clusters. A defect has been raised with the Bluemix on Cloud IBM engineering team.

Overview

This example shows how to use WebHCat to submit Hive, MapReduce and Pig jobs to a BigInsights cluster and wait for the job response. An example use cases for programmatically running Map/Reduce jobs on the BigInsights cluster is a client application that needs to process data in BigInsights and send the results of processing to a third party application. Here the client application could be an ETL tool job, a Microservice, or some other custom application.

This example uses WebHCat to run and monitor the three types of job: Hive, Map-Reduce and Pig.

WebHCat, previously known as Templeton is the REST API for HCatalog, a table and storage management layer for Hadoop

BigInsights for Apache Hadoop clusters are secured using Apache Knox. Apache Knox is a REST API Gateway for interacting with Apache Hadoop clusters. The Apache Knox project provides a Java client library and this can be used for interacting with the Oozie REST API. Using a library is usally more productive because the library provides a higher level of abstraction than when working directly with the REST API.

See also:

Developer experience

Developers will gain the most from these examples if they are:

  • Comfortable using Windows, OS X or *nix command prompts
  • Familiar with compiling and running java applications
  • Able to read code written in a high level language such as Groovy
  • Familiar with the Gradle build tool
  • Familiar with Map/Reduce, Hive and Pig concepts
  • Familiar with the WebHdfsGroovy Example

Example Requirements

You have met the pre-requisites and have followed the setup instructions in the top level README

For Sqoop example, make sure to specify following in your connection.properties file:

dashdb_sqoop_url:jdbc:db2://changeme:changeme/BLUDB:sslConnection=true;
dashdb_sqoop_username:changeme
dashdb_sqoop_password:changeme

Run the example

This example consists of:

  • Hive.groovy - Groovy script to submit the Hive job via WebHCat
  • MapReduce.groovy - Groovy script to submit the Map/Reduce job via WebHCat
  • Pig.groovy - Groovy script to submit the Pig job via WebHCat
  • Sqoop.groovy - Groovy script to submit the Sqoop job via WebHCat
  • build.gradle - Gradle script to compile and package the Map/Reduce code and execute each groovy script

To run the examples, open a command prompt window:

  • change into the directory containing this example and run gradle to execute the example
    • ./gradlew Example (OS X / *nix)
    • gradlew.bat Example (Windows)

You can run individual example:

  • change into the directory containing this example and run gradle to execute the example
    • ./gradlew Hive (OS X / *nix)
    • gradlew.bat Hive (Windows)
  • some output from running the command on my machine is shown below
$ cd examples/WebHCatGroovy
$ ./gradlew Hive
:compileJava UP-TO-DATE
:compileGroovy UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:Hive
[Hive.groovy] Delete /user/biadmin/test: 200
[Hive.groovy] Mkdir /user/biadmin/test: 200
[Hive.groovy] Copy Hive query file to HDFS
[Hive.groovy] Submitted job: job_1469408964315_0126
[Hive.groovy] Polling up to 240s for job completion...
.........................................................
[Hive.groovy] Job status: true
[exit, stderr, stdout]
[Hive.groovy] Content of stdout:
set TABLE_NAME
TABLE_NAME=biadmin_temp_1470349178741


create table if not exists ${hiveconf:TABLE_NAME} ( id int, name string )

insert into table ${hiveconf:TABLE_NAME} values (1, "Hello World!")

select * from ${hiveconf:TABLE_NAME}
1	Hello World!

drop table ${hiveconf:TABLE_NAME}


BUILD SUCCESSFUL

Total time: 1 mins 13.49 secs

The output above shows:

  • the steps performed by gradle (command names are prefixed by ':' such as :compileJava)
  • the steps performed by the Hive.groovy script prefixed with [Hive.groovy]
  • the Hive Job output from content of stdout

To run Map-Reduce example, use ./gradlew MapReduce

$ cd examples/WebHCatGroovy
$ ./gradlew MapReduce
:compileJava UP-TO-DATE
:compileGroovy UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:jar UP-TO-DATE
:MapReduce
[MapReduce.groovy] Delete /user/biadmin/test: 200
[MapReduce.groovy] Mkdir /user/biadmin/test: 200
[MapReduce.groovy] Copy input file to HDFS
[MapReduce.groovy] Copy jar file to HDFS
[MapReduce.groovy] Submitted job: job_1469408964315_0128
[MapReduce.groovy] Polling up to 60s for job completion...
................................................
[MapReduce.groovy] Job status: true
[_SUCCESS, part-r-00000]

BUILD SUCCESSFUL

Total time: 1 mins 2.985 secs

The output above shows:

  • the steps performed by gradle (command names are prefixed by ':' such as :compileJava)
  • the steps performed by the MapReduce.groovy script prefixed with [MapReduce.groovy]
  • the MapReduce Job output from content of stdout

To run Pig example, use ./gradlew Pig

$ cd examples/WebHCatGroovy
$ ./gradlew Pig
:compileJava UP-TO-DATE
:compileGroovy UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:Pig
[Pig.groovy] Delete /user/biadmin/test: 200
[Pig.groovy] Mkdir /user/biadmin/test: 200
[Pig.groovy] Copy Pig query file to HDFS
[Pig.groovy] Copy data file to HDFS
[Pig.groovy] Submitted job: job_1469408964315_0124
[Pig.groovy] Polling up to 60s for job completion...
...................................................
[Pig.groovy] Job status: true
[exit, stderr, stdout]
[Pig.groovy] Content of stdout:
(ctdean)
(pauls)
(carmas)
(dra)


BUILD SUCCESSFUL

Total time: 1 mins 9.279 secs

The output above shows:

  • the steps performed by gradle (command names are prefixed by ':' such as :compileJava)
  • the steps performed by the Pig.groovy script prefixed with [Pig.groovy]
  • the Pig Job output from content of stdout

To Run Sqoop example, use ./gradlew Sqoop

$ cd examples/WebHCatGroovy
$ ./gradlew Sqoop
:compileJava UP-TO-DATE
:compileGroovy
Note: /Users/pierreregazzoni/BigInsights-on-Apache-Hadoop/examples/WebHCatGroovy/src/main/java/org/apache/hadoop/examples/WordCount.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
:processResources UP-TO-DATE
:classes
:Sqoop
[Sqoop.groovy] Delete /user/biadmin/sqoop: 200
[Sqoop.groovy] Mkdir /user/biadmin/sqoop: 200
[Sqoop.groovy] Submitted job: job_1471997327942_0112
[Sqoop.groovy] Polling up to 60s for job completion...
..............................................
[Sqoop.groovy] Job status: true
[exit, stderr, stdout]

[Sqoop.groovy] Content of stderr:
log4j:WARN custom level class [Relative to Yarn Log Dir Prefix] not found.
16/08/25 13:18:08 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6_IBM_27

...

	File Output Format Counters 
		Bytes Written=87
16/08/25 13:18:34 INFO mapreduce.ImportJobBase: Transferred 87 bytes in 22.8155 seconds (3.8132 bytes/sec)
16/08/25 13:18:34 INFO mapreduce.ImportJobBase: Retrieved 3 records.

[_SUCCESS, part-m-00000]

[Sqoop.groovy] Content of part-m-00000
b,N/A (less than 5 years old)
1,Yes, speaks another language
2,No, speaks only English


BUILD SUCCESSFUL

Total time: 59.665 secs

The output above shows:

  • the steps performed by gradle (command names are prefixed by ':' such as :compileJava)
  • the steps performed by the Pig.groovy script prefixed with [Sqoop.groovy]
  • the Sqoop Job output from content of stderr
  • the content of the table that was 'sqooped' from dashdb to HDFS

NOTE: The Sqoop example use a local patch for JIRA KNOX-743

Decomposition Instructions

The examples uses a gradle build file build.gradle when you run ./gradlew or gradle.bat. The build.gradle does the following:

  • compile the Map/Reduce code. The apply plugin: 'java' statement controls this more info.
  • create a Jar file with the compiled Map/Reduce code for MapReduce example. The jar { ... } statement controls this more info.
  • compile and execute Groovy scripts.

Each example script performs the following steps:

  • Create a HTTP session on the BigInsights Knox REST API
  • Upload required query file, jar file and/or data file over WebHDFS
  • Submit the job using Knox REST API for WebHCat
  • Every second, check the status of the job using Knox REST API for WebHCat
  • When the job successfully finishes, check and download the output from the job

All code is well commented and it is suggested that you browse the build.gradle and *.groovy scripts to understand in more detail how they work.