Skip to content
sonalgoyal edited this page Nov 29, 2010 · 3 revisions

This project is a framework for connecting disparate data sources with the Apache Hadoop system, making them interoperable. HIHO connects Hadoop with multiple RDBMS and file systems, so that data can be loaded to Hadoop and unloaded from Hadoop.

Goal

Connect Hadoop seamlessly with different systems and data sources. It is necessary to keep performance in mind, as we are talking of petabytes of data. As far as possible, use high performance vendor extensions to achieve the desired scale.

Status

HIHO is in active development, with release 0.3.0 being the last major release. Check the README for the features in this current release.

HIHO helps copy large amounts of data between different databases and Hadoop. It allows easy import and export of data between traditional RDBMS and the Hadoop File System. If you want to migrate data from your database to Hadoop for further processing, or if you are done with Hadoop processing and want to put your crunched data back to the database so that your users can view your report, HIHO helps you do that. It allows user specified query based integration with JDBC compliant databases. The user can specify the query to run against the database and the query with which data needs to be split across the various mappers. It also allows single table imports to the HDFS. HIHO helps loading data from hadoop to databases. HIHO also connects to Salesfoce and FTP servers.

Main Features

  • User specified query splitter and query for importing data into Hadoop.
  • Range based splitting of input records. No LIMIT or OFFSET clauses. User can specify how the records are to be split amongst the mappers.
  • Support for JDBC compliant databases.
  • single table import
  • No code generation
  • Transformation of input results to delimited records. Choice of delimiter.
  • Transformation of input results to [http://search-hadoop.com/jd/avro/org/apache/avro/generic/GenericRecord.html Avro GenericRecord].
  • Reusable GenericDBWritable and AppendFileOutputFormat in your applications for custom formats.
  • optimized loading to MySQL and Oracle. Existing Hadoop implementation provides a DBOutputFormat which is a record based approach. This creates multiple inserts into the database which is inherently slow. Please check [http://dev.mysql.com/doc/refman/5.1/en/insert-speed.html Speed of inserts in MySQL]. By utilising MySQL and Oracle specific extensions, we are able to bulk load existing Hadoop data files into the database.
  • A command line run is possible.
  • Integration with Salesforce. Get data from Salesforce to Hadoop
  • Integration with FTP servers. Load files to and from FTP servers.

Bugs/Issues/Feature Requests

We look forward to your suggestions and ideas. If you see any bug, or if you want to request a feature, please drop us an email at [mailto:support@nubetech.co support@nubetech.co]

Clone this wiki locally