Skip to content
Ashay Shirwadkar edited this page Jul 15, 2019 · 7 revisions

Apache Arrow

Overview

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Source: https://arrow.apache.org/

Terminologies

Tables are collections of columns in accordance to a schema. They are the most capable dataset-providing abstraction in Arrow.

Schemas describe a logical collection of several pieces of data, each with a distinct name and type, and optional metadata.

Columns contains the actual data in the form of Chunked Arrays. Chunked Array is a data structure managing a list of primitive Arrow arrays logically as one large array. An array holds actual data in contagious memory. Following diagram shows relation between chunked array and array.

Following is a logical representation of Arrow Table

Table {
    Schema {
        Metadata {
            keys:vec;   // Vector of strings
            vals:vec;   // Vector of strings
        }
        Fields:vec {
            name:string;    // Column name
            type:datatype;  // Column data type
        }
    }
    Column:vec {
        name:string;
        type:datatype;
        ChunkedArray {
            chunks:array   // Array which holds actual data.
        }
    }
}

Creating table

Prior adding data, add data type and name of the column in schema

std::vector<std::shared_ptr<arrow::Field>> schema_vector = {
    arrow::field("colname_1", arrow::int8()), arrow::field("colname_2", arrow::uint16())};

auto schema = std::make_shared<arrow::Schema>(schema_vector);

For adding values into the column, Apache Arrow has provided builder classes for each data type. Therefore, for two columns above we need to define instances of the builder class

arrow::Int8Builder int8_builder(pool);
arrow::UInt16Builder uint16_builder(pool);

Add the elements using Append function provided builder class

int8_builder.Append(10);
uint16_builder.Append(12);

At the end, we finalise the arrays, declare the (type) schema and combine them into a single arrow::Table:

std::shared_ptr<arrow::Array> int8_array;
ARROW_RETURN_NOT_OK(int8_builder.Finish(&int8_array));
std::shared_ptr<arrow::Array> uint16_array;
ARROW_RETURN_NOT_OK(uint16_builder.Finish(&uint16_array));

arrow::Table table = arrow::Table::Make(schema, {int8_array, uint16_array});

Print Arrow table

Each column in arrow is represented using Chunked Array. A chunked array is a vector of chunks i.e. arrays which holds actual data.

          Chunked Array		 Array 1
              +---+	             +--+--+--+
              |   +------------> |	|  |  |
              +---+		     +--+--+--+
              |   +---+
              +---+	  |	 +--+--+--+
              |...|	  +----->|  |  |  |
              +---+		 +--+--+--+
	                    Array 2

Therefore, for printing a column will require to iterate through all the arrays inside a chunked array. Consider we have n rows and m columns in a table, and in each column (i.e. chunked_array) we have 3 chunks (i.e. arrays). As we have n rows, each column will have n elements and they are distributed sequentially in all the chunks. To print this table is to print each rows. For printing 1st row, we need to get 1st element from col 1, then 1st element from col 2, so on till 1st element of col m. In terms of arrow, we need to print 1st element in 1st chunk of 1st chunked array then print 1st element in 1st chunk of 2nd chunked array, so on till mth chunked_array. We will print like this until we exahust 1st chunk of each chunked_array. Then we will move to 2nd chunked array and print elements inside it.

Representation of Skyhook Table

Skyhook Metadata

Skyhook metadata is mapped into arrow schema as key-value pairs under arrow->schema->metadata. The index of key and value under metadata vector is represented by

enum arrow_metadata_t {
    METADATA_SKYHOOK_VERSION,
    METADATA_DATA_SCHEMA_VERSION,
    METADATA_DATA_STRUCTURE_VERSION,
    METADATA_DATA_FORMAT_TYPE,
    METADATA_DATA_SCHEMA,
    METADATA_DB_SCHEMA,
    METADATA_TABLE_NAME,
    METADATA_NUM_ROWS
};

For example, you can get number of rows using following code

auto schema = table->schema();
auto metadata = schema->metadata();
int num_rows = metadata->value(METADATA_NUM_ROWS);

RID and Deleted Vector

In case of flatbuffers RID is stored as a separate table i.e. table ROW_TABLE inside the flatbuffer schema and deleted vector is represented as a bitmap inside table ROW. But in case of arrow, both are represented as different column inside the arrow table. They are always the last two columns at the end of the arrow table. Using the following macro declared

#define ARROW_RID_INDEX(cols) (cols)
#define ARROW_DELVEC_INDEX(cols) (cols + 1)
Clone this wiki locally