Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite of StructuredFile data endpoint to support Euca ingest #145

Merged
merged 12 commits into from
May 25, 2017

Conversation

smgallo
Copy link
Contributor

@smgallo smgallo commented May 19, 2017

This PR is a rewrite of the StructuredFile data endpoint to support ingesting Eucalyptus accounting records. A number of small debugging enhancements implemented during testing are also included.

This PR is a prerequisite for ubccr/xdmod-value-analytics#14.

Description

Structured files contain data in a known (structured) format including CSV, tab-delimited, and JSON. Records are separated by an optional Record Separator (RS) and individual fields may be separated by a Field Separator (FS) if the format allows or requires it (e.g., CSV and TSV). Typically, we will read a structured file by iterating over the records and operating on the fields within the record.

This PR implements a StructuredFile data endpoint that implements Iterator, allowing ETL actions to parse the file and then simply aggregate over the endpoint to access the parsed records. While JSON documents are the only structured file we currently use, CSV/TSV files will be supported in the future and functionality to support them has been included (e.g., field separator and record names). Much of the workflow for handling structured files is contained in aStructuredFile, allowing the classes that support particular formats such as JSON to implement only the portions that are required for their particular format such as parsing the file format and decoding the individual records. The StructuredFile ingestor has been updated accordingly.

In addition, a flexible method for filtering input files through one or more external processes has been implemented. ETL\DataEndpoint\Filter\ExternalProcess allows us to specify a pipeline of process that data will be sent through before it is seen by the data endpoint itself. This is implemented using PHP stream filters that are attached directly to a file handle on fopen() in aStructuredFile::attachFilters() and process the input stream before it is accessed by fread(). The ExternalProcess class opens input and output pipes to the external process and sends data through this process allowing any program that accepts input in stdin and produces output on stdout to act as a filter. This includes jq, sed, awk, and even gzip. An example of a filter specification on a data endpoint is:

"endpoints": {
    "source": {
        "type": "jsonfile",
        "name": "Value Analytics People Input File (Filtered to Organizations)",
        "path": "value_analytics/people.json",
        "filters": {
            "jq": {
                "type": "external",
                "path": "jq",
                "arguments": "'map({name: .organizations[].name}) | unique'"
            }
        }
    }
}

Examples

CSV

  • RS: \n (UNIX), \r\n (Windows), \r (OSX)
  • FS: , (CSV), \t
one, two, "three blind mice"

JSON

  • Optional RS: \n (UNIX), \r\n (Windows), \r (OSX)

JSON documents can represent arrays and objects and therefore do not require a RS and do not support a FS. For documents that do not specify a RS it is assumed that the document will be either a single record or an array of records.

A single file containing a 2-dimensional array with no RS. This data is parsed all at once and the individual arrays are available through the iterator.

[
    [1, 2, 3, 4],
    [2, 3, 5, 6]
]

An single file containing an array of objects with no RS where the record names are the object keys. This data is parsed all at once and the individual objects are available through the iterator.

[
    {
        "first": "Steve",
        "last": "Gallo",
        "middle": "M."
    },{
        "first": "Ben",
        "last": "Plessenger"
    }
]

A single file containing one object per line (RS = EOL). The data may be streamed or parsed all at once.

{ "first": "Steve", "last": "Gallo", "middle": "M." }
{ "first": "Ben", "last": "Plessenger" }

Motivation and Context

Needed a more flexible method for ingesting JSON data in a variety of configurations.

Tests performed

Unit tests were added to cover the following cases:

  • File data endpoint
    • Test trying to read a directory instead of a file
    • Test trying to open a file with an invalid mode
    • Test reading a simple string from a file
  • StructuredFile data endpoint (extends File)
    • Test parsing a simple JSON file containing an array of objects (using sample XDMoD-VA data)
    • Test parsing a simple JSON file containing multiple objects, each on a single line, separated by a newline (using sample Euca accounting records)
    • Test error reporting when config is not valid
    • Test error reporting when a filter type is not provided
    • Test filter syntax error
    • Test parsing a simple JSON file containing an array of objects filtered through an external process (jq)
    • Test parsing a simple JSON file containing multiple records separated by a newline and filtered through an external process (jq)
    • Test successful JSON schema validation
    • Test failed JSON schema validation

Testing of parsing JSON files was performed by running the XDMoD-VA ETL and comparing the data to a dump of the modw_value_analytics database on the "VA Demo" site (see PR ubccr/xdmod-value-analytics#14). In addition, all JSON configuration files that are parsed go through this data endpoint.

The CloudAssetTypeIngestor action uses an in-line JSON document (which will be depricated at a later date) and was used to verify this functionality.

$ ./etl_overseer.php -c ../../../etc/etl/etl.json -a CloudAssetTypeIngestor -v info
2017-05-19 15:04:25 [info] Command: './etl_overseer.php' '-c' '../../../etc/etl/etl.json' '-a' 'CloudAssetTypeIngestor' '-v' 'info'
2017-05-19 15:04:25 [notice] dw_extract_transform_load start (process_start_time: 2017-05-19 15:04:25)
2017-05-19 15:04:25 [info] Loading configuration file /home/smgallo/xdmod-structured-file/etc/etl/etl.json
2017-05-19 15:04:26 [info] Parsing /home/smgallo/xdmod-structured-file/etc/etl/etl.json
...
2017-05-19 15:04:26 [info] Verifying endpoint: ('Utility DB', class=ETL\DataEndpoint\Mysql, config=datawarehouse, schema=modw, host=tas-db1-nofw.ccr.buffalo.edu:3306, user=xdmod)
2017-05-19 15:04:26 [info] Verifying endpoint: ('Cloud DB', class=ETL\DataEndpoint\Mysql, config=datawarehouse, schema=modw_cloud, host=tas-db1-nofw.ccr.buffalo.edu:3306, user=xdmod)
2017-05-19 15:04:26 [info] Verifying endpoint: ('Cloud DB', class=ETL\DataEndpoint\Mysql, config=datawarehouse, schema=modw_cloud, host=tas-db1-nofw.ccr.buffalo.edu:3306, user=xdmod)
2017-05-19 15:04:26 [info] Create action CloudAssetTypeIngestor (ETL\Ingestor\StructuredFileIngestor)
2017-05-19 15:04:26 [info] Parse definition file: '/home/smgallo/xdmod-structured-file/etc/etl/etl_tables.d/cloud/asset_type.json'
2017-05-19 15:04:26 [info] Loading configuration file /home/smgallo/xdmod-structured-file/etc/etl/etl_tables.d/cloud/asset_type.json
2017-05-19 15:04:26 [info] Parsing /home/smgallo/xdmod-structured-file/etc/etl/etl_tables.d/cloud/asset_type.json
2017-05-19 15:04:26 [info] Verifying action: CloudAssetTypeIngestor (ETL\Ingestor\StructuredFileIngestor)
2017-05-19 15:04:26 [info] Obtaining lock file '/var/lock/xdmod/etlv2_25727'
2017-05-19 15:04:26 [info] start (action_name: CloudAssetTypeIngestor, action: CloudAssetTypeIngestor (ETL\Ingestor\StructuredFileIngestor), start_date: , end_date: )
2017-05-19 15:04:26 [info] Truncate destination table: `modw_cloud`.`asset_type`
2017-05-19 15:04:26 [info] Table does not exist: '`modw_cloud`.`asset_type`', skipping.
2017-05-19 15:04:26 [notice] Table `modw_cloud`.`asset_type` does not exist, creating.
2017-05-19 15:04:26 [info] Truncate destination table: `modw_cloud`.`asset_type`
2017-05-19 15:04:26 [info] Process date interval (1/1) (start: none, end: none)
2017-05-19 15:04:26 [notice] (action: CloudAssetTypeIngestor (ETL\Ingestor\StructuredFileIngestor), start_time: 1495220666.0755, end_time: 1495220666.0853, elapsed_time: 0.00976, records_loaded: 5)
2017-05-19 15:04:26 [info] ETL\Ingestor\StructuredFileIngestor: Rows Processed: 5 records (Time Taken: 0.01 s)
2017-05-19 15:04:26 [notice] (action: CloudAssetTypeIngestor (ETL\Ingestor\StructuredFileIngestor), start_time: 1495220666.0739, end_time: 1495220666.0858, elapsed_time: 0.01188, records_examined: 5, records_loaded: 5)
2017-05-19 15:04:26 [info] end (action_name: CloudAssetTypeIngestor, action: CloudAssetTypeIngestor (ETL\Ingestor\StructuredFileIngestor))
2017-05-19 15:04:26 [info] Releasing lock '/var/lock/xdmod/etlv2_25727'
2017-05-19 15:04:26 [notice] dw_extract_transform_load end (process_end_time: 2017-05-19 15:04:26)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project as found in the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@smgallo smgallo added enhancement Enhancement of the functionality of an existing feature Category:ETL Extract Transform Load labels May 19, 2017
@smgallo smgallo added this to the v6.7.0 milestone May 19, 2017
Copy link
Contributor

@tyearke tyearke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good. The only overarching concern I had was about how to specify the order filters are applied to a file if there's more than one filter. It looks like the order is currently determined by the order the key-value pairs are given in a config file. This implicitly works given the current way PHP's built-in JSON decoder converts the data to PHP objects, but this behavior is not guaranteed by the JSON specification.

An object is an unordered set of name/value pairs.

I suggest instead using an array for the filters to explicitly encode the order they are applied. If the keys that would be lost are useful for debugging, perhaps the filter objects could take on a new attribute like name in place of the key (or perhaps the index is good enough).

* @param array $propertyList The list of required properties
* @param array $missing Optional reference to an array that will contain a list of the
* missing properties.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment block needs an @return.

* the values are property types.
* @param array $messages Optional reference to an array that will contain a list of
* messages regarding the property types.
* @param int $skipMissingProperties If set to FALSE, properties that are not present in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This int should be a boolean.

* Verify the types of object properties, optionally skipping properties that are not
* present in the object. Property types must match the PHP is_*() methods (e.g.,
* is_int(), is_object(), is_string()) and will silently be skipped if a function
* corresponding to the specified type does not exist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a type that can't be checked by a built-in PHP function is given (whether purposefully or by accident), could there be some kind of explicit fallback behavior instead of silently ignoring the property? Perhaps an instanceof test or just adding a message if an untestable type is given.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I'll add a message to the returned message list.


fclose($this->pipes[1]);
fclose($this->pipes[2]);
fclose($this->pipes[3]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for $this->pipes specifies indices 0 through 2, but these lines close entries 1 through 3.

const DEFAULT_READ_BYTES = 4096;

/**
* The path to a JSON Schema describing each record the JSON file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structured file might not be JSON or use JSON schema for validation.


if ( null === $comparisonColumn ) {
$comparisonColumn = $firstCol;
print "WARNING: No non-nlullable columns, potential for false positives." . PHP_EOL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a minor typo in "non-nullable".

@smgallo
Copy link
Contributor Author

smgallo commented May 24, 2017

@tyearke the array vs object order is a good point - I'll change it to be an array of objects so we can better specify order. The keys are used for debugging but I was updating my design documentation and came up with a better way to handle the logging and identification so losing them won't matter in the long run.

@smgallo
Copy link
Contributor Author

smgallo commented May 25, 2017

@tyearke Requested changes have been made (as well as corresponding changes to ubccr/xdmod-value-analytics#14)

if ( ! is_object($config) ) {
$this->logger->warning(sprintf(
"Filter config must be an object, '%s' given. Skipping.",
$filterKey,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a stray reference to a no-longer-existent variable here, but other than that, looks good to go.

@smgallo smgallo merged commit 66ba63a into ubccr:xdmod6.7 May 25, 2017
@smgallo smgallo deleted the etl/structured-file branch May 25, 2017 15:17
@tyearke tyearke modified the milestones: v7.0.0, v6.7.0 Jun 6, 2017
ryanrath pushed a commit to ryanrath/xdmod that referenced this pull request Jul 24, 2017
…#145)

* Update commit reference for xdmod-test-artifacts
* Add functions for verifying types of stdClass properties
* Rename specs_dir -> schema_dir
* Rewrite StructuredFile data endpoint to allow record separators and data filtering by external processes
* Add tests for File and StructuredFile data endpoints
* When verifying tables select a non-nullable column for comparing LEFT OUTER JOIN results to reduce chance of false positives
* Return distinct negative values from compare() to facilitate debugging
* Change StructuredFile filter definition to an array. Update tests accordingly.
chakrabortyr pushed a commit to chakrabortyr/xdmod that referenced this pull request Oct 17, 2017
…#145)

* Update commit reference for xdmod-test-artifacts
* Add functions for verifying types of stdClass properties
* Rename specs_dir -> schema_dir
* Rewrite StructuredFile data endpoint to allow record separators and data filtering by external processes
* Add tests for File and StructuredFile data endpoints
* When verifying tables select a non-nullable column for comparing LEFT OUTER JOIN results to reduce chance of false positives
* Return distinct negative values from compare() to facilitate debugging
* Change StructuredFile filter definition to an array. Update tests accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category:ETL Extract Transform Load enhancement Enhancement of the functionality of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants