Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Structured File Ingestor #180

Merged
merged 22 commits into from
Jul 19, 2017

Conversation

smgallo
Copy link
Contributor

@smgallo smgallo commented Jul 7, 2017

Update the StructuredFileIngestor to utilize the StructuredFile data endpoint.

This PR is a prerequisite for the following PRs:

Description

This work integrates the new StructuredFile data endpopint into the StructuredFileIngestor.

StructuredFileIngestor

Since the StructuredFile endpoint implements Iterator, we can now simply iterate over the endpoint to access the data records in the file (or multiple files if using the DirectoryScanner endpoint). Common functionality was consolidated in parent classes to make it more broadly useful to all ingestors. For example, the code generating the destination_record_map was moved from pdoIngestor to aRdbmsDestinationAction so it can be used by StructuredFileIngestor, allowing the ingestor to be greatly simplified and the specification of actions using that ingestor more concise.

We have deprecated the use of the source_data and destination_field directives in the action configuration for structured files. Data that was previously contained directly in the action definition has been moved to their own data files in the etl_data.d directory. Since the data was entirely made up of JSON arrays, a header record was added containing the field names (this is now supported by the StructuredFile endpoint as described below). The destination_field directives have also been removed and replaced by the destination_record_map directive (now automatically generated in many cases). This simplifies configuration and allows us separate action configuration and data.

StructuredFile Data Endpoint

The StrucutredFile endpoint now supports auto-discovery of the fields in records and also supports returning a subset or superset of the fields using the following directives:

  • header_record A boolean where true (the default value) indicates that the first record is a header record containing field names. This may be ignored by the implementation if a specific file format (such as a JSON object) does not support this behavior.
  • field_names An array of field names to return. Different file formats may interpret field names differently, but all implementations are expected to return data for all fields specified here. If a field does not exist in the data, its value is expected to be null. If this option is not present, all existing fields in the record are returned. Note that the existing record fields are typically determined by examining the first record present.

CSV/TSV files (to be supported in the future) may contain an optional header row specifying field names. If a header row exists it contains the field names and is skipped for data import. If the header row does not exist, the field names must be specified in the endpoint options. There is no support for optional fields.

  • header_record If true skip the first record for data and use its values as the field names.
  • field_names: [...] If header_record is false, field_names is required and if there are more fields present in the file than field names, ignore the additional data fields. If header_record is true then field_names specifies the fields to return providing null values if the requested field does not exist in the data (e.g., there are more fields requested than exist in the data).

JSON arrays are treated similarly to CSV/TSV files.

  • header_record If true skip the first record and use its values as the field names
  • field_names: [...] If header_record is false, field_names is required and if there are more fields present in the file than field names, ignore the additional data fields. If header_record is true then field_names specifies the fields to return providing null values if the requested field does not exist in the data.

JSON objects specify field names as keys in the object. If field_names is not specified, the keys of the first object will be examined and used as the field names. Support for optional fields is handled by specifying the complete set of possible fields using field_names and returning null values for fields not present in the record.

  • header_record Ignored for this data type.
  • field_names: [...] If specified, return these record fields with null values for any fields that are not found in the record. If the data contains optional fields, then field_names must be specified and must contain the full list of possible field names. If field_names is not specified and subsequent data contains more fields than were found in the first record, those fields will be ignored and only fields found in the first record will be returned.

Motivation and Context

Preparatory work for ingesting Eucalyptus data. The end goal is to be able to use the StructuredFile data endpoint to parse JSON, CSV/TSV, and resource manager log files. This will allow us to consolidate ingestion code in one place and simply iterate over parsed results the same way we do for database result sets.

Tests performed

All existing tests pass. In addition, ingestion was performed on all of the data for XDMoD-VA (JSON files), Resource Allocations (XDCDB query, JSON file, Update ingestor), job dimension tables (JSON files), and XDCDB jobs (XDCDB query, MySQL query, ingestion, aggregation).

XDMoD-VA

./etl_overseer.php -c ../../../etc/etl/etl.json -v debug -p value-analytics -o "experimental_enable_batch_aggregation=true"

./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t people -x jobs_person_id
./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t people_organizations -x division -x appointment_type --ignore-column-type
./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t people_groups \
  -t people_identifiers \
  -t organizations \
  -t identity_providers \
  -t groups \
  -t grant_types \
  -t grants_people \
  -t grants \
  -t grant_roles \
  -t funding_agencies
./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t va_grant_fact_by_year -x pi_jobs_person_id -w 'src.year_id < 201700000'
./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t va_grant_fact_by_quarter -x pi_jobs_person_id -w 'src.quarter_id < 201700002'
./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t va_grant_fact_by_month -x pi_jobs_person_id -w 'src.month_id < 201700005'
./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest -n 1 -v info \
  -t va_grant_fact_by_day -x pi_jobs_person_id -w 'src.day_id < 201700145'

Resource Allocations

./etl_overseer.php -c ../../../etc/etl/etl.json -v debug -p jobs-common

./verify_table_data.php -s modw_cloud_baseline -d modw_cloud_etltest -n 1 -v info \
  -t countable_type \
  -t job_record_type \
  -t job_task_type \
  -t unit \
  -t submission_venue

Common Job Tables

./etl_overseer.php -c ../../../etc/etl/etl.json -v debug -p jobs-common

./verify_table_data.php -s modw_cloud_baseline -d modw_cloud_etltest -n 1 -v info \
  -t countable_type \
  -t job_record_type \
  -t job_task_type \
  -t unit \
  -t submission_venue

XDCDB Jobs

./etl_overseer.php -c ../../../etc/etl/etl.json -s "2017-01-01 00:00:00" -e "2017-01-31 23:59:59" -v debug -p jobs-xdcdb

./verify_table_data.php -s modw_cloud_baseline -d modw_cloud_etltest -n 1 -v info \
  -x last_modified \
  -t job_records \
  -t job_tasks
./verify_table_data.php -s modw_cloud_baseline -d modw_cloud_etltest -n 1 -v info \
  -x last_modified \
  -t jobfact_by_year \
  -t jobfact_by_month \
  -t jobfact_by_quarter \
  -t jobfact_by_year

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project as found in the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@smgallo smgallo added this to the v7.0.0 milestone Jul 7, 2017
@smgallo smgallo added bug Bugfixes enhancement Enhancement of the functionality of an existing feature Category:ETL Extract Transform Load labels Jul 7, 2017
Copy link
Contributor

@plessbd plessbd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented out code then LGTM

@@ -770,11 +773,13 @@ protected function _execute($aggregationUnit)
$sourceJoins = $this->etlSourceQuery->joins;
$firstJoin = array_shift($sourceJoins);
$newFirstJoin = clone $firstJoin;
$newFirstJoin->setName($tmpTableName);
// $newFirstJoin->setName($tmpTableName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

$newFirstJoin->schema = $this->sourceEndpoint->getSchema();

$this->etlSourceQuery->deleteJoins();
$this->etlSourceQuery->addJoin($newFirstJoin);
// $this->etlSourceQuery->deleteJoins();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

@@ -788,8 +793,9 @@ protected function _execute($aggregationUnit)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

$this->etlSourceQuery->addJoin($this->etlSourceQueryOrigFromTable);
// $this->etlSourceQuery->deleteJoins();
$this->etlSourceQuery->joins = array($this->etlSourceQueryOrigFromTable);
// $this->etlSourceQuery->addJoin($this->etlSourceQueryOrigFromTable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code

@smgallo smgallo merged commit 7649467 into ubccr:xdmod7.0 Jul 19, 2017
jtpalmer pushed a commit to jtpalmer/xdmod that referenced this pull request Jul 24, 2017
* Add support for iterating over a StructuredFile data endpoint
* Improve comments, logging, debugging output, and phpcs style fixes
* Move handling of destination_record_map config directive from pdoIngestor to aRdbmsDestinationAction
* Consolidate common functionality. Move pre- and post-execute tasks into aAction and aRdbmsDestinationAction
* Add handling of requested and expected record fields to StructuredFile endpoint plus associated tests
* Apply command-line where clause to table row count calculation in table verifier
* Fix support for passing overrides on the ETL command line
* Allow public properties of any object type to be verified
* Fix bugs in experimental aggregation from update to DbModel
* Add handling of record fields to StructuredFile endpoint
* Update ingestors to use updated StructuredFile endpoint
* Update sample ETL files
* Move source data out of ETL table definitions into data directory. Update ETL config to use new endpoint
ryanrath pushed a commit to ryanrath/xdmod that referenced this pull request Jul 24, 2017
* Add support for iterating over a StructuredFile data endpoint
* Improve comments, logging, debugging output, and phpcs style fixes
* Move handling of destination_record_map config directive from pdoIngestor to aRdbmsDestinationAction
* Consolidate common functionality. Move pre- and post-execute tasks into aAction and aRdbmsDestinationAction
* Add handling of requested and expected record fields to StructuredFile endpoint plus associated tests
* Apply command-line where clause to table row count calculation in table verifier
* Fix support for passing overrides on the ETL command line
* Allow public properties of any object type to be verified
* Fix bugs in experimental aggregation from update to DbModel
* Add handling of record fields to StructuredFile endpoint
* Update ingestors to use updated StructuredFile endpoint
* Update sample ETL files
* Move source data out of ETL table definitions into data directory. Update ETL config to use new endpoint
@smgallo smgallo deleted the etl/structured-file-ingestor branch September 22, 2017 17:39
chakrabortyr pushed a commit to chakrabortyr/xdmod that referenced this pull request Oct 17, 2017
* Add support for iterating over a StructuredFile data endpoint
* Improve comments, logging, debugging output, and phpcs style fixes
* Move handling of destination_record_map config directive from pdoIngestor to aRdbmsDestinationAction
* Consolidate common functionality. Move pre- and post-execute tasks into aAction and aRdbmsDestinationAction
* Add handling of requested and expected record fields to StructuredFile endpoint plus associated tests
* Apply command-line where clause to table row count calculation in table verifier
* Fix support for passing overrides on the ETL command line
* Allow public properties of any object type to be verified
* Fix bugs in experimental aggregation from update to DbModel
* Add handling of record fields to StructuredFile endpoint
* Update ingestors to use updated StructuredFile endpoint
* Update sample ETL files
* Move source data out of ETL table definitions into data directory. Update ETL config to use new endpoint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugfixes Category:ETL Extract Transform Load enhancement Enhancement of the functionality of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants