-
Notifications
You must be signed in to change notification settings - Fork 41
Downloader Users Guide
The herd downloader application is a command line program that provides the ability to copy data (files/directories) registered with the herd Registry from an S3 bucket to local file system. The downloaded data includes the creation of the "manifest.json" side-car file.
The JAR is built as part of the herd application suite in the dm-tools project.
The downloader uses the Amazon S3 SDK which downloads files into the system temporary directory (e.g. /tmp). You should ensure there is adequate space in the temporary directory if large numbers of files are downloaded.
java -jar dm-downloader.jar
[-a <S3AccessKey>]
[-p <S3SecretKey>]
[-e <S3Endpoint>]
-l < LocalDirPath>
-m <ManifestFilePath>
-H <RegServerHost>
-P <RegServerPort>
[-s true]
[-u <username>]
[-w <password>]
[-n <HttpProxyHost>]
[-o <HttpProxyPort>]
[-t <MaxThreads>]
[-c <socketTimeout>]
- Required: No
- Type: String
The AWS access key ID used to identify the user making S3 service requests. When specified, make sure the s3SecretKey is also specified.If the s3AccessKey and s3SecretKey parameters aren't both specified, then the AWS Java default credential provider chain will be used to find credentials. If no credentials are found, an error will result. See the following link for more details: http://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/credentials.html.
- Required: No
- Type: String
The AWS secret access key to be used to authenticate the user making S3 service requests. When specified, make sure the s3AccessKey is also specified.
- Required: No
- Type: String
The optional Amazon S3 endpoint to use when making S3 service calls.
- Required: Yes
- Type: String
The path to a local directory, relative to which the downloaded files will be created. The local path and the S3 path, which was prepended to the data when uploaded, are used together to build the target local directory where the downloaded files will be created. Please note that the target local directory must be empty, but the local path does not have to.
- Required: Yes
- Type: String
Local path to the manifest file.
- Required: Yes
- Type: String
Registration Server hostname.
- Required: Yes
- Type: Integer
Registration Server port.
- Required: Yes
- Type: String
DEPRECATED. Use regServerHost parameter.
- Required: Yes
- Type: Integer
DEPRECATED. Use regServerPort parameter.
- Required: No
- Type: Boolean
- Default: false
If set to true, enables SSL (HTTPS) to communicate with the herd Registration Service. Otherwise, uses HTTP.
- Required: No
- Type: String
The username used for HTTPS client authentication with the herd Registration Service.
Note: To avoid complications with parsing the username if it has spaces, please encapsulate your username in "" (double quotes)
- Required: No
- Type: String
The password used for HTTPS client authentication with the herd Registration Service.
Note: To avoid complications with parsing the password, please encapsulate your password in "" (double quotes)
- Required: No
Display usage information and exit.
- Required: No
Display version information and exit.
- Required: No
- Type: String
The hostname of an HTTP proxy that will be used when connecting to the S3 service. This is needed when a direct HTTP connection isn't allowed. Make sure the httpProxyPort is also specified when usiing this option.
- Required: No
- Type: Integer
The port number of an HTTP proxy that will be used when connectinng to the S3 service. This is needed when a direct HTTP connection isn't allowed. Make sure the httpProxyHost is also specified when using this option.
- Required: No
- Type: Integer
- Default: 10
The maximum number of threads to use during file transfers. If this argument isn't specified, a suitable default will be used. Amazon does a good job of determining how many threads to use so it is not recommended to use this option unless there is a specific need. Please note that we are only expecting to get ~55Mbps of throughput per thread, so please run the tool on the appropriate box given required performance.
- Required: No
- Type: Integer
- Default: 50000
- Release: 0.18.0
The socket timeout in milliseconds. 0 indicates no timeout.
The command line program returns zero when execution succeeds and non-zero when execution fails.
The downloader displays output including errors on the console. Informational messages will be logged such as key program parameters and the total number of files/bytes copied.
NOTE: You might see the below socket and http exceptions in the downloader output. Those exception are safe to ignore, since they are typically handled seamlessly by AWS Java SDK. Still, if you observe those exceptions, please try to reduce the number of threads being used by the relative downloader instance and/or limit the number of the downloader instances (parallel upload jobs) that you run on the relative box.
-
... INFO com.amazonaws.http.AmazonHttpClient.executeHelper - Unable to execute HTTP request: Timeout waiting for connection org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection
-
... INFO com.amazonaws.http.AmazonHttpClient.executeHelper - Unable to execute HTTP request: Broken pipe java.net.SocketException: Socket is closed
-
... INFO com.amazonaws.http.AmazonHttpClient.executeHelper - Unable to execute HTTP request: Socket is closed java.net.SocketException: Broken pipe
The information provided in the manifest file, or "side-car" file, is used by the downloader to retreive information on the business object data registered with the Data Registry. This subsection describes the specification for the manifest file, which includes required and optional fields.
The characteristics of the file should be:
- Name: <manifest_file_name>.json
- Type: Text
- Encoding: UTF8
- Format: JSON
- Required: Yes
- Type: String
- Case sensitive: No
The namespace in which the business object definition belongs to.
- Required: Yes
- Type: String
- Case sensitive: No
The business object definition name (e.g. NEW_ORDERS).
- Required: Yes
- Type: String
- Case sensitive: No
The business object format usage (e.g. PRC).
- Required: Yes
- Type: String
- Case sensitive: No
The business object format file type (e.g. ORC).
- Required: No
- Type: Quoted integer
The business object format version (e.g. 0). When format version is not specified, the business object data with the latest business format version avaiable for this partition value is returned back.
- Required: Yes
- Type: String
- Case sensitive: No
The business object format partition key (e.g. TDATE).
- Required: Yes
- Type: String
- Case sensitive: Yes
The business object data partition value (e.g. 2014-07-21).
- Required: No
- Type: List of String
- Case sensitive: Yes
The business object data sub-partition values
- Required: No
- Type: Quoted integer
The business object data version (e.g. 0). When data version is not specified, the latest business object data is returned back.
- Required: No
- Type: String
- Case sensitive: No
The name of the storage to download from. Defaults to S3_MANAGED.
{
"namespace": STRING,
"businessObjectDefinitionName": STRING,
"businessObjectFormatUsage": STRING,
"businessObjectFormatFileType": STRING,
"businessObjectFormatVersion": STRING,
"partitionKey": STRING,
"partitionValue": STRING,
"subPartitionValues" : [STRING,STRING,STRING,STRING]
"businessObjectDataVersion": STRING,
"storageName": STRING
}
The below is an example of a manifest file (e.g. manifest.json) to retrieve the latest data version of the NEW_ORDERS processed data for 2014-04-01.
{
"namespace": "APPLICATION_A",
"businessObjectDefinitionName": "NEW_ORDERS",
"businessObjectFormatUsage": "PRC",
"businessObjectFormatFileType": "TXT",
"businessObjectFormatVersion": "2",
"partitionKey": "PROCESS_DATE",
"partitionValue": "2014-04-01"
}
The downloaded data includes the creation of the "manifest.json" side-car file. This subsection describes the specification for the manifest file, which includes required and optional fields.
The characteristics of the file should be:
- Name: manifest.json
- Type: Text
- Encoding: UTF8
- Format: JSON
Field Name
|
Description
|
---|---|
namespace | The business object definition namespace. |
businessObjectDefinitionName | The business object definition name. |
businessObjectFormatUsage | The business object format usage. |
businessObjectFormatFileType | The business object format file type. |
businessObjectFormatVersion | The business object format version. |
partitionKey | The business object format partition key. |
partitionValue | The business object data partition value. |
subPartitionValues | The business object data sub-partition values. |
businessObjectDataVersion | The business object data version. |
storageName |
The name of the storage. |
manifestFiles | The list of file information. |
|
The file name of a manifest file. |
|
The size in bytes of the contents of the manifest file. |
|
The row count of a manifest file. |
attributes | The list of name/value pairs associated with the data. |
businessObjectDataParents | The list of business object data parents (i.e. predecessors) that were used in the creation of this data. |
|
The name of the business object definition for a specific business object data parent. |
|
The business object format usage for a specific business object data parent. |
|
The business object format file type for a specific business object data parent. |
|
The business object format version for a specific business object data parent. |
|
The partition value for a specific business object data parent. |
|
The business object data sub-partition values. |
|
The business object data version for a specific business object data parent. |
businessObjectDataChildren | The list of business object data children (i.e. successors) that are dependent on this business object data. |
{
"namespace": STRING,
"businessObjectDefinitionName": STRING,
"businessObjectFormatUsage": STRING,
"businessObjectFormatFileType": STRING,
"businessObjectFormatVersion": STRING,
"partitionKey": STRING,
"partitionValue": STRING,
"subPartitionValues" : [STRING,STRING,STRING,STRING],
"businessObjectDataVersion": STRING,
"storageName": STRING,
"manifestFiles" : [ {
"fileName" : STRING,
"fileSizeBytes" : NUMBER,
"rowCount" : NUMBER,
},
...
],
"attributes": { STRING: STRING, STRING: STRING, ... },
"businessObjectDataParents" : [ {
"businessObjectDefinitionName" : STRING,
"businessObjectFormatUsage" : STRING,
"businessObjectFormatFileType" : STRING,
"businessObjectFormatVersion" : NUMBER,
"partitionValue" : STRING,
"subPartitionValues" : [STRING,STRING,STRING,STRING],
"businessObjectDataVersion" : NUMBER
},
...
]
"businessObjectDataChildren" : [ {
"businessObjectDefinitionName" : STRING,
"businessObjectFormatUsage" : STRING,
"businessObjectFormatFileType" : STRING,
"businessObjectFormatVersion" : NUMBER,
"partitionValue" : STRING,
"subPartitionValues" : [STRING,STRING,STRING,STRING],
"businessObjectDataVersion" : NUMBER
},
...
]
}
The below is an example of a manifest file for NEW_ORDERS object processed data for 2014-04-01.
{
"namespace": "APPLICATION_A",
"businessObjectDefinitionName": "NEW_ORDERS",
"businessObjectFormatUsage": "PRC",
"businessObjectFormatFileType": "TXT",
"businessObjectFormatVersion": "2",
"partitionKey": "PROCESS_DATE",
"partitionValue": "2014-04-01",
"storageName": "S3_MANAGED",
"manifestFiles" : [ {
"fileName" : "testFile1.gz",
"fileSizeBytes" : 10000,
"rowCount" : 1000
}, {
"fileName" : "testFile2.gz",
"fileSizeBytes" : 20000,
"rowCount" : 2000
} ],
"attributes": {"name1": "value1", "name2": "value2"}
"businessObjectDataParents" : [ {
"businessObjectDefinitionName" : "NEW_ORDERS",
"businessObjectFormatUsage" : "SRC",
"businessObjectFormatFileType" : "TXT",
"businessObjectFormatVersion" : 1,
"partitionValue" : "2014-04-01",
"businessObjectDataVersion" : 0
} ]
"businessObjectDataChildren" : [ {
"businessObjectDefinitionName" : "NEW_ORDERS",
"businessObjectFormatUsage" : "PRC2",
"businessObjectFormatFileType" : "TXT",
"businessObjectFormatVersion" : 1,
"partitionValue" : "2014-04-01",
"businessObjectDataVersion" : 0
} ]
}
The below command downloads NEW_ORDERS 2014-04-01 data registered with the herd from the S3_MANAGED DEV bucket and the local files system.
java -jar dm-uploader-app.jar \
-a <accessKey> \
-p <secretKey> \
-e s3-external-1.amazonaws.com \
-l /nfs/site/mrkt/exchange_ingest/ECXH_PD/20140401/NEW_ORDERS_DU/EXCH_V2_FMT/ \
-m /export/home/application_a_dev/dm-downloader-manifest-files/new-orders-pd-v2-2014-04-01.json \
-H myHostname.us-east-1.elb.amazonaws.com \
-P 80 \
-s true \
-u <username> \
-w <password> \
-n 10.0.0.100 \
-o 3128
- Please make sure that the server where you run the Uploader can talk to the herd application server. That might require a new firewall rule to be set up.
- Depending on your environment, in order for the uploader tool to communicate with the AWS S3, you might need to provide values for the HTTP proxy parameters (i.e. -n and -o parameters).
- Getting Started with herd
- herd Usage Pages
- herd API documentation
- herd Workflow Tasks
- herd Tools