-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPR2-1642: Include part-*.snappy.parquet in batch job during ingestio… #9199
Conversation
|
2d0afb1
to
9e9d3e6
Compare
|
|
The batch job of the replay pipeline does not process diff files created by the reload pipeline. This is because the batch job looks for files matching the name pattern
LOAD*.parquet
whereas the reload diff files have the name patternpart-*.snappy.parquet
.Steps to reproduce:
To fix this, an extra argument will be supplied to the batch job invocation during the replay pipeline:
--dpr.batch.load.fileglobpattern : {part-*.snappy.parquet,LOAD*parquet}