-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for incremental ingestion #374
Labels
Comments
yruslan
added a commit
that referenced
this issue
Sep 6, 2024
yruslan
added a commit
that referenced
this issue
Sep 6, 2024
yruslan
added a commit
that referenced
this issue
Sep 9, 2024
yruslan
added a commit
that referenced
this issue
Sep 10, 2024
yruslan
added a commit
that referenced
this issue
Sep 10, 2024
yruslan
added a commit
that referenced
this issue
Sep 10, 2024
yruslan
added a commit
that referenced
this issue
Sep 11, 2024
yruslan
added a commit
that referenced
this issue
Sep 11, 2024
yruslan
added a commit
that referenced
this issue
Sep 12, 2024
yruslan
added a commit
that referenced
this issue
Sep 12, 2024
yruslan
added a commit
that referenced
this issue
Sep 12, 2024
yruslan
added a commit
that referenced
this issue
Sep 12, 2024
…and add more useful methods to metastore interfaces.
yruslan
added a commit
that referenced
this issue
Sep 16, 2024
yruslan
added a commit
that referenced
this issue
Sep 17, 2024
yruslan
added a commit
that referenced
this issue
Sep 18, 2024
yruslan
added a commit
that referenced
this issue
Sep 18, 2024
yruslan
added a commit
that referenced
this issue
Sep 18, 2024
yruslan
added a commit
that referenced
this issue
Sep 18, 2024
yruslan
added a commit
that referenced
this issue
Sep 18, 2024
yruslan
added a commit
that referenced
this issue
Sep 18, 2024
yruslan
added a commit
that referenced
this issue
Sep 18, 2024
Apparently, Spark 2.4.8 infers '2021-02-18' as timestamp :O
yruslan
added a commit
that referenced
this issue
Sep 18, 2024
Apparently, Spark 2.4.8 infers '2021-02-18' as timestamp :O
yruslan
added a commit
that referenced
this issue
Sep 19, 2024
yruslan
added a commit
that referenced
this issue
Sep 20, 2024
yruslan
added a commit
that referenced
this issue
Sep 20, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
Apparently, Spark 2.4.8 infers '2021-02-18' as timestamp :O
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
…for timezone differences in timestamp offsets.
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Sep 30, 2024
yruslan
added a commit
that referenced
this issue
Oct 1, 2024
yruslan
added a commit
that referenced
this issue
Oct 1, 2024
yruslan
changed the title
Add support for incremental ingestion
Add support for incremental ingestion (in Pramen)
Oct 3, 2024
yruslan
changed the title
Add support for incremental ingestion (in Pramen)
Add support for incremental ingestion
Oct 3, 2024
yruslan
added a commit
that referenced
this issue
Oct 4, 2024
yruslan
added a commit
that referenced
this issue
Oct 8, 2024
This is because for inclusive intervals minimums are not needed.
yruslan
added a commit
that referenced
this issue
Oct 9, 2024
This is when the input table does not have an information date field, and uncommitted offsets are old. Then they wasn't checked.
yruslan
added a commit
that referenced
this issue
Oct 9, 2024
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background
Currently, incremental updates are made by overwriting the latest info date partitions multiple times a day.
This can be inefficient, especially for big tables with many events.
If the source table has a monotonically increasing field (a timestamp, a number, etc), it can be used for updates to the latest partition.
Feature
Add support for incremental ingestion (in Pramen).
Example
Proposed Solution
This PR adds 'incremental' as a schedule type, and mechanisms for managing offsets (experimental).
Pramen
version 1.10
introduces the concept of incremental ingestion. It allows running a pipeline multiple times a daywithout reprocessing data that was already processed. In order to enable it, use
incremental
schedule when defining youringestion operation:
In order for the incremental ingestion to work you need to define a monotonically increasing field, called an offset.
Usually, this incremental field can be a counter, or a record creation timestamp. You need to define the offset field in
your source. The source should support incremental ingestion in order to use this mode.
Offset types available at the moment:
short
,int
,long
)datetime
ortimestamp
fieldsstring
/varchar(n)
types.Only ingestion jobs support incremental schedule at the moment. Incremental transformations and sinks are planned to be
available soon.
Pramen PR: #487
After the completion of the issue an epic will be created for Aqueduct.
The text was updated successfully, but these errors were encountered: