Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for incremental ingestion #374

Closed
yruslan opened this issue Mar 15, 2024 · 0 comments · Fixed by #487
Closed

Add support for incremental ingestion #374

yruslan opened this issue Mar 15, 2024 · 0 comments · Fixed by #487
Assignees
Labels
DE enhancement New feature or request Pramen-Scala

Comments

@yruslan
Copy link
Collaborator

yruslan commented Mar 15, 2024

Background

Currently, incremental updates are made by overwriting the latest info date partitions multiple times a day.
This can be inefficient, especially for big tables with many events.

If the source table has a monotonically increasing field (a timestamp, a number, etc), it can be used for updates to the latest partition.

Feature

Add support for incremental ingestion (in Pramen).

Example

offset.column {
  name = "created_at"
  type = "datetime"
}

Proposed Solution

This PR adds 'incremental' as a schedule type, and mechanisms for managing offsets (experimental).

Pramen version 1.10 introduces the concept of incremental ingestion. It allows running a pipeline multiple times a day
without reprocessing data that was already processed. In order to enable it, use incremental schedule when defining your
ingestion operation:

schedule = "incremental"

In order for the incremental ingestion to work you need to define a monotonically increasing field, called an offset.
Usually, this incremental field can be a counter, or a record creation timestamp. You need to define the offset field in
your source. The source should support incremental ingestion in order to use this mode.

offset.column {
  name = "created_at"
  type = "datetime"
}

Offset types available at the moment:

Type Description
integral Any integral type (short, int, long)
datetime A datetime or timestamp fields
string Only string / varchar(n) types.

Only ingestion jobs support incremental schedule at the moment. Incremental transformations and sinks are planned to be
available soon.

Pramen PR: #487

After the completion of the issue an epic will be created for Aqueduct.

@yruslan yruslan added enhancement New feature or request Pramen-Scala labels Mar 15, 2024
@yruslan yruslan added the DE label Mar 22, 2024
@yruslan yruslan self-assigned this Sep 5, 2024
yruslan added a commit that referenced this issue Sep 12, 2024
…and add more useful methods to metastore interfaces.
yruslan added a commit that referenced this issue Sep 13, 2024
yruslan added a commit that referenced this issue Sep 18, 2024
Apparently, Spark 2.4.8 infers '2021-02-18' as timestamp :O
yruslan added a commit that referenced this issue Sep 18, 2024
Apparently, Spark 2.4.8 infers '2021-02-18' as timestamp :O
yruslan added a commit that referenced this issue Sep 30, 2024
Apparently, Spark 2.4.8 infers '2021-02-18' as timestamp :O
yruslan added a commit that referenced this issue Sep 30, 2024
yruslan added a commit that referenced this issue Sep 30, 2024
…for timezone differences in timestamp offsets.
yruslan added a commit that referenced this issue Oct 1, 2024
@yruslan yruslan changed the title Add support for incremental ingestion Add support for incremental ingestion (in Pramen) Oct 3, 2024
@yruslan yruslan changed the title Add support for incremental ingestion (in Pramen) Add support for incremental ingestion Oct 3, 2024
yruslan added a commit that referenced this issue Oct 4, 2024
yruslan added a commit that referenced this issue Oct 4, 2024
yruslan added a commit that referenced this issue Oct 7, 2024
yruslan added a commit that referenced this issue Oct 8, 2024
This is because for inclusive intervals minimums are not needed.
yruslan added a commit that referenced this issue Oct 9, 2024
This is when the input table does not have an information date field, and uncommitted offsets are old. Then they wasn't checked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DE enhancement New feature or request Pramen-Scala
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant