#374 Incremental Ingestion #487

yruslan · 2024-09-18T13:13:52Z

Closes #374
Closes #421

This PR adds 'incremental' as a schedule type, and mechanisms for managing offsets (experimental).

Pramen version 1.10 introduces the concept of incremental ingestion. It allows running a pipeline multiple times a day
without reprocessing data that was already processed. In order to enable it, use incremental schedule when defining your
ingestion operation:

schedule = "incremental"

In order for the incremental ingestion to work you need to define a monotonically increasing field, called an offset.
Usually, this incremental field can be a counter, or a record creation timestamp. You need to define the offset field in
your source. The source should support incremental ingestion in order to use this mode.

offset.column {
  name = "created_at"
  type = "datetime"
}

Offset types available at the moment:

Type	Description
integral	Any integral type (`short`, `int`, `long`)
datetime	A `datetime` or `timestamp` fields
string	Only `string` / `varchar(n)` types.

Only ingestion jobs support incremental schedule at the moment. Incremental transformations and sinks are planned to be
available soon.

github-actions · 2024-09-19T07:27:22Z

Unit Test Coverage

Overall Project	84.16% `-2.66%`	🍏
Files changed	77.8%	❌

Module	Coverage
pramen:core Jacoco Report	84.93% `-2.93%`	❌

Files

Module	File	Coverage
pramen:core Jacoco Report	IdentityTransformer.scala	100%	🍏
	OffsetInfoParser.scala	100%	🍏
	TableReaderJdbcConfig.scala	100% `-1.08%`	🍏
	JournalHadoopCsv.scala	100%	🍏
	JdbcSource.scala	100%	🍏
	TableReaderJdbcBase.scala	100%	🍏
	JournalTasks.scala	100%	🍏
	TaskCompletedCsv.scala	100% `-18.46%`	❌
	JournalTask.scala	100% `-6.15%`	❌
	MetaTableStats.scala	100% `-57.14%`	❌
	OffsetRecords.scala	100%	🍏
	OffsetRecordConverter.scala	100%	🍏
	DataOffsetRequest.scala	100%	❌
	DataOffsetAggregated.scala	100%	❌
	OffsetRecord.scala	100%	❌
	MetaTable.scala	98.91%	🍏
	Schedule.scala	98.58% `-8.87%`	❌
	MetastorePersistenceRaw.scala	98.39%	🍏
	TransferTable.scala	97.83%	🍏
	TaskCompleted.scala	97.48% `-7.56%`	🍏
	InfoDateConfig.scala	97%	🍏
	TransferJob.scala	96.77%	🍏
	SqlGeneratorDenodo.scala	96.43% `-1.34%`	🍏
	SqlGeneratorDb2.scala	96.4% `-1.35%`	🍏
	SqlGeneratorHive.scala	96.38% `-1.36%`	🍏
	SqlGeneratorHsqlDb.scala	96.38% `-1.36%`	🍏
	OperationSplitter.scala	96.29%	🍏
	SqlGeneratorGeneric.scala	96.26% `-1.4%`	🍏
	ScheduleStrategyUtils.scala	95.78%	🍏
	TransformationJob.scala	95.56% `-1.78%`	🍏
	SqlGeneratorSas.scala	95.5% `-1.74%`	🍏
	SparkSource.scala	95.45% `-4.03%`	🍏
	LocalSparkSource.scala	94.88% `-1.02%`	❌
	SparkUtils.scala	94.76% `-0.57%`	🍏
	SinkJob.scala	93.67%	🍏
	SqlGeneratorOracle.scala	93.38% `-1.32%`	🍏
	PramenImpl.scala	93.23%	🍏
	ScheduleStrategyIncremental.scala	93.16% `-6.84%`	🍏
	SqlGeneratorPostgreSQL.scala	91.71% `-5.99%`	🍏
	SqlGeneratorMySQL.scala	91.71% `-3.23%`	🍏
	PythonTransformationJob.scala	91.71%	🍏
	PipelineNotificationBuilderHtml.scala	91.3% `-1.74%`	❌
	JobBase.scala	91.26% `-1.82%`	🍏
	AppContextImpl.scala	90.56% `-2.58%`	🍏
	MetastoreImpl.scala	90.33% `-1.73%`	🍏
	ConcurrentJobRunnerImpl.scala	88.98% `-8.33%`	❌
	AppRunner.scala	88.91%	🍏
	PipelineStateImpl.scala	88.89%	🍏
	OperationDef.scala	88.74% `-1.2%`	❌
	RawFileSource.scala	88.72% `-0.51%`	❌
	TaskRunnerBase.scala	88.47% `-0.17%`	🍏
	SqlGeneratorMicrosoft.scala	87.5% `-2.53%`	🍏
	OffsetManagerJdbc.scala	86.85% `-13.15%`	🍏
	MetastorePersistenceParquet.scala	86.6% `-2.49%`	🍏
	MetastorePersistence.scala	84.43%	🍏
	MetastorePersistenceTransient.scala	84.38%	🍏
	TableReaderJdbc.scala	84.03% `-2.47%`	❌
	IncrementalIngestionJob.scala	81.65% `-18.35%`	🍏
	JournalJdbc.scala	81.53%	🍏
	BookkeeperBase.scala	81.21% `-3.76%`	❌
	IngestionJob.scala	80.93%	🍏
	TableReaderJdbcNative.scala	78.96% `-16.69%`	❌
	MetastorePersistenceDelta.scala	71.7% `-11.52%`	❌
	TableReaderSpark.scala	70.63% `-25.92%`	❌
	Bookkeeper.scala	69.28%	🍏
	SlickUtils.scala	67.05% `-13.64%`	❌
	MetastorePersistenceTransientEager.scala	64.39% `-7.2%`	❌
	BookkeeperJdbc.scala	56.35% `-0.73%`	❌
	PramenDb.scala	54.19% `-1.83%`	❌

…e introduced.

…ore.

…and add more useful methods to metastore interfaces.

…d info dates + offsets.

…ed offsets.

Apparently, Spark 2.4.8 infers '2021-02-18' as timestamp :O

…'sbt test'.

yruslan · 2024-10-03T13:04:27Z

Putting this back to draft since Kevin suggested to use full intervals for offset tracking instead of half-intervals.
Potential benefits:

Easier to understand the offsets table
We don't have to rely on absolute minimum values for each offset type

…lusive intervals.

This is because for inclusive intervals minimums are not needed.

This is when the input table does not have an information date field, and uncommitted offsets are old. Then they wasn't checked.

...core/src/main/scala/za/co/absa/pramen/core/runner/splitter/ScheduleStrategyIncremental.scala

kevinwallimann · 2024-10-11T14:59:36Z

...core/src/main/scala/za/co/absa/pramen/core/runner/splitter/ScheduleStrategyIncremental.scala

+                  TaskPreDef(date, TaskRunReason.New)
+                })
+              } else {
+                Seq(TaskPreDef(infoDate, TaskRunReason.New))


Should there not be the same logic as in line 75? I.e. empty seq if maximumInfoDate is after infoDate?

This is because when we don't have an information date at the source, our information date is the processing date, and we can't run for previous information dates, hence empty Seq. But when the source has an information column, new offsets could have arrived for that specific date, so we allow it to run.

When I looked at the code, and especially when started writing unit tests for it as you suggested, I decided to refactor it a bit to make it less confising.

...core/src/main/scala/za/co/absa/pramen/core/runner/splitter/ScheduleStrategyIncremental.scala

pramen/core/src/main/scala/za/co/absa/pramen/core/pipeline/IncrementalIngestionJob.scala

pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala

kevinwallimann · 2024-10-23T11:14:18Z

pramen/core/src/main/scala/za/co/absa/pramen/core/bookkeeper/OffsetManagerJdbc.scala

+    offsets.map(OffsetRecordConverter.toDataOffset)
+  }
+
+  override def getUncommittedOffsets(table: String, onlyForInfoDate: Option[LocalDate]): Array[UncommittedOffset] = {


There seems to be some duplicated logic with getOffsets. What different purpose do these two methods have?

The number offsets for a table could be quite a lot, but uncommitted offsets count should be very low. So this method makes sure that when we get uncommitted offsets for a table, the filtering is done on the database, not after they are transferred.

getOffsets() requires an information date to ensure millions of offsets won't be loaded, but returns both committed and uncommitted offsets.

pramen/core/src/test/scala/za/co/absa/pramen/core/tests/bookkeeper/OffsetManagerJdbcSuite.scala

kevinwallimann · 2024-10-23T11:28:22Z

pramen/core/src/test/scala/za/co/absa/pramen/core/tests/sql/SqlGeneratorMySqlSuite.scala

Why did you not test the splitComplexIdentifier for mysql, but for MicrosoftSQL?

splitComplexIdentifier() is defined in SqlGeneratorBase which is tested already in SqlGeneratorDenodoSuite. But for MicrosoftSQL the method is special in order to support 2 types of escaping [] and "" and its mix.

But I'll move testing of splitComplexIdentifier() defined at SqlGeneratorBase to SqlGeneratorGenericSuite or create SqlGeneratorBaseSuite to avoid confusion

kevinwallimann · 2024-10-23T11:34:21Z

pramen/core/src/test/scala/za/co/absa/pramen/core/tests/sql/SqlGeneratorSasSuite.scala

+    "wrapped query without alias for SQL queries " in {
+      assert(gen.getDtable("SELECT A FROM B") == "(SELECT A FROM B)")
+    }
+  }


Why did you not add a test for quote and unquote?

Added tests. And found a bug. 😄

Co-authored-by: Kevin Wallimann <kevin.wallimann@absa.africa>

…SparkUtilsSuite.scala Co-authored-by: Kevin Wallimann <kevin.wallimann@absa.africa>

Co-authored-by: Kevin Wallimann <kevin.wallimann@absa.africa>

yruslan · 2024-10-29T11:07:52Z

Hi Kevin, I've fixed all obvious issues, and added explanations for some. I'm happy with the results. Thanks again for the effort reviewing this!

Going to merge this now to unblock some other issues on the stack. Comments remain open, and if you want to continue conversation on some of them will be happy to.

yruslan force-pushed the feature/374-incremental-ingestion branch 3 times, most recently from 9d9ee04 to 41c8f14 Compare September 18, 2024 15:01

yruslan force-pushed the feature/374-incremental-ingestion branch 2 times, most recently from a5f21d3 to 91e7891 Compare September 25, 2024 12:28

yruslan force-pushed the feature/374-incremental-ingestion branch 2 times, most recently from 35017a7 to 09420c2 Compare September 30, 2024 13:39

yruslan added 22 commits September 30, 2024 15:41

#374 Add a table in bookeeping for storing offsets.

f5211b4

#374 Add bookeeping interfaces for offset management.

d6d1a13

#374 Implement offset management DB operations.

9b09f15

#374 Improve offset management DB operations and add test suites.

5bda4c5

#374 Bump up the minor version number since breaking changes are to b…

45fec57

…e introduced.

#374 Add the notion of 'batchId' and 'getCurrentBatch' for the metast…

9d79f40

…ore.

#374 Remove parenthesis of several get methods.

86df4b5

#374 Add interfaces for sources to fetch data based on offsets.

71201ac

#374 Add table reader interfaces for incremental processing.

533bd93

#374 Add an end to end test for incremental processing.

27ab3cd

Fixup

7329801

#374 Add initial support for incremental transformers.

ba63e16

#374 Make normal transformers compatible with incremental ingestion, …

0232e9c

…and add more useful methods to metastore interfaces.

#374 Add support for reruns for incremental ingestion with offsets an…

239e1f1

…d info dates + offsets.

#374 Add support for historical runs, and for re-committing uncommitt…

307c0f1

…ed offsets.

#374 Implement the offset type: 'datetime' for incremental ingestion.

a8a77dc

Update Jacoco.

d6fa73c

#374 Add integrations tests missing from the last fixup.

fe246e5

#374 Another fixup from compile warnings.

91fb48a

#374 Fixed Spark 2.4.8 support in integration tests.

1f1d7a9

Apparently, Spark 2.4.8 infers '2021-02-18' as timestamp :O

Update Jacoco report version

edd90b3

Make Jacoco take into account integration tests by including them in …

78a0675

…'sbt test'.

yruslan force-pushed the feature/374-incremental-ingestion branch from 6c108ed to f861bf9 Compare September 30, 2024 14:00

yruslan added 2 commits October 1, 2024 09:27

#374 Update README with the new feature.

f3704b8

#374 Add more tests for SQL generation related to offsets.

f8d9d6e

yruslan marked this pull request as ready for review October 1, 2024 08:40

yruslan mentioned this pull request Oct 3, 2024

Add support for incremental ingestion #374

Closed

yruslan marked this pull request as draft October 3, 2024 13:02

yruslan added 2 commits October 4, 2024 10:47

#374 Implement offset management based on inclusive intervals.

7bce016

#374 Fix new incremental ingestion and integration tests to match inc…

99b3689

…lusive intervals.

yruslan force-pushed the feature/374-incremental-ingestion branch from 96d43fc to 99b3689 Compare October 4, 2024 11:08

yruslan added 2 commits October 7, 2024 11:57

#374 Fix new incremental ingestion and integration tests to match inc…

fdea270

…lusive intervals.

Fix a timing dependency of a unit test.

745315b

yruslan marked this pull request as ready for review October 8, 2024 07:55

yruslan requested a review from kevinwallimann October 8, 2024 07:56

yruslan added 3 commits October 8, 2024 09:59

#374 Remove minimum values for offset types.

30ee28b

This is because for inclusive intervals minimums are not needed.

#374 Fix a scenario when uncommitted offsets are not properly handled.

d44a8db

This is when the input table does not have an information date field, and uncommitted offsets are old. Then they wasn't checked.

#374 Refactor validation for the incremental ingestion job.

b3f91b1

kevinwallimann reviewed Oct 23, 2024

View reviewed changes

yruslan and others added 8 commits October 24, 2024 14:48

Add unit tests for the incremental scheduling strategy

7df79cd

Remove the nasty 'var' and possible error related to it.

9e86134

Update pramen/api/src/main/scala/za/co/absa/pramen/api/Source.scala

f6676f2

Co-authored-by: Kevin Wallimann <kevin.wallimann@absa.africa>

Update pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/…

d5f3267

…SparkUtilsSuite.scala Co-authored-by: Kevin Wallimann <kevin.wallimann@absa.africa>

Apply suggestions from code review

f3a8b35

Co-authored-by: Kevin Wallimann <kevin.wallimann@absa.africa>

Fix PR suggestions.

df88ce5

Fix imports removed by IDE.

aecd5e5

Fix more PR suggestions regarding splitting complex identifiers.

ba5e5de

yruslan merged commit 23171be into main Oct 29, 2024
8 checks passed

yruslan deleted the feature/374-incremental-ingestion branch October 29, 2024 11:08

yruslan mentioned this pull request Nov 7, 2024

Release Pramen v1.10.0 #515

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#374 Incremental Ingestion #487

#374 Incremental Ingestion #487

yruslan commented Sep 18, 2024 •

edited

Loading

github-actions bot commented Sep 19, 2024 •

edited

Loading

yruslan commented Oct 3, 2024

kevinwallimann Oct 11, 2024

yruslan Oct 24, 2024

kevinwallimann Oct 23, 2024

yruslan Oct 24, 2024

kevinwallimann Oct 23, 2024

yruslan Oct 24, 2024

kevinwallimann Oct 23, 2024

yruslan Oct 25, 2024

yruslan commented Oct 29, 2024 •

edited

Loading

#374 Incremental Ingestion #487

#374 Incremental Ingestion #487

Conversation

yruslan commented Sep 18, 2024 • edited Loading

github-actions bot commented Sep 19, 2024 • edited Loading

Unit Test Coverage

yruslan commented Oct 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yruslan commented Oct 29, 2024 • edited Loading

yruslan commented Sep 18, 2024 •

edited

Loading

github-actions bot commented Sep 19, 2024 •

edited

Loading

yruslan commented Oct 29, 2024 •

edited

Loading