#520 Add ability to run incremental transformers and sinks #526

yruslan · 2024-12-12T07:58:03Z

Closes #520

This PR adds the ability to specify increnental as the schedule type which would enable incremental transformers and sinks.

In order for a transformer or a sink to use a table from metastore in incremental way, the code should invoke metastore.getCurrentBatch() method instead of metastore.getTable(). metastore.getCurrentBatch() also works for normal batch pipelines.

When getCurrentBatch() is used with daily, weekly or monthly schedule, it returns data for the information date corresponding to the running job, same as invoking metastore.getTable("my_table", Some(infoDate), Some(infoDate)).
When getCurrentBatch() is used with incremental schedule, it returns only latests non-processed data. The offset management is used to keep tracked of processed data.
The column pramen_batchid is added automatically to output tables of ingested and transformed data in order to track offsets. The exception is metastore raw format, which keeps original files as they are, and so we can't add the pramen_batchid column to such tables.
The offsets manager updates the offsets only after output of transformers or sinks have succeeded. It does the update in transactional manner. But if update failed in the middle, duplicates are possible on next runs, so we can say that Pramen provides 'AT LEAST ONCE' semantics for incremental transformation pipelines.
Reruns are possible for full days to remove duplicates. But for incremental sinks, such ask Kafka sink duplicates still might happen.

Example offsets with one ingestion of raw files, one transient transformer, one normal transformer, and a sink:

Notification of the second run for a day:

The corresponding offsets are:

Offsets committed together are highlighted. They are committed atomically. Ether all or none. Note that transient jobs are committed only if the jobs that is calling it was successfully saved. This is so that no records are lost.

Records are as follows:

TEST_POC_raw minimum and maximum offsets for the raw file ingestion
TEST_POC_raw->TEST_POC_parquet for tracking the input of the conversion transient transformer
TEST_POC_parquet->TEST_POC_publish for tracking the input of the standardization transformer
TEST_POC_publish for tracking the output of the standardization transformer
TEST_POC_publish->TEST_POC_publish->kafka_avro for tracking the input of the sink
TEST_POC_publish->kafka_avro for tracking the output of the sink.

… match raw file ingestion.

…es consistently with the old behavior.

…ncies.

… job dependencies.

…st load on the database.

…ut tables.

The table name in 'offsets' table should be bigger than in bookkeeping because there are virtual tables that take other table names as part of it.

…anceOf[]'.

…iles source.

yruslan · 2024-12-12T08:08:33Z

pramen/core/src/main/scala/za/co/absa/pramen/core/reader/TableReaderJdbc.scala

+      if (isDataQuery) {
+        df = SparkUtils.sanitizeDfColumns(df, jdbcReaderConfig.specialCharacters)
+      }


This is the fix for #398

Not part of incremental transformer PR, but fixes a bug that is fixed in 1.9.12 and ported here.

github-actions · 2024-12-12T08:08:47Z

Unit Test Coverage

Overall Project	84.22% `-1.15%`	🍏
Files changed	72.92%	❌

Module	Coverage
pramen:core Jacoco Report	84.97% `-1.27%`	❌
pramen-extras Jacoco Report	76.75%	🍏

Files

Module	File	Coverage
pramen:core Jacoco Report	ConversionTransformer.scala	100%	🍏
	IdentityTransformer.scala	100%	🍏
	TableReaderJdbcConfig.scala	100% `-1.03%`	🍏
	TrackingTable.scala	100%	❌
	OrchestratorImpl.scala	100%	🍏
	OffsetManagerUtils.scala	100%	🍏
	MetastoreReaderBatchImpl.scala	100%	🍏
	OffsetRecords.scala	100%	🍏
	OffsetCommitRequest.scala	100%	❌
	MetaTable.scala	98.96%	🍏
	TransformationJob.scala	96.54%	🍏
	MetastorePersistenceRaw.scala	96.51% `-2.33%`	🍏
	OffsetManagerCached.scala	96.21% `-3.79%`	🍏
	OperationSplitter.scala	93.32%	🍏
	LocalCsvSink.scala	93.22%	🍏
	TransientTableManager.scala	92.74%	🍏
	RawFileSource.scala	92.13%	🍏
	JobBase.scala	91.42%	🍏
	RuntimeConfig.scala	91.28%	🍏
	PipelineNotificationBuilderHtml.scala	91.04% `-0.9%`	❌
	AppContextImpl.scala	90.79%	🍏
	PythonTransformationJob.scala	90.52% `-1.38%`	❌
	ConcurrentJobRunnerImpl.scala	88.86%	🍏
	SinkJob.scala	88.35% `-7.91%`	🍏
	OffsetManagerJdbc.scala	87.96%	🍏
	MetastoreImpl.scala	87.49% `-4.24%`	🍏
	TaskRunnerBase.scala	85.54%	🍏
	ReaderMode.scala	84.85%	❌
	TableReaderJdbc.scala	84.62%	🍏
	MetastoreReaderIncrementalImpl.scala	83.23% `-16.77%`	🍏
	IngestionJob.scala	81.1%	🍏
	IncrementalIngestionJob.scala	80.98% `-0.45%`	🍏
	MetastoreReaderBase.scala	77.93% `-22.07%`	❌
	BookkeeperJdbc.scala	56.51%	🍏
pramen-extras Jacoco Report	EcsPipelineNotificationTarget.scala	91.45%	🍏
	KafkaAvroSink.scala	0%	🍏

pramen/core/src/main/resources/reference.conf

pramen/core/src/main/scala/za/co/absa/pramen/core/pipeline/TransformationJob.scala

kevinwallimann · 2024-12-27T09:53:48Z

pramen/core/src/main/scala/za/co/absa/pramen/core/metastore/MetastoreReaderCore.scala

+import za.co.absa.pramen.api.MetastoreReader
+
+trait MetastoreReaderCore extends MetastoreReader {
+  def commitOutputTable(tableName: String, trackingName: String): Unit


I would rename the class to something like MetastoreReaderIncremental and the method to commitIncrementalOutputTable to make it clear that this trait is only for incremental processing

The intent of the trait is that it should contain methods that are available to the framework only, not to user code. Yes, it has only incremental related methods now. But can extend in the future. But I agree, the name might be misleading. What about MetastoreReaderInternal?

But now that I have looked at the other suggestion (the one about encapsulation), it it does make sense to specialize this trait only for the incremental processing logic.

Ok, for the future, I think MetastoreReaderInternal is more self-explanatory than MetastoreReaderCore. I thought it was some core functionality that is essential for the Metastore functionality, therefore I was confused 😆

kevinwallimann · 2024-12-27T09:59:34Z

pramen/core/src/main/scala/za/co/absa/pramen/core/metastore/MetastoreImpl.scala

    val metastore = this

-    new MetastoreReader {
+    new MetastoreReaderCore {


I think it would be a clearer separation of concerns and encapsulation if you had two separate implementations, one for batch and one for the incremental processing. E.g. leave the original MetastoreReader impl as is, and then extend it for the incremental functionality, and override getCurrentBatch for the additional logic for the incremental case.

Good suggestion! I agree. The inline anonymous class created here is becoming way too big.

pramen/core/src/main/scala/za/co/absa/pramen/core/metastore/MetastoreImpl.scala

pramen/core/src/test/scala/za/co/absa/pramen/core/source/RawFileSourceSuite.scala

...en/core/src/test/scala/za/co/absa/pramen/core/tests/bookkeeper/OffsetManagerUtilsSuite.scala

Co-authored-by: Kevin Wallimann <kevin.wallimann@absa.africa>

yruslan added 19 commits November 22, 2024 11:27

#520 Prepare interfaces for incremental transformer processing.

6f944bf

#520 Make the default batchid field for tables having 'raw' format to…

d90bd7d

… match raw file ingestion.

#520 Implement offset management for incremental processing.

e5755e7

#520 Allow gracefully adding 'pramen_batchid' field to metastore tabl…

b8b85b1

…es consistently with the old behavior.

#520 Add unit tests for the incremental processing.

53acd43

#520 Add chain-commit for incremental jobs with transient job depende…

aeac5df

…ncies.

#520 Add chain-commit transaction for incremental jobs with transient…

254dafa

… job dependencies.

#520 Move some common offset management code to OffsetManagerUtils.

c381cf5

#520 Rename the commit flag to make it easier to understand.

b700025

#520 Add the cache layer to the offset manager calls that incur bigge…

ffa27ae

…st load on the database.

#520 Simplify the way metastore reader is created for various purposes.

ec92fc8

#520 Implement incremental transformations job tracking logic.

2636b3d

#520 Improve performance of committing of transformers and sinks outp…

fe130d3

…ut tables.

#398 Fix decimal auto-correctness for Hive JDBC source.

423c46b

#520 Move OffsetCommitRequest to the model package.

7ff08a4

#520 Increase the size of offset table names, and various fixups.

5bb749b

The table name in 'offsets' table should be bigger than in bookkeeping because there are virtual tables that take other table names as part of it.

#520 Make 'isRaw' an interface function instead of relying on 'isInst…

24fc1de

…anceOf[]'.

#520 Add unit test suite for the new incremental methods of the raw f…

bc5019e

…iles source.

#520 Add unit test suite for the offset manager utils.

365461d

yruslan commented Dec 12, 2024

View reviewed changes

yruslan requested review from kevinwallimann and VladimirRybalko December 12, 2024 10:43

yruslan added 2 commits December 12, 2024 12:30

#520 Update README of the new functionality.

7157901

#520 Add documentation for the identity transformer.

f65794b

kevinwallimann reviewed Dec 30, 2024

View reviewed changes

yruslan and others added 4 commits January 2, 2025 09:37

Apply suggestions from code review

ec765a7

Co-authored-by: Kevin Wallimann <kevin.wallimann@absa.africa>

#520 Fix incremental processing PR suggestions.

ffcfcdb

#520 Split batch and incremental implementation of MetastoreReader.

96abd15

#520 Improve a test as a PR suggestion.

0565b95

#520 Fix a regression introduced by the refactoring.

c51419d

kevinwallimann approved these changes Jan 2, 2025

View reviewed changes

yruslan merged commit cc0301e into main Jan 2, 2025
8 checks passed

yruslan deleted the feature/520-incremental-processing-trafsformers branch January 2, 2025 16:29

yruslan mentioned this pull request Jan 3, 2025

Release Pramen v1.10.3 #530

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#520 Add ability to run incremental transformers and sinks #526

#520 Add ability to run incremental transformers and sinks #526

yruslan commented Dec 12, 2024 •

edited

Loading

yruslan Dec 12, 2024 •

edited

Loading

github-actions bot commented Dec 12, 2024 •

edited

Loading

kevinwallimann Dec 27, 2024

yruslan Jan 2, 2025

yruslan Jan 2, 2025

kevinwallimann Jan 2, 2025

kevinwallimann Dec 27, 2024

yruslan Jan 2, 2025

#520 Add ability to run incremental transformers and sinks #526

#520 Add ability to run incremental transformers and sinks #526

Conversation

yruslan commented Dec 12, 2024 • edited Loading

yruslan Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Dec 12, 2024 • edited Loading

Unit Test Coverage

kevinwallimann Dec 27, 2024

Choose a reason for hiding this comment

yruslan Jan 2, 2025

Choose a reason for hiding this comment

yruslan Jan 2, 2025

Choose a reason for hiding this comment

kevinwallimann Jan 2, 2025

Choose a reason for hiding this comment

kevinwallimann Dec 27, 2024

Choose a reason for hiding this comment

yruslan Jan 2, 2025

Choose a reason for hiding this comment

yruslan commented Dec 12, 2024 •

edited

Loading

yruslan Dec 12, 2024 •

edited

Loading

github-actions bot commented Dec 12, 2024 •

edited

Loading