Full execution provenance resolution #5639

pditommaso · 2025-01-05T13:06:13Z

This PR implements the ability to trace the full provenance of a Nextflow pipeline, so that once a task execution is completed, it reports the set of direct upstream tasks that have originated one or more inputs.

How it works

Each output value that's emitted by a task or an operator is wrapped with an object instance. This makes it possible to assign to each emitted value a unique identity based on the underlying Java object identity.

Each object is associated with the corresponding task or operator run (i.e. TaskRun and OperatorRun).

Once the output value is received as an input by task, the upstream task is determined by inspecting the output-run association table.

Required changes

This approach requires enclosing each output value with a wrapper object, and "unwrap" it once it is received by the downstream task or operator, so that the corresponding operation is not altered.

The input unwrapping can be automated easily both for tasks and operators because they have a common message receive interface.

However the output wrapping requires modifying all nextflow operators because each of them of a custom logic to produce the outputs

Possible problems

It should be assessed the impact of creating an object instance for each output value generated by the workflow execution on the underlying Java heap.

Similarity, keeping a heap reference for each task and operator run may determine memory pressure on large workflow graphs.

Current state and next steps

The current implementation demonstrates that this approach is viable. The solution already supports any tasks and the operators: branch, map, flatMap, collectFile.

Tests are available in this case.

The remaining operators should be added to fully support existing workflow applications.

Alternative solution

A simpler solution is possible using the output file paths as the identity value to track the tasks provenance using a logic very similar to the above proposal.

However, the path approach is limited to the case in which all workflow tasks and operator produce file values. The provenance can be tracked for task having one or more non-file input/output values.

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

netlify · 2025-01-05T13:07:02Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`0e30a8f`
🔍 Latest deploy log	https://app.netlify.com/sites/nextflow-docs-staging/deploys/678ed618926b500008a64754

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

bentsherman · 2025-01-15T13:27:44Z

This is why we need fewer operators 😆

The splitter operators should work similarly to flatMap

pditommaso · 2025-01-15T13:30:11Z

I know, I know but they exists

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

bentsherman · 2025-01-18T01:36:17Z

For posterity, here is an example from fetchngs that would not be captured by file-based tracking:

https://github.com/nf-core/fetchngs/blob/8ec2d934f9301c818d961b1e4fdf7fc79610bdc5/workflows/sra/main.nf#L54-L57

SRA_RUNINFO_TO_FTP outputs a csv file that is split into records using the splitCsv operator. These records are filtered and eventually passed to CUSTOM_SRATOOLSNCBISETTINGS:

https://github.com/nf-core/fetchngs/blob/8ec2d934f9301c818d961b1e4fdf7fc79610bdc5/subworkflows/nf-core/fastq_download_prefetch_fasterqdump_sratools/main.nf#L20

So fetchngs should be a good test case for identity-based tracking.

bentsherman · 2025-01-18T04:09:35Z

After a first pass, I feel good about the overall approach. Most of the changes seem to be general cleanup, which is appreciated, and most of the new behavior is isolated into new packages, which should keep it easy to evolve. I only have some minor questions that we can discuss later.

The trickiest part is clearly the operators -- linking inputs to outputs correctly, wrapping/unwrapping values correctly, especially for the scatter and gather operators. I will want to dig into this bit to see if we can simplify anything, but even the current amount of overhead looks acceptable.

It looks like the memory impact will be manageable. The Msg wrapper itself shouldn't cost much, at most a few MB. My main concern was keeping lots of intermediate objects alive, but looks like you avoided this by keeping only the object identity of Msg and not the reference itself. You do have to keep all task runs alive, which in truth I have not stress-tested, but nf-prov is already doing this so no change there.

I'm curious to see how you handle groupTuple... should be similar to buffer/collate, just harder 😆

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

bentsherman · 2025-01-18T19:58:24Z

modules/nextflow/src/main/groovy/nextflow/extension/GroupTupleOp.groovy

@@ -55,20 +62,21 @@ class GroupTupleOp {

    private sort

-    GroupTupleOp(Map params, DataflowReadChannel source) {
+    private OpContext context = new ContextRunPerThread()


Why does groupTuple use the run-per-thread context? Same question for collate. I would expect both to use ContextGrouping, whereas ContextRunPerThread seems to be for combining multiple input channels. But groupTuple and collate are both single-threaded.

ContextRunPerThread is needed when an operator uses multiple DataflowProcessors under the hood. Likely here also ContextSequential should work, because the real logic is provided by this snipper that unwrap the values and create a new OperatorRun instead with the corresponding inputs ids

nextflow/modules/nextflow/src/main/groovy/nextflow/extension/GroupTupleOp.groovy

Lines 168 to 177 in 61c153c

for( Object it : tuple ) {

if( it instanceof ArrayBag ) {

final bag = it

for( int i=0; i<bag.size(); i++ ) {

bag[i] = OpDatum.unwrap(bag[i], inputs)

}

}

}

return new OperatorRun(new LinkedHashSet<Integer>(inputs))

I guess you could make it work with either the ContextGrouping or the ContextSequential. Still trying to understand how the OpDatum is being used but it seems like a good way to associate inputs with outputs when the mapping is not simply 1-to-1.

I figure we should avoid the ContextRunPerThread where it isn't needed, since both collate and groupTuple have only one DataflowProcessor

OpDatum keep together a concrete value with the OperatorRun instance that acquired it and ultimately the set of inputs IDs. When the operator emission is composed, a new run instance is created with all inputs that concurred to that outputs. Makes sense?

Yes that makes sense. Perhaps the better question is, what does an "operator run" mean for an operator like groupTuple? Each input is a separate "run" in one sense, but only one of those "runs" will emit an output. So is there an OperatorRun for each input or one for each emitted group?

I will write some tests and use my mermaid diagram to test some of these operators. I suspect something is missing for buffer / collate / groupTuple

In principle it should be an instance for each run, but in the reality it's used as object holder to associate an output message with the upstream inputs. See here

nextflow/modules/nextflow/src/main/groovy/nextflow/prov/Tracker.groovy

Lines 114 to 131 in 0e30a8f

protected Set<TaskId> findUpstreamTasks0(final int msgId, Set<TaskId> upstream) {

final run = messages.get(msgId)

if( run instanceof TaskRun ) {

upstream.add(run.id)

return upstream

}

if( run instanceof OperatorRun ) {

for( Integer it : run.inputIds ) {

if( it!=msgId ) {

findUpstreamTasks0(it, upstream)

}

else {

log.trace "Skip duplicate provenance message id=${msgId}"

}

}

}

return upstream

}

Right. Maybe it would be better to rename OperatorRun to OperatorLink, and rename TrailRun to ProvenanceLink, to clarify their purpose as a link from a set of inputs to an output

bentsherman · 2025-01-18T20:18:34Z

I ripped the code from nf-prov to generate a mermaid diagram of the task graph using your provenance method.

Your rnaseq-nf toy pipeline works fine:

I tried to run against fetchngs, but the run hangs at the very end 😞

You should be able to reproduce it with:

make pack
./build/releases/nextflow-24.11.0-edge-dist run nf-core/fetchngs -r 1.12.0 -profile test,conda --outdir results

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

pditommaso · 2025-01-19T20:29:32Z

Well done. I'll check fetchngs asap

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

bentsherman · 2025-01-21T20:43:13Z

modules/nextflow/src/main/groovy/nextflow/extension/op/ContextRunPerThread.groovy

+import groovy.util.logging.Slf4j
+import nextflow.prov.OperatorRun
+/**
+ * Implements an operator context that binds a new run to the current thread


Worth commenting here that a "thread" here refers to a DataflowProcessor (i.e. input channel).

My custom tests on join and mix are working correctly, so I'm assuming that each DataflowProcessor uses the same thread for all of its runs (as long as maxForks is 1).

For operators like collate and groupTuple that use only one DP, the run-per-thread context essentially allows you to manually override how the provenance links are recorded. They are also working correctly with my tests.

Even better would be to replace the thread-local with a Map<DataflowProcessor,OperatorLink>, but not a big deal if that would be too complicated

pditommaso added 3 commits January 3, 2025 18:39

wip

b5d423b

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Task provenance - poc #1

ca9fe88

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Task provenance - poc #2

494bbb0

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

pditommaso requested review from jorgee and bentsherman and removed request for jorgee January 5, 2025 13:06

pditommaso marked this pull request as draft January 5, 2025 13:22

pditommaso added 14 commits January 6, 2025 18:10

Add toList and toSortedList operators

ddeee05

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Refactor redice operator as its own class

e4e505a

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Fix tests

e613123

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Fix tests

1f49d92

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Fix tests

2ab3a6b

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Fix tests

1826cfd

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Fix more tests

e7c8722

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Add chain operator

88ae57e

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Fix tests

c96d573

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Fix and improvements

f727005

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Fix OperatorRun allocation

f06256f

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Add distinc + unique operators

7848791

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Add first operator

0014db9

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Add take and last operator

2b4c5b3

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

pditommaso force-pushed the task-tracker-v2 branch from 3e21298 to 2b4c5b3 Compare January 9, 2025 17:59

pditommaso added 6 commits January 9, 2025 19:49

Add count operator

17fced9

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Add sum and mean ops

e449538

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Improve run operator allocation

ba69632

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Bind many

fee34c7

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Add support for buffer operator

bf06bf6

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Add mix operator

6d13fbc

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Add sum and mean operators

702705f

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

pditommaso added 2 commits January 15, 2025 16:17

Add multimap operator [ci fast]

5750f71

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Fix tests [ci fast]

14484a5

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

This comment was marked as off-topic.

Sign in to view

pditommaso force-pushed the task-tracker-v2 branch from ad6fa38 to 4a77f54 Compare January 15, 2025 19:41

Minor [ci fast]

ff74318

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

pditommaso force-pushed the task-tracker-v2 branch from 4a77f54 to ff74318 Compare January 15, 2025 19:44

Add collate operator

9e5bc10

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Update docs [ci skip]

6b926f9

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

pditommaso force-pushed the task-tracker-v2 branch from 50bb038 to 6b926f9 Compare January 18, 2025 10:16

pditommaso added 3 commits January 18, 2025 11:23

Use a set object in place of list for operator inputs [ci fast]

8637aee

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Add groupTuple operator

787349f

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Add compile static to GroupTupleOp class [ci fast]

08ac37f

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

bentsherman reviewed Jan 18, 2025

View reviewed changes

pditommaso added 2 commits January 19, 2025 21:20

Merge branch 'master' into task-tracker-v2

b8763c1

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Add bare minimal docs [ci skip]

61c153c

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

pditommaso force-pushed the master branch from 1f834a0 to 6454605 Compare January 20, 2025 15:13

pditommaso added 3 commits January 20, 2025 23:27

Merge branch 'master' into task-tracker-v2

c2efcdc

Add until operator [ci fast]

fadab20

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

Add ifEmpty operator [ci fast]

0e30a8f

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>

bentsherman reviewed Jan 21, 2025

View reviewed changes

bentsherman mentioned this pull request Jan 23, 2025

Mermaid diagram for identity-based provenance tracking #5706

Draft

jorgee mentioned this pull request Jan 24, 2025

Open lineage observer #5709

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full execution provenance resolution #5639

Full execution provenance resolution #5639

pditommaso commented Jan 5, 2025

netlify bot commented Jan 5, 2025 •

edited

Loading

bentsherman commented Jan 15, 2025

pditommaso commented Jan 15, 2025

This comment was marked as off-topic.

This comment was marked as off-topic.

bentsherman commented Jan 18, 2025

bentsherman commented Jan 18, 2025

bentsherman Jan 18, 2025

pditommaso Jan 19, 2025

bentsherman Jan 21, 2025

pditommaso Jan 21, 2025

bentsherman Jan 21, 2025

pditommaso Jan 21, 2025

bentsherman Jan 21, 2025

bentsherman commented Jan 18, 2025 •

edited

Loading

pditommaso commented Jan 19, 2025

bentsherman Jan 21, 2025

bentsherman Jan 21, 2025

	for( Object it : tuple ) {
	if( it instanceof ArrayBag ) {
	final bag = it
	for( int i=0; i<bag.size(); i++ ) {
	bag[i] = OpDatum.unwrap(bag[i], inputs)
	}
	}
	}

	return new OperatorRun(new LinkedHashSet<Integer>(inputs))

	protected Set<TaskId> findUpstreamTasks0(final int msgId, Set<TaskId> upstream) {
	final run = messages.get(msgId)
	if( run instanceof TaskRun ) {
	upstream.add(run.id)
	return upstream
	}
	if( run instanceof OperatorRun ) {
	for( Integer it : run.inputIds ) {
	if( it!=msgId ) {
	findUpstreamTasks0(it, upstream)
	}
	else {
	log.trace "Skip duplicate provenance message id=${msgId}"
	}
	}
	}
	return upstream
	}

Full execution provenance resolution #5639

Are you sure you want to change the base?

Full execution provenance resolution #5639

Conversation

pditommaso commented Jan 5, 2025

How it works

Required changes

Possible problems

Current state and next steps

Alternative solution

netlify bot commented Jan 5, 2025 • edited Loading

✅ Deploy Preview for nextflow-docs-staging canceled.

bentsherman commented Jan 15, 2025

pditommaso commented Jan 15, 2025

This comment was marked as off-topic.

This comment was marked as off-topic.

bentsherman commented Jan 18, 2025

bentsherman commented Jan 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bentsherman commented Jan 18, 2025 • edited Loading

pditommaso commented Jan 19, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Jan 5, 2025 •

edited

Loading

bentsherman commented Jan 18, 2025 •

edited

Loading