Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-49073] Parallel step should propagate the worst result when not using failFast #325

Merged
merged 20 commits into from
Nov 26, 2019

Conversation

basil
Copy link
Member

@basil basil commented Sep 18, 2019

Problem

See JENKINS-49073 (comment).

Solution

Inspired by jenkinsci/pipeline-build-step-plugin#24 (comment). Add a new propagateWorst property to the parallel step that implements the desired behavior. When this is in effect, the "worst" branch propagates its build status, as defined by the ordering Jesse suggested.

Implementation

To make this change easier to implement, I first did two commits that simply refactor the existing code. These should be reviewed first to ensure that I didn't break anything.

  1. The first commit simplifies originalFailure from a SimpleEntry<String,Throwable> to just a Throwable. The key was never used, so this is just a simplification.
  2. The second commit changes the saved on-disk state from the first failure with all subsequent failures chained onto it via suppressed exceptions to a LinkedHashSet of failures in insertion order. This allows us to delay choosing the top-level exception until later in the execution in checkAllDone, which is needed in order to implement the new propagateWorst functionality.


@Issue("JENKINS-49073")
@Test
public void parallelPropagatesStatus() throws Exception {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these new test cases, which exercise the new functionality and fail prior to this change. Are there any other use cases I should consider or test cases I should write?

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice from a quick glance; have not yet done a detailed review of the logic.

@@ -167,6 +167,12 @@
<version>1.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins</groupId>
<artifactId>pipeline-build-step</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary I think; pending JENKINS-27092, can use non-sandboxed scripts which directly throw new FlowInterruptedException(…) for tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this isn't necessary, but I think it makes the test more realistic. I saw that we're already pulling in pipeline-input-step, so pulling in pipeline-build-step as well didn't seem like a big price to pay for the benefit of more realistic integration testing. I think I'd rather keep this as-is, but let me know if you feel strongly otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong feeling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'd lean towards the FlowInterruptedException approach since it would make parallelPropagatesStatusImpl much easier to read, but I think either approach is fine.

@jglick jglick self-requested a review September 18, 2019 19:45
@basil basil closed this Sep 27, 2019
@basil basil reopened this Sep 27, 2019
@basil
Copy link
Member Author

basil commented Sep 27, 2019

My most recent test run failed due to an issue that doesn't look related to this change:

[ERROR] inProgressButProgramLoadFailure(org.jenkinsci.plugins.workflow.cps.PersistenceProblemsTest)  Time elapsed: 195.219 s  <<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 180 seconds
	at java.lang.Thread.sleep(Native Method)
	at org.jenkinsci.plugins.workflow.cps.PersistenceProblemsTest.assertCompletedCleanly(PersistenceProblemsTest.java:56)
	at org.jenkinsci.plugins.workflow.cps.PersistenceProblemsTest.lambda$inProgressButProgramLoadFailure$19(PersistenceProblemsTest.java:321)
	at org.jenkinsci.plugins.workflow.cps.PersistenceProblemsTest$$Lambda$2/1166807841.run(Unknown Source)
	at org.jvnet.hudson.test.RestartableJenkinsRule$3.evaluate(RestartableJenkinsRule.java:200)
	at org.jvnet.hudson.test.RestartableJenkinsRule$6.evaluate(RestartableJenkinsRule.java:240)
	at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:556)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:748)

Restarting to get another build.

@basil basil closed this Sep 27, 2019
@basil basil reopened this Sep 27, 2019
@basil
Copy link
Member Author

basil commented Sep 27, 2019

Now all branches of my build failed without even running a single test. I'm just going to give up for today and try to run a build again tomorrow.

@slonopotamus
Copy link
Contributor

slonopotamus commented Sep 30, 2019

  1. How propagateWorst=true interacts with failFast=true?
  2. I'd like propagateWorst=true to be default. What is the procedure of discussing this design decision?

@basil basil closed this Sep 30, 2019
@basil basil reopened this Sep 30, 2019
@basil
Copy link
Member Author

basil commented Sep 30, 2019

Now all branches of my build failed without even running a single test. I'm just going to give up for today and try to run a build again tomorrow.

I got further today. My test run passed in all environments except in windows-11-2.164.1, where the following two errors occurred:

[ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 218.424 s <<< FAILURE! - in org.jenkinsci.plugins.workflow.cps.PersistenceProblemsTest
[ERROR] inProgressButProgramLoadFailure(org.jenkinsci.plugins.workflow.cps.PersistenceProblemsTest)  Time elapsed: 192.424 s  <<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 180 seconds
	at java.base@11.0.3/java.lang.Thread.sleep(Native Method)
	at app//org.jenkinsci.plugins.workflow.cps.PersistenceProblemsTest.assertCompletedCleanly(PersistenceProblemsTest.java:56)
	at app//org.jenkinsci.plugins.workflow.cps.PersistenceProblemsTest.lambda$inProgressButProgramLoadFailure$19(PersistenceProblemsTest.java:321)
	at app//org.jenkinsci.plugins.workflow.cps.PersistenceProblemsTest$$Lambda$67/0x0000000100125c40.run(Unknown Source)
	at app//org.jvnet.hudson.test.RestartableJenkinsRule$3.evaluate(RestartableJenkinsRule.java:200)
	at app//org.jvnet.hudson.test.RestartableJenkinsRule$6.evaluate(RestartableJenkinsRule.java:240)
	at app//org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:556)
	at app//org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at app//org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.base@11.0.3/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
[ERROR] Tests run: 15, Failures: 1, Errors: 0, Skipped: 4, Time elapsed: 188.762 s <<< FAILURE! - in org.jenkinsci.plugins.workflow.cps.FlowDurabilityTest
[ERROR] testCompleteAndLoadBuilds(org.jenkinsci.plugins.workflow.cps.FlowDurabilityTest)  Time elapsed: 4.283 s  <<< FAILURE!
org.junit.ComparisonFailure: 
expected:<...e] End of Pipeline
[]> but was:<...e] End of Pipeline
[Finished: SUCCESS
]>
	at org.junit.Assert.assertEquals(Assert.java:115)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at org.jenkinsci.plugins.workflow.cps.FlowDurabilityTest$2.evaluate(FlowDurabilityTest.java:420)
	at org.jvnet.hudson.test.RestartableJenkinsRule$6.evaluate(RestartableJenkinsRule.java:240)
	at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:556)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.lang.Thread.run(Thread.java:834)

I can't reproduce these locally and they don't appear to be related to my changes. I'll leave it to the maintainers to comment on any possible next steps here.

@basil
Copy link
Member Author

basil commented Sep 30, 2019

How propagateWorst=true interacts with failFast=true?

The two are mutually exclusive and an IllegalArgumentException is thrown if one attempts to use both simultaneously (see f1c1968).

I'd like propagateWorst=true to be default. What is the procedure of discussing this design decision?

I would also like it to be the default, but I erred on the conservative side in this change. There is precedent for changing existing behavior (and breaking backward compatibility!) to adopt a more sensible default in jenkinsci/pipeline-build-step-plugin#24 (comment):

though I think the behavior in this PR is a natural default

Given this past precedent, it seems possible to change the default behavior of the parallel step to propagate the worst result by default, and (to me) this seems like a "natural default", but I defer to the judgment of the maintainers regarding this matter.

@basil basil mentioned this pull request Sep 30, 2019
@dwnusbaum
Copy link
Member

@basil Before jenkinsci/pipeline-build-step-plugin#24, was there any normal scenario in which users would care about the difference between propagate: true vs propagate: false ? The only one I can think of is for users manually throwing FlowInterruptedException, but I don't know how common that is. I am thinking it would be better to just make propagate: true the default so we don't have another option, but that depends on whether users care about the previous behavior.

@basil
Copy link
Member Author

basil commented Oct 2, 2019

Before jenkinsci/pipeline-build-step-plugin#24, was there any normal scenario in which users would care about the difference between propagate: true vs propagate: false?

I think you meant to ask if there was any normal scenario in which users would care about the difference between propagateWorst: true vs propagateWorst: false in the parallel step (i.e., what I am proposing in this PR) rather than propagate: true and propagate: false in the build step (which is what was affected in jenkinsci/pipeline-build-step-plugin#24). So I'll answer the question I think you meant to ask. But if you really did mean to ask about the behavior of the build step, let me know and we can discuss that as well.

I took a look at the 116 Pipeline jobs in use at my company, and I didn't find a single instance where the author manually threw a FlowInterruptedException. In all the cases where we are using parallel steps (without failFast: true), we had initially expected the worst result to be propagated and were surprised when it wasn't. To work around this, we wrote the code described in JENKINS-49073 (comment). This is why I think propagating the worst result would be a natural default (when failFast: true is not being used).

Based on the above, I think we have a few options:

  1. The most conservative option is to preserve the existing behavior (propagate the first result) by default, add a new option to propagate the worst result (off by default), and fail with an IllegalStateException when propagateWorst is used with failFast.
  2. The simplest option is to propagate the worst result by default (a change in existing behavior), except when failFast is set. No new syntax would be added to the API. When failFast is set, we'd continue to propagate the first result as we do today, but there would no longer be a way to propagate the first result if failFast is not set.
  3. A compromise would be to propagate the worst result by default (a change in existing behavior), except when failFast is set, but to also add a new propagateFirst option, which could be used if you want to get the old behavior of propagating the first result without setting failFast.

I personally don't think there is a real use case for propagating the first result without setting failFast. At least I have never wanted this in any of my pipelines. So I dislike option 3 because it unnecessarily complicates the API. So for me the real choice is between option 1 and option 2. Since I think propagating the worst result by default (except when failFast is set) is the most natural default, I personally favor option 2 (knowing that it breaks backward compatibility). Another reason to favor option 2 is that it keeps the syntax simple and avoids having to throw an IllegalArgumentException when failFast is used with propagateWorst.

@dwnusbaum
Copy link
Member

@basil I think option 2 seems like the most obvious behavior. Unfortunately, we have no real way to know how many users might be relying on the existing behavior, but given that users would have to be manually throwing FlowInterruptedException to notice a difference, it seems safe enough to change the default behavior, so my vote is for option 2.

@basil basil changed the title Parallel step should offer the ability to propagate the worst result [JENKINS-49073] Parallel step should offer the ability to propagate the worst result Oct 31, 2019
@basil
Copy link
Member Author

basil commented Oct 31, 2019

Sounds good to me! I've updated the change to propagate the worst result by default when failFast is false and removed the propagateWorst setting.

Comment on lines +618 to +625
parallelPropagatesStatusImpl(
Result.NOT_BUILT, Result.SUCCESS, Result.NOT_BUILT);
parallelPropagatesStatusImpl(
Result.NOT_BUILT, Result.SUCCESS, Result.UNSTABLE, Result.NOT_BUILT);
parallelPropagatesStatusImpl(
Result.NOT_BUILT, Result.SUCCESS, Result.UNSTABLE, Result.FAILURE, Result.NOT_BUILT);
parallelPropagatesStatusImpl(
Result.NOT_BUILT, Result.NOT_BUILT, Result.NOT_BUILT);
Copy link
Member Author

@basil basil Oct 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely satisfied with how the code is currently working for these test cases. For example, for three parallel builds whose results are SUCCESS, UNSTABLE, and NOT_BUILT we are currently propagating a status of NOT_BUILT. I don't consider this ideal; to me, UNSTABLE is "worse" than NOT_BUILT. Similarly, for four parallel builds whose results are SUCCESS, UNSTABLE, FAILURE, and NOT_BUILT we are currently propagating a status of NOT_BUILT. I don't consider this ideal; to me, FAILURE is "worse" than NOT_BUILT. Finally, for two parallel builds whose status is SUCCESS and NOT_BUILT we are currently propagating a status of NOT_BUILT. This doesn't make sense to me. If all the child jobs had a status of NOT_BUILT, then I could understanding the parent job having a status of NOT_BUILT, but as long as a single child job has a status of SUCCESS, I would expect that to be propagated to the parent.

I will think some more about how to tweak the comparator algorithm to handle these cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another tricky question is how to handle ABORTED. Between SUCCESS and ABORTED, ABORTED is clearly worse. But what about between FAILURE/UNSTABLE and ABORTED? I would argue that FAILURE/UNSTABLE are "worse", because we know the status of the failed/unstable job (whereas we don't know the status of the aborted job, which never ran to completion). Between NOT_BUILT and ABORTED I think ABORTED is worse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: What result is worst, I would just use the ordering defined by Result.ordinal (used to implement Result.isWorseThan and friends). Whether that ordering makes the most sense could be debated, but that's the ordering that everything else in Jenkins uses, so I would use it here as well.

Might be interesting to run pipeline-model-definition's tests with this change to see if any of them start failing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: What result is worst, I would just use the ordering defined by Result.ordinal (used to implement Result.isWorseThan and friends). Whether that ordering makes the most sense could be debated, but that's the ordering that everything else in Jenkins uses, so I would use it here as well.

My comparison logic is already using Result#isWorseThan and Result#isBetterThan, so if we're OK with the behavior demonstrated by the new tests (which I think is acceptable, but not ideal) then I can leave the comparison logic as-is.

Might be interesting to run pipeline-model-definition's tests with this change to see if any of them start failing.

Sure. I opened jenkinsci/pipeline-model-definition-plugin#360 to cover this testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened jenkinsci/pipeline-model-definition-plugin#360 to cover this testing.

These tests passed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate on my earlier comment, the values in Result have always seemed weird to me. ABORTED and NOT_BUILT feel like a different kind of thing entirely than SUCCESS, FAILURE, and UNSTABLE. Sometimes I think it would have made more sense for results to have 2 parts, a "completion" result, like , NOT_BUILT, ABORTED, COMPLETED, and a "goodness" result, like SUCCESS, WARNING, ERROR. Then you could have something like a Pipeline step that is ABORTED and SUCCESS in the case of a user cancelling the build, or ABORTED and WARNING in the case of something like a timeout, NOT_BUILT and SUCCESS could be used in cases like a Declarative when expression that is false, and here in the parallel step, the completion result would always be COMPLETED, and only the "goodness" result would be determined by the results of the branches. Using two values creates other problems though, and not all combinations make sense.

So all that to say, I agree that the behavior here with NOT_BUILT feels wrong, but I'm not sure if it makes sense to try to change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your observations here. The awkwardness in the use cases I described above is exposing a deeper problem in how the Result API is defined in Jenkins core. When I get a chance I'll try to file a bug/RFE against Jenkins core with the results of this discussion. Perhaps one day that can be addressed, but I agree that it is quite orthogonal to this effort.

Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me in general, added some minor comments.

Comment on lines +618 to +625
parallelPropagatesStatusImpl(
Result.NOT_BUILT, Result.SUCCESS, Result.NOT_BUILT);
parallelPropagatesStatusImpl(
Result.NOT_BUILT, Result.SUCCESS, Result.UNSTABLE, Result.NOT_BUILT);
parallelPropagatesStatusImpl(
Result.NOT_BUILT, Result.SUCCESS, Result.UNSTABLE, Result.FAILURE, Result.NOT_BUILT);
parallelPropagatesStatusImpl(
Result.NOT_BUILT, Result.NOT_BUILT, Result.NOT_BUILT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate on my earlier comment, the values in Result have always seemed weird to me. ABORTED and NOT_BUILT feel like a different kind of thing entirely than SUCCESS, FAILURE, and UNSTABLE. Sometimes I think it would have made more sense for results to have 2 parts, a "completion" result, like , NOT_BUILT, ABORTED, COMPLETED, and a "goodness" result, like SUCCESS, WARNING, ERROR. Then you could have something like a Pipeline step that is ABORTED and SUCCESS in the case of a user cancelling the build, or ABORTED and WARNING in the case of something like a timeout, NOT_BUILT and SUCCESS could be used in cases like a Declarative when expression that is false, and here in the parallel step, the completion result would always be COMPLETED, and only the "goodness" result would be determined by the results of the branches. Using two values creates other problems though, and not all combinations make sense.

So all that to say, I agree that the behavior here with NOT_BUILT feels wrong, but I'm not sure if it makes sense to try to change it.


private FreeStyleProject getDownstreamJob(Result result) throws IOException {
FreeStyleProject ds = story.j.createFreeStyleProject();
ds.getBuildersList().add(new TestBuilder() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe simpler to use a Pipeline job for the downstream builds instead:

WorkflowJob ds = story.j.createProject(WorkflowJob.class);
ds.setDefinition(new CpsFlowDefinition("currentBuild.result = " + result, true));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes a big difference one way or the other. I did it this way simply because I was copying an existing example from the pipeline-build-step plugin's tests.

@@ -167,6 +167,12 @@
<version>1.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins</groupId>
<artifactId>pipeline-build-step</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'd lean towards the FlowInterruptedException approach since it would make parallelPropagatesStatusImpl much easier to read, but I think either approach is fine.

basil and others added 2 commits November 4, 2019 09:49
Co-Authored-By: Devin Nusbaum <dwnusbaum@users.noreply.github.com>
@dwnusbaum
Copy link
Member

dwnusbaum commented Nov 25, 2019

@basil My suggestions were a bit messed up and incomplete, so I filed basil#1 for improvements. It would be great if you could take a look at that PR and merge it in if you are ok with the changes.

@basil
Copy link
Member Author

basil commented Nov 25, 2019

@basil My suggestions were a bit messed up and incomplete, so I filed basil#1 for improvements. It would be great if you could take a look at that PR and merge it in if you are ok with the changes.

Thanks for helping me take this across the finish line! At this point I think the only remaining change we need to make is to release pipeline-build-step and resolve the TODO in pom.xml in this PR by pulling in the released version.

pom.xml Outdated Show resolved Hide resolved
Co-Authored-By: Devin Nusbaum <dwnusbaum@users.noreply.github.com>
@dwnusbaum dwnusbaum changed the title [JENKINS-49073] Parallel step should offer the ability to propagate the worst result [JENKINS-49073] Parallel step should propagate the worst result when not using failFast Nov 26, 2019
@dwnusbaum dwnusbaum merged commit f005885 into jenkinsci:master Nov 26, 2019
@basil basil deleted the propagate branch November 23, 2021 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants