Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File output of multiple tasks? #110

Closed
henricasanova opened this issue Jun 5, 2019 · 12 comments
Closed

File output of multiple tasks? #110

henricasanova opened this issue Jun 5, 2019 · 12 comments
Assignees
Labels
Milestone

Comments

@henricasanova
Copy link
Contributor

One of our users in Rennes mentioned that he had problems related to the fact that for some DAX or JSON workflows, a single file is output of multiple tasks! This causes problems because then he would have weird behaviors in FileRegistry services, since one task would produce a file called "foo", and then another task would find is already somewhere even it's supposed to produce it.

I just assumed those were bogus workflow descriptions, and added a simple check in the WorkflowTask::addOutputFile method to throw an exception if a file is already output of another task. To my surprise, that broke one of our tests! The JSON workflow in the WorkflowLoadFromJSONTest.LoadValidJSON test has such a strange behavior. The error message in my exception is explicit:

Trying to set file 'columns.txt' as output of task 'individuals_ID0000002', but this file is already the output of task 'individuals_ID0000001'

So, is it valid to have a single file be the output of multiple tasks??? Or is the JSON in that test actually bogus.

@rafaelfsilva
Copy link
Member

That should not be the case, so it is definitely a bug.

On a side note, when implementing the WorkQueue simulator, I noticed that WRENCH does not allow an input file to used by more than one job. Is there a reason for that?

@henricasanova
Copy link
Contributor Author

An input file cannot be used by more than one job? perhaps you mean "one task"? But regarldess, that should not be the case and I am quite sure that we have tests in which this is the case....

@henricasanova
Copy link
Contributor Author

I removed the addFileToMap() helper function, which was useless as it was used only twice, and tried to do too much, which led to problems. This is way back when I was learning C++ :)

I double-checked that the code allows a single file to be input to multiple tasks, but checks that a file cannot be output of multiple tasks. So I believe the code does the right thing now. This is not committed yet (see below)

But now test WorkflowLoadFromJSONTest.LoadValidJSON fails because in that JSON we have a file that’s output of multiple tasks. More specifically, the file "columns.txt" is listed as output to 40+ tasks !!! (for instance: output of individuals_ID0000001 and output of individuals_ID0000002). So, I am not sure what to do. For the test of course we can rename this file in 40+ different ways. But since this is a real Pegasus JSON file, I am concerned. Is this allowed???? And if it is, what does it mean...

@henricasanova
Copy link
Contributor Author

henricasanova commented Aug 2, 2019

Just merge a branch into master that addresses the above in the following way:

  • Took out the Pegasus-specific workflow loading from the WRENCH core and puts it in the tools/ directory. Compiling the tools now generates a separate library and installs one header file that can be used to parse Pegasus files. This may not be the best way of doing it, but I think it will do for a while.
  • Removed the notion of "task type" from the WRENCH core, as this was Pegasus-specific
  • Enabled exceptions in WorflowTask::addOutputFile() whenever the same file is about to become output of two distinct tasks

The changes above likely break the wrench-pegasus implementation because the code in tools/pegasus currently ignores the "auxiliary" and "transfer" tasks. (there are TODO in there). Furthermore, the WorkflowLoadFromJSONTest.LoadValidJSON test is now disabled, as it is also broken.

What remains to be done is:

  • tools/pegasus/src/PegasusWorkflowParser.cpp to do something with "auxiliary" and "transfer" tasks. These would entail creating dummy tasks, cloning/renaming files, adding control dependencies
  • Un-disable the WorkflowLoadFromJSONTest.LoadValidJSON test and make sure it passes
  • Make sure that everything works of the Pegasus simulator

@henricasanova
Copy link
Contributor Author

I've been thinking about this, and thought I'd write down my thoughts here so as not to forget.

The WRENCH model: the workflow consists of tasks that read/write files, thus defining data dependencies, and can also incorporate control dependencies that are not data dependencies.

In Pegasus workflows we find, in addition to the above

  • Auxiliary and transfer tasks
  • tasks that have the same file as input and output (e.g., transfer tasks)
  • tasks that have the same file as output (perhaps some concatenate operations?)

Part of the impedance mismatch above is that Pegasus workflows encode some implementation decisions applied to abstract workflows. This really cannot be part of WRENCH as it's a non-WMS-generic, but Pegasus-specific approach.

Goal: how do we keep WRENCH simple/generic, yet make it possible to use simulate Pegasus easily, AND allow people who do not care about Pegasus to import sensible Pegasus workflows (as abstract workflows) because Pegasus is a good source of workflows?

Questions: Here is a list of questions, and if the answers are what I think they are, then at the end of this post I have a proposal for a solution.

  1. From a .dax or .json Pegasus workflow file, can we infer the abstract workflow? (I think the answer is "yes")

  2. In an abstract Pegasus workflow, is it true that no task can have the same file as input and output? (I think the answer is "yes")

  3. In abstract Pegasus workflows, is it true that a file can be output to multiple tasks? (I think the answer is, sadly, "yes', but a "no" would be good - meaning that there are bugs in currently available Pegasus .dax/.json files)

Proposal::

  1. By default, when calling createWorkflow(), by default the workflow loaded is the abstract workflow (assuming answer to Question 1 above is "yes", that answer to Question 2 above is "yes", and dealing with Question 3 in case its answer is "yes" by creating extra 'fake" files...)

  2. An extra argument can be passed so that the workflow loaded is NOT the abstract workflow, but the realized Pegasus workflow with "auxiliary" and "transfer" tasks.

The challenge for step #2 above is how to include this Pegasus-specific stuff in WRENCH, without making WRENCH too Pegasus-specific. This was done before by creating special task types, but that seems strange because it wouldn't be relevant to other WMSs. But the WorkflowTask is is now just one type. BUT, the WMS developer can of course extend the WorkflowTask class, and encode anything workflow-specific things (e.g., including auxiliary/transfer/whatever tasks). And so, a Pegasus simulator will inspect the dynamic type and take appropriate action if the task is now a "normal" WRENCH task.

@rafaelfsilva
Copy link
Member

Here are some comments to your questions and proposal:

Questions

  1. Yes
  2. Yes. This is a bug on the JSON generator for Pegasus workflows. I will fix it soon.
  3. Unfortunately, Pegasus allows an output file to be output of several tasks. However, I think this is not the correct behavior. I did a test run and I noticed that the output file gets overwritten by each task. So, I think that it is fine if we do not allow that. Thoughts?

Proposal

  1. I agree with this proposal. I would avoid creating "fake" files to reduce confusion, and just launch an exception if the situation occurs. I will fix the Pegasus workflows that we will make available as examples.
  2. I think we should avoid putting too much Pegasus-specific things into WRENCH. Maybe we could allow a way to extend the load function, so the simulator implementation can handle that. I completely agree that we should limit WRENCH to handle the abstract workflow, and auxiliary tasks should be handled by simulator-specific implementations.

@henricasanova
Copy link
Contributor Author

ok, this all sounds good:

  • if it's ok to throw exceptions when a file is output of multiple tasks, that's great and easier.

  • Pegasus-specific things are ok-ish right now, because the code of the load function is in tools/pegasus. But yes, allowing to extend to load function could be an option.

I'll start working on implementing the above, and see where that leads us.
Then you'll have to update the Pegasus simulator and see if the API changes are ok.

@henricasanova
Copy link
Contributor Author

Looking at a few DAX files, I don;t see any "auxiliary" or "transfer" jobs, like in the JSON. Is it the case that DAX files only show abstract workflows?

@rafaelfsilva
Copy link
Member

The DAX used for simulation is an extension of the actual abstract workflow – i.e., it only contains the compute jobs extended with task runtimes and in some cases sizes of input/output files.

The JSON file is extracted from the actual execution, thus it contains all tasks for the executable workflow.

@henricasanova
Copy link
Contributor Author

i just pushed changes to master (and, again, forgot to include the issue number in the commit message). given that dax and json files are qualitatively different, the api have now 3 methods:

  • PegasusWorkflowParser::createWorkflowFromDAX()
  • PegasusWorkflowParser::createWorkflowFromJSON()
  • PegasusWorkflowParser::createExecutableWorkflowFromJSON()

The first two methods above are implemented (with the JSON one removing auxiliary and transfer tasks... will see how it needs to be modified if/when it fails on some workflows). The 3rd one is to be implemented, and we'll have to figure out how to do it. Here are some options:

  1. not implement it at all. Just put its code in the pegasus simulator, and yes, some of that code will be cut-and-pasted from PegasusWorkflowParser::createWorkflowFromJSON(), but it's not a lot of code. in the Pegasus simulator WorkflowTask could be extended, say as PegasusWorkflowTask, to implement the auxiliary and transfer tasks.

  2. Implement it and augment WorkflowTask so that it has some "custom/generic/string/void*" field that encodes the behavior/spec of auxiliary/transfer tasks.

  3. Implement it and add in PegasusWorkflowParser.c the above PegasusWorkflowTask

@henricasanova
Copy link
Contributor Author

henricasanova commented Aug 15, 2019

After discussion with Rafael, option 1 above is the best choice. So, what remains to be done:

  • Update the Pegasus simulator to deal with the executable workflow stuff
  • Cleanup up workflow generator and published workflows to have NO shared output file
  • Remove PegasusWorkflowParser::createExecutableWorkflowFromJSON() from WRENCH

@rafaelfsilva rafaelfsilva modified the milestones: 1.5, 1.6 Feb 4, 2020
@rafaelfsilva
Copy link
Member

I am closing this issue since the Pegasus simulator (https://github.com/wrench-project/pegasus) has a parser (wrench-pegasus-parser.py) for actual Pegasus executions that fixes the bug mentioned above. Traces that will be generated with this new parser will be uploaded to the WorkflowHub Traces section (https://github.com/workflowhub/traces-pegasus).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants