Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in the constructed BT structure and execution #211

Open
roveri-marco opened this issue Apr 5, 2022 · 13 comments
Open

Problem in the constructed BT structure and execution #211

roveri-marco opened this issue Apr 5, 2022 · 13 comments

Comments

@roveri-marco
Copy link
Collaborator

I've created a simple example with three durative actions move deliver and load. I've added a mutual exclusion predicate `(nobusy ?r)'
The model does not contain any numeric fluent but only predicates. I can successfully run the launcher and the terminal

ros2 launch pddl_project_problem_4 pddl_project_problem_4_launch.py on one terminal

ros2 run plansys2_terminal plansys2_terminal on the second terminal

Then on the terminal I issue the commands specified in the command files in attach, and then I ask to generate a plan and to run it.

> get plan
plan: 
0:	(load r h c1 depot n3 empty)	[2]
2.001:	(load r h c4 depot n2 n3)	[2]
4.001:	(move h r depot l1)	[3]
7.001:	(deliver r h p1 c1 food l1 n2 n3)	[2]
9.001:	(move h r l1 depot)	[3]
12.001:	(load r h c2 depot n2 n3)	[2]
14.001:	(move h r depot l2)	[3]
17.001:	(deliver r h p2 c2 food l2 n2 n3)	[2]
19.001:	(move h r l2 depot)	[3]
22.001:	(load r h c3 depot n2 n3)	[2]
24.001:	(move h r depot l3)	[3]
27.001:	(deliver r h p4 c3 food l3 n2 n3)	[2]
29.002:	(deliver r h p4 c4 medecine l3 n3 empty)	[2]
> run
[(deliver r h p1 c1 food l1 n2 n3) [(move h r l1 depot) [ERROR] [1649160044.052233393] [executor_client]: Plan Failed
[WARN] [1649160044.052428155] [executor_client]: Action: (deliver r h p1 c1 food l1 n2 n3):7001 was cancelled
[WARN] [1649160044.052531583] [executor_client]: Action: (deliver r h p2 c2 food l2 n2 n3):17001 was not executed
[WARN] [1649160044.052605985] [executor_client]: Action: (deliver r h p4 c3 food l3 n2 n3):27001 was not executed
[WARN] [1649160044.052680407] [executor_client]: Action: (deliver r h p4 c4 medecine l3 n3 empty):29002 was not executed
[WARN] [1649160044.052755220] [executor_client]: Action: (load r h c1 depot n3 empty):0 succeeded with message_status: Loaded c1 on h
[WARN] [1649160044.052837107] [executor_client]: Action: (load r h c2 depot n2 n3):12001 was not executed
[WARN] [1649160044.052909024] [executor_client]: Action: (load r h c3 depot n2 n3):22001 was not executed
[WARN] [1649160044.052984950] [executor_client]: Action: (load r h c4 depot n2 n3):2000 succeeded with message_status: Loaded c4 on h
[WARN] [1649160044.053187625] [executor_client]: Action: (move h r depot l1):4001 succeeded with message_status: Moved handr from depot to l1
[WARN] [1649160044.053268048] [executor_client]: Action: (move h r depot l2):14001 was not executed
[WARN] [1649160044.053343593] [executor_client]: Action: (move h r depot l3):24001 was not executed
[WARN] [1649160044.053418866] [executor_client]: Action: (move h r l1 depot):9001 was cancelled
[WARN] [1649160044.053499951] [executor_client]: Action: (move h r l2 depot):19001 was not executed

Finished with error(s) 
> 

The launcher terminates with this error

Moving handr from depot to l1 [100%]  
[plansys2_node-1] [ERROR] [1649160041.842474977] [executor]: [(deliver r h p1 c1 food l1 n2 n3):7001]Error checking over all requirements: (and (isat r l1)(isat h l1)(isat p1 l1)(need p1 food)(cratecontent c1 food)(capacity_follow n2 n3))
[plansys2_node-1] [ERROR] [1649160041.939288584] [executor]: Executor BT finished with FAILURE state
[plansys2_node-1] [INFO] [1649160041.940379489] [executor]: Plan Failed
[plansys2_node-1] [PublisherZMQ] Server quitting.
[plansys2_node-1] [PublisherZMQ] just died. Exeption Context was terminated

I believe that the BT constructed is buggy since it tries to execute the (deliver r h p1 c1 food l1 n2 n3) and the (move h r l1 depot) together, while the move shall wait for the deliver to terminate. Thus, the overall conditions of the deliver are not met and the execution fails. If you comment out the lines that also force the move to use the (nobusy ?r) predicate, then the execution succeeds since it correctly waits for the deliver to terminate before starting the move.

If you comment the lines in the domain.pddl file where the (nobusy ?r) predicate is used, then other errors appear, although the plan is generated correctly (the execution fails).

The files to reproduce the errors:
pddl_project_problem_4-buggy.zip

commands.txt

@jjzapf
Copy link
Collaborator

jjzapf commented Apr 7, 2022

Hi @roveri-marco.

This sounds similar to some issues I've seen in the past with incorrect parallel actions. I recently made some modifications to the Plan-to-Action Graph portion of the BTBuilder, which I think should have helped to correct this. See #206. Have you pulled the latest version of the code? I tested the example you provided on my machine and I could not replicate the error. If your code is up to date and you're still seeing this error, I will look into this further next week to double check there are no errors in the recent modifications.

> get plan
plan: 
0:	(load r h c1 depot n3 empty)	[2]
2.001:	(load r h c4 depot n2 n3)	[2]
4.001:	(move h r depot l1)	[3]
7.001:	(deliver r h p1 c1 food l1 n2 n3)	[2]
9.001:	(move h r l1 depot)	[3]
12.001:	(load r h c2 depot n2 n3)	[2]
14.001:	(move h r depot l2)	[3]
17.001:	(deliver r h p2 c2 food l2 n2 n3)	[2]
19.001:	(move h r l2 depot)	[3]
22.001:	(load r h c3 depot n2 n3)	[2]
24.001:	(move h r depot l3)	[3]
27.001:	(deliver r h p4 c3 food l3 n2 n3)	[2]
29.002:	(deliver r h p4 c4 medecine l3 n3 empty)	[2]
> run
[INFO] [1649304012.473790968] - executor_client:/home/jjzapf/src/ros2_planning_system/plansys2_executor/src/plansys2_executor/ExecutorClient.cpp:execute_and_check_plan:88: Plan Succeeded

Successful finished

@roveri-marco
Copy link
Collaborator Author

@jjzapf I'm no longer able to reproduce it, although I was in the master and I've not changed anything...

However, if I change the domain as in the domain.txt file attach (that was my original version of the problem) there is another error that appears (not sure it is only on my side or is it a bug). The commands are the same as the example above, but in this case the plan is different and there is more parallelism (for this reason I've created two instances of the load and deliver actions. The new plan is the following:

> get plan
plan: 
0:	(load r h c1 depot n3 empty)	[2]
0.001:	(load r h c4 depot n3 empty)	[2]
2.001:	(move h r depot l1)	[3]
5.001:	(deliver r h p1 c1 food l1 n3 empty)	[2]
7.001:	(move h r l1 depot)	[3]
10.001:	(load r h c2 depot n3 empty)	[2]
12.001:	(move h r depot l2)	[3]
15.001:	(deliver r h p2 c2 food l2 n3 empty)	[2]
17.001:	(move h r l2 depot)	[3]
20.001:	(load r h c3 depot n3 empty)	[2]
22.001:	(move h r depot l3)	[3]
25.001:	(deliver r h p4 c3 food l3 n3 empty)	[2]
25.002:	(deliver r h p4 c4 medecine l3 n3 empty)	[2]

It first executes the (load r h c1 depot n3 empty) and then the (load r h c4 depot n3 empty), but the two shall be executed together or the second wait for the first to start, indeed in this case if it waits the end then the semantics of execution is wrong since at the end of the first actions the preconditions of the second are no longer valid!

Robot r loading crate c1 on carrier h at depot [100%]  
Robot r loading crate c1 on carrier h at depot [100%]  
[plansys2_node-1] [ERROR] [1649314400.270742446] [executor]: [(load r h c4 depot n3 empty):1]Error checking at start reqs: (and (isat c4 depot)(capacity_c h empty))
[plansys2_node-1] [ERROR] [1649314400.366455060] [executor]: [(load r h c4 depot n3 empty):1]Error checking at start reqs: (and (isat c4 depot)(capacity_c h empty))

I believe, looking at the bt.txt in attach, that the BT wrongly waits for the first action to finish to start the second.

Last, in some more complex application we are building on top of plansys2 we face several random crash with very limited information on the reason some of the nodes crashes... it seems some shared pointer die, but it is very hard to debug.

@jjzapf @fmrico can you suggest me some approach to debug this random crashes?

@jjzapf
Copy link
Collaborator

jjzapf commented Apr 9, 2022

When computing the root set, we look for actions that can be run in parallel. To test whether 2 actions can be run in parallel, we first test whether each action is executable given the current state. We then check whether the effects of one action will contradict the requirements of the other. For example, in a move action you might have a (at start (not (robot_at r loc))) effect that would contradict a (over all (robot_at r loc)) requirement. Since we do not know precisely how these actions will overlap, it seemed good to assume that they can overlap in any way possible or even be ran in sequence. Hence, we also check the requirements of an action against the "at end" effects. See code link below.

https://github.com/IntelligentRoboticsLabs/ros2_planning_system/blob/master/plansys2_executor/src/plansys2_executor/BTBuilder.cpp#L172

Unfortunately, this logic fails in the case you provided where an at end effect contradicts an at start requirement. Clearly PDDL does allow for this and 2 actions should be allowed to run in parallel so long as their at start and over all requirements/effects do not lead to contradictions. I tested removing the checks for the "at end" effects and this does seem to help the code to progress further. However, I still ran into some more errors later down the pipe.

I will have time early next week to dig into this further and will hopefully come up with a fix. Thanks for finding this error and providing a nice example for debugging!

@roveri-marco
Copy link
Collaborator Author

@jjzapf Thanks for the explanation. I look forward to testing it.
Moreover, is there a document that describes the conversion from a PDDL plan to the BT? I know there is a paper by @fmrico , but I believe that is only partially explaining the translation and I believe there have been modifications w.r.t. what is explained in that paper. We may work together on this document to precisely and correctly handle the conversion if you think it would be worth it (e.g. also with @fmrico ) and think to publish somewhere the result if it would be worth publishing at the end.

The other problem that we are facing is the random crash of some action nodes and/or of the executor.
We think that It might be related to resources (memory and/or CPU) since the machine was with all the CPUs loaded at 100%, but we are not sure... it might also be some memory problem. Can you or @fmrico suggest debugging strategies (e.g. enable compilation flags, enable additional logging, ...)?

@jjzapf
Copy link
Collaborator

jjzapf commented Apr 9, 2022

I was thinking about fleshing out the description I made in #206 into a more complete document. That description covers the plan-to-action graph part of the code/algorithm. I tried to describe as best I could the previous approach as well as the modifications I made. I agree that it is crucial to get this part of the code correct and document the resulting approach. We should definitely update the code/website documentation. I'm also very open to working on a future publication with you and @fmrico.

I haven’t been able to look into the random crashes yet and don’t have any ideas off the top of my head. This week has been pretty busy for me, but I hope to dig into these issues in the coming week.

@jjzapf
Copy link
Collaborator

jjzapf commented Apr 12, 2022

@roveri-marco, I opened PR #216, which fixes some of the issues discussed above. In particular, it fixes the bad logic regarding the checks for effect/condition contradictions when determining if 2 actions can run in parallel. It also fixes a bug that I found in the prune_backwards function in the BTBuilder. (See description in #216 for more details.)

Here is the action graph that I am getting for your example after applying the changes in #216.

action_graph

Here is the behavior tree that I am getting for your example after applying the changes in #216. Note that the red boxes represent the WaitAction sets.

bt

Unfortunately, I am still getting an error with your example. I added some additional print statements to see what was going on.

[plansys2_node-1] [ERROR] [(load r h c1 depot n3 empty):0] applying at end effects : (and (not (capacity_c h empty))(capacity_c h n3)(carry h c1))
[plansys2_node-1] [ERROR] [(move h r depot l1):2000] applying at start effects : (and (not (isat h depot))(not (isat r depot)))
[plansys2_node-1] [ERROR] [(load r h c4 depot n3 empty):1] Error checking over all requirements: (and (isat r depot)(isat h depot)(capacity_follow n3 empty)(deliverable c4))

It seems that an "over all" condition of (load r h c4 depot n3 empty) is being checked after an "at start" effect of (move h r depot l1) is being applied. However, (load r h c4 depot n3 empty) should finish before (move h r depot l1) starts. So I am not sure what is going on here. As far as I can tell, this error is not do to the action graph formation, but rather occurs somewhere later in the code. I will try to dig into this a bit more and open another PR if I find a solution.

@fmrico
Copy link
Contributor

fmrico commented Apr 12, 2022

We may work together on this document to precisely and correctly handle the conversion if you think it would be worth it (e.g. also with @fmrico ) and think to publish somewhere the result if it would be worth publishing at the end.

This would be great. Maybe we could try to define a theoretical framework to prove the correctness of the current algorithm 🤔 We could send it again to AAMAS or maybe try in ICAPS. Next week, when I return from holidays, I could start to think about it. Any comments or ideas are welcome.

@jjzapf I have just merged #216 , but I read that the bug is not still fixed. Should I reopen again?

@roveri-marco
Copy link
Collaborator Author

Hi @fmrico and @jjzapf I'll check and let you know as well.
We may discuss the theoretical framework after the Easter holidays.

@jjzapf
Copy link
Collaborator

jjzapf commented Apr 12, 2022

I'll be happy to work on the theoretical framework and publication with you as well. I'll be on vacation the 1st half of next week, but I should be available starting Wednesday or Thursday.

@fmrico, thanks for merging #216. You don't need to reopen it. That PR focused on a bug related to the action graph formulation. I believe we are now getting the correct action graph for the example @roveri-marco provided. I think the unfixed bug has to do with the behavior tree execution. So I believe that should be handled in a separate PR. I'm not exactly sure what's going on, but I'll try to dig into it a bit more.

@jjzapf
Copy link
Collaborator

jjzapf commented Apr 13, 2022

If we modify the if statement in this line

https://github.com/IntelligentRoboticsLabs/ros2_planning_system/blob/master/plansys2_executor/src/plansys2_executor/behavior_tree/check_overall_req_node.cpp#L49

to this

if (!(*action_map_)[action].action_executor->is_finished() && !check(reqs, problem_client_))

then the example @roveri-marco provided will finish successfully. I think the issue is that we are using a ReactiveSequence to handle this part of the behavior tree. From the BehaviorTree.CPP documentation, I found this warning concerning the use of a ReactiveSequence.

This node is particularly useful to continuously check Conditions; but the user should also be careful when using asynchronous children, to be sure that they are not ticked more often that expected.

By adding some print statements, I was able to discover that an "over all" requirements check for the 2nd "load" action was occurring after the "load" action had finished as well as after the "at start" effect of the "move" action had been applied. This is a problem because the "at start" effects of the "move" action conflict with the "over all" conditions of the "load" action. Thus, to fix this I added a check to the CheckOverAllReq code to see if the action had finished before checking the requirements and possibly triggering a failure.

Unfortunately, this change breaks some of our unit tests. I will try to see if it is possible to fix these.

One other thing that I noticed is that even though the first 2 actions should run in parallel, they seem to be running one after the other. This could be do to the fact that they are of the same action type "load". I noticed that @roveri-marco created 2 "load" action executor clients. However, I'm not sure if the system knows that it can use the 2nd action executor client if the first is busy. This is just a guess. @fmrico do you have any insight as to how the system would handle running 2 actions of the same type at the same time?

@roveri-marco
Copy link
Collaborator Author

The action auction mechanism should be able to handle it (at least from documentation), but I'm not sure it is handled internally.

@jjzapf
Copy link
Collaborator

jjzapf commented Apr 13, 2022

I think I found a better solution to the issue I discussed in my previous comment. See PR #217 for details. The short story is that I don't think the WaitAction node was waiting for the entire action sub-tree to finish, just the ExecuteAction node. Thus an "at start" effect of a successor action was allowed to occur before the predecessor action was really finished.

@fmrico
Copy link
Contributor

fmrico commented Apr 14, 2022

@fmrico do you have any insight as to how the system would handle running 2 actions of the same type at the same time?

If there is more than one action performer for one action, there should not be any problem. If there is only one, it will wait until one is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants