-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
btl/tcp: Don't use mca_btl_tcp_put. #8984
btl/tcp: Don't use mca_btl_tcp_put. #8984
Conversation
Use btl/base active messaging protocols instead. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
@bosilca Can you please take a look? |
Hmm, how does this impact the large message two-sided performance when using TCP? Also, if it is better to disable btl/tcp put you should just remove it from the default for btl_flags. That way it can be turned on if needed. |
@hjelmn doing a code inspection, I don't see a consumer of btl_put except for osc/rdma, but I could be wrong on that. If it's not the case, is it possible to turn this off only for osc and leave it on for point-to-point? |
@hjelmn ^^^ ? |
@hjelmn Can you please take another look? This apparently works around a wrong answer in one-sided, so it's desirable as a band-aid for v5.0.0 until we can get a better fix. |
@wckzhang Are you also familiar with BTL/TCP? Could you also please take a look? v5.0.0rc1 is less than one month away. |
We don't know why the TCP BTL put is incorrect? So we're disabling it in in favor of other protocols as a temporary fix? Is the wrong answer difficult to debug? |
I'll dig up the test case, it was a simple one. If I remember right, the test was doing accumulate in a loop, and the accumulates were completing out of order with this path. |
btl_put is used by pml/ob1 as well. |
I'm pretty wary of just disabling a feature like this. I feel like this is a place where the "temporary" solution would just become permanent since nobody will want to commit time for a real fix. |
Exactly. Though I am concerned that this behavior is incorrect even for ob1. I am not expecting to have time to look tomorrow or Monday. |
The counter to that statement is: since nobody will want to commit time to create a fix, this would block all future OMPI releases...which is why we propose the alternative. If you truly feel that this feature must work for OMPI to release, then it sounds to me like you are taking responsibility for making that happen in a timely fashion - yes? |
The proposed solution has the unfortunate side-effect of decreasing the performance for two-sided communications over TCP, a quite drastic "alternative". I tried to follow the discussion across the issues and past PR, but it is not clear to me what the root cause is. Can someone summarize it? |
This test fails under IBM mtt with a wrong answer, here:
Here's the test source: https://github.com/open-mpi/ompi-tests/blob/master/ibm/onesided/c_post_start.c My thinking was this path shouldn't be used for the osc/rdma path, but perhaps I am wrong and there is an alternative fix. I am not sure if this path can be disabled only for osc and not pml/ob1 - I'm not familiar with it enough. |
Use btl/base active messaging protocols instead.
This was included/discussed a bit in #8756, but that PR has some other issues.
It might be easier to get changes in piecemeal.
Fixes #8983
Signed-off-by: Austen Lauria awlauria@us.ibm.com