-
Notifications
You must be signed in to change notification settings - Fork 293
comm: refactor comm hierarchical information #7367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3d4b1e9
to
339e488
Compare
With MPIR_comm_rank_to_lpid, we can look up process' node_id from MPIR_Process.node_map. For dynamic processes, we don't support node id anyway.
Replace direct access comm_ptr->intranode_table[] with MPIR_Get_intranode_rank. We can optimize MPIR_Get_intranode_rank later by exploiting regular node topology as well as inlining the function. Maintaining intranode_table is not efficient.
TODO: optimize MPIR_Get_intranode_rank.
c640b0f
to
b62ac53
Compare
Factor out the create_subcomm from creating node_comm and node_roots_comm. Now that subcomm calls MPID_Subcomm_create_hook rather than MPID_Comm_commit_pre_hook, we no longer need the wrapper MPIR_Comm_commit_internal. It is removed. Move the creation of subcomms to the end of MPIR_Comm_commit so the subcomms can inherit settings such as vci_enabled and seq.
We need construct subcomms before MPID_Comm_commit_post_hook because device-layer need subcomms to setup vcis, currently for comm_world, but later for MPI_Comm_create_from_group. The vci settings need be propagated to subcomms after MPID_Comm_commit_post_hook.
MPIR_Find_local and MPIR_Find_external are obscure and inefficient. Refactor by first obtain comm->{num_local, num_external, local_rank, external_rank} and then internode_table and intranode_table. Then we can directly construct subcomms from internode_table and intranode_table. TODO: Storing the rank tables in every comm takes memory. We should optimize for flat comms and balanced and consecutive comms since those rankmaps can be easily computed.
Everything can be figured out more efficiently with comm->node_map. Remove internode_table and intranode_table. node_count is the same as num_external.
The assumption is MPIDI_Comm_create_multi_leaders only called for a regular hierarchical communicator. With that assumption, the multi_leads_comm can be directly computed rather via MPIR_Find_local and MPIR_Find_external. Also the for-loop with condition to select i==local_rank is weird and unnecessary. This clean up removes the usage of MPIR_Find_local and MPIR_Find_external. TODO: clean up other internal comms.
When comm->attr is 0, many fields in MPIR_Comm are unset. The comm should still work but all the optimization parts should check comm->attr before accessing the corresponding comm->fields. Add attribute MPIR_COMM_ATTR__HIERARCHY to mark the comm contains hierarchical fields. This allows for lightweight internal comm that can be used in boot strapping or in commposite algorithms that do not need complex and heavy full Comm_commit.
There is no easy way to tell whether a comm is a user-visible or an internal subcomm. Add a attr bit for it. NOTE: subcomms are only used in collectives. If we swap the ranks in MPIC send/recv functions and convert the subcomm traffic into parent comm, then we don't need context offset such as MPIR_CONTEXT_INTRANODE_OFFSET and MPIR_CONTEXT_INTERNODE_OFFSET. In addition, we can skip most of facilities in the subcomm achieving the same goals as in pmodels#7103.
Group the fields by usage groups.
A more concise and extendable way for hierarchical charateristics. This will replace hierarchy_kind.
The only real usage for checking hierarchy_kind is whether the comm contatains optimizable hierarchical structure. This is to check whether hierarchy_kind == MPIR_COMM_HIERARCHY_KIND__PARENT. We can simply check whether the MPIR_COMM_ATTR__HIERARCY bit is on for the same result.
30ca384
to
2357b7a
Compare
test:mpich/ch4/most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments but looks OK. It's going to be disruptive to anyone maintaining hierarchical algos, so we should note in CHANGES. Maybe with a nod to the benefits 😉.
@@ -121,6 +121,9 @@ enum MPIR_COMM_HINT_PREDEFINED_t { | |||
S*/ | |||
struct MPIR_Comm { | |||
MPIR_OBJECT_HEADER; /* adds handle and ref_count fields */ | |||
int attr; /* if attr is 0, only the core set of fields are set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way we can make it clearer these are separate from user attributes? It's not critical, just noting the name collision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to suggestions. But I think it is ok. There is no way people will confuse user attributes with this since there is no way user attributes can fit into the simplicity of single int
:)
src/mpi/comm/commutil.c
Outdated
if (num_local * num_nodes == comm_size) { | ||
comm->hierarchy_flags |= MPIR_COMM_HIERARCHY__NODE_BALANCED; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this check strict enough? the old check put all the ranks in node buckets and checked that each bucket was the same size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. The num_local
is only the size of the local node. Let me update it by iterating through the node map.
} else { | ||
/* canonical or trivial */ | ||
if ((comm->rank / comm->num_local) == (r / comm->num_local)) { | ||
return r % comm->num_local; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there another minor optimization to be had in the node consecutive case where we can subtract and offset rather than use the modulo operator? otherwise this looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you still need maintain an array of node offsets. I don't think the use case will be common enough for the extra code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only thinking of the node consecutive case and where we've already determined they are on the same node. Something like:
/* canonical or trivial */
if ((comm->rank / comm->num_local) == (r / comm->num_local)) {
if (comm->hierarchy_flags & MPIR_COMM_HIERARCHY__NODE_CONSECUTIVE) {
return r - comm->local_offset;
}
return r % comm->num_local;
} else {
return -1;
}
But we're already 3 branches deep so maybe it's not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't reliably use comm->rank / comm->num_local
when comm->num_local
may not represent node boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting a not-yet-defined comm->local_offset
or comm->node_offset
which you could compute at creation time. Anyway we don't need to hold up this PR on it.
Rather than figuring out some hierarchical characteristics at runtime, do the check at commit time.
Always check (attr | MPIR_COMM_ATTR_HIERARCHY) before access comm's hierarchical fields.
Cananonical comm, node consecutive and node balanced, can trivially compute rank maps so we don't need store the tables. This should save some memory for very large jobs.
we may not always construct comm->node_comm and comm->node_roots_comm especially for trivial hierarchy. Use MPIR_Comm_get_node_comm and MPIR_Comm_get_node_roots_comm will work for the trivial cases and it is useful for simplicity in some cases.
Builtin comms may get created during bootstrap especially comm_self. Thus we should check to avoid duplicate building builtin comms.
The node_comm may not get created for trivial hierarchy.
test:mpich/ch4/most I don't think these are related to this PR:
but I wonder whether this indicates a regression at some point. |
Pull Request Description
MPIR_Comm
currently contains hierarchical fields including:hierarchy_kind
is not particularly useful. All we need is a boolean attribute to tell whether it communicator has hierarchical structure for optimization or not.[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.