Skip to content

comm: refactor comm hierarchical information #7367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Apr 7, 2025
Merged

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Apr 3, 2025

Pull Request Description

MPIR_Comm currently contains hierarchical fields including:

    MPIR_Comm_hierarchy_kind_t hierarchy_kind;  /* flat, parent, node, or node_roots */
    struct MPIR_Comm *node_comm;        /* Comm of processes in this comm that are on
                                         * the same node as this process. */
    struct MPIR_Comm *node_roots_comm;  /* Comm of root processes for other nodes. */
    int *intranode_table;       /* intranode_table[i] gives the rank in
                                 * node_comm of rank i in this comm or -1 if i
                                 * is not in this process' node_comm.
                                 * It is of size 'local_size'. */
    int *internode_table;       /* internode_table[i] gives the rank in
                                 * node_roots_comm of rank i in this comm.
                                 * It is of size 'local_size'. */
    int node_count;             /* number of nodes this comm is spread over */
  1. Hierarchical fields are optimizations and they should be optional. A communicator without hierarchical info should still work and be treated as a "flat" topology.
  2. The design of these fileds are inefficient, both inflexible and wastes memory. For example, if a communicator is "balanced" and "consecutive" -- a common target for performance engineer -- then it is easier and faster to directly compute rank translation without consulting a table. If the communicator is "irregular", then it is likely not worthwhile for the optimization.
  3. hierarchy_kind is not particularly useful. All we need is a boolean attribute to tell whether it communicator has hierarchical structure for optimization or not.
  4. The subcomms need be lightweight and flexible. They should not be used for querying hierarchical info.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2504_subcomm branch 2 times, most recently from 3d4b1e9 to 339e488 Compare April 3, 2025 20:43
hzhou added 3 commits April 3, 2025 16:17
With MPIR_comm_rank_to_lpid, we can look up process' node_id from
MPIR_Process.node_map. For dynamic processes, we don't support node id
anyway.
Replace direct access comm_ptr->intranode_table[] with
MPIR_Get_intranode_rank. We can optimize MPIR_Get_intranode_rank later
by exploiting regular node topology as well as inlining the function.
Maintaining intranode_table is not efficient.
TODO: optimize MPIR_Get_intranode_rank.
@hzhou hzhou force-pushed the 2504_subcomm branch 11 times, most recently from c640b0f to b62ac53 Compare April 6, 2025 03:03
hzhou added 10 commits April 5, 2025 22:04
Factor out the create_subcomm from creating node_comm and
node_roots_comm.

Now that subcomm calls MPID_Subcomm_create_hook rather than
MPID_Comm_commit_pre_hook, we no longer need the wrapper
MPIR_Comm_commit_internal. It is removed.

Move the creation of subcomms to the end of MPIR_Comm_commit so the
subcomms can inherit settings such as vci_enabled and seq.
We need construct subcomms before MPID_Comm_commit_post_hook because
device-layer need subcomms to setup vcis, currently for comm_world, but
later for MPI_Comm_create_from_group. The vci settings need be
propagated to subcomms after MPID_Comm_commit_post_hook.
MPIR_Find_local and MPIR_Find_external are obscure and inefficient.
Refactor by first obtain comm->{num_local, num_external, local_rank,
external_rank} and then internode_table and intranode_table. Then we
can directly construct subcomms from internode_table and
intranode_table.

TODO: Storing the rank tables in every comm takes memory. We should
optimize for flat comms and balanced and consecutive comms since those
rankmaps can be easily computed.
Everything can be figured out more efficiently with comm->node_map.

Remove internode_table and intranode_table.

node_count is the same as num_external.
The assumption is MPIDI_Comm_create_multi_leaders only called for
a regular hierarchical communicator. With that assumption, the
multi_leads_comm can be directly computed rather via MPIR_Find_local and
MPIR_Find_external.

Also the for-loop with condition to select i==local_rank is weird and
unnecessary.

This clean up removes the usage of MPIR_Find_local and
MPIR_Find_external.

TODO: clean up other internal comms.
When comm->attr is 0, many fields in MPIR_Comm are unset. The comm
should still work but all the optimization parts should check comm->attr
before accessing the corresponding comm->fields.

Add attribute MPIR_COMM_ATTR__HIERARCHY to mark the comm contains
hierarchical fields.

This allows for lightweight internal comm that can be used in boot
strapping or in commposite algorithms that do not need complex and heavy
full Comm_commit.
There is no easy way to tell whether a comm is a user-visible or an
internal subcomm. Add a attr bit for it.

NOTE: subcomms are only used in collectives. If we swap the ranks in
MPIC send/recv functions and convert the subcomm traffic into parent
comm, then we don't need context offset such as
MPIR_CONTEXT_INTRANODE_OFFSET and MPIR_CONTEXT_INTERNODE_OFFSET.
In addition, we can skip most of facilities in the subcomm achieving the
same goals as in pmodels#7103.
Group the fields by usage groups.
A more concise and extendable way for hierarchical charateristics.

This will replace hierarchy_kind.
The only real usage for checking hierarchy_kind is whether the comm
contatains optimizable hierarchical structure. This is to check whether
hierarchy_kind == MPIR_COMM_HIERARCHY_KIND__PARENT.

We can simply check whether the MPIR_COMM_ATTR__HIERARCY bit is on for
the same result.
@hzhou hzhou force-pushed the 2504_subcomm branch 2 times, most recently from 30ca384 to 2357b7a Compare April 6, 2025 21:30
@hzhou
Copy link
Contributor Author

hzhou commented Apr 6, 2025

test:mpich/ch4/most
test:mpich/ch3/most

@hzhou hzhou requested a review from raffenet April 7, 2025 03:24
Copy link
Contributor

@raffenet raffenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments but looks OK. It's going to be disruptive to anyone maintaining hierarchical algos, so we should note in CHANGES. Maybe with a nod to the benefits 😉.

@@ -121,6 +121,9 @@ enum MPIR_COMM_HINT_PREDEFINED_t {
S*/
struct MPIR_Comm {
MPIR_OBJECT_HEADER; /* adds handle and ref_count fields */
int attr; /* if attr is 0, only the core set of fields are set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way we can make it clearer these are separate from user attributes? It's not critical, just noting the name collision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to suggestions. But I think it is ok. There is no way people will confuse user attributes with this since there is no way user attributes can fit into the simplicity of single int :)

Comment on lines 534 to 557
if (num_local * num_nodes == comm_size) {
comm->hierarchy_flags |= MPIR_COMM_HIERARCHY__NODE_BALANCED;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this check strict enough? the old check put all the ranks in node buckets and checked that each bucket was the same size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. The num_local is only the size of the local node. Let me update it by iterating through the node map.

} else {
/* canonical or trivial */
if ((comm->rank / comm->num_local) == (r / comm->num_local)) {
return r % comm->num_local;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there another minor optimization to be had in the node consecutive case where we can subtract and offset rather than use the modulo operator? otherwise this looks good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you still need maintain an array of node offsets. I don't think the use case will be common enough for the extra code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only thinking of the node consecutive case and where we've already determined they are on the same node. Something like:

/* canonical or trivial */
if ((comm->rank / comm->num_local) == (r / comm->num_local)) {
    if (comm->hierarchy_flags & MPIR_COMM_HIERARCHY__NODE_CONSECUTIVE) {
        return r - comm->local_offset;
    }
    return r % comm->num_local;
} else {
    return -1;
}

But we're already 3 branches deep so maybe it's not worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't reliably use comm->rank / comm->num_local when comm->num_local may not represent node boundary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting a not-yet-defined comm->local_offset or comm->node_offset which you could compute at creation time. Anyway we don't need to hold up this PR on it.

hzhou added 7 commits April 7, 2025 12:39
Rather than figuring out some hierarchical characteristics at runtime, do
the check at commit time.
Always check (attr | MPIR_COMM_ATTR_HIERARCHY) before access comm's
hierarchical fields.
Cananonical comm, node consecutive and node balanced, can trivially
compute rank maps so we don't need store the tables. This should save
some memory for very large jobs.
we may not always construct comm->node_comm and comm->node_roots_comm
especially for trivial hierarchy. Use MPIR_Comm_get_node_comm and
MPIR_Comm_get_node_roots_comm will work for the trivial cases and it is
useful for simplicity in some cases.
Builtin comms may get created during bootstrap especially comm_self.
Thus we should check to avoid duplicate building builtin comms.
The node_comm may not get created for trivial hierarchy.
@hzhou
Copy link
Contributor Author

hzhou commented Apr 7, 2025

test:mpich/ch4/most
test:mpich/ch3/most

I don't think these are related to this PR:

rma.02754 - ./rma/linked_list 4
rma.02793 - ./rma/linked_list_bench_lock_excl 4
rma.02794 - ./rma/linked_list_bench_lock_shr 4
rma.02795 - ./rma/linked_list_bench_lock_shr_nocheck 4 

but I wonder whether this indicates a regression at some point.

@hzhou hzhou merged commit 2a1e89c into pmodels:main Apr 7, 2025
7 of 8 checks passed
@hzhou hzhou deleted the 2504_subcomm branch April 7, 2025 21:42
@hzhou hzhou mentioned this pull request Apr 7, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants