Skip to content

Commit

Permalink
Increase MaxConnections and MaxStartup in sshd config
Browse files Browse the repository at this point in the history
When more than 10 data movement requests come in for a particular
rabbit, the default sshd configuration starts dropping 30% of
connections and drops all after 100 connections (the default value is
set to 10:30:100). This causes data movement requests to fail since any
concurrency over 10 causes ssh to close connections (from mpirun).

This change increases that value to be able handle the max theoretical
load for a particular rabbit. This image runs on 1 pod per rabbit node
(i.e. nnf-dm-worker-*) and each rabbit node supports 16 compute nodes of
192 cores. Each core on a compute node could be creating a data movement
request.

16 * 192 = 3072

Bump it up to an power of 2 for good measure -> 4096.

Signed-off-by: Blake Devcich <blake.devcich@hpe.com>
  • Loading branch information
bdevcich committed Aug 23, 2024
1 parent 042f4c6 commit 822127c
Showing 1 changed file with 7 additions and 0 deletions.
7 changes: 7 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,13 @@ COPY --from=builder /deps/dtcmp/lib/ /usr

COPY --from=builder /mfu/ /usr

# Increase the number of allowed incomming ssh connections to support many mpirun applications
# attempting to hit a mpi host/worker (i.e. rabbit node) all at once. A compute node has 192 cores,
# and each rabbit has 16 compute nodes. This means 3072 (192*16) ssh connections could come in at
# once. Round to the nearest power of 2 for good measure.
RUN sed -i "s/[ #]\(.*MaxSessions\).*/\1 4096/g" /etc/ssh/sshd_config \
&& sed -i "s/[ #]\(.*MaxStartups\).*/\1 4096/g" /etc/ssh/sshd_config

###############################################################################
# Pull in the debugging symbols on top of production image
FROM production AS debug
Expand Down

0 comments on commit 822127c

Please sign in to comment.