Memory leak with persistent MPI sends and the ob1 "get" protocol #6565

s-kuberski · 2019-04-03T09:18:53Z

Background information

A memory leak appears when using persistent communication with the vader BTL and large message sizes.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.0 on local computer, 2.1.2 and 3.1.3 on clusters

Describe how Open MPI was installed

4.0.0: from tarball

Please describe the system on which you are running

Operating system/version: Ubuntu 18.04 / CentOS 7.5 / Scientific Linux release 7.5
Computer hardware: Laptop / Intel Skylake / Intel Nehalem
Network type: vader RDMA

Details of the problem

Running a simulation program with Open MPI, memory leaks appeared causing the application to crash. The behaviour can be reproduced with the attached code block.

When the vader BTL is used, the used memory increases linearly over time. This bug is directly connected to the message size:
With btl_vader_eager_limit 4096 and a message size of 4041 bytes, the bug appears. If the eager limit is raised or the message size is decreased, no problem occurs.

Only btl_vader_single_copy_mechanism set to cma could be tested, no problem is seen if the value is set to none.

If buffered communication is used and the buffer is detached and attached manually, the problem does not appear.

Only the shared-memory communication with vader is affected. If the processes are located on different nodes or -mca btl ^vader is set, everything is fine.

#include <stdlib.h>
#include <stdio.h>
#include "mpi.h"

int main(int argc, char **argv) {
  int rank, size, rplus, rminus, cnt, max = 4000000, i;
  char *sbuf, *rbuf;
  static MPI_Request req;
  int lbuf = 4041;

  /* Initialize MPI and assert even size */
  MPI_Init(&argc,&argv);
  MPI_Comm_size(MPI_COMM_WORLD,&size);
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);

  if ( size & 1 ) {
    if ( rank == 0 ) fprintf(stderr,"ERROR: Invalid number of MPI tasks: %d\n", size);
    MPI_Finalize();
    return -1;
  }

  /* Optional arguments: max and lbuf*/
  if ( argc > 1 ) {
    if(atoi(argv[1])>0) max = atoi(argv[1]);
    if ( rank == 0 ) printf("max=%d\n", max);
  }
  if ( argc > 2 ) {
    lbuf = atoi(argv[2]);
    if ( rank == 0 ) printf("lbuf=%d\n", lbuf);
  }
  /* allocate buffers */
  sbuf = malloc(sizeof(char) * lbuf);
  rbuf = malloc(sizeof(char) * lbuf);
  /* Initialize buffers */
  for(i=0; i<lbuf; i++) sbuf[i] = rbuf[i] = 0; 
  sbuf[0] = rank;

  /* Initialize communicators: single message from all even ranks to next odd rank */
  rplus  = ( rank + 1 ) % size;
  rminus = ( rank - 1 + size ) % size;
  if ( rank & 1 )
    MPI_Recv_init(rbuf, lbuf, MPI_CHAR, rminus, 0, MPI_COMM_WORLD, &req);
  else
  	MPI_Send_init(sbuf, lbuf, MPI_CHAR, rplus, 0, MPI_COMM_WORLD, &req);
  
  /* Repeat communications */
  MPI_Barrier(MPI_COMM_WORLD);
  for(cnt = 0; cnt<max; cnt++) {
    MPI_Status stat;
    MPI_Start(&req);
    MPI_Wait(&req,&stat);
  }
  MPI_Request_free(&req);
  MPI_Finalize();
  return 0;
}

The text was updated successfully, but these errors were encountered:

jsquyres · 2019-04-09T21:43:28Z

Per #6547:

I think this is related to issue #5798 and was fixed with commit c076be5 but I don't think this patch was pulled into 4.0.x.

Need to check to see if c076be5 made it to v4.0.x.

@hppritcha @gpaulsen @aingerson FYI

jsquyres · 2019-04-09T21:52:35Z

Ah, did #6550 fix the issue?

@aingerson @s-kuberski Can you check the v4.0.x nightly snapshot tarball that will be generated tonight (i.e., in a few hours) and see if the issue is fixed for you? https://www.open-mpi.org/nightly/v4.0.x/

aingerson · 2019-04-09T21:55:20Z

Will do!

s-kuberski · 2019-04-10T09:29:21Z

I still see the behaviour with the mpirun-executable built from the tarball (i.e. mpirun (Open MPI) 4.0.2a1), but better check this @aingerson, in case I did something wrong with the installation.

aingerson · 2019-04-10T16:12:17Z

I've tested the nightly snapshot twice with all of our tests and don't seem to be seeing the problem anymore

jsquyres · 2019-04-10T16:42:18Z

An interesting disparity -- @s-kuberski can you try with the openmpi-v4.0.x-201904100241-811dfc6.tar.bz2 (or later) tarball from https://www.open-mpi.org/nightly/v4.0.x/?

aingerson · 2019-04-10T16:48:41Z

I will also add that for fun I just tested on the nightly tar ball from a couple of days ago (before the fix was applied) and am still not seeing the error anymore.... I'm going to do a bit more testing and figure out what exact test is triggering this issue. This is a race condition, right?

hppritcha · 2019-04-10T16:50:13Z

#6550 went in to v4.0.x a couple of days ago.

aingerson · 2019-04-10T16:52:23Z

It looks like it went in yesterday and I tested the snapshot from 4/6

aingerson · 2019-04-10T19:21:41Z

I don't think my issue is actually related to this or to what I originally thought the fix was...
I've been doing some more testing and am noticing that my issue is actually resolved from 4.0.0 to 4.0.1
I hadn't tested on 4.0.1 until now and it seems to be ok with the stable release. Sorry for causing all the drama 😬

s-kuberski · 2019-04-11T09:20:38Z

An interesting disparity -- @s-kuberski can you try with the openmpi-v4.0.x-201904100241-811dfc6.tar.bz2 (or later) tarball from https://www.open-mpi.org/nightly/v4.0.x/?

This is the version I tried yesterday. Still a linear increase in the used memory...

jsquyres · 2019-04-12T20:26:22Z

I am able to reproduce this problem on master. Two things I notice so far:

It happens with the ob1 "get" protocol; it does not happen with "put" or "send".
The sender process is the process that grows without bound; the receiver process stays flat (in terms of memory usage).

The vader CMA get/put methods are downright simple; I can't see where a leak would happen there (particularly if the receiver is doing the CMA read, but the sender process is the one that is growing without bound).

This implies that it's an OB1 issue...?

jsquyres · 2019-04-12T22:20:44Z

Spoiler: I can reproduce the problem with the TCP BTL.

Expanding on the list from above:

This appears on master, v2.x, v3.0.x, v3.1.x, and v4.0.x branches. I did not test back further than that.
Here are the conditions for which I have been able to reproduce the problem:
- Persistent send is used (receive mode does not matter; this does not happen with regular or buffered sends), AND
- The Vader BTL is used with CMA or emulated (it does not happen with btl_vader_single_copy_mechanism=none), AND
- The Vader BTL "get" flag must be set (i.e., if "get" is not set, it does not happen), AND
- @s-kuberski cited (and I confirm) that btl_vader_eager_limit=4096 and the sent message is a contiguous 4041 bytes (I have not checked other eager limits / message sizes).
I note that this also happens with the TCP BTL when the "get" flag is set and the btl_tcp_eager_limit=4096 and the message is a contiguous 4041 bytes.
When it happens, the sender process grows without bound (easy to see via top).

Since the problem also happens with the TCP BTL, I'd say that vader is in the clear. This is likely an ob1 problem, or perhaps a general persistent send problem.

Here's how I reproduced the problem:

# Vader
# Both "emulated" and "cma" trigger the problem
$ mpirun --mca btl vader,self \
    --mca btl_vader_flags send,get,inplace,atomics,fetching-atomics \
    --mca btl_vader_single_copy_mechanism emulated \
    -np 2 ./leaky-mcleakface 40000000

# TCP
$ mpirun --mca btl tcp,self \
    --mca btl_tcp_flags send,get,inplace \
    -np 2 ./leaky-mcleakface 40000000

kawashima-fj · 2019-04-15T14:17:48Z

I commented out MPI_Finalize in @s-kuberski's original source and ran it on Valgrind. Valgrind reported a suspicious trace.

==19521== 51,553,992 bytes in 1,563 blocks are still reachable in loss record 5,703 of 5,703
==19521==    at 0x4C2BBAF: malloc (vg_replace_malloc.c:299)
==19521==    by 0x5B0A10B: opal_free_list_grow_st (opal_free_list.c:210)
==19521==    by 0x10695885: opal_free_list_wait_st (opal_free_list.h:297)
==19521==    by 0x106958EA: opal_free_list_wait (opal_free_list.h:314)
==19521==    by 0x10698505: mca_pml_ob1_send_request_start_rdma (pml_ob1_sendreq.c:695)
==19521==    by 0x1069A881: mca_pml_ob1_send_request_start_btl (pml_ob1_sendreq.h:426)
==19521==    by 0x1069AA25: mca_pml_ob1_send_request_start_seq (pml_ob1_sendreq.h:467)
==19521==    by 0x1069AB4B: mca_pml_ob1_send_request_start (pml_ob1_sendreq.h:500)
==19521==    by 0x1069AD12: mca_pml_ob1_start (pml_ob1_start.c:100)
==19521==    by 0x4F045E9: PMPI_Start (pstart.c:76)
==19521==    by 0x108EC4: main (getleak.c:50)

https://github.com/open-mpi/ompi/blob/master/ompi/mca/pml/ob1/pml_ob1_sendreq.c#L694-L698

I ran:

mpiexec
  -n 2
  --mca btl self,vader
  --mca btl_vader_flags send,get,inplace,atomics,fetching-atomics
  --mca btl_vader_eager_limit 4096
  --mca btl_vader_single_copy_mechanism emulated
  valgrind --leak-check=full --show-leak-kinds=all ./getleak 100000

kawashima-fj · 2019-04-15T15:30:17Z

For blocking and nonblocking operations, the mca_pml_ob1_send_request_fini function is called in MPI_Send or MPI_Wait and sendreq->rdma_frag is returned to the free list.

For persistent operations, the sendreq object is reused and the mca_pml_ob1_send_request_fini function is called only once in MPI_Request_free. However, sendreq->rdma_frag is allocated by MCA_PML_OB1_RDMA_FRAG_ALLOC every time when MPI_Start is called. Probably this is the cause of the memory leak.

Do we need to return sendreq->rdma_frag to the free list by MCA_PML_OB1_RDMA_FRAG_RETURN every time the send operation completes (in mca_pml_ob1_rget_completion or somewhere)?

I don't have enough time to investigate more this week...

Fix the leak of fragments for persistent sends (issue #6565)

gpaulsen · 2019-06-28T19:30:15Z

Resolved on v4.0.x in PR #6634.
Removing Target: v4.0.x label.

awlauria · 2020-10-13T15:50:32Z

It sounds like this issue can be closed?

awlauria · 2020-10-16T19:11:32Z

Closing - this went into the 4.0 series and is in master. 3.0 is closed as far as I know. If someone disagrees, please reopen.

jsquyres added bug Severity: critical Target: main labels Apr 9, 2019

jsquyres added this to the v4.0.2 milestone Apr 9, 2019

jsquyres mentioned this issue Apr 9, 2019

Shared memory not getting cleaned up #6547

Closed

jsquyres added Severity: blocker and removed Severity: critical labels Apr 9, 2019

jsquyres added the Target: v4.0.x label Apr 12, 2019

jsquyres changed the title ~~Memory leak with vader BTL~~ Memory leak with persistent MPI sends Apr 12, 2019

jsquyres changed the title ~~Memory leak with persistent MPI sends~~ Memory leak with persistent MPI sends and the ob1 "get" protocol Apr 12, 2019

jsquyres assigned bosilca Apr 23, 2019

bosilca mentioned this issue Apr 27, 2019

Fix the leak of fragments for persistent sends (issue #6565) #6621

Merged

jsquyres added Target: v2.x Target: v3.0.x Target: v3.1.x labels Apr 30, 2019

kawashima-fj added a commit that referenced this issue May 3, 2019

Merge pull request #6621 from bosilca/topic/persistent_req_leak

dabad08

Fix the leak of fragments for persistent sends (issue #6565)

gpaulsen added the fix for v4.0.2 label Jun 7, 2019

gpaulsen removed Target: main Target: v4.0.x fix for v4.0.2 labels Jun 28, 2019

hppritcha removed this from the v4.0.2 milestone Jul 1, 2019

mwheinz mentioned this issue Dec 4, 2019

vader transport appears to leave SHM files laying around after successful termination #7220

Closed

awlauria closed this as completed Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak with persistent MPI sends and the ob1 "get" protocol #6565

Memory leak with persistent MPI sends and the ob1 "get" protocol #6565

s-kuberski commented Apr 3, 2019

jsquyres commented Apr 9, 2019 •

edited

Loading

jsquyres commented Apr 9, 2019

aingerson commented Apr 9, 2019

s-kuberski commented Apr 10, 2019

aingerson commented Apr 10, 2019

jsquyres commented Apr 10, 2019

aingerson commented Apr 10, 2019

hppritcha commented Apr 10, 2019

aingerson commented Apr 10, 2019

aingerson commented Apr 10, 2019

s-kuberski commented Apr 11, 2019

jsquyres commented Apr 12, 2019

jsquyres commented Apr 12, 2019

kawashima-fj commented Apr 15, 2019

kawashima-fj commented Apr 15, 2019

gpaulsen commented Jun 28, 2019

awlauria commented Oct 13, 2020

awlauria commented Oct 16, 2020

Memory leak with persistent MPI sends and the ob1 "get" protocol #6565

Memory leak with persistent MPI sends and the ob1 "get" protocol #6565

Comments

s-kuberski commented Apr 3, 2019

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed

Please describe the system on which you are running

Details of the problem

jsquyres commented Apr 9, 2019 • edited Loading

jsquyres commented Apr 9, 2019

aingerson commented Apr 9, 2019

s-kuberski commented Apr 10, 2019

aingerson commented Apr 10, 2019

jsquyres commented Apr 10, 2019

aingerson commented Apr 10, 2019

hppritcha commented Apr 10, 2019

aingerson commented Apr 10, 2019

aingerson commented Apr 10, 2019

s-kuberski commented Apr 11, 2019

jsquyres commented Apr 12, 2019

jsquyres commented Apr 12, 2019

kawashima-fj commented Apr 15, 2019

kawashima-fj commented Apr 15, 2019

gpaulsen commented Jun 28, 2019

awlauria commented Oct 13, 2020

awlauria commented Oct 16, 2020

jsquyres commented Apr 9, 2019 •

edited

Loading