Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with persistent MPI sends and the ob1 "get" protocol #6565

Closed
s-kuberski opened this issue Apr 3, 2019 · 18 comments
Closed

Memory leak with persistent MPI sends and the ob1 "get" protocol #6565

s-kuberski opened this issue Apr 3, 2019 · 18 comments

Comments

@s-kuberski
Copy link

Background information

A memory leak appears when using persistent communication with the vader BTL and large message sizes.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.0 on local computer, 2.1.2 and 3.1.3 on clusters

Describe how Open MPI was installed

4.0.0: from tarball

Please describe the system on which you are running

  • Operating system/version: Ubuntu 18.04 / CentOS 7.5 / Scientific Linux release 7.5
  • Computer hardware: Laptop / Intel Skylake / Intel Nehalem
  • Network type: vader RDMA

Details of the problem

Running a simulation program with Open MPI, memory leaks appeared causing the application to crash. The behaviour can be reproduced with the attached code block.

When the vader BTL is used, the used memory increases linearly over time. This bug is directly connected to the message size:
With btl_vader_eager_limit 4096 and a message size of 4041 bytes, the bug appears. If the eager limit is raised or the message size is decreased, no problem occurs.

Only btl_vader_single_copy_mechanism set to cma could be tested, no problem is seen if the value is set to none.

If buffered communication is used and the buffer is detached and attached manually, the problem does not appear.

Only the shared-memory communication with vader is affected. If the processes are located on different nodes or -mca btl ^vader is set, everything is fine.

#include <stdlib.h>
#include <stdio.h>
#include "mpi.h"

int main(int argc, char **argv) {
  int rank, size, rplus, rminus, cnt, max = 4000000, i;
  char *sbuf, *rbuf;
  static MPI_Request req;
  int lbuf = 4041;

  /* Initialize MPI and assert even size */
  MPI_Init(&argc,&argv);
  MPI_Comm_size(MPI_COMM_WORLD,&size);
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);

  if ( size & 1 ) {
    if ( rank == 0 ) fprintf(stderr,"ERROR: Invalid number of MPI tasks: %d\n", size);
    MPI_Finalize();
    return -1;
  }

  /* Optional arguments: max and lbuf*/
  if ( argc > 1 ) {
    if(atoi(argv[1])>0) max = atoi(argv[1]);
    if ( rank == 0 ) printf("max=%d\n", max);
  }
  if ( argc > 2 ) {
    lbuf = atoi(argv[2]);
    if ( rank == 0 ) printf("lbuf=%d\n", lbuf);
  }
  /* allocate buffers */
  sbuf = malloc(sizeof(char) * lbuf);
  rbuf = malloc(sizeof(char) * lbuf);
  /* Initialize buffers */
  for(i=0; i<lbuf; i++) sbuf[i] = rbuf[i] = 0; 
  sbuf[0] = rank;

  /* Initialize communicators: single message from all even ranks to next odd rank */
  rplus  = ( rank + 1 ) % size;
  rminus = ( rank - 1 + size ) % size;
  if ( rank & 1 )
    MPI_Recv_init(rbuf, lbuf, MPI_CHAR, rminus, 0, MPI_COMM_WORLD, &req);
  else
  	MPI_Send_init(sbuf, lbuf, MPI_CHAR, rplus, 0, MPI_COMM_WORLD, &req);
  
  /* Repeat communications */
  MPI_Barrier(MPI_COMM_WORLD);
  for(cnt = 0; cnt<max; cnt++) {
    MPI_Status stat;
    MPI_Start(&req);
    MPI_Wait(&req,&stat);
  }
  MPI_Request_free(&req);
  MPI_Finalize();
  return 0;
}
@jsquyres
Copy link
Member

jsquyres commented Apr 9, 2019

Per #6547:

I think this is related to issue #5798 and was fixed with commit c076be5 but I don't think this patch was pulled into 4.0.x.

Need to check to see if c076be5 made it to v4.0.x.

@hppritcha @gpaulsen @aingerson FYI

@jsquyres
Copy link
Member

jsquyres commented Apr 9, 2019

Ah, did #6550 fix the issue?

@aingerson @s-kuberski Can you check the v4.0.x nightly snapshot tarball that will be generated tonight (i.e., in a few hours) and see if the issue is fixed for you? https://www.open-mpi.org/nightly/v4.0.x/

@aingerson
Copy link

Will do!

@s-kuberski
Copy link
Author

I still see the behaviour with the mpirun-executable built from the tarball (i.e. mpirun (Open MPI) 4.0.2a1), but better check this @aingerson, in case I did something wrong with the installation.

@aingerson
Copy link

I've tested the nightly snapshot twice with all of our tests and don't seem to be seeing the problem anymore

@jsquyres
Copy link
Member

An interesting disparity -- @s-kuberski can you try with the openmpi-v4.0.x-201904100241-811dfc6.tar.bz2 (or later) tarball from https://www.open-mpi.org/nightly/v4.0.x/?

@aingerson
Copy link

I will also add that for fun I just tested on the nightly tar ball from a couple of days ago (before the fix was applied) and am still not seeing the error anymore.... I'm going to do a bit more testing and figure out what exact test is triggering this issue. This is a race condition, right?

@hppritcha
Copy link
Member

#6550 went in to v4.0.x a couple of days ago.

@aingerson
Copy link

It looks like it went in yesterday and I tested the snapshot from 4/6

@aingerson
Copy link

I don't think my issue is actually related to this or to what I originally thought the fix was...
I've been doing some more testing and am noticing that my issue is actually resolved from 4.0.0 to 4.0.1
I hadn't tested on 4.0.1 until now and it seems to be ok with the stable release. Sorry for causing all the drama 😬

@s-kuberski
Copy link
Author

An interesting disparity -- @s-kuberski can you try with the openmpi-v4.0.x-201904100241-811dfc6.tar.bz2 (or later) tarball from https://www.open-mpi.org/nightly/v4.0.x/?

This is the version I tried yesterday. Still a linear increase in the used memory...

@jsquyres
Copy link
Member

I am able to reproduce this problem on master. Two things I notice so far:

  1. It happens with the ob1 "get" protocol; it does not happen with "put" or "send".
  2. The sender process is the process that grows without bound; the receiver process stays flat (in terms of memory usage).

The vader CMA get/put methods are downright simple; I can't see where a leak would happen there (particularly if the receiver is doing the CMA read, but the sender process is the one that is growing without bound).

This implies that it's an OB1 issue...?

@jsquyres
Copy link
Member

Spoiler: I can reproduce the problem with the TCP BTL.

Expanding on the list from above:

  1. This appears on master, v2.x, v3.0.x, v3.1.x, and v4.0.x branches. I did not test back further than that.
  2. Here are the conditions for which I have been able to reproduce the problem:
    • Persistent send is used (receive mode does not matter; this does not happen with regular or buffered sends), AND
    • The Vader BTL is used with CMA or emulated (it does not happen with btl_vader_single_copy_mechanism=none), AND
    • The Vader BTL "get" flag must be set (i.e., if "get" is not set, it does not happen), AND
    • @s-kuberski cited (and I confirm) that btl_vader_eager_limit=4096 and the sent message is a contiguous 4041 bytes (I have not checked other eager limits / message sizes).
  3. I note that this also happens with the TCP BTL when the "get" flag is set and the btl_tcp_eager_limit=4096 and the message is a contiguous 4041 bytes.
  4. When it happens, the sender process grows without bound (easy to see via top).

Since the problem also happens with the TCP BTL, I'd say that vader is in the clear. This is likely an ob1 problem, or perhaps a general persistent send problem.

Here's how I reproduced the problem:

# Vader
# Both "emulated" and "cma" trigger the problem
$ mpirun --mca btl vader,self \
    --mca btl_vader_flags send,get,inplace,atomics,fetching-atomics \
    --mca btl_vader_single_copy_mechanism emulated \
    -np 2 ./leaky-mcleakface 40000000

# TCP
$ mpirun --mca btl tcp,self \
    --mca btl_tcp_flags send,get,inplace \
    -np 2 ./leaky-mcleakface 40000000

@jsquyres jsquyres changed the title Memory leak with vader BTL Memory leak with persistent MPI sends Apr 12, 2019
@jsquyres jsquyres changed the title Memory leak with persistent MPI sends Memory leak with persistent MPI sends and the ob1 "get" protocol Apr 12, 2019
@kawashima-fj
Copy link
Member

I commented out MPI_Finalize in @s-kuberski's original source and ran it on Valgrind. Valgrind reported a suspicious trace.

==19521== 51,553,992 bytes in 1,563 blocks are still reachable in loss record 5,703 of 5,703
==19521==    at 0x4C2BBAF: malloc (vg_replace_malloc.c:299)
==19521==    by 0x5B0A10B: opal_free_list_grow_st (opal_free_list.c:210)
==19521==    by 0x10695885: opal_free_list_wait_st (opal_free_list.h:297)
==19521==    by 0x106958EA: opal_free_list_wait (opal_free_list.h:314)
==19521==    by 0x10698505: mca_pml_ob1_send_request_start_rdma (pml_ob1_sendreq.c:695)
==19521==    by 0x1069A881: mca_pml_ob1_send_request_start_btl (pml_ob1_sendreq.h:426)
==19521==    by 0x1069AA25: mca_pml_ob1_send_request_start_seq (pml_ob1_sendreq.h:467)
==19521==    by 0x1069AB4B: mca_pml_ob1_send_request_start (pml_ob1_sendreq.h:500)
==19521==    by 0x1069AD12: mca_pml_ob1_start (pml_ob1_start.c:100)
==19521==    by 0x4F045E9: PMPI_Start (pstart.c:76)
==19521==    by 0x108EC4: main (getleak.c:50)

https://github.com/open-mpi/ompi/blob/master/ompi/mca/pml/ob1/pml_ob1_sendreq.c#L694-L698

I ran:

mpiexec
  -n 2
  --mca btl self,vader
  --mca btl_vader_flags send,get,inplace,atomics,fetching-atomics
  --mca btl_vader_eager_limit 4096
  --mca btl_vader_single_copy_mechanism emulated
  valgrind --leak-check=full --show-leak-kinds=all ./getleak 100000

@kawashima-fj
Copy link
Member

For blocking and nonblocking operations, the mca_pml_ob1_send_request_fini function is called in MPI_Send or MPI_Wait and sendreq->rdma_frag is returned to the free list.

For persistent operations, the sendreq object is reused and the mca_pml_ob1_send_request_fini function is called only once in MPI_Request_free. However, sendreq->rdma_frag is allocated by MCA_PML_OB1_RDMA_FRAG_ALLOC every time when MPI_Start is called. Probably this is the cause of the memory leak.

Do we need to return sendreq->rdma_frag to the free list by MCA_PML_OB1_RDMA_FRAG_RETURN every time the send operation completes (in mca_pml_ob1_rget_completion or somewhere)?

I don't have enough time to investigate more this week...

kawashima-fj added a commit that referenced this issue May 3, 2019
Fix the leak of fragments for persistent sends (issue #6565)
@gpaulsen
Copy link
Member

Resolved on v4.0.x in PR #6634.
Removing Target: v4.0.x label.

@awlauria
Copy link
Contributor

It sounds like this issue can be closed?

@awlauria
Copy link
Contributor

Closing - this went into the 4.0 series and is in master. 3.0 is closed as far as I know. If someone disagrees, please reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants