Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ompi/v3.x.x bug since August 21: opal_datatype_pack.c:203 and opal_datatype_unpack.c:135 #6932

Closed
ericch1 opened this issue Aug 27, 2019 · 20 comments

Comments

@ericch1
Copy link

ericch1 commented Aug 27, 2019

Hi,

EDIT: I modified the mentioned SHAs in this first message since it contains wrong info about the wrong sha
up to commit d3587f5, everything was fine, but
as of commit 390e0bc, we have some tests that are failing with errors like this:

[dockercentos7:18478] opal_datatype_pack.c:203
	Pointer 0xdf6c970 size 9 is outside [0xdf6c880,0xdf6c969] for
	base ptr 0xdf6c880 count 10 and data 
[dockercentos7:18478] Datatype 0xa10f7a0[] size 17 align 8 id 0 length 4 used 3
true_lb 0 true_ub 17 (true_extent 17) lb 0 ub 24 (extent 24)
nbElems 3 loops 0 flags 114 (committed contiguous )-cC----GD--[---][---]
   contain OPAL_INT1:* OPAL_INT8:* OPAL_FLOAT8:* 
--C---P-D--[---][---]    OPAL_FLOAT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_INT8 count 1 disp 0x8 (8) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_INT1 count 1 disp 0x10 (16) blen 1 extent 1 (size 1)
-------G---[---][---]    OPAL_LOOP_E prev 3 elements first elem displacement 0 size of data 17
Optimized description 
-cC---P-DB-[---][---]     OPAL_UINT1 count 1 disp 0x0 (0) blen 8 extent 8 (size 8)
-cC---P-DB-[---][---]     OPAL_UINT1 count 1 disp 0x8 (8) blen 9 extent 9 (size 9)
-------G---[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 17

[dockercentos7:18478] opal_datatype_unpack.c:135
	Pointer 0xeb57a98 size 9 is outside [0xeb579a8,0xeb57a91] for
	base ptr 0xeb579a8 count 10 and data 
[dockercentos7:18478] Datatype 0xa10f7a0[] size 17 align 8 id 0 length 4 used 3
true_lb 0 true_ub 17 (true_extent 17) lb 0 ub 24 (extent 24)
nbElems 3 loops 0 flags 114 (committed contiguous )-cC----GD--[---][---]
   contain OPAL_INT1:* OPAL_INT8:* OPAL_FLOAT8:* 
--C---P-D--[---][---]    OPAL_FLOAT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_INT8 count 1 disp 0x8 (8) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_INT1 count 1 disp 0x10 (16) blen 1 extent 1 (size 1)
-------G---[---][---]    OPAL_LOOP_E prev 3 elements first elem displacement 0 size of data 17
Optimized description 
-cC---P-DB-[---][---]     OPAL_UINT1 count 1 disp 0x0 (0) blen 8 extent 8 (size 8)
-cC---P-DB-[---][---]     OPAL_UINT1 count 1 disp 0x8 (8) blen 9 extent 9 (size 9)
-------G---[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 17

Other example:

[dockercentos7:09967] opal_datatype_pack.c:203
	Pointer 0x8be7d78 size 9 is outside [0x8be4c40,0x8be7d71] for
	base ptr 0x8be4c40 count 525 and data 
[dockercentos7:09967] Datatype 0x8ab8650[] size 17 align 8 id 0 length 4 used 3
true_lb 0 true_ub 17 (true_extent 17) lb 0 ub 24 (extent 24)
nbElems 3 loops 0 flags 114 (committed contiguous )-cC----GD--[---][---]
   contain OPAL_INT8:* OPAL_BOOL:* 
--C---P-D--[---][---]      OPAL_INT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_INT8 count 1 disp 0x8 (8) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_BOOL count 1 disp 0x10 (16) blen 1 extent 1 (size 1)
-------G---[---][---]    OPAL_LOOP_E prev 3 elements first elem displacement 0 size of data 17
Optimized description 
-cC---P-DB-[---][---]      OPAL_INT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8)
-cC---P-DB-[---][---]     OPAL_UINT1 count 1 disp 0x8 (8) blen 9 extent 9 (size 9)
-------G---[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 17

[dockercentos7:09967] *** Process received signal ***
[dockercentos7:09967] Signal: Aborted (6)
[dockercentos7:09967] Signal code:  (-6)
[dockercentos7:09967] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7f355e57d5d0]
[dockercentos7:09967] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f355d5a2207]
[dockercentos7:09967] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f355d5a38f8]
[dockercentos7:09967] [ 3] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt(_Z15attacheDebuggerv+0x2c5e)[0x41a3ee]
[dockercentos7:09967] [ 4] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x2bd0)[0x7f356bcfd7e0]
[dockercentos7:09967] [ 5] /lib64/libc.so.6(+0x36280)[0x7f355d5a2280]
[dockercentos7:09967] [ 6] /lib64/libc.so.6(__sched_yield+0x7)[0x7f355d64ed47]
[dockercentos7:09967] [ 7] /opt/openmpi-4.x_debug/lib/libopen-pal.so.40(opal_progress+0xc0)[0x7f355c1988f0]
[dockercentos7:09967] [ 8] /opt/openmpi-4.x_debug/lib/libopen-pal.so.40(ompi_sync_wait_mt+0x187)[0x7f355c1a10a5]
[dockercentos7:09967] [ 9] /opt/openmpi-4.x_debug/lib/libmpi.so.40(+0x5ef27)[0x7f355f164f27]
[dockercentos7:09967] [10] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_request_default_wait+0x27)[0x7f355f164fe9]
[dockercentos7:09967] [11] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xeb)[0x7f355f209957]
[dockercentos7:09967] [12] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_recursivedoubling+0x35e)[0x7f355f20b976]
[dockercentos7:09967] [13] /opt/openmpi-4.x_debug/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xa8)[0x7f354b37e42e]
[dockercentos7:09967] [14] /opt/openmpi-4.x_debug/lib/libmpi.so.40(PMPI_Allreduce+0x3c5)[0x7f355f181612]

http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.19.20h08m05s_config.log
http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.19.20h08m05s_confdefs.h
http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.19.20h08m05s_ompi_info_all.txt

All failing tests have more than 1 process.
They are all showing opal_datatype_pack.c:203 and opal_datatype_unpack.c:135 as above.

Note that we are compiling/testing with --enable-debug ...

I do not have a MWE now, but I wanted to report asap so you can be aware of this.

Thanks,

Eric

@hppritcha
Copy link
Member

@ericch1 could you provide us with a test case?

@ericch1
Copy link
Author

ericch1 commented Aug 27, 2019

Ok, I will try do this this week. It is not easy to extract an example from the code, but since it looks like it's happening at the beginning I should be able to do it...

@ericch1
Copy link
Author

ericch1 commented Aug 27, 2019

It is not as easy as I tough... I will try to see if valgrind will give us some clues...
In the meantime, I will test commit fd13b27

@ericch1
Copy link
Author

ericch1 commented Aug 28, 2019

Ok, the commit fd13b27 is good.
Now I am launching the tests against commit 7b09c15, I will have the results in a few hours.

@ericch1
Copy link
Author

ericch1 commented Aug 28, 2019

Ok, 7b09c15 is not yet finished, but all tests that were failing are all good! So the real wrong merge is really only the modifications in f96994b...

I was looking for the validation tests of the OpenMPI. Are you still using jenkins to tests merge requests? I found this:
http://bgate.mellanox.com/jenkins/view/all/

but when I try to look into the details of a build/tests, I only see very few tests that lasts 13 seconds:

http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/10301/console

Do I look at the good place?

Is it possible or necessary to have a build with "--enable-debug" mode to catch my issue?
If it costs only 13s, it would worth the trouble!!! :)
For us, all the tests (3341) consumes about 1-2 hours...

@ericch1 ericch1 changed the title ompi/v4.0.x bug since August 19: opal_datatype_pack.c:203 and opal_datatype_unpack.c:135 ompi/v4.0.x bug since August 21: opal_datatype_pack.c:203 and opal_datatype_unpack.c:135 Aug 29, 2019
@ericch1
Copy link
Author

ericch1 commented Aug 29, 2019

Hi,
I was testing the commits when I saw I mispointed the problem: It occured on August 21 with commit 390e0bc. The preceding merge, d3587f5, is good.

Sorry for this misguiding error!!! :/

@bosilca
Copy link
Member

bosilca commented Aug 29, 2019

To confirm that is a problem with the merge or with the datatype engine itself, can you run your test with the master ?

@ericch1
Copy link
Author

ericch1 commented Aug 29, 2019

Ok, I will test master this morning. Also, I have to look how to launch your tests database with my configuration (particularly --enable-debug)

@ericch1
Copy link
Author

ericch1 commented Aug 29, 2019

Ok, which master/SHAs should I test? Maybe in this order:

@ericch1
Copy link
Author

ericch1 commented Aug 29, 2019

Ok, I just tested 390e0bc without "--enable-debug" and the problem is still there, but I have less information on stderr:

[dockercentos7:17688] *** Process received signal ***
[dockercentos7:17688] Signal: Aborted (6)
[dockercentos7:17688] Signal code:  (-6)
[dockercentos7:17688] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7f14f92a35d0]
[dockercentos7:17688] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f14f82c8207]
[dockercentos7:17688] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f14f82c98f8]
[dockercentos7:17688] [ 3] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt(_Z15attacheDebuggerv+0x2c5e)[0x41a3ee]
[dockercentos7:17688] [ 4] /home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x2bd0)[0x7f150695a7e0]
[dockercentos7:17688] [ 5] /lib64/libc.so.6(+0x36280)[0x7f14f82c8280]
[dockercentos7:17688] [ 6] /lib64/libc.so.6(__sched_yield+0x7)[0x7f14f8374d47]
[dockercentos7:17688] [ 7] /opt/openmpi-4.x_debug/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x7f14f6f72dc5]
[dockercentos7:17688] [ 8] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_request_default_wait+0x1f0)[0x7f14f9e7cb40]
[dockercentos7:17688] [ 9] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xc9)[0x7f14f9ecc789]
[dockercentos7:17688] [10] /opt/openmpi-4.x_debug/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_recursivedoubling+0x296)[0x7f14f9ecd016]
[dockercentos7:17688] [11] /opt/openmpi-4.x_debug/lib/libmpi.so.40(PMPI_Allreduce+0x183)[0x7f14f9e8f8a3]

I will now launch the tests with branch master 8f32a59 to verify it is ok, then I will launch tests against master 94f26f5.

Also, is https://github.com/open-mpi/mtt the suite which I could launch to test more deeply my local OpenMPI installation? (I never used it before).

Thanks,
Eric

@ericch1
Copy link
Author

ericch1 commented Aug 29, 2019

Looks like 8f32a59 is ok. All failing tests passed (the remaining tests are running).
I am about to launch tests with 94f26f5.

@bosilca
Copy link
Member

bosilca commented Aug 29, 2019

Looking at your output it seems to me that the datatype representation could have been further optimized in order for the optimized description to look like

Optimized description 
-cC---P-DB-[---][---]      OPAL_INT1 count 1 disp 0x0 (0) blen 17 extent 17 (size 17)
-------G---[---][---]    OPAL_LOOP_E prev 1 elements first elem displacement 0 size of data 17

instead of

Optimized description 
-cC---P-DB-[---][---]      OPAL_INT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8)
-cC---P-DB-[---][---]     OPAL_UINT1 count 1 disp 0x8 (8) blen 9 extent 9 (size 9)
-------G---[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 17

I created #6945 to address this optimization issue, but I don't think it fixes anything else. If you can give it a try let me know the outcome.

Also, do you have a reproducer for your test case ?

@ericch1
Copy link
Author

ericch1 commented Aug 30, 2019

Ok, the sha 94f26f5 is bad.

Here is the stderr:

[dockercentos7:11303] opal_datatype_unpack.c:135
	Pointer 0x8fd31a8 size 9 is outside [0x8fd0070,0x8fd31a1] for
	base ptr 0x8fd0070 count 525 and data 
[dockercentos7:11303] Datatype 0x8ebac70[] size 17 align 8 id 0 length 4 used 3
true_lb 0 true_ub 17 (true_extent 17) lb 0 ub 24 (extent 24)
nbElems 3 loops 0 flags 114 (committed contiguous )-cC----GD--[---][---]
   contain OPAL_INT8:* OPAL_BOOL:* 
--C---P-D--[---][---]      OPAL_INT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_INT8 count 1 disp 0x8 (8) blen 1 extent 8 (size 8)
--C---P-D--[---][---]      OPAL_BOOL count 1 disp 0x10 (16) blen 1 extent 1 (size 1)
-------G---[---][---]    OPAL_LOOP_E prev 3 elements first elem displacement 0 size of data 17
Optimized description 
-cC---P-DB-[---][---]      OPAL_INT8 count 1 disp 0x0 (0) blen 1 extent 8 (size 8)
-cC---P-DB-[---][---]     OPAL_UINT1 count 1 disp 0x8 (8) blen 9 extent 9 (size 9)
-------G---[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 17

*** Error in `/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt': corrupted size vs. prev_size: 0x0000000008fd31a0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f5d4)[0x7f7399f095d4]
/lib64/libc.so.6(+0x816cb)[0x7f7399f0b6cb]
/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_MaillageUtil.so(_ZN17PAScatterMultipleISt4pairIlS0_IlbEEED1Ev+0x3d)[0x7f73aa7befed]
/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_MaillageUtil.so(_ZN11PAPartitionI52PATraitStockagePartitionConteneurDichotomiqueVecteurI10PtrPorteurI6SommetS2_EEE12lecturePriveER18PAPRFichierLecturel+0xe1d)[0x7f73aa7e166d]
/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_MaillageUtil.so(_ZN27LectureConnectiviteMaillage15lectureMaillageER18PAPRFichierLectureR8Maillage+0x592)[0x7f73aa7b98b2]
/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Maillage.so(_ZN8Maillage24importeParalleleVersion1ERKSsRK17PAGroupeProcessusl+0xcf0)[0x7f73aac912e0]
/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Maillage.so(_ZN8Maillage16importeParalleleERKSsRK17PAGroupeProcessusl+0xd90)[0x7f73aac93610]
/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Contact.so(_ZN17CorpsAvecMaillage28lisDonneesDeBaseAvecMaillageERKSsRSsR20GestionFichierChampsRPS3_R24ListeEntitesGeometriquesRPS7_R9GeometrieRPSB_S1_RP17EntiteGeometriqueRSF_b+0x1f0)[0x7f73a6f10d60]
/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/GIREF/lib/libgiref_opt_Contact.so(_ZN7CorpsEF16lisDonneesDeBaseERKSs+0xf0)[0x7f73a6f60070]
/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt(_ZN17CollectionDeCorps10lisUnCorpsISsEE18SYEnveloppeMessageISsERKT_bbPP5Corps+0x8f2)[0x433e12]
/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt(_ZN17CollectionDeCorps16lisDonneesDeBaseIN9__gnu_cxx17__normal_iteratorIPSsSt6vectorISsSaISsEEEEEE18SYEnveloppeMessageISsET_SA_bb+0xdf)[0x43434f]
/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt[0x414a5b]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f7399eac3d5]
/home/cmpbib/compilation_BIB_docker/COMPILE_AUTO/BIB/bin/Test.BIBProblemeGD.opt[0x4159af]

I will test ebe7ed6 tomorrow.

@jsquyres
Copy link
Member

@gpaulsen @hppritcha I think you guys should evaluate the severity of this issue for the upcoming release.

@ericch1
Copy link
Author

ericch1 commented Aug 30, 2019

I have good news! The patch in commit ebe7ed6 fixes all failing tests and all our other tests are 100% successful! :)
Hope this is incorporated into v4.0.x before next release!

Thanks,
Eric

http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.30.09h45m24s_config.log
http://www.giref.ulaval.ca/~cmpgiref/ompi_4.x/2019.08.30.09h45m24s_ompi_info_all.txt

@gpaulsen gpaulsen assigned bosilca and unassigned bosilca Aug 30, 2019
@hppritcha hppritcha removed this from the v4.0.2 milestone Aug 30, 2019
@bosilca
Copy link
Member

bosilca commented Aug 31, 2019

The fix (and a tester to prevent it from happening in the future) is pending, it can be merged as soon as jenkins is happy. The patch should be easy backported to the stables (but I don not have time before next week).

jsquyres pushed a commit to jsquyres/ompi that referenced this issue Sep 3, 2019
This patch fixes the merge of contiguous elements into larger but more
compact datatypes, and allows for contiguous elements to have thir
blocklen increasing instead of the count. The idea is to always maximize
the blocklen, aka. the contiguous part of the datatype.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 41e6f55)

Addendum to original cherry-pick commit:

This is a cherry-pick from master to the v3.1.x branch, which required
some conflict resolution.  This commit was shown to fix
open-mpi#6932.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
jsquyres pushed a commit to jsquyres/ompi that referenced this issue Sep 3, 2019
This patch fixes the merge of contiguous elements into larger but more
compact datatypes, and allows for contiguous elements to have thir
blocklen increasing instead of the count. The idea is to always maximize
the blocklen, aka. the contiguous part of the datatype.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit 41e6f55)

Addendum to original cherry-pick commit:

This is a cherry-pick from master to the v3.1.x branch, which required
some conflict resolution.  This commit was shown to fix
open-mpi#6932.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@jsquyres jsquyres assigned jsquyres and bosilca and unassigned jsquyres Sep 3, 2019
@jsquyres
Copy link
Member

jsquyres commented Sep 3, 2019

We got a PR for v4.0.x (#6952).

The fix does not apply directly to v3.1.x / v3.0.x, though -- @bosilca is going to have a look.

@ericch1
Copy link
Author

ericch1 commented Sep 10, 2019

ompi/v4.0.x is 100% functional for us again this morning, thanks a lot!

@gpaulsen
Copy link
Member

gpaulsen commented Mar 4, 2020

This was fixed in v4.0.2 on the v4.0.x stream.

@jsquyres, @bwbarrett are you still working with @bosilca for a similar PR to the v3.1.x or v3.0.x streams? If not please close this issue.

@gpaulsen gpaulsen changed the title ompi/v4.0.x bug since August 21: opal_datatype_pack.c:203 and opal_datatype_unpack.c:135 ompi/v3.x.x bug since August 21: opal_datatype_pack.c:203 and opal_datatype_unpack.c:135 Mar 9, 2020
@jsquyres
Copy link
Member

jsquyres commented Apr 3, 2020

It looks like it was too difficult to port back to v3.x -- the official guidance is that the fix is in the v4.0.x series.

@jsquyres jsquyres closed this as completed Apr 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants