Topic/datatype #3441

bosilca · 2017-05-02T06:00:46Z

This started as a fix for #3439 but then evolved into a complete redesign of the handling of some datatype and convertor internal structures (mainly the array of counts for predefined types).

bosilca · 2017-05-02T06:02:01Z

@ggouaillardet can you try this in a heterogeneous environment ?

ibm-ompi · 2017-05-02T06:13:04Z

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/805cc76376493b5d175dca9ac64833a2

ibm-ompi · 2017-05-02T06:15:48Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/d227a5018dfdd6cb9d3336835b9338bd

ibm-ompi · 2017-05-02T06:22:16Z

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/30256b2e57e9a3a40fc8ef555877a110

ggouaillardet · 2017-05-02T07:44:17Z

@bosilca this PR build but all tests crash
i will be off until Monday, and i will unlikely have any chance to review this before.
once this is fixed, do you plan to backport it to the release branches ? or make a simpler fix just for the release branches ?

bosilca · 2017-05-02T14:54:43Z

I haven't checked yet, but I think this patch would allow us to drastically reduce the size of the predefined datatypes. If we implement the reduction, we will change the ABI, so this might only be good for 3.0. For the others I will take a look at having a slimmed down version.

ibm-ompi · 2017-05-03T03:19:24Z

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/2fce5a06ccb4313d246c882baaeb5239

ibm-ompi · 2017-05-03T03:19:47Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/90bfae8e90e611662bc9637eb7aca130

ibm-ompi · 2017-05-03T03:26:47Z

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/cc4639d8b917a0e8d3bac6fd2ed0dbd2

ibm-ompi · 2017-05-03T19:51:01Z

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/6fd8c492a06afe924bcdeedd43ed6d5c

ibm-ompi · 2017-05-03T19:53:57Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/f4a14ddc5fca9a636699d7d67d76ce26

ibm-ompi · 2017-05-03T19:58:07Z

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/4d387051d59b342b93d537dea980d4c3

bosilca · 2017-05-04T05:05:30Z

It turns out that in addition of being a bug fix, this patch also provide a performance boost. We have shred few nsecs from our shared memory latency from 95.450 nsecs to 91.865 nsecs on a E5-2650 v3 @ 2.30GHz.

ggouaillardet

datatype/getel and datatype/darray-pack from the ibm test suite fail. possible fixes are in the inline comments

ggouaillardet · 2017-05-08T04:58:53Z

ompi/datatype/ompi_datatype_create_darray.c

@@ -216,9 +216,14 @@ int32_t ompi_datatype_create_darray(int size,
    }

    /* Build up array */
+    displs[0] = st_offsets[start_loop];


did you really mean to commit changes to this file ?
before entering the loop, the st_offsets array is uninitialized, and this commit causes a failure in the datatype/darray-pack test from the ibm test suite

this was not supposed to get pushed. It is from another patch, but it is wrong and incomplete. I will remove.

ggouaillardet · 2017-05-08T05:02:37Z

opal/datatype/opal_datatype_get_count.c

+ * when we use get_element_count). Thus, we will pay the cost once per
+ * datatype, but we will only update this array if/when needed.
+ */
+int opal_datatype_compute_ptypes( opal_datatype_t* datatype )


the datatype/getel test from the ibm test suite loops forever.

i got things working with the patch below.
opal_datatype_get_element_count might have to be updated as well, though i could not write a program that causes the same endless loop

diff --git a/opal/datatype/opal_datatype_get_count.c b/opal/datatype/opal_datatype_get_count.c index a860d5f..c7c87ea 100644 --- a/opal/datatype/opal_datatype_get_count.c +++ b/opal/datatype/opal_datatype_get_count.c @@ -169,10 +169,13 @@ int opal_datatype_compute_ptypes( opal_datatype_t* datatype ) while( 1 ) { /* loop forever the exit condition is on the last OPAL_DATATYPE_END_LOOP */ if( OPAL_DATATYPE_END_LOOP == pElems[pos_desc].elem.common.type ) { /* end of the current loop */ if( --(pStack->count) == 0 ) { /* end of loop */ - stack_pos--; pStack--; - if( stack_pos == -1 ) return 0; /* completed */ + if( stack_pos == 0 ) return 0; /* completed */ + stack_pos--; + pStack--; + pos_desc++; + } else { + pos_desc = pStack->index + 1; } - pos_desc = pStack->index + 1; continue; } if( OPAL_DATATYPE_LOOP == pElems[pos_desc].elem.common.type ) {

Thanks for noticing. The same pattern appears in several location, I fixed them all. Please review the updated version.

Change the type of the count to be a size_t (it does not alter the total size of the internal structures, so has no impact on the ABI). Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

The internal array of counts of predefined types is now only created when needed, which is either in a heterogeneous environment, or when one call get_elements. It saves space and makes the convertor creation a little faster in some cases. Rearrange the fields in the datatype description structs. The macro OPAL_DATATYPE_INIT_PTYPES_ARRAY had a bug, and the static array was only partially created. All predefined types should have the ptypes array created and initialized. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

As Gilles suggested on open-mpi#2535 the opal_unpack_general_function was unpacking based on the requested count and not on the amount of packed data provided. Fixes open-mpi#2535. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

ggouaillardet · 2017-05-09T07:22:08Z

@bosilca this looks good to me
i also tried in an hetero environment and could not find any issue

jsquyres · 2017-05-09T10:35:51Z

bot:ompi:retest

jsquyres · 2017-05-09T15:32:47Z

Note that per discussions, this PR caused master to become ABI-incompatible with the v3.x (which will shortly be renamed to be v3.0.x) branch. That's not a problem -- it just means that the next release series with be v4.0.x (vs. v3.1.x.).

bosilca · 2017-05-09T18:57:27Z

We have not released yet the 3.0. Why not including this change before the official release? This PR is not only a performance fix, but it addresses a real issue (identified by Gilles in #3439).

jsquyres · 2017-05-09T20:10:26Z

@bwbarrett @hppritcha What say you for v3.0.0?

bwbarrett · 2017-05-09T20:15:58Z

I'm not happy about a big DDT patch 2 weeks before we do our first RC. Would have been better to fix the bug without all the other work, but I suppose I'm not going to have a choice, am I?

bosilca · 2017-05-09T20:40:43Z

I like your pragmatism ;) In the defense of the patch, it looks big because it changes all the accesses to a datatype field (from ".btypes" into "->ptypes"), but the logical changes are rather small.

Technically it is possible to make a patch for the 3.x that does not have the dynamic ptypes array. This will address the problem Gilles found, but: 1) the code will diverge between 3.x and master (and additional issues might be difficult to fix); 2) it will also require 47 * sizeof(size_t) extra bytes per datatype (including the predefined); and 3) the homogeneous case will not be as streamlined as with this patch. Your call.

hppritcha · 2017-05-10T15:54:30Z

I tried to apply this patch to v3.x and it doesn't patch cleanly.

pn1249323:~/ompi (v3.x)$ git am patch_file
Applying: Don't overflow the internal datatype count. Change the type of the count to be a size_t (it does not alter the total size of the internal structures, so has no impact on the ABI).
error: patch failed: opal/datatype/opal_datatype_internal.h:155
error: opal/datatype/opal_datatype_internal.h: patch does not apply
error: patch failed: opal/datatype/opal_datatype_optimize.c:50
error: opal/datatype/opal_datatype_optimize.c: patch does not apply
Patch failed at 0001 Don't overflow the internal datatype count. Change the type of the count to be a size_t (it does not alter the total size of the internal structures, so has no impact on the ABI).
The copy of the patch that failed is found in: .git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
pn1249323:~/ompi (v3.x|AM 1/6)$ git am --abort
pn1249323:~/ompi (v3.x)$ git show HEAD

* Don't overflow the internal datatype count. Change the type of the count to be a size_t (it does not alter the total size of the internal structures, so has no impact on the ABI). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Optimize the datatype creation. The internal array of counts of predefined types is now only created when needed, which is either in a heterogeneous environment, or when one call get_elements. It saves space and makes the convertor creation a little faster in some cases. Rearrange the fields in the datatype description structs. The macro OPAL_DATATYPE_INIT_PTYPES_ARRAY had a bug, and the static array was only partially created. All predefined types should have the ptypes array created and initialized. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Fix the boundary computation. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * test/datatype: add test for short unpack on heteregeneous cluster Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Trying to reduce the cost of creating a convertor. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Respect the unpack boundaries. As Gilles suggested on open-mpi#2535 the opal_unpack_general_function was unpacking based on the requested count and not on the amount of packed data provided. Fixes open-mpi#2535. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

bosilca · 2017-05-10T16:14:51Z

Indeed, we didn't made the change from OPAL_PTRDIFF_TYPE to ptrdiff_t in master. I have prepared a PR for you. #3504

* Don't overflow the internal datatype count. Change the type of the count to be a size_t (it does not alter the total size of the internal structures, so has no impact on the ABI). Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Optimize the datatype creation. The internal array of counts of predefined types is now only created when needed, which is either in a heterogeneous environment, or when one call get_elements. It saves space and makes the convertor creation a little faster in some cases. Rearrange the fields in the datatype description structs. The macro OPAL_DATATYPE_INIT_PTYPES_ARRAY had a bug, and the static array was only partially created. All predefined types should have the ptypes array created and initialized. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Fix the boundary computation. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * test/datatype: add test for short unpack on heteregeneous cluster Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Trying to reduce the cost of creating a convertor. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> * Respect the unpack boundaries. As Gilles suggested on open-mpi#2535 the opal_unpack_general_function was unpacking based on the requested count and not on the amount of packed data provided. Fixes open-mpi#2535. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit open-mpi/ompi@cbf03b3)

bosilca requested a review from ggouaillardet May 2, 2017 06:00

bosilca added bug enhancement labels May 2, 2017

bosilca added this to the v2.1.2 milestone May 2, 2017

bosilca force-pushed the topic/datatype branch from 95c6e8f to c8dd1c9 Compare May 3, 2017 03:05

rhc54 mentioned this pull request May 3, 2017

Track RTE-related commits that need to go to v3.0 #3289

Closed

27 tasks

bosilca force-pushed the topic/datatype branch from c8dd1c9 to 6f17464 Compare May 3, 2017 19:37

bosilca force-pushed the topic/datatype branch from 6f17464 to e28f647 Compare May 4, 2017 04:42

bosilca force-pushed the topic/datatype branch from 0761ed7 to 56ac2ec Compare May 5, 2017 03:28

bosilca mentioned this pull request May 5, 2017

opal_unpack_general_function might upack too much data #2535

Closed

ggouaillardet requested changes May 8, 2017

View reviewed changes

bosilca added 6 commits May 8, 2017 23:31

Don't overflow the internal datatype count.

307ff6b

Change the type of the count to be a size_t (it does not alter the total size of the internal structures, so has no impact on the ABI). Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

Fix the boundary computation.

d946268

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

test/datatype: add test for short unpack on heteregeneous cluster

a5e08f5

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

Trying to reduce the cost of creating a convertor.

1022817

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

Respect the unpack boundaries.

b3cc530

As Gilles suggested on open-mpi#2535 the opal_unpack_general_function was unpacking based on the requested count and not on the amount of packed data provided. Fixes open-mpi#2535. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

bosilca force-pushed the topic/datatype branch from 9e836d4 to b3cc530 Compare May 9, 2017 03:36

ggouaillardet approved these changes May 9, 2017

View reviewed changes

bosilca merged commit cbf03b3 into open-mpi:master May 9, 2017

bosilca deleted the topic/datatype branch May 9, 2017 13:31

jsquyres removed this from the v2.1.2 milestone May 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic/datatype #3441

Topic/datatype #3441

bosilca commented May 2, 2017

bosilca commented May 2, 2017

ibm-ompi commented May 2, 2017

ibm-ompi commented May 2, 2017

ibm-ompi commented May 2, 2017

ggouaillardet commented May 2, 2017

bosilca commented May 2, 2017

ibm-ompi commented May 3, 2017

ibm-ompi commented May 3, 2017

ibm-ompi commented May 3, 2017

ibm-ompi commented May 3, 2017

ibm-ompi commented May 3, 2017

ibm-ompi commented May 3, 2017

bosilca commented May 4, 2017

ggouaillardet left a comment

ggouaillardet May 8, 2017

bosilca May 8, 2017

ggouaillardet May 8, 2017

bosilca May 9, 2017

ggouaillardet commented May 9, 2017

jsquyres commented May 9, 2017

jsquyres commented May 9, 2017

bosilca commented May 9, 2017

jsquyres commented May 9, 2017

bwbarrett commented May 9, 2017

bosilca commented May 9, 2017

hppritcha commented May 10, 2017 •

edited

Loading

bosilca commented May 10, 2017

Topic/datatype #3441

Topic/datatype #3441

Conversation

bosilca commented May 2, 2017

bosilca commented May 2, 2017

ibm-ompi commented May 2, 2017

ibm-ompi commented May 2, 2017

ibm-ompi commented May 2, 2017

ggouaillardet commented May 2, 2017

bosilca commented May 2, 2017

ibm-ompi commented May 3, 2017

ibm-ompi commented May 3, 2017

ibm-ompi commented May 3, 2017

ibm-ompi commented May 3, 2017

ibm-ompi commented May 3, 2017

ibm-ompi commented May 3, 2017

bosilca commented May 4, 2017

ggouaillardet left a comment

Choose a reason for hiding this comment

ggouaillardet May 8, 2017

Choose a reason for hiding this comment

bosilca May 8, 2017

Choose a reason for hiding this comment

ggouaillardet May 8, 2017

Choose a reason for hiding this comment

bosilca May 9, 2017

Choose a reason for hiding this comment

ggouaillardet commented May 9, 2017

jsquyres commented May 9, 2017

jsquyres commented May 9, 2017

bosilca commented May 9, 2017

jsquyres commented May 9, 2017

bwbarrett commented May 9, 2017

bosilca commented May 9, 2017

hppritcha commented May 10, 2017 • edited Loading

bosilca commented May 10, 2017

hppritcha commented May 10, 2017 •

edited

Loading