Graph: Deterministic coloring #249

lucbv · 2018-06-04T17:21:59Z

@srajama1
Adding deterministic graph coloring based on dependency list.
This is a simple version that parallelizes using range policies and uses only one atomic to update newFrontierSize.
In a next step I am planning on switching bannedColors to use a bit array that will be more compact in memory.

srajama1 · 2018-06-06T22:15:25Z

perf_test/graph/KokkosGraph_color.cpp

@@ -181,6 +184,8 @@ void run_experiment(
    kh.set_verbose(true);
  }

+  std::cout << "algorithm: " << algorithm << std::endl;


Assume this is temporary.

Never mind previous comment, this is in testing

I added guards and now you need to request these outputs explicitly, it might actually make sense to have some special debugging flag to get that level of output details?

srajama1 · 2018-06-06T22:23:57Z

src/graph/impl/KokkosGraph_GraphColor_impl.hpp

+
+  bool _ticToc; //if true print info in each step
+  int _chunkSize; //the size of the minimum work unit assigned to threads. Changes the convergence on GPUs
+  char _use_color_set; //the VBD algorithm type.


I thought this sort of this would be in the handle ..

For some reason it is being kept separate in the other GraphColor classes, so I am using the same pattern. I could look into why/if this design is appropriate?

srajama1 · 2018-06-06T22:33:11Z

src/graph/impl/KokkosGraph_GraphColor_impl.hpp

+
+  /** \brief Function to color the vertices of the graphs. Performs a vertex-based coloring.
+   * \param colors is the output array corresponding the color of each vertex. Size is this->nv.
+   *   Attn: Color array must be nonnegative numbers. If there is no initial colors,


srajama1 · 2018-06-06T22:35:42Z

src/graph/impl/KokkosGraph_GraphColor_impl.hpp

+   *   Attn: Color array must be nonnegative numbers. If there is no initial colors,
+   *   it should be all initialized with zeros. Any positive value in the given array, will make the
+   *   algorithm to assume that the color is fixed for the corresponding vertex.
+   * \param num_phases: The number of iterations (phases) that algorithm takes to converge.


srajama1 · 2018-06-06T22:44:33Z

src/graph/impl/KokkosGraph_GraphColor_impl.hpp

+                           KOKKOS_LAMBDA(const size_type frontierIdx) {
+
+                             size_type frontierNode = frontier(frontierIdx);
+                             int* bannedColors = new int[maxColors];


This new seems to be trouble.

srajama1 · 2018-06-06T22:49:37Z

src/graph/impl/KokkosGraph_GraphColor_impl.hpp

+      update = ( (valueType) score_(i) < update ? update : (valueType) score_(i) );
+    }
+  }; // functorScoreCalcution()
+


This appears like a good baseline to start with and optimize (except the new for bannedcolors above).

lucbv · 2018-06-07T16:57:19Z

@srajama1 let me know what you think of this new cut, I found a bug and fixed it, I also added a bit array variant for the bannedColors.

srajama1

Can we also add a unit test and run on bowman and white ?

srajama1 · 2018-06-07T20:34:51Z

src/graph/impl/KokkosGraph_GraphColor_impl.hpp

+
+  template <class score_type, class max_type, class execution_space>
+  struct functorScoreCalcution {
+    typedef typename Kokkos::Experimental::Max<max_type, execution_space>::value_type valueType;


@crtrott : Why is this Max still experimental ? I assume we can use it.

srajama1 · 2018-06-07T21:16:02Z

src/graph/impl/KokkosGraph_GraphColor_impl.hpp

@@ -157,7 +159,7 @@ class GraphColor {


    //create a ban color array to keep track of
-    //which colors have been taking by the neighbor vertices.
+    //which colors have been taken by the neighbor vertices.


This banned_color need to be deleted ?

srajama1 · 2018-06-18T22:41:32Z

src/batched/KokkosBatched_Util.hpp

@@ -3,7 +3,7 @@

 /// \author Kyungjoo Kim (kyukim@sandia.gov)

-#define __KOKKOSBATCHED_PROMOTION__ 1
+//#define __KOKKOSBATCHED_PROMOTION__ 1


These changes are already in master, right ? How will this merge work ? I am just asking as I don't know.

I think that some commits were put in my branch by mistake when I rebased my code to a more recent version of develop.
I will investigate how to clean this.

I also have a problem with running the unit-test with Cuda and Serial instantiated at the same time. The Cuda unit-tests run fine on there own and OpenMP+Serial runs fine too but combinations of Cuda+OpenMP or Cuda+Serial are crashing...

The crash do not happen in the perf-test though for some reason?

lucbv · 2018-06-18T23:38:26Z

@srajama1, I removed the rogue commit from history, it should look better now.

srajama1

Most of my comments are related to performance. There is one related to allocating arrays. I am ok as long as these can be addressed before next master promotion so approving this PR so it can be merged.

srajama1 · 2018-06-19T00:04:44Z

src/blas/impl/KokkosBlas1_team_nrm2_spec.hpp

@@ -76,7 +76,7 @@ struct TeamNrm2<TeamType, XV, false> {
  typedef Kokkos::Details::ArithTraits<typename IPT::mag_type>   AT;

  static KOKKOS_INLINE_FUNCTION mag_type team_nrm2 (const TeamType& team, const XV& X) {
-    mag_type result;
+    mag_type result = 0.0; //Kokkos::Details::ArithTraits<mag_type>zero();


Is this a TBD later for using ArithTraits ?

I had issues compiling the code when result is not initialized, I attempted using the ArithTraits as it seemed to be the cleanest option but the compiler complained about it so I left it as a comment in order to look at it again later.

Ok, Thanks ! Just a TBD that needs clean up in the next round.

srajama1 · 2018-06-19T01:02:01Z

src/graph/impl/KokkosGraph_GraphColor_impl.hpp

+                           }
+                         });
+
+    Kokkos::deep_copy(host_newFrontierSize, newFrontierSize);


This is something we need to explore. It would be better to avoid this copy every time and launch another kernel later. May be just a TBD for now.

Yes, I think that ultimately the whole while loop probably needs to be pushed on the device but I am not a big fan of writing "fake" parallel_for functions to push code on device although at the moment that is probably the only option?
Short of putting the while loop on the device we need to transfer some data back and forth at each iteration to know if the graph has been explored completely or if there is still a new frontier to explore.

I wish there is a way to return a value or a device initiated copy back to host. I am fine with this for now.

srajama1 · 2018-06-19T01:03:04Z

src/graph/impl/KokkosGraph_GraphColor_impl.hpp

+			     frontierSize() = newFrontierSize();
+			     newFrontierSize() = 0;
+			   });
+      Kokkos::deep_copy(host_frontierSize, frontierSize);


Another host to device copy, TBD for later

srajama1 · 2018-06-19T01:05:31Z

src/graph/impl/KokkosGraph_GraphColor_impl.hpp

+                             KOKKOS_LAMBDA(const size_type frontierIdx) {
+
+                               size_type frontierNode = frontier(frontierIdx);
+                               int* bannedColors = new int[maxColors];


I still don't like this array. This is a crash waiting to happen and something that cannot perform well due to global memory access. Can we check the node_type early in the code and issue a warning. Doesn't have to be in this PR, but better to do it before next master promotion.

srajama1 · 2018-06-19T01:17:21Z

If spot check on white and bowman is clean I can click the merge button.

srajama1 · 2018-06-19T15:53:36Z

src/graph/impl/KokkosGraph_GraphColor_impl.hpp

+                               colors(frontierNode) = myColor;
+                             }); // Loop over current frontier
+      }
+      Kokkos::deep_copy(host_newFrontierSize, newFrontierSize);


Is there a fence missing here before ? How are we sure newFrontierSize is updated ?

srajama1

Is there a fence missing here before ? How are we sure newFrontierSize is updated ?

srajama1 · 2018-06-19T15:55:32Z

See my latest comment. Other than that I am willing to merge if spot checks are passing.

lucbv · 2018-06-20T19:25:06Z

@srajama1, here is what I am getting from the spot check on white:

[lberge@white22 spot_check]$ ./test_all_sandia --kokkoskernels-path=/home/lberge/kk_lucbv --kokkos-path=/home/lberge/kokkos --spot-check --with-cuda-options=enable_lambda
Running on machine: white
Going to test compilers:  gcc/5.4.0 ibm/13.1.6 cuda/8.0.44 cuda/9.0.103
Testing compiler gcc/5.4.0
  Starting job gcc-5.4.0-Serial-release
  Starting job gcc-5.4.0-OpenMP-release
  PASSED gcc-5.4.0-OpenMP-release
Testing compiler ibm/13.1.6
  Starting job gcc-5.4.0-OpenMP_Serial-release
  PASSED gcc-5.4.0-Serial-release
  Starting job ibm-13.1.6-OpenMP-release
  PASSED gcc-5.4.0-OpenMP_Serial-release
  Starting job ibm-13.1.6-Serial-release
  PASSED ibm-13.1.6-Serial-release
Testing compiler cuda/8.0.44
  Starting job ibm-13.1.6-OpenMP_Serial-release

The ibm compiler crashes for the two OpenMP builds with the following error that seem unrelated to what I have been doing in this PR

/home/projects/pwr8-rhel73-lsf/ibm/xl/xlC/13.1.6/bin/xlC  -I/ascldap/users/lberge/kk_lucbv_build/spot_check/TestAll_2018-06-20_08.05.12/ibm/13.1.6/OpenMP-rel\
ease/install/include -I./ -I/ascldap/users/lberge/kk_lucbv_build/spot_check/TestAll_2018-06-20_08.05.12/ibm/13.1.6/OpenMP-release/kokkos/install/include -I/a\
scldap/users/lberge/kk_lucbv_build/spot_check/TestAll_2018-06-20_08.05.12/ibm/13.1.6/OpenMP-release/kokkos/install/include -I/ascldap/users/lberge/kk_lucbv_b\
uild/spot_check/TestAll_2018-06-20_08.05.12/ibm/13.1.6/OpenMP-release/kokkos/install/include -I/ascldap/users/lberge/kk_lucbv_build/spot_check/TestAll_2018-0\
6-20_08.05.12/ibm/13.1.6/OpenMP-release/kokkos/install/include/eti -std=c++11 -mcpu=power8 -mtune=power8 -qsmp=omp -I/home/lberge/kokkos/tpls/gtest -I/home/l\
berge/kk_lucbv/unit_test/ -I/home/lberge/kk_lucbv/unit_test/blas -I/home/lberge/kk_lucbv/unit_test/sparse -I/home/lberge/kk_lucbv/unit_test/graph -I/home/lbe\
rge/kk_lucbv/unit_test/../test_common -I/home/lberge/kk_lucbv/unit_test/batched -I/home/lberge/kk_lucbv/unit_test/openmp -O3 -Werror -Wall -Wshadow -pedantic\
 -Wsign-compare -Wtype-limits -Wuninitialized   -c /home/lberge/kk_lucbv/unit_test/openmp/Test_OpenMP_Batched_VectorView.cpp
1586-494 (U) INTERNAL COMPILER ERROR: Signal 11.
Calling signal handler...
/opt/ibm/xlC/13.1.6/bin/.orig/xlC: error: 1501-230 Internal compiler error; please contact your Service Representative. For more information visit:
http://www.ibm.com/support/docview.wss?uid=swg21110810
make[2]: *** [Test_OpenMP_Batched_VectorMisc.o] Error 251
make[2]: *** Waiting for unfinished jobs....

The Cuda build actually never starts, probably because of the xlC crash?
I will start a spot check on bowman, hopefully it will go through more smoothly!

ndellingwood · 2018-06-20T19:32:45Z

@lucbv don't worry about the xlC failure, there is an internal compiler issue with this particular compiler getting tested.
Try the following to test with Cuda:
./test_all_sandia --kokkoskernels-path=/home/lberge/kk_lucbv --kokkos-path=/home/lberge/kokkos --spot-check --with-cuda-options=enable_lambda cuda/8.0.44
Adding the extra compiler argument will force the script to run just that cuda build.

lucbv · 2018-06-20T22:14:13Z

@ndellingwood, thanks, I just got results from spot-check for cuda on white:

Running on machine: white
Going to test compilers:  cuda/8.0.44
Testing compiler cuda/8.0.44
  Starting job cuda-8.0.44-Cuda_OpenMP-release
  PASSED cuda-8.0.44-Cuda_OpenMP-release
  Starting job cuda-8.0.44-Cuda_Serial-release
  PASSED cuda-8.0.44-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-8.0.44-Cuda_OpenMP-release build_time=877 run_time=612
cuda-8.0.44-Cuda_Serial-release build_time=878 run_time=992
#######################################################
FAILED TESTS
#######################################################

still waiting for bowman to complete and that should be it...

lucbv · 2018-06-21T14:14:03Z

@srajama1 @ndellingwood here are my results from bowman:

#######################################################
PASSED TESTS
#######################################################
intel-16.4.258-Pthread-release build_time=1342 run_time=2234
intel-16.4.258-Pthread_Serial-release build_time=1909 run_time=4551
intel-16.4.258-Serial-release build_time=1274 run_time=2237
intel-17.2.174-OpenMP-release build_time=1586 run_time=812
intel-17.2.174-OpenMP_Serial-release build_time=1992 run_time=3108
intel-17.2.174-Pthread-release build_time=1166 run_time=2194
intel-17.2.174-Pthread_Serial-release build_time=1738 run_time=4417
intel-17.2.174-Serial-release build_time=1168 run_time=2248
intel-18.0.128-Pthread-release build_time=1071 run_time=2171
#######################################################
FAILED TESTS
#######################################################
intel-18.0.128-OpenMP-release (test failed)
intel-18.0.128-OpenMP_Serial-release (test failed)
intel-18.0.128-Pthread_Serial-release (test failed)
intel-18.0.128-Serial-release (test failed)

Here are the tests that failed during the spot-check sorted by build:

intel-18.0.128-OpenMP-release
[==========] 332 tests from 1 test case ran. (789436 ms total)
[  PASSED  ] 326 tests.
[  FAILED  ] 6 tests, listed below:
[  FAILED  ] openmp.sparse_replaceSumIntoLonger_double_int64_t_int_TestExecSpace
[  FAILED  ] openmp.sparse_replaceSumIntoLonger_double_int64_t_size_t_TestExecSpace
[  FAILED  ] openmp.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] openmp.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_double
[  FAILED  ] openmp.batched_scalar_team_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] openmp.batched_scalar_team_trsm_l_u_nt_n_dcomplex_double

intel-18.0.128-Serial-release
[==========] 332 tests from 1 test case ran. (2188996 ms total)
[  PASSED  ] 328 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] serial.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] serial.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_double
[  FAILED  ] serial.batched_scalar_team_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] serial.batched_scalar_team_trsm_l_u_nt_n_dcomplex_double

intel-18.0.128-OpenMP_Serial-release
[==========] 332 tests from 1 test case ran. (849489 ms total)
[  PASSED  ] 326 tests.
[  FAILED  ] 6 tests, listed below:
[  FAILED  ] openmp.sparse_replaceSumIntoLonger_double_int64_t_int_TestExecSpace
[  FAILED  ] openmp.sparse_replaceSumIntoLonger_double_int64_t_size_t_TestExecSpace
[  FAILED  ] openmp.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] openmp.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_double
[  FAILED  ] openmp.batched_scalar_team_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] openmp.batched_scalar_team_trsm_l_u_nt_n_dcomplex_double

intel-18.0.128-Pthread_Serial-release
[==========] 332 tests from 1 test case ran. (2016547 ms total)
[  PASSED  ] 328 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] serial.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] serial.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_double
[  FAILED  ] serial.batched_scalar_team_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] serial.batched_scalar_team_trsm_l_u_nt_n_dcomplex_double

srajama1 · 2018-06-21T15:20:13Z

This is weird. I don't know why we don't see in our jenkins jobs and why we don't see them with Intel 18. Any way this is not due to your PR. If you are ok I can merge it.

Can you file one issues for batched trsm (@kyungjoo-kim) and another one for replaceSumInto (@crtrott) ?

kyungjoo-kim · 2018-06-21T15:30:58Z

@lucbv You did not update your personal repo. Merge your repo with the develop branch then test it again.

lucbv · 2018-06-21T15:56:37Z

@kyungjoo-kim sure I can fetch from kokkos-kernels and rebase on develop, did you already fix the errors I have been seeing in my spot-check?

It seems to work OK, will check more tomorrow.

over score array in order to compute the maximum number of colors in the graph.

…nge policy

This is a first draft and it is buggy, it seems that some conflicts lead to nodes missing their color. Will need to look at the atomics carefully and also at the bit array logic...

I am adding a new version of the algorithm called VBDBIT which implements a 64 bits, bit array to store the banned colors. I also found a bug related to decrementing the dependency list, it is now done atomically and that atomic decrement is more rigorously guarded to happen only when necessary.

lucbv · 2018-07-09T17:21:18Z

@srajama1, the code has been reworked to use functors instead of lambdas and is passing the spot check on white and bowman, so you can merge it in at your convenience : )

lucbv self-assigned this Jun 6, 2018

lucbv added InDevelop feature request labels Jun 6, 2018

lucbv requested a review from srajama1 June 6, 2018 19:50

srajama1 reviewed Jun 6, 2018

View reviewed changes

srajama1 reviewed Jun 7, 2018

View reviewed changes

srajama1 reviewed Jun 18, 2018

View reviewed changes

lucbv force-pushed the Deterministic_coloring branch from 1ae61c1 to 3b9575e Compare June 18, 2018 23:37

srajama1 approved these changes Jun 19, 2018

View reviewed changes

srajama1 reviewed Jun 19, 2018

View reviewed changes

srajama1 approved these changes Jun 20, 2018

View reviewed changes

srajama1 mentioned this pull request Jun 21, 2018

Implemented fix to Issue #258 #261

Merged

lucbv force-pushed the Deterministic_coloring branch from 8e9e5df to 4aaf45e Compare June 21, 2018 16:09

lucbv force-pushed the Deterministic_coloring branch from 4aaf45e to eda1653 Compare July 6, 2018 13:19

lucbv added 4 commits July 6, 2018 07:22

Deterministic coloring: first implementation in serial

a709e16

It seems to work OK, will check more tomorrow.

Deterministic coloring: clean up a bit serial implementation

ce32e71

Deterministic coloring: adding parallel calculation and reduction

2d36f21

over score array in order to compute the maximum number of colors in the graph.

Deterministic graph coloring parallel

5b1afbc

lucbv added 9 commits July 6, 2018 07:22

Deterministic coloring: first parallelization attempt using simple ra…

898e3bf

…nge policy

Deterministic coloring: implementing bit array for banned colors

a2eed06

This is a first draft and it is buggy, it seems that some conflicts lead to nodes missing their color. Will need to look at the atomics carefully and also at the bit array logic...

Deterministic coloring: fixes for CUDA

32e822a

Deterministic coloring: adding unit-test

2cb7b3b

Deterministic coloring: fixing parallel_for execution space

d26c37e

Deterministic coloring: small fix to delete array properly

3a48ad0

Deterministic coloring: switching from lambda to functors

eda1653

small fix for -Wshadow error

96ee015

srajama1 merged commit e463b6e into kokkos:develop Jul 18, 2018

srajama1 mentioned this pull request Jul 18, 2018

Deterministic Coloring #271

Closed

lucbv deleted the Deterministic_coloring branch August 13, 2018 22:44

kokkos-devops-admin mentioned this pull request Nov 24, 2021

Add rocBLAS GEMV wrapper #1201

Merged

Graph: Deterministic coloring #249

Graph: Deterministic coloring #249

Conversation

lucbv commented Jun 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucbv commented Jun 7, 2018

srajama1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucbv commented Jun 18, 2018

srajama1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srajama1 commented Jun 19, 2018

Choose a reason for hiding this comment

srajama1 left a comment

Choose a reason for hiding this comment

srajama1 commented Jun 19, 2018

lucbv commented Jun 20, 2018

ndellingwood commented Jun 20, 2018

lucbv commented Jun 20, 2018

lucbv commented Jun 21, 2018

srajama1 commented Jun 21, 2018

kyungjoo-kim commented Jun 21, 2018

lucbv commented Jun 21, 2018

lucbv commented Jul 9, 2018