Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph: Deterministic coloring #249

Merged
merged 13 commits into from
Jul 18, 2018
Merged

Conversation

lucbv
Copy link
Contributor

@lucbv lucbv commented Jun 4, 2018

@srajama1
Adding deterministic graph coloring based on dependency list.
This is a simple version that parallelizes using range policies and uses only one atomic to update newFrontierSize.
In a next step I am planning on switching bannedColors to use a bit array that will be more compact in memory.

@lucbv lucbv self-assigned this Jun 6, 2018
@lucbv lucbv requested a review from srajama1 June 6, 2018 19:50
@@ -181,6 +184,8 @@ void run_experiment(
kh.set_verbose(true);
}

std::cout << "algorithm: " << algorithm << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume this is temporary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind previous comment, this is in testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added guards and now you need to request these outputs explicitly, it might actually make sense to have some special debugging flag to get that level of output details?


bool _ticToc; //if true print info in each step
int _chunkSize; //the size of the minimum work unit assigned to threads. Changes the convergence on GPUs
char _use_color_set; //the VBD algorithm type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this sort of this would be in the handle ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason it is being kept separate in the other GraphColor classes, so I am using the same pattern. I could look into why/if this design is appropriate?


/** \brief Function to color the vertices of the graphs. Performs a vertex-based coloring.
* \param colors is the output array corresponding the color of each vertex. Size is this->nv.
* Attn: Color array must be nonnegative numbers. If there is no initial colors,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"are no"

* Attn: Color array must be nonnegative numbers. If there is no initial colors,
* it should be all initialized with zeros. Any positive value in the given array, will make the
* algorithm to assume that the color is fixed for the corresponding vertex.
* \param num_phases: The number of iterations (phases) that algorithm takes to converge.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_loops

KOKKOS_LAMBDA(const size_type frontierIdx) {

size_type frontierNode = frontier(frontierIdx);
int* bannedColors = new int[maxColors];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new seems to be trouble.

update = ( (valueType) score_(i) < update ? update : (valueType) score_(i) );
}
}; // functorScoreCalcution()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears like a good baseline to start with and optimize (except the new for bannedcolors above).

@lucbv
Copy link
Contributor Author

lucbv commented Jun 7, 2018

@srajama1 let me know what you think of this new cut, I found a bug and fixed it, I also added a bit array variant for the bannedColors.

Copy link
Contributor

@srajama1 srajama1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a unit test and run on bowman and white ?


template <class score_type, class max_type, class execution_space>
struct functorScoreCalcution {
typedef typename Kokkos::Experimental::Max<max_type, execution_space>::value_type valueType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crtrott : Why is this Max still experimental ? I assume we can use it.

@@ -157,7 +159,7 @@ class GraphColor {


//create a ban color array to keep track of
//which colors have been taking by the neighbor vertices.
//which colors have been taken by the neighbor vertices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This banned_color need to be deleted ?

@@ -3,7 +3,7 @@

/// \author Kyungjoo Kim (kyukim@sandia.gov)

#define __KOKKOSBATCHED_PROMOTION__ 1
//#define __KOKKOSBATCHED_PROMOTION__ 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are already in master, right ? How will this merge work ? I am just asking as I don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that some commits were put in my branch by mistake when I rebased my code to a more recent version of develop.
I will investigate how to clean this.

I also have a problem with running the unit-test with Cuda and Serial instantiated at the same time. The Cuda unit-tests run fine on there own and OpenMP+Serial runs fine too but combinations of Cuda+OpenMP or Cuda+Serial are crashing...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crash do not happen in the perf-test though for some reason?

@lucbv lucbv force-pushed the Deterministic_coloring branch from 1ae61c1 to 3b9575e Compare June 18, 2018 23:37
@lucbv
Copy link
Contributor Author

lucbv commented Jun 18, 2018

@srajama1, I removed the rogue commit from history, it should look better now.

Copy link
Contributor

@srajama1 srajama1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments are related to performance. There is one related to allocating arrays. I am ok as long as these can be addressed before next master promotion so approving this PR so it can be merged.

@@ -76,7 +76,7 @@ struct TeamNrm2<TeamType, XV, false> {
typedef Kokkos::Details::ArithTraits<typename IPT::mag_type> AT;

static KOKKOS_INLINE_FUNCTION mag_type team_nrm2 (const TeamType& team, const XV& X) {
mag_type result;
mag_type result = 0.0; //Kokkos::Details::ArithTraits<mag_type>zero();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TBD later for using ArithTraits ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had issues compiling the code when result is not initialized, I attempted using the ArithTraits as it seemed to be the cleanest option but the compiler complained about it so I left it as a comment in order to look at it again later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Thanks ! Just a TBD that needs clean up in the next round.

}
});

Kokkos::deep_copy(host_newFrontierSize, newFrontierSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we need to explore. It would be better to avoid this copy every time and launch another kernel later. May be just a TBD for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that ultimately the whole while loop probably needs to be pushed on the device but I am not a big fan of writing "fake" parallel_for functions to push code on device although at the moment that is probably the only option?
Short of putting the while loop on the device we need to transfer some data back and forth at each iteration to know if the graph has been explored completely or if there is still a new frontier to explore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there is a way to return a value or a device initiated copy back to host. I am fine with this for now.

frontierSize() = newFrontierSize();
newFrontierSize() = 0;
});
Kokkos::deep_copy(host_frontierSize, frontierSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another host to device copy, TBD for later

KOKKOS_LAMBDA(const size_type frontierIdx) {

size_type frontierNode = frontier(frontierIdx);
int* bannedColors = new int[maxColors];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like this array. This is a crash waiting to happen and something that cannot perform well due to global memory access. Can we check the node_type early in the code and issue a warning. Doesn't have to be in this PR, but better to do it before next master promotion.

@srajama1
Copy link
Contributor

If spot check on white and bowman is clean I can click the merge button.

colors(frontierNode) = myColor;
}); // Loop over current frontier
}
Kokkos::deep_copy(host_newFrontierSize, newFrontierSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a fence missing here before ? How are we sure newFrontierSize is updated ?

Copy link
Contributor

@srajama1 srajama1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a fence missing here before ? How are we sure newFrontierSize is updated ?

@srajama1
Copy link
Contributor

See my latest comment. Other than that I am willing to merge if spot checks are passing.

@lucbv
Copy link
Contributor Author

lucbv commented Jun 20, 2018

@srajama1, here is what I am getting from the spot check on white:

[lberge@white22 spot_check]$ ./test_all_sandia --kokkoskernels-path=/home/lberge/kk_lucbv --kokkos-path=/home/lberge/kokkos --spot-check --with-cuda-options=enable_lambda
Running on machine: white
Going to test compilers:  gcc/5.4.0 ibm/13.1.6 cuda/8.0.44 cuda/9.0.103
Testing compiler gcc/5.4.0
  Starting job gcc-5.4.0-Serial-release
  Starting job gcc-5.4.0-OpenMP-release
  PASSED gcc-5.4.0-OpenMP-release
Testing compiler ibm/13.1.6
  Starting job gcc-5.4.0-OpenMP_Serial-release
  PASSED gcc-5.4.0-Serial-release
  Starting job ibm-13.1.6-OpenMP-release
  PASSED gcc-5.4.0-OpenMP_Serial-release
  Starting job ibm-13.1.6-Serial-release
  PASSED ibm-13.1.6-Serial-release
Testing compiler cuda/8.0.44
  Starting job ibm-13.1.6-OpenMP_Serial-release

The ibm compiler crashes for the two OpenMP builds with the following error that seem unrelated to what I have been doing in this PR

/home/projects/pwr8-rhel73-lsf/ibm/xl/xlC/13.1.6/bin/xlC  -I/ascldap/users/lberge/kk_lucbv_build/spot_check/TestAll_2018-06-20_08.05.12/ibm/13.1.6/OpenMP-rel\
ease/install/include -I./ -I/ascldap/users/lberge/kk_lucbv_build/spot_check/TestAll_2018-06-20_08.05.12/ibm/13.1.6/OpenMP-release/kokkos/install/include -I/a\
scldap/users/lberge/kk_lucbv_build/spot_check/TestAll_2018-06-20_08.05.12/ibm/13.1.6/OpenMP-release/kokkos/install/include -I/ascldap/users/lberge/kk_lucbv_b\
uild/spot_check/TestAll_2018-06-20_08.05.12/ibm/13.1.6/OpenMP-release/kokkos/install/include -I/ascldap/users/lberge/kk_lucbv_build/spot_check/TestAll_2018-0\
6-20_08.05.12/ibm/13.1.6/OpenMP-release/kokkos/install/include/eti -std=c++11 -mcpu=power8 -mtune=power8 -qsmp=omp -I/home/lberge/kokkos/tpls/gtest -I/home/l\
berge/kk_lucbv/unit_test/ -I/home/lberge/kk_lucbv/unit_test/blas -I/home/lberge/kk_lucbv/unit_test/sparse -I/home/lberge/kk_lucbv/unit_test/graph -I/home/lbe\
rge/kk_lucbv/unit_test/../test_common -I/home/lberge/kk_lucbv/unit_test/batched -I/home/lberge/kk_lucbv/unit_test/openmp -O3 -Werror -Wall -Wshadow -pedantic\
 -Wsign-compare -Wtype-limits -Wuninitialized   -c /home/lberge/kk_lucbv/unit_test/openmp/Test_OpenMP_Batched_VectorView.cpp
1586-494 (U) INTERNAL COMPILER ERROR: Signal 11.
Calling signal handler...
/opt/ibm/xlC/13.1.6/bin/.orig/xlC: error: 1501-230 Internal compiler error; please contact your Service Representative. For more information visit:
http://www.ibm.com/support/docview.wss?uid=swg21110810
make[2]: *** [Test_OpenMP_Batched_VectorMisc.o] Error 251
make[2]: *** Waiting for unfinished jobs....

The Cuda build actually never starts, probably because of the xlC crash?
I will start a spot check on bowman, hopefully it will go through more smoothly!

@ndellingwood
Copy link
Contributor

@lucbv don't worry about the xlC failure, there is an internal compiler issue with this particular compiler getting tested.
Try the following to test with Cuda:
./test_all_sandia --kokkoskernels-path=/home/lberge/kk_lucbv --kokkos-path=/home/lberge/kokkos --spot-check --with-cuda-options=enable_lambda cuda/8.0.44
Adding the extra compiler argument will force the script to run just that cuda build.

@lucbv
Copy link
Contributor Author

lucbv commented Jun 20, 2018

@ndellingwood, thanks, I just got results from spot-check for cuda on white:

Running on machine: white
Going to test compilers:  cuda/8.0.44
Testing compiler cuda/8.0.44
  Starting job cuda-8.0.44-Cuda_OpenMP-release
  PASSED cuda-8.0.44-Cuda_OpenMP-release
  Starting job cuda-8.0.44-Cuda_Serial-release
  PASSED cuda-8.0.44-Cuda_Serial-release
#######################################################
PASSED TESTS
#######################################################
cuda-8.0.44-Cuda_OpenMP-release build_time=877 run_time=612
cuda-8.0.44-Cuda_Serial-release build_time=878 run_time=992
#######################################################
FAILED TESTS
#######################################################

still waiting for bowman to complete and that should be it...

@lucbv
Copy link
Contributor Author

lucbv commented Jun 21, 2018

@srajama1 @ndellingwood here are my results from bowman:

#######################################################
PASSED TESTS
#######################################################
intel-16.4.258-Pthread-release build_time=1342 run_time=2234
intel-16.4.258-Pthread_Serial-release build_time=1909 run_time=4551
intel-16.4.258-Serial-release build_time=1274 run_time=2237
intel-17.2.174-OpenMP-release build_time=1586 run_time=812
intel-17.2.174-OpenMP_Serial-release build_time=1992 run_time=3108
intel-17.2.174-Pthread-release build_time=1166 run_time=2194
intel-17.2.174-Pthread_Serial-release build_time=1738 run_time=4417
intel-17.2.174-Serial-release build_time=1168 run_time=2248
intel-18.0.128-Pthread-release build_time=1071 run_time=2171
#######################################################
FAILED TESTS
#######################################################
intel-18.0.128-OpenMP-release (test failed)
intel-18.0.128-OpenMP_Serial-release (test failed)
intel-18.0.128-Pthread_Serial-release (test failed)
intel-18.0.128-Serial-release (test failed)

Here are the tests that failed during the spot-check sorted by build:

intel-18.0.128-OpenMP-release
[==========] 332 tests from 1 test case ran. (789436 ms total)
[  PASSED  ] 326 tests.
[  FAILED  ] 6 tests, listed below:
[  FAILED  ] openmp.sparse_replaceSumIntoLonger_double_int64_t_int_TestExecSpace
[  FAILED  ] openmp.sparse_replaceSumIntoLonger_double_int64_t_size_t_TestExecSpace
[  FAILED  ] openmp.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] openmp.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_double
[  FAILED  ] openmp.batched_scalar_team_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] openmp.batched_scalar_team_trsm_l_u_nt_n_dcomplex_double

intel-18.0.128-Serial-release
[==========] 332 tests from 1 test case ran. (2188996 ms total)
[  PASSED  ] 328 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] serial.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] serial.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_double
[  FAILED  ] serial.batched_scalar_team_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] serial.batched_scalar_team_trsm_l_u_nt_n_dcomplex_double

intel-18.0.128-OpenMP_Serial-release
[==========] 332 tests from 1 test case ran. (849489 ms total)
[  PASSED  ] 326 tests.
[  FAILED  ] 6 tests, listed below:
[  FAILED  ] openmp.sparse_replaceSumIntoLonger_double_int64_t_int_TestExecSpace
[  FAILED  ] openmp.sparse_replaceSumIntoLonger_double_int64_t_size_t_TestExecSpace
[  FAILED  ] openmp.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] openmp.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_double
[  FAILED  ] openmp.batched_scalar_team_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] openmp.batched_scalar_team_trsm_l_u_nt_n_dcomplex_double

intel-18.0.128-Pthread_Serial-release
[==========] 332 tests from 1 test case ran. (2016547 ms total)
[  PASSED  ] 328 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] serial.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] serial.batched_scalar_serial_trsm_l_u_nt_n_dcomplex_double
[  FAILED  ] serial.batched_scalar_team_trsm_l_u_nt_n_dcomplex_dcomplex
[  FAILED  ] serial.batched_scalar_team_trsm_l_u_nt_n_dcomplex_double

@srajama1
Copy link
Contributor

This is weird. I don't know why we don't see in our jenkins jobs and why we don't see them with Intel 18. Any way this is not due to your PR. If you are ok I can merge it.

Can you file one issues for batched trsm (@kyungjoo-kim) and another one for replaceSumInto (@crtrott) ?

@kyungjoo-kim
Copy link
Contributor

@lucbv You did not update your personal repo. Merge your repo with the develop branch then test it again.

@lucbv
Copy link
Contributor Author

lucbv commented Jun 21, 2018

@kyungjoo-kim sure I can fetch from kokkos-kernels and rebase on develop, did you already fix the errors I have been seeing in my spot-check?

@lucbv lucbv force-pushed the Deterministic_coloring branch from 8e9e5df to 4aaf45e Compare June 21, 2018 16:09
@lucbv lucbv force-pushed the Deterministic_coloring branch from 4aaf45e to eda1653 Compare July 6, 2018 13:19
lucbv added 9 commits July 6, 2018 07:22
This is a first draft and it is buggy, it seems that some conflicts lead to
 nodes missing their color. Will need to look at the atomics carefully and
also at the bit array logic...
I am adding a new version of the algorithm called VBDBIT which implements
a 64 bits, bit array to store the banned colors.
I also found a bug related to decrementing the dependency list, it is now
done atomically and that atomic decrement is more rigorously guarded to
happen only when necessary.
@lucbv
Copy link
Contributor Author

lucbv commented Jul 9, 2018

@srajama1, the code has been reworked to use functors instead of lambdas and is passing the spot check on white and bowman, so you can merge it in at your convenience : )

@srajama1 srajama1 merged commit e463b6e into kokkos:develop Jul 18, 2018
@srajama1 srajama1 mentioned this pull request Jul 18, 2018
@lucbv lucbv deleted the Deterministic_coloring branch August 13, 2018 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants