Catastrophically slow usage of OpenMP #517

ianmccul · 2024-11-22T08:40:21Z

I came across this fragment of code:

Cytnx/src/backend/linalg_internal_cpu/QR_internal.cpp

Lines 23 to 28 in fa4596e

    
           #ifdef UNI_OMP 
        
             #pragma omp parallel for schedule(dynamic) 
        
           #endif 
        
                 for (cytnx_uint64 i = 0; i < min; i++) out[i] = elem[i * N + i]; 
        
               }

This is a misuse of OpenMP. As discussed in #511, OpenMP with schedule(dynamic), without specifying a chunk size, does a single iteration of the loop in each thread. It is intended for cases where each iteration does a significant amount of work that may take a variable amount of time. It is not appropriate to use schedule(dynamic) in loops where the iterations will take a fixed amount of time. Also, if you use schedule(dynamic), unless each iteration takes a very long time, you should set the chunk size, eg

#pragma omp parallel for schedule(dynamic,100000)

The effect of using schedule(dynamic) on a tight loop is catastrophic. From the Fill example discussed in #511,

Benchmark	Time (ms)
FillCpu	1.20837
FillCpuStatic	1.6324
FillCpuDynamic	278.007
FillCpuDynamicChunk	1.8086

As this benchmark shows, there really isn't a useful benefit from using OpenMP at all in memory-bound tight loops such as the above; at best you get a few percent improvement even with lots of threads. That particular example is a reimplementation of BLAS dcopy anyway, and it is hard to imagine N ever being large enough to benefit from hand-optimization (the QR decomposition is $O(N^3)$, so speeding up an $O(N)$ loop won't do much). But using schedule(dynamic) means the loop will be running ~ 200x slower.

On further checks, the string schedule(dynamic) appears 2360 in the codebase. I checked a random sample of 10 of these. 7 of them were definitely erroneous, with code essentially identical to the fragment above. The remaining 3 are most likely erroneous, but might benefit from OpenMP if done carefully (they were also 3 copies of the same code fragment, that probably should be a helper function). To confirm, I checked an additional sample of 10, with the same mix of 7 definitely erroneous, and 3 likely erroneous.

It looks like there is some attempt to have some benchmarks in the bm_tests/ directory, but it looks like it is disabled in CMakeLists.txt. Is there a way to build the benchmarks?

I honestly don't know what to do here. Some options:

use sed to remove all #pragma omp lines that contain schedule(dynamic)
use sed to replace schedule(dynamic) with schedule(static)
Check by hand every use of schedule(dynamic). For 2360 occurrances, of which statistical analysis suggests approximately 100% will need modifying, this will take time time.

I think my preferred option is probably the first (remove all #pragma omp lines that contain schedule(dynamic)), and then (1) get the benchmarking infrastructure working, and (2) every useage of OpenMP is accompanied by a benchmark that demonstrates a useful speedup.

The text was updated successfully, but these errors were encountered:

ianmccul · 2024-11-22T08:43:48Z

It is also unclear to me why #pragma omp calls are surrounded by #ifdef UNI_OMP tests. OpenMP is designed so that it can be used transparently, so if the compiler isn't using OpenMP (or doesn't know what OpenMP is), then the openmp calls just do nothing. It is trivial to make a version of omp.h that implements dummy versions of the OpenMP API calls, so it isn't necessary to use conditional testing for those either.

I suggest removing all of the #ifdef UNI_OMP conditionals. The only place UNI_OMP should appear is in a wrapper header for omp.h, where it would look similar to

// cytnx_omp.hpp

#ifdef UNI_OMP
  #include <omp.h>
#else
  #include "omp_dummy.h"
#endif

ianmccul · 2024-11-22T08:50:59Z

The benchmark code used above:

// compile command: g++ -std=c++17 -fopenmp -O3 -o fill fill.cpp && ./fill 1000000 4

#include <iostream>
#include <iomanip>
#include <chrono>
#include <vector>
#include <cstdlib>
#include <omp.h>

using namespace std;

// from https://github.com/google/benchmark/blob/62a321d6dc377e0ba9c712b6a8d64360616de564/include/benchmark/benchmark.h#L525
template <class Tp>
inline void DoNotOptimize(Tp& value) {
#if defined(__clang__)
  asm volatile("" : "+r,m"(value) : : "memory");
#else
  asm volatile("" : "+m,r"(value) : : "memory");
#endif
}

template <class Tp>
inline void DoNotOptimize(Tp&& value) {
#if defined(__clang__)
  asm volatile("" : "+r,m"(value) : : "memory");
#else
  asm volatile("" : "+m,r"(value) : : "memory");
#endif
}

template <typename DType>
void FillCpu(DType* first, DType value, size_t count) {
  for (int i = 0; i < count; ++i) {
    DoNotOptimize(first[i] = value);
  }
}

template <typename DType>
void FillCpuStatic(DType* first, const DType &value, size_t count) {
   #pragma omp parallel for schedule(static)
   for (int i = 0; i < count; ++i) {
      DoNotOptimize(first[i] = value);
   }
}

template <typename DType>
void FillCpuDynamic(DType* first, const DType &value, size_t count) {
#pragma omp parallel for schedule(dynamic)
  for (int i = 0; i < count; ++i) {
    DoNotOptimize(first[i] = value);
  }
}

template <typename DType>
void FillCpuDynamicChunk(DType* first, const DType &value, size_t count) {
#pragma omp parallel for schedule(dynamic,100000)
  for (int i = 0; i < count; ++i) {
    DoNotOptimize(first[i] = value);
  }
}

struct BenchmarkResult
{
  std::string Name;
  std::chrono::duration<double> Time;
};

std::ostream& operator<<(std::ostream& out, BenchmarkResult const& x)
{
  return out << "Total time for " << x.Name << ": " << (x.Time.count()*1000) << " milliseconds";
}

std::vector<BenchmarkResult> AllResults;

void ShowAllResults()
{
  cout << "\n| Benchmark            | Time (ms) |\n";
  cout <<   "|----------------------|-----------|\n";
  for (const auto& result : AllResults) {
    cout << "| " << std::setw(20) << result.Name
         << " | " << (result.Time.count() * 1000)
         << "     |\n";
  }
}

template <typename TestFunc>
void Benchmark(std::string Test, TestFunc f, int num_iterations)
{
  cout << "\nTesting " << Test << "..." << endl;
  auto start = chrono::high_resolution_clock::now();
  for (int iter = 0; iter < num_iterations; ++iter) {
    f();
  }
  auto end = chrono::high_resolution_clock::now();
  const std::chrono::duration<double> total_time = end - start;
  BenchmarkResult R{Test, total_time/num_iterations};
  AllResults.push_back(R);
  cout << R << endl;
}

#define BENCHMARK(Func, numiter, ...) \
Benchmark(#Func, [&]() { Func(__VA_ARGS__); }, numiter)

int main(int argc, char** argv) {
  if (argc < 3) {
    cout << "expected: fill <count> <threads>\n";
    return 1;
  }
  int count = atoi(argv[1]);
  int nthreads = atoi(argv[2]);
  omp_set_num_threads(nthreads);
  int num_iterations = 10'000'000'000ll / count;
  cout << "Using " << num_iterations << " iterations." << endl;
  int *ptr = new int[count];
  int value = 10;

  BENCHMARK(FillCpu, num_iterations, ptr, value, count);
  BENCHMARK(FillCpuStatic, num_iterations, ptr, value, count);
  BENCHMARK(FillCpuDynamic, num_iterations/100, ptr, value, count);  // reduce iterations -- it is too slow otherwise
  BENCHMARK(FillCpuDynamicChunk, num_iterations, ptr, value, count);

  ShowAllResults();
}

yingjerkao · 2024-11-22T14:38:23Z

I suggest to remove all of the #pragma omp since the places where the data parallel is important is in BLAS and LAPACK, which should be taken care of by MKL or Openblas.

kaihsin · 2024-11-23T05:47:21Z

I'd love to see the legacy uni10 code being replace and its nice finally someone starting to working on it. Yes. This is one of the issue that rooted there for a while and we lacking of bandwidth to work on it.

ianmccul · 2024-11-23T15:40:17Z

Remove UNI_OMP completely?

yingjerkao · 2024-11-29T02:41:11Z

Remove UNI_OMP completely?

Yes, at this moment.

IvanaGyro mentioned this issue Nov 23, 2024

Consider not to use OMP for FillCpu() #511

Closed

IvanaGyro self-assigned this Nov 29, 2024

IvanaGyro mentioned this issue Nov 30, 2024

Remove UNI_OMP guards surrounding #pragma omp... #526

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catastrophically slow usage of OpenMP #517

Catastrophically slow usage of OpenMP #517

ianmccul commented Nov 22, 2024

ianmccul commented Nov 22, 2024

ianmccul commented Nov 22, 2024 •

edited

Loading

yingjerkao commented Nov 22, 2024

kaihsin commented Nov 23, 2024

ianmccul commented Nov 23, 2024

yingjerkao commented Nov 29, 2024

Catastrophically slow usage of OpenMP #517

Catastrophically slow usage of OpenMP #517

Comments

ianmccul commented Nov 22, 2024

ianmccul commented Nov 22, 2024

ianmccul commented Nov 22, 2024 • edited Loading

yingjerkao commented Nov 22, 2024

kaihsin commented Nov 23, 2024

ianmccul commented Nov 23, 2024

yingjerkao commented Nov 29, 2024

ianmccul commented Nov 22, 2024 •

edited

Loading