Benckmark on a simple image processing task, much slower than cpp version, how to improve please? #6405

jiangsutx · 2022-10-23T09:03:09Z

I hope to use Taichi on image processing tasks for performance.
So I tested on a simple task: a brute-force implementation of box filter.

My taichi code：

import time
import cv2
import taichi as ti

ti.init(arch=ti.cpu)

img = cv2.imread('1.png').astype(np.float32) / 255.0
h, w = img.shape[:2]
ti_img = ti.Vector.field(3, dtype=ti.f32, shape=(h, w))
ti_img.from_numpy(img)
pixels = ti.Vector.field(3, dtype=ti.f32, shape=(h, w))

@ti.kernel
def ti_boxfilter():
    for x, y in pixels:
        pixels[x, y] = [0, 0, 0]
    for x, y in pixels:
        for i in range(d):
            for j in range(d):
                pixels[x, y] += ti_img[x + i - r, y + j - r]

    for x, y in pixels:
        pixels[x, y] /= d ** 2

N = 20
ti_boxfilter()
t0 = time.time()
for i in range(N):
    ti_boxfilter()
pixels.to_numpy()
t1 = time.time()

print('time for ti_blur5x5: ', (t1 - t0) / N)
t1 = time.time()

I tested on my Macbook Pro 2019, and the image is of resolution 1080x1080.
The time on CPU is 154 ms, and on metal is about 3 ms.

I also tested on cpp code, and try to use SIMD via ispc. And finally the cpu time reduces to about 10 ms.

I understand taichi may not be optimal for such task and such device.
I just wonder is there anything I can do to further push limit of taichi ？ It is really easy to write algorithms in Python.

Thanks in advance for help.

The text was updated successfully, but these errors were encountered:

neozhaoliang · 2022-10-24T01:24:14Z

@jiangsutx May I ask what's the cpp code you are using?

The box filter is a separable filter, hence one does not need to traverse all the $d^2$ pixels, just two 1d convolutions would suffice.

Also, all three for loops can be put into a single one:

for x, y in pixels:
       pixels[x, y] = 0.0
       ...
       pixels[x, y] /= d * d

into the above for loop.

jiangsutx · 2022-10-24T02:06:46Z

@jiangsutx May I ask what's the cpp code you are using?

The box filter is a separable filter, hence one does not need to traverse all the d2 pixels, just two 1d convolutions would suffice.

Also, all three for loops can be put into a single one:
for x, y in pixels:
       pixels[x, y] = 0.0
       ...
       pixels[x, y] /= d * d
into the above for loop.

task void boxfilter_ispc_vanilla(uniform const uint8 img[],
                                   uniform uint8 out[],
                                   uniform int height, uniform int width, uniform int d, uniform int span) {
    uniform int y0 = taskIndex * span;
    uniform int y1 = min((taskIndex + 1) * span, height);
    uniform int r = d / 2;
    foreach (row = y0 ... y1, col = 0 ... width) {
        int offh = row * width * 1;
            int cnt = 0;
            int sum0 = 0;
            int br = 7;
            for ( int dy = -br; dy <= br; ++dy) {
                for ( int dx = -br; dx <= br; ++dx) {
                    int idx = (row + dy) * width * 1 + (col + dx) * 1;
                    sum0 += img[idx];
                    cnt += 1;
                }
            }
            int idx = offh + col * 1;
            out[idx] = sum0 / max(cnt, 1);
    }
}

export void boxfilter_ispc_vanilla_mt(uniform const uint8 img[],
                                   uniform uint8 out[], uniform int height, uniform int width, uniform int d) {
    uniform int span = 32;
    launch[height / span + 1] boxfilter_ispc_vanilla(img, out, height, width, d, span);
}

I wrote my own cpp code, corresponding to the taichi code above. (not strictly equavelant, such as data type).
Actually it is in ispc language.

I know it can be easily accelerated to O(n), such as using integral image, as in OpenCV cv::boxFilter, it is 8 ms.

As I mentioned, I chose a simple and brute-force piece of code, just want to evaluate the limit of taichi's performance.
By the way, I also tested on integral image method on Taichi, it is still slower than OpenCV's version.

neozhaoliang · 2022-10-24T02:15:34Z

The performance of the compiled kernel in your taichi code should be approximately the same as the cpp code.
The observed gap may be because Taichi used pybind11, and we are working on this.

strongoier · 2022-10-28T10:26:03Z

@jiangsutx Could you try the latest taichi-nightly (pip install -i https://pypi.taichi.graphics/simple/ taichi-nightly) and test again under packed mode (ti.init(arch=ti.cpu, packed=True))? We are more than glad to help improve the performance together. Thanks!

jiangsutx · 2022-10-31T03:40:44Z

Thanks for your reply.

I did some tests:

I did not change taichi boxfilter related code.
When simply use ti.init(arch=ti.cpu, packed=True), time increased from about 210ms to about 285ms.
I further installed taichi-nightly, it now decreases to about 56ms.

I observed about 4x speed-up on my Macbook. Great!

Would you please explain a little about the differences, or give some advices about the version or best practice?

bobcao3 · 2022-10-31T03:49:37Z

That smells like JIT overhead... Kind of confusing, because that's an extremely big speedup, but from the code it shouldn't be a JIT issue

strongoier · 2022-10-31T11:53:45Z

Hi @jiangsutx. Thanks for reporting the results. You can first take a look at https://docs.taichi-lang.org/docs/layout#packed-mode. Then you will realize that when packed=False, the shape of your field is not (1080, 1080), but (2048, 2048). Therefore, you are actually iterating on a field which is 4x larger than you expect.

When simply use ti.init(arch=ti.cpu, packed=True), time increased from about 210ms to about 285ms.

This is because packed mode used to have an overhead on address calculation as stated in the doc.

I further installed taichi-nightly, it now decreases to about 56ms.

This is because in the latest commit such overhead has been eliminated for common scenarios.

strongoier · 2022-11-11T05:33:05Z

The remaining gap between taichi and ispc is likely due to lack of SIMD support in Taichi CPU backends. We are working on that and you can expect better performance in future versions. I'll close this issue for now. Feel free to open new issues if you meet other problems.

jiangsutx added the question Question on using Taichi label Oct 23, 2022

taichi-gardener added this to Taichi Lang Oct 23, 2022

taichi-gardener moved this to Untriaged in Taichi Lang Oct 23, 2022

neozhaoliang self-assigned this Oct 24, 2022

neozhaoliang assigned turbo0628 Oct 28, 2022

neozhaoliang moved this from Untriaged to In Progress in Taichi Lang Oct 28, 2022

strongoier closed this as completed Nov 11, 2022

Repository owner moved this from In Progress to Done in Taichi Lang Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benckmark on a simple image processing task, much slower than cpp version, how to improve please? #6405

Benckmark on a simple image processing task, much slower than cpp version, how to improve please? #6405

jiangsutx commented Oct 23, 2022

neozhaoliang commented Oct 24, 2022 •

edited

Loading

jiangsutx commented Oct 24, 2022

neozhaoliang commented Oct 24, 2022 •

edited

Loading

strongoier commented Oct 28, 2022

jiangsutx commented Oct 31, 2022

bobcao3 commented Oct 31, 2022

strongoier commented Oct 31, 2022

strongoier commented Nov 11, 2022

Benckmark on a simple image processing task, much slower than cpp version, how to improve please? #6405

Benckmark on a simple image processing task, much slower than cpp version, how to improve please? #6405

Comments

jiangsutx commented Oct 23, 2022

neozhaoliang commented Oct 24, 2022 • edited Loading

jiangsutx commented Oct 24, 2022

neozhaoliang commented Oct 24, 2022 • edited Loading

strongoier commented Oct 28, 2022

jiangsutx commented Oct 31, 2022

bobcao3 commented Oct 31, 2022

strongoier commented Oct 31, 2022

strongoier commented Nov 11, 2022

neozhaoliang commented Oct 24, 2022 •

edited

Loading

neozhaoliang commented Oct 24, 2022 •

edited

Loading