Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benckmark on a simple image processing task, much slower than cpp version, how to improve please? #6405

Closed
jiangsutx opened this issue Oct 23, 2022 · 8 comments
Assignees
Labels
question Question on using Taichi

Comments

@jiangsutx
Copy link

I hope to use Taichi on image processing tasks for performance.
So I tested on a simple task: a brute-force implementation of box filter.

My taichi code:

import time
import cv2
import taichi as ti

ti.init(arch=ti.cpu)

img = cv2.imread('1.png').astype(np.float32) / 255.0
h, w = img.shape[:2]
ti_img = ti.Vector.field(3, dtype=ti.f32, shape=(h, w))
ti_img.from_numpy(img)
pixels = ti.Vector.field(3, dtype=ti.f32, shape=(h, w))

@ti.kernel
def ti_boxfilter():
    for x, y in pixels:
        pixels[x, y] = [0, 0, 0]
    for x, y in pixels:
        for i in range(d):
            for j in range(d):
                pixels[x, y] += ti_img[x + i - r, y + j - r]

    for x, y in pixels:
        pixels[x, y] /= d ** 2

N = 20
ti_boxfilter()
t0 = time.time()
for i in range(N):
    ti_boxfilter()
pixels.to_numpy()
t1 = time.time()

print('time for ti_blur5x5: ', (t1 - t0) / N)
t1 = time.time()

I tested on my Macbook Pro 2019, and the image is of resolution 1080x1080.
The time on CPU is 154 ms, and on metal is about 3 ms.

I also tested on cpp code, and try to use SIMD via ispc. And finally the cpu time reduces to about 10 ms.

I understand taichi may not be optimal for such task and such device.
I just wonder is there anything I can do to further push limit of taichi ? It is really easy to write algorithms in Python.

Thanks in advance for help.

@jiangsutx jiangsutx added the question Question on using Taichi label Oct 23, 2022
@taichi-gardener taichi-gardener moved this to Untriaged in Taichi Lang Oct 23, 2022
@neozhaoliang
Copy link
Contributor

neozhaoliang commented Oct 24, 2022

@jiangsutx May I ask what's the cpp code you are using?

The box filter is a separable filter, hence one does not need to traverse all the $d^2$ pixels, just two 1d convolutions would suffice.

Also, all three for loops can be put into a single one:

for x, y in pixels:
       pixels[x, y] = 0.0
       ...
       pixels[x, y] /= d * d

into the above for loop.

@neozhaoliang neozhaoliang self-assigned this Oct 24, 2022
@jiangsutx
Copy link
Author

@jiangsutx May I ask what's the cpp code you are using?

The box filter is a separable filter, hence one does not need to traverse all the d2 pixels, just two 1d convolutions would suffice.

Also, all three for loops can be put into a single one:

for x, y in pixels:
       pixels[x, y] = 0.0
       ...
       pixels[x, y] /= d * d

into the above for loop.

task void boxfilter_ispc_vanilla(uniform const uint8 img[],
                                   uniform uint8 out[],
                                   uniform int height, uniform int width, uniform int d, uniform int span) {
    uniform int y0 = taskIndex * span;
    uniform int y1 = min((taskIndex + 1) * span, height);
    uniform int r = d / 2;
    foreach (row = y0 ... y1, col = 0 ... width) {
        int offh = row * width * 1;
            int cnt = 0;
            int sum0 = 0;
            int br = 7;
            for ( int dy = -br; dy <= br; ++dy) {
                for ( int dx = -br; dx <= br; ++dx) {
                    int idx = (row + dy) * width * 1 + (col + dx) * 1;
                    sum0 += img[idx];
                    cnt += 1;
                }
            }
            int idx = offh + col * 1;
            out[idx] = sum0 / max(cnt, 1);
    }
}

export void boxfilter_ispc_vanilla_mt(uniform const uint8 img[],
                                   uniform uint8 out[], uniform int height, uniform int width, uniform int d) {
    uniform int span = 32;
    launch[height / span + 1] boxfilter_ispc_vanilla(img, out, height, width, d, span);
}

I wrote my own cpp code, corresponding to the taichi code above. (not strictly equavelant, such as data type).
Actually it is in ispc language.

I know it can be easily accelerated to O(n), such as using integral image, as in OpenCV cv::boxFilter, it is 8 ms.

As I mentioned, I chose a simple and brute-force piece of code, just want to evaluate the limit of taichi's performance.
By the way, I also tested on integral image method on Taichi, it is still slower than OpenCV's version.

@neozhaoliang
Copy link
Contributor

neozhaoliang commented Oct 24, 2022

The performance of the compiled kernel in your taichi code should be approximately the same as the cpp code.
The observed gap may be because Taichi used pybind11, and we are working on this.

@neozhaoliang neozhaoliang moved this from Untriaged to In Progress in Taichi Lang Oct 28, 2022
@strongoier
Copy link
Contributor

@jiangsutx Could you try the latest taichi-nightly (pip install -i https://pypi.taichi.graphics/simple/ taichi-nightly) and test again under packed mode (ti.init(arch=ti.cpu, packed=True))? We are more than glad to help improve the performance together. Thanks!

@jiangsutx
Copy link
Author

Thanks for your reply.

I did some tests:

  1. I did not change taichi boxfilter related code.
  2. When simply use ti.init(arch=ti.cpu, packed=True), time increased from about 210ms to about 285ms.
  3. I further installed taichi-nightly, it now decreases to about 56ms.

I observed about 4x speed-up on my Macbook. Great!

Would you please explain a little about the differences, or give some advices about the version or best practice?

@bobcao3
Copy link
Collaborator

bobcao3 commented Oct 31, 2022

That smells like JIT overhead... Kind of confusing, because that's an extremely big speedup, but from the code it shouldn't be a JIT issue

@strongoier
Copy link
Contributor

Hi @jiangsutx. Thanks for reporting the results. You can first take a look at https://docs.taichi-lang.org/docs/layout#packed-mode. Then you will realize that when packed=False, the shape of your field is not (1080, 1080), but (2048, 2048). Therefore, you are actually iterating on a field which is 4x larger than you expect.

When simply use ti.init(arch=ti.cpu, packed=True), time increased from about 210ms to about 285ms.

This is because packed mode used to have an overhead on address calculation as stated in the doc.

I further installed taichi-nightly, it now decreases to about 56ms.

This is because in the latest commit such overhead has been eliminated for common scenarios.

@strongoier
Copy link
Contributor

The remaining gap between taichi and ispc is likely due to lack of SIMD support in Taichi CPU backends. We are working on that and you can expect better performance in future versions. I'll close this issue for now. Feel free to open new issues if you meet other problems.

Repository owner moved this from In Progress to Done in Taichi Lang Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question on using Taichi
Projects
Status: Done
Development

No branches or pull requests

5 participants