-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benckmark on a simple image processing task, much slower than cpp version, how to improve please? #6405
Comments
@jiangsutx May I ask what's the cpp code you are using? The box filter is a separable filter, hence one does not need to traverse all the Also, all three for loops can be put into a single one: for x, y in pixels:
pixels[x, y] = 0.0
...
pixels[x, y] /= d * d into the above for loop. |
task void boxfilter_ispc_vanilla(uniform const uint8 img[],
uniform uint8 out[],
uniform int height, uniform int width, uniform int d, uniform int span) {
uniform int y0 = taskIndex * span;
uniform int y1 = min((taskIndex + 1) * span, height);
uniform int r = d / 2;
foreach (row = y0 ... y1, col = 0 ... width) {
int offh = row * width * 1;
int cnt = 0;
int sum0 = 0;
int br = 7;
for ( int dy = -br; dy <= br; ++dy) {
for ( int dx = -br; dx <= br; ++dx) {
int idx = (row + dy) * width * 1 + (col + dx) * 1;
sum0 += img[idx];
cnt += 1;
}
}
int idx = offh + col * 1;
out[idx] = sum0 / max(cnt, 1);
}
}
export void boxfilter_ispc_vanilla_mt(uniform const uint8 img[],
uniform uint8 out[], uniform int height, uniform int width, uniform int d) {
uniform int span = 32;
launch[height / span + 1] boxfilter_ispc_vanilla(img, out, height, width, d, span);
} I wrote my own cpp code, corresponding to the taichi code above. (not strictly equavelant, such as data type). I know it can be easily accelerated to O(n), such as using integral image, as in OpenCV As I mentioned, I chose a simple and brute-force piece of code, just want to evaluate the limit of taichi's performance. |
The performance of the compiled kernel in your taichi code should be approximately the same as the cpp code. |
@jiangsutx Could you try the latest taichi-nightly ( |
Thanks for your reply. I did some tests:
I observed about Would you please explain a little about the differences, or give some advices about the version or best practice? |
That smells like JIT overhead... Kind of confusing, because that's an extremely big speedup, but from the code it shouldn't be a JIT issue |
Hi @jiangsutx. Thanks for reporting the results. You can first take a look at https://docs.taichi-lang.org/docs/layout#packed-mode. Then you will realize that when
This is because packed mode used to have an overhead on address calculation as stated in the doc.
This is because in the latest commit such overhead has been eliminated for common scenarios. |
The remaining gap between taichi and ispc is likely due to lack of SIMD support in Taichi CPU backends. We are working on that and you can expect better performance in future versions. I'll close this issue for now. Feel free to open new issues if you meet other problems. |
I hope to use Taichi on image processing tasks for performance.
So I tested on a simple task: a brute-force implementation of box filter.
My taichi code:
I tested on my Macbook Pro 2019, and the image is of resolution
1080x1080
.The time on CPU is
154 ms
, and on metal is about3 ms
.I also tested on cpp code, and try to use SIMD via
ispc
. And finally the cpu time reduces to about10 ms
.I understand taichi may not be optimal for such task and such device.
I just wonder is there anything I can do to further push limit of taichi ? It is really easy to write algorithms in Python.
Thanks in advance for help.
The text was updated successfully, but these errors were encountered: