Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing struct vs deferencing fields in struct performance #4802

Open
bsavery opened this issue Apr 15, 2022 · 6 comments
Open

Passing struct vs deferencing fields in struct performance #4802

bsavery opened this issue Apr 15, 2022 · 6 comments
Labels
advanced optimization The issue or bug is related to advanced optimization discussion Welcome discussion!

Comments

@bsavery
Copy link
Contributor

bsavery commented Apr 15, 2022

In short I see significant performance decrease in passing a struct to a function vs dereferencing the struct fields and passing the values.

Here I show two ways to intersect a ray with a bunch of spheres, one passing the sphere struct vs getting the values of the sphere struct and passing those.

import taichi as ti
import time

ti.init(arch=ti.gpu)

n_spheres = 100000

# just a ray pointing in Z direction
ray_org = ti.Vector([0.0, 0.0, 0.0])
ray_dir = ti.Vector([0.0, 0.0, 1.0])

# random set of spheres
sphere_type = ti.types.struct(center=ti.types.vector(3, ti.f32), radius=ti.f32)
spheres = sphere_type.field(shape=(n_spheres,))
@ti.kernel
def fill_spheres():
    # generate random spheres in 0-100 xyz with radius 0-10
    for i in spheres:
        spheres[i].radius = ti.random() * 10.0
        spheres[i].center = ti.Vector([ti.random(), ti.random(), ti.random()]) * 100.0
fill_spheres()


@ti.kernel
def pass_reference() -> ti.i32:
    num_hit = 0
    for i in spheres:
        sphere = spheres[i]
        if intersect_sphere(sphere, ray_org, ray_dir):
            num_hit += 1
    return num_hit


@ti.kernel
def pass_decomposed() -> ti.i32:
    num_hit = 0
    for i in spheres:
        sphere = spheres[i]
        center, radius = sphere.center, sphere.radius
        if intersect_center_radius(center, radius, ray_org, ray_dir):
            num_hit += 1
    return num_hit


@ti.func
def intersect_sphere(sphere, ray_origin, ray_direction):
    # return if the ray hits the sphere
    oc = ray_origin - sphere.center
    a = ray_direction.norm_sqr()
    half_b = oc.dot(ray_direction)
    c = (oc.norm_sqr() - sphere.radius**2)
    discriminant = (half_b**2) - a * c

    return discriminant >= 0.0

@ti.func
def intersect_center_radius(center, radius, ray_origin, ray_direction):
    # return if the ray hits the decomposed sphere
    oc = ray_origin - center
    a = ray_direction.norm_sqr()
    half_b = oc.dot(ray_direction)
    c = (oc.norm_sqr() - radius**2)
    discriminant = (half_b**2) - a * c

    return discriminant >= 0.0


t = time.time()
pass_reference()
print('Passing reference', time.time() - t)

t = time.time()
pass_decomposed()
print('Passing decomposed', time.time() - t)

Also note metal vs vulkan effects quite a bit:
Metal
Passing reference 0.23202180862426758
Passing decomposed 0.09195494651794434

Vulkan
Passing reference 0.07849979400634766
Passing decomposed 0.05303597450256348

CPU
Passing reference 0.08902120590209961
Passing decomposed 0.06803393363952637

Originally posted by @bsavery in #4784 (reply in thread)

@FantasyVR
Copy link
Collaborator

cc @qiao-bo @turbo0628

@k-ye
Copy link
Member

k-ye commented Apr 18, 2022

As a good first step, we can enable ti.init(..., print_ir=True) to see if there's significant difference in CHI IR. See #4784 (reply in thread)

@k-ye
Copy link
Member

k-ye commented Apr 18, 2022

Note that you might also want to exclude the timing for the first run, because that counts JIT time as well.

@bsavery
Copy link
Contributor Author

bsavery commented Apr 18, 2022

@k-ye thanks for that.

So with a simple change to exclude the jit time (run once before timing) I get:

Vulkan
Passing reference 0.02786087989807129
Passing decomposed 0.047796010971069336 (worse!)

CPU
Passing reference 0.01233816146850586
Passing decomposed 0.011642217636108398

Metal:
Passing reference 0.0014309883117675781
Passing decomposed 0.0014178752899169922

So there is something coming out with different code, and the previous timing seemed to more reflect that the compilation / JIT time is different for each kernel (because of the different code being generated)

Side question: Is there a way to not recompile kernels each test run? I.e. if I run python test.py and the code for the kernel doesn't change, it would be nice to not recompile next time I call python test.py. Not sure how you could do this, maybe with timestamps or checking if the IR is different, but you could imagine this being useful.

@k-ye
Copy link
Member

k-ye commented Apr 18, 2022

Is there a way to not recompile kernels each test run?

Yup, @PGZXB 's working on an offline cache system (#4401), starting with the LLVM backend. We are using the AST as the cache key now. @PGZXB only has one day or so per week to work on Taichi, so the feature is a bit slow to release. But we are moving towards that direction. Thanks for your suggestion :-)

@FantasyVR FantasyVR added discussion Welcome discussion! advanced optimization The issue or bug is related to advanced optimization labels Apr 21, 2022
@bobcao3
Copy link
Collaborator

bobcao3 commented Apr 21, 2022

That's quite fascinating. I'd imagine the CHI-IR generated should be quite similar, and it's weird that we only see Vulkan with a big regression. (And I'm assuming this is on a mac? Where SPIR-V is actually translated to MSL by MoltenVK...)

We should check three things:

  1. Check for consistency and reproducibility on other devices (i.e. whether this is a problem that interacts with specific environment or this is an issue only from codegen)
  2. Compare the CHI-IR differences
  3. Compare the SPIR-V differences

Something else that can be quite helpful is to run the two SPIR-V through Radeon Graphics Analyzer as well to get the raw assembly for the #inst & cycle latency readings

@Leonz5288 Leonz5288 moved this to Untriaged in Taichi Lang Apr 29, 2022
@Leonz5288 Leonz5288 moved this from Untriaged to Backlog in Taichi Lang Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
advanced optimization The issue or bug is related to advanced optimization discussion Welcome discussion!
Projects
Status: Backlog
Development

No branches or pull requests

4 participants