Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI][Hexagon] Implement quantized avgpool #12340

Merged
merged 17 commits into from
Aug 24, 2022

Conversation

jverma-quic
Copy link
Contributor

@jverma-quic jverma-quic commented Aug 8, 2022

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

cc @mehrdadh

@jverma-quic
Copy link
Contributor Author

jverma-quic commented Aug 8, 2022

@github-actions github-actions bot requested a review from mehrdadh August 8, 2022 20:14
Copy link
Contributor

@cconvey cconvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I may have more comments later, but I thought I had enough feedback for this iteration.

raise RuntimeError("Output width is too large")


def saturate(x, dtype):

This comment was marked as resolved.

Comment on lines 55 to 57
return te.max(0, te.min(x, 255))
elif dtype == "int8":
return te.max(-127, te.min(x, 128))

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will look into it.

raise RuntimeError(f"Unexpected layout '{layout}'")


def get_fixed_point_value(flp, dtype="int16"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python type annotations (for params and return value) would definitely be helpful here.

E.g., is flp a Python intrinsic, a Numpy numeric value, a TE PrimExpr, or something else?

raise RuntimeError(f"Unexpected layout '{layout}'")


def get_fixed_point_value(flp, dtype="int16"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a few unit tests for this function? It's sufficiently complicated that the code isn't obviously correct just from reading it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into adding some test cases. Thanks!

raise RuntimeError(f"Unexpected layout '{layout}'")


def get_fixed_point_value(flp, dtype="int16"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't notice anything in the function body (or docs) indicating if/how this function handles flp values that are:

  • denormalized
  • positive or negative infinity
  • NaN

Should this function handle those cases gracefully or at least assert if they're encountered?

I think regardless of how they're handled, the docstring should discuss the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Thanks! The function handles denormalized values, but doesn't handle Nan or infinity. I'll add assert for these two cases. Will also include additional details on the denormalized case.

PoolArea = kh * kw

scale = input_scale / output_scale
scale_fixed_point, rsh = get_fixed_point_value(scale, "int16")

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rsh (short for right shift) and it's log2(scale_factor). Sorry, I couldn't really think of a better names (may be log2_scale_factor). If you've any suggestions, please do share.

This comment was marked as resolved.



def qnn_avg_pool2d_compute(
data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to document any limitation regarding the applicability / correctness of this function?

I.e., if someone was writing a model and tried to use this TOPI code, how would they discover any limitations (if there are any) on how this could be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are some assumptions made which are listed at the top of the file.

Comment on lines 117 to 171
def schedule_nhwc_8h8w32c(outs, ins, output_layout: str, input_layout: str):
"""Schedule for input and output layout nhwc-8h8w32c"""
func = te.create_prim_func([ins, outs])
s = tir.Schedule(func)
Sum = s.get_block("sum")
Avg = s.get_block("avg")

input_transform_fn = get_layout_transform_fn(input_layout)
output_transform_fn = get_layout_transform_fn(output_layout)
s.transform_layout(Sum, ("read", 0), input_transform_fn)
s.transform_layout(Avg, ("write", 0), output_transform_fn)

# Schedule 'Avg'
n, h, w, c = s.get_loops(Avg)
ho, hi = s.split(h, [None, 8])
wo, wi = s.split(w, [None, 8])
wio, wii = s.split(wi, [None, 4])
co, ci = s.split(c, [None, 32])
s.reorder(n, ho, wo, co, hi, wio, wii, ci)
wii_ci = s.fuse(wii, ci)
s.vectorize(wii_ci)

# Schedule 'Sum'
s.compute_at(Sum, wio)
Sum_axis = s.get_loops(Sum)
s.reorder(Sum_axis[-2], Sum_axis[-1], Sum_axis[-4], Sum_axis[-3])
ci_wii = s.fuse(Sum_axis[-4], Sum_axis[-3])
# s.vectorize(ci_wii) # Doesn't work
return s


def schedule_n11c_2048c(outs, ins, output_layout: str, input_layout: str):
"""Schedule for output layout: n11c-2048c, input layout: nhwc-8h8w32c"""
func = te.create_prim_func([ins, outs])
s = tir.Schedule(func)
Sum = s.get_block("sum")
Avg = s.get_block("avg")

input_transform_fn = get_layout_transform_fn(input_layout)
output_transform_fn = get_layout_transform_fn(output_layout)
s.transform_layout(Sum, ("read", 0), input_transform_fn)
s.transform_layout(Avg, ("write", 0), output_transform_fn)

# Schedule 'Avg'
n, h, w, c = s.get_loops(Avg)
co, ci = s.split(c, [None, 2048])
cio, cii = s.split(ci, [None, 128])
s.vectorize(cii)

# Schedule 'Sum'
s.compute_at(Sum, cio)
Sum_axis = s.get_loops(Sum)
s.reorder(Sum_axis[-2], Sum_axis[-1], Sum_axis[-3])
# s.vectorize(Sum_axis[-3]) # Doesn't work
return s

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I'll add some comments.

@jverma-quic
Copy link
Contributor Author

Please let me know if there are any additional comments. If not, can someone approve and merge it for me please?

@TejashShah
Copy link

cc @cconvey @mehrdadh, please take a relook of the updated files.

@cconvey
Copy link
Contributor

cconvey commented Aug 16, 2022

@jverma-quic : Could you please click the "Ready for re-review" link next to my name? I think that might be necessary to separate my old vs. new review comments.

Comment on lines 41 to 42
fp6 = np.random.uniform(2.44885652993e38, 2.54885652993e38, size=(1))
fp7 = np.random.uniform(1.46711479073e-34, 1.76098837843e-34, size=(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numbers seem pretty specific. It would be nice to have a comment indicating what (if anything) they correspond to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers don't really mean anything but I just wanted to test with some very large and small floating-point values to make sure that the conversion function is handling them properly, i.e., doesn't introduce large error.

python/tvm/topi/hexagon/utils.py Outdated Show resolved Hide resolved
Comment on lines +45 to +53
fxp, rsh = utils.get_fixed_point_value(flp, "int16")
# Compute scale_factor using rsh (rsh is log2 of the scale_factor). While doing this,
# we use IEEE-754 floating-point representation since rsh can be negative or positive.

scale = ((rsh + 127) & 0xFF) << 23 # Add bias (127) and position it into exponent bits
scale_i = struct.pack("I", scale) # Pack it as integer
scale_f = struct.unpack("f", scale_i) # Unpack as float

converted_flp = fxp / scale_f[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move this logic into new function in utils, e.g. get_floating_point_value(fxp:int, rsh:int, dtype="float16") -> float ?

I'm just thinking that the two conversion functions probably belong in the same place, even if one is currently used only for testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I had earlier but I decided not to do it mainly because it's need just for testing and doesn't provide any additional value. I would prefer to keep it that way unless this is a major concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense. I think it's just a matter of personal preference, so no object to keeping it as is.

Comment on lines 36 to 42
fp1 = np.random.uniform(0.00001, 0.0002, size=(10))
fp2 = np.random.uniform(0.001, 0.02, size=(10))
fp3 = np.random.uniform(1, 20, size=(10))
fp4 = np.random.uniform(900, 1000, size=(10))
fp5 = np.random.uniform(1e9, 1e10, size=(10))
fp6 = np.random.uniform(2.44885652993e38, 2.54885652993e38, size=(1))
fp7 = np.random.uniform(1.46711479073e-34, 1.76098837843e-34, size=(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if random draws are worth the effort / complexity here...

  • If the goal is just simple, sanity-checking unit tests, then I'm not sure we really need randomness. Especially if they lead to test-failures that can't be reproduced for the sake of debugging, due to the randomization.

  • If the goal is to check corner cases, I would think that's better done using specifically chosen values, e.g.

    • extreme value / special values for floating-point numbers
    • floating point values that, by inspection of the conversion algorithm, are likely to be critical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really think about it since this is just a small unit test and we're just constructing at most 10 element long arrays. If you're really concerned about the complexity aspect of it, then I don't mind doing what you're suggesting but otherwise, I would prefer leaving it as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection to leaving the code as it is. Would you consider just adding a comment about the semi-arbitrary nature of those numbers? Usually when I see something as precise as 2.44885652993e38, I assume the number is chosen for a particular reason. It might save other readers some time to know that there's no deeper meaning here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'll add some comments to make it explicit.


# Make sure that 'flp' isn't NaN or infinity
if math.isnan(flp) or math.isinf(flp):
raise RuntimeError("Can not handle NaN or INF")
Copy link
Contributor

@cconvey cconvey Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Sometimes comments like this indicate a temporary limitation of the function that could be addressed in a later version. But IIUC, the fixed-point format we're dealing with here is simply incapable of expressing those two concepts.

It might be helpful to use an error message that's clearer about this.

Comment on lines 249 to 250
if exp_stored_value == 0:
raise RuntimeError("Can not handle denormalized values")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is somewhat redundant to a comment I left regarding the function's docstring, above.)

It would be nice to have a comment regarding why denormalized values aren't handled. E.g.:

  • they're always indistinguishable from 0 in the resulting fixed-point representation, or
  • we don't need to support them yet, so we're just not dealing with them for now, or
  • (something else)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll elaborate on this. Thanks!


Additonal notes on various floating-point values:
------------------------------------------------
1) Denormalized values: Can't be represented as fixed-point - causes assertion failure
Copy link
Contributor

@cconvey cconvey Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the claim that denormal values can't be expressed as fixed-point.

My understanding is that IEEE-754 denormalized format is simply a special way of encoding numbers that are much closer to 0 than normalized float16 values can express. I don't understand why that's fundamentally inexpressable as fixed-point.

Are we assuming some additional unstated limitations on our fixedpoint representation? E.g., the range of values that we're willing to let rsh take on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To convert denormalized values into fixed point values, we'll require a very large scale factor which can't be represented using the available bits.


def saturate(x: te.Tensor, dtype: str):
"""Saturate value for the specified data type"""
return te.max(te.min_value(dtype), te.min(x, te.max_value(dtype)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I looked at several of the Hexagon .so files produced by this PR's unit tests, I didn't see any indication that Hexagon's saturate or :sat instructions were being used.

This isn't a critique of the PR; I'm just mentioning it as a point of interest for future work.

Copy link
Contributor

@cconvey cconvey Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm wondering if the saturate can sometimes be elided entirely when dtype=float16.

Here's my (maybe flawed) reasoning:

  • Hexagon has two units that might do this math: the ARM core, and the HVX units.
  • If this code runs on the ARM core, then it's treated as an IEEE-754 single-precision float.
  • If this code runs on an HVX core, then it's going to be processed using qfloat16 semantics, which automatically uses saturate behavior.

So any dataflow path that definitely involves qfloat16 representation could (perhaps) entirely avoid explicit saturation logic.

I'm starting to wonder if TIR should eventually distinguish saturated vs. unsaturated ops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, @cconvey! You're correct about saturate not being needed for float16 dtype. Please note that the functions in this file qnn/avg_pool2d.py are meant to be used only for the quantized models and therefore should have uint8 and int8 dtypes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I looked at several of the Hexagon .so files produced by this PR's unit tests, I didn't see any indication that Hexagon's saturate or :sat instructions were being used.

This isn't a critique of the PR; I'm just mentioning it as a point of interest for future work.

That's very likely. Thanks for looking into it!

Unless we generate saturating llvm instructions through TVM, we will have to add additional code in LLVM to recognize the sequence of min, max as saturate.

Copy link
Contributor

@cconvey cconvey Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we generate saturating llvm instructions through TVM, we will have to add additional code in LLVM to recognize the sequence of min, max as saturate.

I wonder if there's a good way to coordinate on where these replacement-patterns get implemented.

I imagine it makes sense to eventually put an optimization like this into one of TVM or LLVM, but it's potentially a waste of effort to put it into both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think it will be better to do it in TVM as we can generate the appropriate LLVM saturating instruction during TVM codegen which can then be lowered into target specific instructions in the LLVM backend.

@cconvey
Copy link
Contributor

cconvey commented Aug 18, 2022

This PR introduces several TOPI-related functions (qnn_avg_pool2d_compute and qnn_avg_pool2d_schedule). Does this PR make these functions available for compile-time consideration by TOPI?

I'm not very familiar with the mechanisms TVM uses for this, so apologies if I'm just missing how it happens.

Co-authored-by: Christian Convey <christian.convey@gmail.com>
@jverma-quic
Copy link
Contributor Author

This PR introduces several TOPI-related functions (qnn_avg_pool2d_compute and qnn_avg_pool2d_schedule). Does this PR make these functions available for compile-time consideration by TOPI?

I'm not very familiar with the mechanisms TVM uses for this, so apologies if I'm just missing how it happens.

That's correct. The PR does introduce several TOPI related functions. However, since they require inputs and outputs to be in 2d discontiguous buffer, they aren't yet available for use by OpStrategy. One of the issues here is that Relay is unable to handle complex layout needed for these discontiguous buffers and requires some additional work.

@jverma-quic jverma-quic requested review from cconvey and removed request for mehrdadh August 19, 2022 18:28
Copy link
Contributor

@cconvey cconvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jverma-quic , the PR is looking really good. I left one small suggestion, feel free to ignore it.

Once you're satisifed, I'm happy with getting this merged. You may want to ping @mehrdadh for that official review; I think he was waiting for my review to finish.

Comment on lines 36 to 42
fp1 = np.random.uniform(0.00001, 0.0002, size=(10))
fp2 = np.random.uniform(0.001, 0.02, size=(10))
fp3 = np.random.uniform(1, 20, size=(10))
fp4 = np.random.uniform(900, 1000, size=(10))
fp5 = np.random.uniform(1e9, 1e10, size=(10))
fp6 = np.random.uniform(2.44885652993e38, 2.54885652993e38, size=(1))
fp7 = np.random.uniform(1.46711479073e-34, 1.76098837843e-34, size=(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection to leaving the code as it is. Would you consider just adding a comment about the semi-arbitrary nature of those numbers? Usually when I see something as precise as 2.44885652993e38, I assume the number is chosen for a particular reason. It might save other readers some time to know that there's no deeper meaning here.

@jverma-quic
Copy link
Contributor Author

Thanks @jverma-quic , the PR is looking really good. I left one small suggestion, feel free to ignore it.

Once you're satisifed, I'm happy with getting this merged. You may want to ping @mehrdadh for that official review; I think he was waiting for my review to finish.

Thanks @cconvey! I really appreciate you taking time to review this PR and your detailed comments.

@github-actions github-actions bot removed the request for review from cconvey August 19, 2022 20:00
@github-actions github-actions bot requested a review from mehrdadh August 19, 2022 20:00
@jverma-quic
Copy link
Contributor Author

@cconvey, @mehrdadh, @kparzysz-quic, @TejashShah : I'm waiting to merge this PR. Unless there are additional comments, can someone approve and merge it for me please? Thanks!

@mehrdadh
Copy link
Member

@jverma-quic sorry for the delay, looking at the PR now

@mehrdadh mehrdadh merged commit 1afd059 into apache:main Aug 24, 2022
@mehrdadh
Copy link
Member

@jverma-quic PR is merged! Thanks for your contribution!
Moving forward please use a meaningful PR description.

@cconvey thanks for the review!

xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
* [TOPI][Hexagon] Implement quantized avgpool

* Fix pylint errors

* Needed to adjust input padding for int8 buffer layout

* Fix formatting issue

* Add unit test for fixed-point conversion utility function

Also, address review comments.

* Remove pytest.skip for test_avg_pool2d_slice.py to enable on-target testing

* Fix formatting issue

* Update python/tvm/topi/hexagon/utils.py

Co-authored-by: Christian Convey <christian.convey@gmail.com>

* Update comments and error messages

* Address review comments

* Import Tuple from typing

* Address pylint error

Co-authored-by: Christian Convey <christian.convey@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants