Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hexagon] Add Hand written HVX conv2d #12204

Merged
merged 5 commits into from
Sep 12, 2022
Merged

Conversation

quic-sanirudh
Copy link
Contributor

@quic-sanirudh quic-sanirudh commented Jul 27, 2022

This PR adds an initial version of a hand-coded HVX intrinsics implementation of conv2d for Hexagon target.

@kparzysz-quic is a co-author to the patch as he has written many of the utility functions for this patch.

cc @mehrdadh

Copy link
Contributor

@cconvey cconvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting PR!

I've made some comments on low-hanging items. I'll save the rest of my review for a future iteration, just to avoid having too many open topics as the same time.

#include <cassert>
#include <cinttypes>

#include "conv2d.h"
Copy link
Contributor

@cconvey cconvey Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my subjective opinion, but there might be less room for confusion / bugs if we used a longer path here, e.g. #include <runtime/hexagon/ops/conv2d.h>, rather than adding this directory to the include path via the CMakeLists.txt file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks for the suggestion, it makes sense to me. BTW, sorry it took a bit of time to address them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @quic-sanirudh , happy it was helpful. Please click on the "Ready for re-review" link next to my name so I can provide feedback on the updated code.

* specific language governing permissions and limitations
* under the License.
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this header file is only #included by conv2d_hvx.cc. Would it make sense to simply put this content into that .cc file and delete this header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I was planning to reuse this header for the quantized implementation (which might contain a couple different files as well) that I'm working on, which is why I made it into a separate header

Comment on lines 38 to 44
#define HAP_CALL(hap_fn, ...) \
{ \
int rc = hap_fn(__VA_ARGS__); \
if (rc != 0) { \
debug("%s failed: rc=%x", #hap_fn, rc); \
} \
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any users for this macro. Would it make sense to remove it from the PR?

If we keep this header, I think its content belongs somewhere in/under the tvm::runtime::hexagon C++ namespace, rather than ::detail. (At least, that's the idiom I see in other header files under src/runtime.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @cconvey here, and hexagon_common.cc implements similar functionality into TVM's runtime logger which we should aim to use for debugging in the runtime. This can be done with LOG(INFO/WARNING/FATAL) << "message" which is automatically directed to FARF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cconvey Thanks for the suggestion, I've moved them into the tvm::runtime::hexagon namespace and removed the HAP_CALL macro

@csullivan I've modified all the debug messages to LOG_INFO calls, thanks for that suggestion. I did not know that this has been adapted to make FARF calls for hexagon already.

Comment on lines 49 to 50
// Standalone DLTensor: the standalone-ness means that this object owns the shape
// (as opposed to a DLTensor).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for the comment to elaborate on "owns the shape"?

I'm not sure what that means, and it would also be helpful if the comment gave a little more info/context for why this is being done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had a different understanding, but I'm not completely sure right now. Since @kparzysz-quic introduced the SDLTensor type initially, I'll let him explain this part as I don't want to make an incorrect explanation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DLTensor has a pointer to an array of dimensions, but doesn't own the storage, someone else has to allocate it somewhere, put the shape in it, and store the pointer in the DLTensor. This is annoying, since DLTensor is not an entity that can live on its own (the strides are also like this, iirc, but we don't use them here). Standalone DLTensor was invented to contain the shape as a part of the tensor object.


// Standalone DLTensor: the standalone-ness means that this object owns the shape
// (as opposed to a DLTensor).
template <size_t N>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial nitpick: If the template parameter name was RANK or NDIM rather than N, I would have need ~5 less seconds to understand its role.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks.

Comment on lines 102 to 113
constexpr uintptr_t nhwc_at(const DLTensor& a, int n, int y, int x, int c) {
if (y < 0 || y >= a.shape[1]) return uintptr_t(0);
auto p = static_cast<uintptr_t*>(a.data);
assert(n == 0);
return p[y * a.shape[2] * a.shape[3] + x * a.shape[3] + c];
}

constexpr uintptr_t hwio_at(const DLTensor& f, int y, int x, int i, int o) {
auto p = static_cast<uintptr_t*>(f.data);
return p[y * f.shape[1] * f.shape[2] * f.shape[3] + x * f.shape[2] * f.shape[3] + i * f.shape[3] +
o];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason these functions are constexpr? At least with nhwc_at, I didn't notice it being used in a context that would benefit from it being constexpr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I removed constexpr calls on this, but then wondered if any of the above functions could benefit from constexpr either, as the files are compiled as part of the runtime, and before the model compilation (which is when the op shapes and other constants are known).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value does removing constexpr add here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO both options are reasonable. When you see a constexpr function, it could be for either of two reasons:

  • a human noticed that it was possible, and added constexpr in case it's useful in the future, or
  • there is, or shall be in the future, a callsite that needs the function to be constxepr.

I haven't noticed any codebases that clarify which of those reasons applies, so there's ambiguity either way.

void* GetDataSpace() const { return data_space; }

private:
SDLTensor(void* data_ptr, DLDataType data_type, void* data_space) : data_space(data_space) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the relationship between data_ptr and data_space? I.e.:

  • Is data_ptr always implicitly an address within the data_space allocation?

  • Is either of these ever allowed to be nullptr?

  • Is a single data space ever shared by multiple SDLTensor instances?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, as I commented above, @kparzysz-quic could probably provide the proper answer.

Having said that, I think the difference is that data_space is meant to store the pointer returned from AllocDataSpace, which means we could call FreeDataSpace on the same pointer again. data_ptr would be the same as data_space in the case of activations.

In case of weights, data_ptr stores the pointers the first element of each "chunk" in layout (I've explained the difference between "chunkified" and "blockized" below and also added them as comments in the code). In short, since chunks can be of different sizes (due to the restriction that there would be no padding along height and width as part of the layout), the data_ptr allows efficient access each of the chunks in the data space.

For the questions you've asked:

  1. data_ptr could be the same as data_space or a completely separate pointer pointing to the start addresses of each chunk of weights
  2. I think both have to be valid pointers. In case of weights, the data_ptr points to addresses on the stack, so it gets freed automatically, but data_space is always freed with a call to FreeDataSpace
  3. I'm not sure, it is right now used only for the activations and weights of convolution, and is there are always separate SDLTensor instances for them.

Comment on lines 136 to 139
void release(tvm::runtime::DeviceAPI* device_api, const SDLTensor<N>& tensor) {
if (auto* data_space = tensor.GetDataSpace()) {
device_api->FreeDataSpace(hexagon_device, data_space);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This validity of this comment hinges on the answer to one of my questions, above.)

Is there a possibility that this function will leave one or mode SDLTensor instances with stale data and/or data_space pointers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short no. Like I explained above, data is either the same as data_space, and so gets freed together, or points to some pointer on the stack, and thus freed during return.

Comment on lines 120 to 125
void blockize_hwc_16b(void* out, void* inp_flat, int height, int width, int depth);

void deblockize_hwc_16b(void* out_flat, void* inp, int height, int width, int depth);

void chunkify_hwio_16b(void** out_ptr, int out_ptr_size, void* out, void* inp, int height,
int width, int idepth, int odepth);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to explain what "blockize" vs. "chunkify" mean. Neither term means anything specific to me, and I can't tell (based on the names) if they're two different names for the same concept.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I made a comment about this above and after reading further I think even my definitions are not accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csullivan your comment below was almost correct. blockized represents modifying the layout to be non-contiguous blocks of data used in hexagon. chunkified means that the data is allocated as a packed contiguous blocks, with the only difference being that it cannot be padded along height and width. This leads chunks of different sizes, which is why I think packed might not be the right term for this.

extern "C" int conv2d_packed(TVMValue* args, int* type_codes, int num_args, TVMValue* out_val,
int out_code, void* res_handle);

namespace detail {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in the header file: I think this belongs in/under the tvm::runtime::hexagon namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks.

Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bravo 👏 on the impressive vectorized conv2d kernel on a very complicated set of layouts.

Comment on lines 38 to 44
#define HAP_CALL(hap_fn, ...) \
{ \
int rc = hap_fn(__VA_ARGS__); \
if (rc != 0) { \
debug("%s failed: rc=%x", #hap_fn, rc); \
} \
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @cconvey here, and hexagon_common.cc implements similar functionality into TVM's runtime logger which we should aim to use for debugging in the runtime. This can be done with LOG(INFO/WARNING/FATAL) << "message" which is automatically directed to FARF.

* @brief Return the input vector filled with the 2 channel elements(which is the 1st and 3rd
* element) from base_ptr filled up 32 times to get 64 elements
*
* 1. It's generated by first creating 2 vectors "splatted" with the 2 required elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"splatted" == broadcast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I used "splatted" as the hexagon instruction to do this is called vsplat.

* @param bias_flat Flat bias values and are not used right now
* @param filt_shape Original filter shape
* @param pad_shape Pad top and pad left shape
* @param relu Whether to apply relu after convolution, not done right now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a # TODO(author): Add support for relu and similar for all the params which are not supported but planned to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks.

* @param cr_filt Chunkified weights as returned from output of prepare_hwio
* @param out_shape Original output shape of the tensor before blockization
* @param act_shape Original input shape
* @param bias_flat Flat bias values and are not used right now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also as below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks.

Comment on lines 147 to 151
* @param cr_out blockized output tensor with zeros already filled in
* @param cr_act blockized activations
* @param cr_filt Chunkified weights as returned from output of prepare_hwio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blockized means the data has been packed into an array of non-contiguous allocations
chunkified means weights are in a contiguous packed layout

For chunkified maybe we just refer to packed instead, and define the meaning of blocked somewhere in the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replied to @cconvey's query about the difference above. I've also added the meaning of chunkify in the comments to make it more obvious for people going through the code.

*
* Activations vector would thus be:
* ----------- ------ ----- ---- --
* act_vec = [i0,i1,i0,i1,i0,i1,...,i0,i1] - 2 elements of the input channels broadcasted 32 times
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mapping is lambda n, h, w, c: [n, h//8, w//4, c//32, h%8, (w%4)//2, c%32, w%2]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 121 to 131
inline HVX_Vector computeOuputVector(HVX_Vector act_vec, HVX_Vector wgt_vec) {
HVX_Vector v_res = Q6_Vqf16_vmpy_VhfVhf(act_vec, wgt_vec); // result is in qf16
HVX_Vector v_rot = Q6_V_vror_VR(v_res, 2);
HVX_Vector v_reduced = Q6_Vqf16_vadd_Vqf16Vqf16(v_res, v_rot);
HVX_Vector v_hf = Q6_Vhf_equals_Vqf16(v_reduced);
HVX_Vector v1e = getOddEvenOnes().first;
HVX_Vector v_reduced_even_lanes = Q6_V_vand_VV(v_hf, v1e);
return v_reduced_even_lanes;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat

return module


shape_parameters = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead using tvm's helpers, you can either do

act_shape, wgt_shape, inp_offset, inp_stride = tvm.testing.parameters(((1,8,4,3),(3,3,3,3),(0,0),(1,1)), ((1,10,14,3), (3,3,3,3), (0,0), (1,1)), ...)

or

act_shape = tvm.testing.parameter((1,8,4,3), (1,10,14,3), ...)
wgt_shape = tvm.testing.parameter((3,3,3,3), (3,3,3,64), ...)
inp_offset = tvm.testing_parameter((0, 0))
inp_stride = tvm.testing_parameter((1,1), (2,2))

which will do the cartesian product of the individually set parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've done it the first way (using tvm.testing.parameters(...) and not parameter(...) because there are a few restrictions on the compatibility of inputs, weights and stride to make sure that we don't get an incorrect output shape. So, using cartesian products to generate the input/weight/stride combinations was creating problems.



@tvm.testing.requires_hexagon
@pytest.mark.parametrize("shapes", shape_parameters, ids=map(gen_id, shape_parameters))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After using tvm.testing.parameter this decoration can be deleted and test_conv2d can take these params as function arguments directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks.

Comment on lines 182 to 186
"act_shape": (1, 8, 8, 3),
"wgt_shape": (2, 2, 3, 3),
"inp_offset": (0, 0),
"inp_stride": (2, 2),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all very small convolutions. Can we add some that are representative of real workloads and mark them with the slow decorator, or skip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding larger convolution tests earlier, but realized that it won't work due to VTCM limit. I can still add them and mark them as skip if you think that's a good idea for the future, but as it stands now, the op has to be split into chunks to accommodate VTCM size.

@csullivan
Copy link
Contributor

Also @quic-sanirudh, if @kparzysz-quic is a co-author I suggest amending the commit message with Co-authored-by: Krzysztof Parzyszek <kparzysz@quicinc.com> which will attribute mutual authorship in Git and GitHub.

Comment on lines 89 to 117
inline uintptr_t nhwc_at(const DLTensor& a, int n, int y, int x, int c) {
if (y < 0 || y >= a.shape[1]) return uintptr_t(0);
auto p = static_cast<uintptr_t*>(a.data);
assert(n == 0);
return p[y * a.shape[2] * a.shape[3] + x * a.shape[3] + c];
}

inline uintptr_t hwio_at(const DLTensor& f, int y, int x, int i, int o) {
auto p = static_cast<uintptr_t*>(f.data);
return p[y * f.shape[1] * f.shape[2] * f.shape[3] + x * f.shape[2] * f.shape[3] + i * f.shape[3] +
o];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would Horner's algorithm be a good technique for computing p's index? It's potentially faster and potentially easier to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler should be able to take care of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds plausible. My suggestion could be a case of premature optimization; I guess we'll know either way if this turns out to be a hotspot during profiling.

Comment on lines 89 to 117
inline uintptr_t nhwc_at(const DLTensor& a, int n, int y, int x, int c) {
if (y < 0 || y >= a.shape[1]) return uintptr_t(0);
auto p = static_cast<uintptr_t*>(a.data);
assert(n == 0);
return p[y * a.shape[2] * a.shape[3] + x * a.shape[3] + c];
}

inline uintptr_t hwio_at(const DLTensor& f, int y, int x, int i, int o) {
auto p = static_cast<uintptr_t*>(f.data);
return p[y * f.shape[1] * f.shape[2] * f.shape[3] + x * f.shape[2] * f.shape[3] + i * f.shape[3] +
o];
}
Copy link
Contributor

@cconvey cconvey Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DLTensor struct has two members that don't seem to be considered here: strides and byte_offset.

Is it truly safe to ignore those members here? If so, it might be helpful to (ideally) add an assert and/or a comment about why they can be ignored.

If the reason we can safely ignore strides and byte_offset is that we know they were set to 0 and nullptr, repsectively, by SDLTensor's constructor, then would it make sense to have these two functions explicitly operate on SDLTensor objects rather than DLTensor objects? (And if so, is there any reason they shouldn't be class methods of SDLTensor rather than free-standing functions?)

Comment on lines 89 to 117
inline uintptr_t nhwc_at(const DLTensor& a, int n, int y, int x, int c) {
if (y < 0 || y >= a.shape[1]) return uintptr_t(0);
auto p = static_cast<uintptr_t*>(a.data);
assert(n == 0);
return p[y * a.shape[2] * a.shape[3] + x * a.shape[3] + c];
}

inline uintptr_t hwio_at(const DLTensor& f, int y, int x, int i, int o) {
auto p = static_cast<uintptr_t*>(f.data);
return p[y * f.shape[1] * f.shape[2] * f.shape[3] + x * f.shape[2] * f.shape[3] + i * f.shape[3] +
o];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions seem to assume that a.ndim >= 4. Would it make sense to ensure that's true, by e.g.:

  • assert(a.n >= 4) or
  • converting these to template functions, and replace the first argument with const SDLTensor<4> & a?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions should be fast. I'd rather not have any extra code in here. It would actually need to be ICHECK, since TVM is compiled without assertions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. If we consider speed to be a primary concern, would any of the following be a good compromise?

  • convert these to members of SDLTensor, and (optionally) only provide implementations of the two methods for the template specialization NDIM=4, or
  • leave these as free-standing functions, but still templatize them and only provide implementations for NDIM=4, or
  • add a few comments indicating the assumptions these functions make about the rank of a / f, or
  • adjust the names of these two functions to indicate their assumption about the rank of a / f.

The only way I'd see the templatization approaches be worth the additional code complexity is if:

  • you're already planning in converting these to methods of SDLTensor, which is already a template class, or
  • you have future plans for this code that would involve values of NDIM other than 4.

Copy link
Contributor

@kparzysz-quic kparzysz-quic Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding extra comments is probably the best at the moment. The names of these functions are specific to particular layouts, so it's not clear what they should be called if they were to work in other cases. I think that hwio_at is somewhat generic, so it can be templatized, or made a member. For nhwc_at there is some specificity (like returning 0 for out-of-bounds y-accesses), so to avoid additional changes, I suggest just adding comments to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments, but not sure if that's enough, but I guess we can always add more comments in follow up PRs.

Comment on lines 102 to 105
inline uint32_t* bias_at(const DLTensor& b, int d) {
auto p = static_cast<uint32_t*>(b.data);
return p + d;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't appear to be used anywhere in the PR. Is that intentional?

Usually for TVM PRs, we try to ensure thorough test coverage of the new code (link).

A code-comment below states that bias isn't currently accepted. Given that, would it make sense to just remove this function from the PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be deleted. It will need to be resurrected once we support bias.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

}

inline uint32_t* bias_at(const DLTensor& b, int d) {
auto p = static_cast<uint32_t*>(b.data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reasonable way to ensure that this static_cast is actually valid for b?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

Comment on lines 91 to 114
auto p = static_cast<uintptr_t*>(a.data);
assert(n == 0);
return p[y * a.shape[2] * a.shape[3] + x * a.shape[3] + c];
}

inline uintptr_t hwio_at(const DLTensor& f, int y, int x, int i, int o) {
auto p = static_cast<uintptr_t*>(f.data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reasonable way to ensure that these static_cast is actually valid for a and f?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, there is no way to guarantee anything like that.

Comment on lines 53 to 59
inline uint16_t* getElementPtr(int block_out_y, int block_out_x, int block_out_c, int yi, int xio,
int ci, int xii, const DLTensor& block) {
auto block_ptr = nhwc_at(block, 0, block_out_y, block_out_x, block_out_c);
auto block_offset = yi * 128 + xio * 64 + ci * 2 + xii;
auto first_element_ptr = reinterpret_cast<uint16_t*>(block_ptr);
return first_element_ptr + block_offset;
}
Copy link
Contributor

@cconvey cconvey Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function's name and argument list suggest that it handles tensors with a variety of element dtypes. But its body and return type assume uint16_t.

Is there some reason that's a safe assumption in this context?

If yes, then it might be helpful to add some comments explaining why. If not, then I'd suggest some form of disambiguation, e.g.:

  • rename this to getElementPtr_u16, or
  • convert this to a template that's parameterized by dtype.

Even if this uint16_t is being used as a stand-in for qfloat16 or IEEE-754 half-precision floats, it's not (immediately) obvious that this code will only ever be used for 16-bit HVX values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rename the files to conv2d_fp16_hvx.cc/.h instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of renaming the files. Even with that, we should consider any symbols that will be visible outside of convey2d_fp16_hvx.cc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed the file and I've made that function static inline for now so that the symbol is not visible outside as its not used outside anyway for now.

namespace runtime {
namespace hexagon {

inline uint16_t* getElementPtr(int block_out_y, int block_out_x, int block_out_c, int yi, int xio,
Copy link
Contributor

@cconvey cconvey Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If uint16_t is a stand-in for qfloat16, is there a typename supplied by the Hexagon SDK that might be more descriptive?

This might not be a valid example, but I'm thinking of something like qhl_q0_t.

Or if the SDK provides no such type name, could it make sense to just define one here? E.g., using qfloat16_ptr = uint16_t*;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not qfloat16, the storage type is _Float16. I'm opposed to using such typedefs, since they actually obscure things instead of clarifying anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we couldn't just replace uint16_t with _Float16 then? AFAICT both GCC and LLVM recognize _Float16, and IIUC we're currently only planning to ever build this code with LLVM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recognizing of _Float16 in clang at least is target-specific. For Hexagon, clang recognizes it when the target architecture is >= v68.

Comment on lines +50 to +67
SDLTensor(void* data_ptr, DLDataType data_type, void* data_space) : data_space(data_space) {
data = data_ptr;
device = hexagon_device;
ndim = NDIM;
dtype = data_type;
shape = dims;
strides = nullptr;
byte_offset = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The roles of data_ptr and data_space are explained in this PR comment, but I don't see that same information in the current code.

Could we add put that same information into one or more code comments?

Also, even if that PR comment is right about there being no risk of dangling pointers, that isn't obviously true just from normal-level-of-effort reading of the code. Is there some way we can address that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added comments explaining them, thanks for the suggestion. As to the idea of dangling pointers, I'll try and address them separately as I'm not sure how generic the usage of these are going to be.

Comment on lines +114 to +129
/**
* @brief Convert the layout of weights from flat to "chunked". The term chunked is explained below:
*
* Weights are packed into the below mentioned layout (notation similar to index map):
* Since weights cannot be exactly represented into a index map notation, the
* base split up is mentioned below with a few gotchas
*
* lambda h, w, i, o: h//8, w//4, o//32, i//32, h%8, w%4, (i%32)//2, o%32, i%2
*
* The gotchas are:
* - (w%4) is actually stored in the right to left order, as in 3,2,1,0 instead of 0,1,2,3
* - The h%8 and (w%4) dimensions are not padded up, leading to chunks of different sizes
* (thereby the name "chunked" instead of packed)
* - The thinnest chunk of width is stored first. For example, if a kernel is 5x5, the first
* chunk along the width has size 1 (representing index 0) and then next one has size 4
* representing indices (1,2,3,4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move the code-comments that explain "chunked" and "blockized" to just before the first usage of those terms, i.e. near the top of conv2d.h or into a separate README.rst file? It took me a while to notice this explanation because of where it's placed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved it to the conv2d.h file as well, and I'll create a separate PR with the documentation, thanks for the suggestions.

@cconvey
Copy link
Contributor

cconvey commented Aug 19, 2022

FYI, due to work schedules, @csullivan will follow up with any issues / suggestions I raised in my reviews. I'm happy to for this to be merged once @csullivan Approves.

int o = round_up(hwio_flat->shape[3], 32);
int64_t shape_1d[] = {h * w * i * o};
void* hwio_vtcm =
device_api->AllocDataSpace(hexagon_device, 1, shape_1d, hwio_flat->dtype, vtcm_scope);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this freed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually freed at the calling function for prepare_hwio() here

int c = round_up(nhwc_flat->shape[3], 32);
int64_t shape_2d[2] = {(n * h * w * c) / (8 * 4 * 32), 8 * 4 * 32};
void* nhwc_vtcm =
device_api->AllocDataSpace(hexagon_device, 2, shape_2d, nhwc_flat->dtype, vtcm_scope);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this freed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually freed at the calling function for prepare_nhwc() here

int num_wgt_chunks = hexagonrt::calculate_num_weight_chunks(wgt_flat->shape);
LOG_INFO << "num_wgt_chunks: " << num_wgt_chunks;
auto wgt_ptr_table =
reinterpret_cast<void**>(__builtin_alloca(num_wgt_chunks * sizeof(uintptr_t)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not freeing this as far as I can tell. Also probably should use the device api to allocate memory on device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're just allocating on the stack, it'll be freed on function return as defined in the __builtin_alloca documentation

* @param width
* @param depth
*/
void blockize_hwc_16b(void* out, void* inp_flat, int height, int width, int depth) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some unit testing of these utilities would be ideal and is possible pre-commit with gtest on the simulator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll try adding a few unit tests and update the PR, thanks for the comments.

Copy link
Contributor Author

@quic-sanirudh quic-sanirudh Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csullivan I've added the unit tests with gtest. Could you please check out the test cases and review the patch again when you get a chance, thanks.

quic-sanirudh and others added 4 commits September 6, 2022 19:54
Co-authored-by: Krzysztof Parzyszek <kparzysz@quicinc.com>
Co-authored-by: Krzysztof Parzyszek <kparzysz@quicinc.com>
@quic-sanirudh quic-sanirudh requested review from csullivan and removed request for mehrdadh September 9, 2022 18:49
Copy link
Contributor

@cconvey cconvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @quic-sanirudh and @kparzysz-quic !

I think you addressed @csullivan 's feedback. Let's go ahead and merge this, and we can address any follow-up concerns later if necessary.

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking @cconvey 's thumbs up to green light this PR and make forward progress on this.

xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
* [Hexagon] Add Hand written HVX conv2d

Co-authored-by: Krzysztof Parzyszek <kparzysz@quicinc.com>

* Address review comments

Co-authored-by: Krzysztof Parzyszek <kparzysz@quicinc.com>

* Add some more comments and a file rename

* Add gtest unit tests for blockize/deblockize

* Add gtest unit tests fp16 utils

Co-authored-by: Krzysztof Parzyszek <kparzysz@quicinc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants