Skip to content

Commit

Permalink
Add CPU kernels for DynamicTimeWarping and UnfoldTensor. (#22033)
Browse files Browse the repository at this point in the history
### Description
Add CPU kernels for DynamicTimeWarping and UnfoldTensor.
  • Loading branch information
mindest authored Oct 11, 2024
1 parent f1f3d94 commit 3c80aa9
Show file tree
Hide file tree
Showing 10 changed files with 312 additions and 20 deletions.
4 changes: 2 additions & 2 deletions docs/ContribOperators.md
Original file line number Diff line number Diff line change
Expand Up @@ -1545,7 +1545,7 @@ This version of the operator has been available since version 1 of the 'com.micr

### <a name="com.microsoft.DynamicTimeWarping"></a><a name="com.microsoft.dynamictimewarping">**com.microsoft.DynamicTimeWarping**</a>

Input is cost matrix where each value in input[r][c] is the cost for pass the point (r, c). From current point(r, c), points (r+1, c), (r+1, c+1) or (r, c+1) could be arrived in next move. Given such cost matrix, return dynamic time wrapping of shape [2, x], where the path made by all points (output[0][t], output[1][t])have the lowest cost among all paths from (0, 0) to (M-1, N-1).
Input is cost matrix where each value in input[r][c] is the cost for pass the point (r, c). From current point(r, c), points (r+1, c), (r+1, c+1) or (r, c+1) could be arrived in next move. Given such cost matrix, return dynamic time warping of shape [2, x], where the path made by all points (output[0][t], output[1][t])have the lowest cost among all paths from (0, 0) to (M-1, N-1).

#### Version

Expand Down Expand Up @@ -5974,7 +5974,7 @@ This version of the operator has been available since version 1 of the 'com.micr

### <a name="com.microsoft.UnfoldTensor"></a><a name="com.microsoft.unfoldtensor">**com.microsoft.UnfoldTensor**</a>

Returns a tensor which contains all slices of size size from input tensor in the dimension dim. Step between two slices is given by step. If sizedim is the size of dimension dim for input tensor, the size of dimension dim in the returned tensor will be (sizedim - size) / step + 1. An additional dimension of size size is appended in the returned tensor.
Returns a tensor which contains all slices of size `size` from input tensor in the dimension `dim`. Step between two slices is given by `step`. If `sizedim` is the size of dimension `dim` for input tensor, the size of dimension `dim` in the returned tensor will be `(sizedim - size) / step + 1`. An additional dimension of size `size` is appended in the returned tensor.

#### Version

Expand Down
2 changes: 2 additions & 0 deletions docs/OperatorKernels.md
Original file line number Diff line number Diff line change
Expand Up @@ -471,6 +471,7 @@ Do not modify directly.*
|DequantizeLinear|*in* x:**T1**<br> *in* x_scale:**T2**<br> *in* x_zero_point:**T1**<br> *out* y:**T2**|1+|**T1** = tensor(int16), tensor(int32), tensor(int4), tensor(int8), tensor(uint16), tensor(uint4), tensor(uint8)<br/> **T2** = tensor(float)|
|DynamicQuantizeLSTM|*in* X:**T**<br> *in* W:**T2**<br> *in* R:**T2**<br> *in* B:**T**<br> *in* sequence_lens:**T1**<br> *in* initial_h:**T**<br> *in* initial_c:**T**<br> *in* P:**T**<br> *in* W_scale:**T**<br> *in* W_zero_point:**T2**<br> *in* R_scale:**T**<br> *in* R_zero_point:**T2**<br> *out* Y:**T**<br> *out* Y_h:**T**<br> *out* Y_c:**T**|1+|**T** = tensor(float)<br/> **T1** = tensor(int32)<br/> **T2** = tensor(int8), tensor(uint8)|
|DynamicQuantizeMatMul|*in* A:**T1**<br> *in* B:**T2**<br> *in* b_scale:**T1**<br> *in* b_zero_point:**T2**<br> *in* bias:**T1**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int8), tensor(uint8)|
|DynamicTimeWarping|*in* input:**F**<br> *out* output:**I**|1+|**F** = tensor(float)<br/> **I** = tensor(int32)|
|EmbedLayerNormalization|*in* input_ids:**T1**<br> *in* segment_ids:**T1**<br> *in* word_embedding:**T**<br> *in* position_embedding:**T**<br> *in* segment_embedding:**T**<br> *in* gamma:**T**<br> *in* beta:**T**<br> *in* mask:**T1**<br> *in* position_ids:**T1**<br> *out* output:**T**<br> *out* mask_index:**T1**<br> *out* embedding_sum:**T**|1+|**T** = tensor(float)|
|ExpandDims|*in* X:**T**<br> *in* axis:**tensor(int32)**<br> *out* Y:**T**|1+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)<br/> **axis** = tensor(int32)|
|FastGelu|*in* X:**T**<br> *in* bias:**T**<br> *out* Y:**T**|1+|**T** = tensor(float)|
Expand Down Expand Up @@ -518,6 +519,7 @@ Do not modify directly.*
|Tokenizer|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(string)|
|TransposeMatMul|*in* A:**T**<br> *in* B:**T**<br> *out* Y:**T**|1+|**T** = tensor(float)|
|Trilu|*in* X:**T**<br> *in* k:**tensor(int64)**<br> *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(int64)|
|UnfoldTensor|*in* input:**T**<br> *out* output:**T**|1+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
|Unique|*in* x:**T**<br> *out* y:**T**<br> *out* idx:**tensor(int64)**<br> *out* counts:**tensor(int64)**|1+|**T** = tensor(float)|
|WhisperBeamSearch|*in* input_ids:**F**<br> *in* max_length:**I**<br> *in* min_length:**I**<br> *in* num_beams:**I**<br> *in* num_return_sequences:**I**<br> *in* length_penalty:**T**<br> *in* repetition_penalty:**T**<br> *in* vocab_mask:**M**<br> *in* prefix_vocab_mask:**M**<br> *in* attention_mask:**I**<br> *in* decoder_input_ids:**I**<br> *in* logits_processor:**I**<br> *in* cross_qk_layer_head:**I**<br> *in* extra_decoding_ids:**I**<br> *in* temperature:**T**<br> *out* sequences:**I**<br> *out* sequences_scores:**T**<br> *out* scores:**T**<br> *out* cross_qk:**V**<br> *out* non_speech_probs:**T**|1+|**T** = tensor(float)|
|WordConvEmbedding|*in* Sequence:**T**<br> *in* W:**T1**<br> *in* B:**T1**<br> *in* C:**T1**<br> *out* Y:**T1**|1+|**T** = tensor(int32)<br/> **T1** = tensor(float)|
Expand Down
4 changes: 4 additions & 0 deletions onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,8 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1,
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, MLFloat16, SkipSimplifiedLayerNormalization);
class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, Inverse);
class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, Trilu);
class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, UnfoldTensor);
class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, DynamicTimeWarping);

#ifdef ENABLE_ATEN
class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kPytorchAtenDomain, 1, ATen);
Expand Down Expand Up @@ -358,6 +360,8 @@ Status RegisterCpuContribKernels(KernelRegistry& kernel_registry) {
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, MLFloat16, SkipSimplifiedLayerNormalization)>,
BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, Inverse)>,
BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, Trilu)>,
BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, UnfoldTensor)>,
BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, DynamicTimeWarping)>,

#ifdef ENABLE_ATEN
BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kPytorchAtenDomain, 1, ATen)>,
Expand Down
101 changes: 101 additions & 0 deletions onnxruntime/contrib_ops/cpu/tensor/dynamic_time_warping.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#include "contrib_ops/cpu/tensor/dynamic_time_warping.h"
#include "core/providers/cpu/tensor/utils.h"

#include <vector>
#include <numeric>

using namespace onnxruntime::common;

namespace onnxruntime {
namespace contrib {

ONNX_OPERATOR_KERNEL_EX(
DynamicTimeWarping,
kMSDomain,
1,
kCpuExecutionProvider,
(*KernelDefBuilder::Create())
.TypeConstraint("F", DataTypeImpl::GetTensorType<float>())
.TypeConstraint("I", DataTypeImpl::GetTensorType<int32_t>()),
DynamicTimeWarping);

Status DynamicTimeWarping::Compute(OpKernelContext* ctx) const {
const Tensor& input_tensor = *ctx->Input<Tensor>(0);
const auto& input_dims = input_tensor.Shape().GetDims();
int rank = SafeInt<int>(input_dims.size());
ORT_ENFORCE(rank == 2 || (rank == 3 && input_dims[0] == 1),
"Currently input rank must be 2, or (3 with first dim equal to 1), but got:", rank);

const size_t rows = SafeInt<size_t>(input_dims[rank == 3 ? 1 : 0]);
const size_t cols = SafeInt<size_t>(input_dims[rank == 3 ? 2 : 1]);

std::vector<std::vector<float>> cost(rows + 1, std::vector<float>(cols + 1, std::numeric_limits<float>::infinity()));
std::vector<std::vector<int8_t>> trace(rows + 1, std::vector<int8_t>(cols + 1, -1));
std::vector<std::vector<int32_t>> path_helper;

// Compute the cost and trace matrices
cost[0][0] = 0;
for (size_t j = 1; j <= cols; ++j) {
for (size_t i = 1; i <= rows; ++i) {
const float c0 = cost[i - 1][j - 1];
const float c1 = cost[i - 1][j];
const float c2 = cost[i][j - 1];

float cur_cost;
int8_t cur_trace;
if (c0 < c1 && c0 < c2) {
cur_cost = c0;
cur_trace = 0;
} else if (c1 < c0 && c1 < c2) {
cur_cost = c1;
cur_trace = 1;
} else {
cur_cost = c2;
cur_trace = 2;
}

cost[i][j] = cur_cost + input_tensor.Data<float>()[(i - 1) * cols + j - 1];
trace[i][j] = cur_trace;
}
}

// Back-tracing to find the optimal path
int i = static_cast<int>(rows);
int j = static_cast<int>(cols);
int result_len = 0;
while (i > 0 && j > 0) {
path_helper.push_back({i - 1, j - 1});
++result_len;
int8_t cur_trace = trace[i][j];
switch (cur_trace) {
case 0:
--i;
--j;
break;
case 1:
--i;
break;
case 2:
--j;
break;
default:
ORT_THROW("Invalid trace value: ", cur_trace);
}
}

// Update the output tensor
Tensor* output_tensor = ctx->Output(0, TensorShape{2LL, SafeInt<int64_t>(result_len)});
auto* output_data = output_tensor->MutableData<int32_t>();
for (int k = 0; k < result_len; ++k) {
output_data[k] = path_helper[static_cast<size_t>(result_len) - k - 1][0];
output_data[k + result_len] = path_helper[static_cast<size_t>(result_len) - k - 1][1];
}

return Status::OK();
}

} // namespace contrib
} // namespace onnxruntime
24 changes: 24 additions & 0 deletions onnxruntime/contrib_ops/cpu/tensor/dynamic_time_warping.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#pragma once
#include "core/framework/op_kernel.h"
#include <core/common/safeint.h>

namespace onnxruntime {
namespace contrib {

using onnxruntime::OpKernelContext;
using onnxruntime::OpKernelInfo;

class DynamicTimeWarping : public OpKernel {
public:
DynamicTimeWarping(const OpKernelInfo& info) : OpKernel(info) {}

~DynamicTimeWarping() = default;

Status Compute(OpKernelContext* context) const override;
};

} // namespace contrib
} // namespace onnxruntime
109 changes: 109 additions & 0 deletions onnxruntime/contrib_ops/cpu/tensor/unfold.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#include "contrib_ops/cpu/tensor/unfold.h"
#include "core/providers/cpu/tensor/utils.h"
#include "core/providers/common.h"
#include "core/platform/threadpool.h"

#include <vector>
#include <numeric>

using namespace onnxruntime::common;

namespace onnxruntime {
namespace contrib {

ONNX_OPERATOR_KERNEL_EX(
UnfoldTensor,
kMSDomain,
1,
kCpuExecutionProvider,
(*KernelDefBuilder::Create())
.TypeConstraint("T", DataTypeImpl::AllTensorTypes()),
UnfoldTensor);

template <typename T>
Status LaunchUnfoldTensor(const T* input,
T* output,
int64_t leading_dims_size,
int64_t unfold_dim_size,
int64_t tailing_dims_size,
int64_t unfold_size,
int64_t step_size,
concurrency::ThreadPool* tp) {
int64_t unfold_dim_size_dst = (unfold_dim_size - unfold_size) / step_size + 1;
int64_t N = leading_dims_size * unfold_dim_size_dst * tailing_dims_size * unfold_size;

int64_t stride_leading_dst = unfold_size * tailing_dims_size * unfold_dim_size_dst;
int64_t stride_fold_dim_src = tailing_dims_size * step_size;
int64_t stride_leading_src = tailing_dims_size * unfold_dim_size;

static constexpr double cost = 1.0;
concurrency::ThreadPool::TryParallelFor(tp, static_cast<ptrdiff_t>(N), cost,
[&](std::ptrdiff_t begin, std::ptrdiff_t end) {
for (std::ptrdiff_t i = begin; i != end; ++i) {
const int64_t idx = static_cast<int64_t>(i);
const int64_t idx_leading = idx / stride_leading_dst;
int64_t n = idx % stride_leading_dst;
const int64_t stride_fold_dim_dst = tailing_dims_size * unfold_size;
const int64_t idx_fold = n / stride_fold_dim_dst;
n %= stride_fold_dim_dst;
const int64_t idx_tailing = n / unfold_size;
const int64_t idx_append = n % unfold_size;

int64_t idx_src = idx_leading * stride_leading_src +
idx_fold * stride_fold_dim_src + idx_tailing +
idx_append * tailing_dims_size;
output[idx] = input[idx_src];
}
});

return Status::OK();
}

Status UnfoldTensor::Compute(OpKernelContext* ctx) const {
const Tensor& input_tensor = *ctx->Input<Tensor>(0);
const auto& input_dims = input_tensor.Shape().GetDims();
int rank = SafeInt<int>(input_dims.size());

int dim = SafeInt<int>(HandleNegativeAxis(dim_, rank));
ORT_ENFORCE(dim < rank, "input rank:", rank, " is not bigger than attribut specified dim: ", dim);
ORT_ENFORCE(input_dims[dim] >= size_, "dimsize:", input_dims[dim], " is less than unfold size:", size_);

int64_t leading_dims = std::accumulate(input_dims.begin(), input_dims.begin() + static_cast<size_t>(dim),
1LL, std::multiplies<int64_t>());
int64_t tailing_dims = std::accumulate(input_dims.begin() + (static_cast<size_t>(dim) + 1),
input_dims.end(), 1LL, std::multiplies<int64_t>());

std::vector<int64_t> output_dims(static_cast<size_t>(rank) + 1, 0);
std::copy(input_dims.begin(), input_dims.end(), output_dims.begin());
output_dims[dim] = (input_dims[dim] - size_) / step_ + 1;
output_dims.back() = size_;
TensorShape output_shape(output_dims);
Tensor* output_tensor = ctx->Output(0, output_shape);

auto* tp = ctx->GetOperatorThreadPool();

Status status;
if (input_tensor.IsDataType<float>()) {
status = LaunchUnfoldTensor<float>(input_tensor.Data<float>(), output_tensor->MutableData<float>(),
leading_dims, input_dims[dim], tailing_dims, size_, step_, tp);
} else if (input_tensor.IsDataType<double>()) {
status = LaunchUnfoldTensor<double>(input_tensor.Data<double>(), output_tensor->MutableData<double>(),
leading_dims, input_dims[dim], tailing_dims, size_, step_, tp);
} else if (input_tensor.IsDataType<int32_t>()) {
status = LaunchUnfoldTensor<int32_t>(input_tensor.Data<int32_t>(), output_tensor->MutableData<int32_t>(),
leading_dims, input_dims[dim], tailing_dims, size_, step_, tp);
} else if (input_tensor.IsDataType<int64_t>()) {
status = LaunchUnfoldTensor<int64_t>(input_tensor.Data<int64_t>(), output_tensor->MutableData<int64_t>(),
leading_dims, input_dims[dim], tailing_dims, size_, step_, tp);
} else {
return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, "Unsupported data type: ", input_tensor.DataType());
}

return status;
}

} // namespace contrib
} // namespace onnxruntime
47 changes: 47 additions & 0 deletions onnxruntime/contrib_ops/cpu/tensor/unfold.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#pragma once
#include "core/framework/op_kernel.h"
#include <core/common/safeint.h>

namespace onnxruntime {
namespace contrib {

using onnxruntime::OpKernelContext;
using onnxruntime::OpKernelInfo;

template <typename T>
Status LaunchUnfoldTensor(
const T* input,
T* output,
int64_t leading_dims_size,
int64_t unfold_dim_size,
int64_t tailing_dims_size,
int64_t unfold_size,
int64_t step_size);

class UnfoldTensor final : public OpKernel {
public:
UnfoldTensor(const OpKernelInfo& info) : OpKernel(info) {
dim_ = SafeInt<int>(info.GetAttrOrDefault<int64_t>("dim", -1LL));
step_ = SafeInt<int>(info.GetAttrOrDefault<int64_t>("step", 1LL));
ORT_ENFORCE(step_ > 0, "step must greater than zero!");

int64_t temp_size;
ORT_ENFORCE(info.GetAttr("size", &temp_size).IsOK());
size_ = SafeInt<int>(temp_size);
}

~UnfoldTensor() = default;

Status Compute(OpKernelContext* context) const override;

private:
int dim_;
int size_;
int step_;
};

} // namespace contrib
} // namespace onnxruntime
12 changes: 6 additions & 6 deletions onnxruntime/core/graph/contrib_ops/contrib_defs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1067,11 +1067,11 @@ ONNX_MS_OPERATOR_SET_SCHEMA(GridSample, 1,
ONNX_MS_OPERATOR_SET_SCHEMA(
UnfoldTensor, 1,
OpSchema()
.SetDoc("Returns a tensor which contains all slices of size size from input tensor in the dimension dim. "
"Step between two slices is given by step. "
"If sizedim is the size of dimension dim for input tensor, the size of dimension dim in "
"the returned tensor will be (sizedim - size) / step + 1. "
"An additional dimension of size size is appended in the returned tensor.")
.SetDoc("Returns a tensor which contains all slices of size `size` from input tensor in the dimension `dim`. "
"Step between two slices is given by `step`. "
"If `sizedim` is the size of dimension `dim` for input tensor, the size of dimension `dim` in "
"the returned tensor will be `(sizedim - size) / step + 1`. "
"An additional dimension of size `size` is appended in the returned tensor.")
.Attr("dim", "specify the dimension to unfold", AttributeProto::INT, static_cast<int64_t>(-1))
.Attr("size", "specify the size", AttributeProto::INT)
.Attr("step", "specify the step.", AttributeProto::INT, static_cast<int64_t>(1))
Expand Down Expand Up @@ -1122,7 +1122,7 @@ ONNX_MS_OPERATOR_SET_SCHEMA(
OpSchema()
.SetDoc("Input is cost matrix where each value in input[r][c] is the cost for pass the point (r, c). From current point"
"(r, c), points (r+1, c), (r+1, c+1) or (r, c+1) could be arrived in next move. Given such cost matrix, return "
"dynamic time wrapping of shape [2, x], where the path made by all points (output[0][t], output[1][t])"
"dynamic time warping of shape [2, x], where the path made by all points (output[0][t], output[1][t])"
"have the lowest cost among all paths from (0, 0) to (M-1, N-1).")
.Input(0, "input", "Input cost tensor, it must be 2D tensor of shape M x N, or 1 x M x N", "F")
.Output(0, "output", "Output tensor. shape is [2, x], where max(M, N) <= x < M + N", "I")
Expand Down
9 changes: 5 additions & 4 deletions onnxruntime/test/contrib_ops/dynamic_time_warping_op_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,12 @@ using namespace ONNX_NAMESPACE;
namespace onnxruntime {
namespace test {

TEST(DynamicTimeWarping, simple) {
#ifdef USE_CUDA

TEST(DynamicTimeWarp, simple) {
if (NeedSkipIfCudaArchLowerThan(530)) {
return;
}
#endif

std::vector<float> X = {
3.0f,
Expand Down Expand Up @@ -113,11 +113,12 @@ TEST(DynamicTimeWarp, simple) {
tester.AddOutput<int32_t>("output", {2, 12}, Y);

std::vector<std::unique_ptr<IExecutionProvider>> execution_providers;
#ifdef USE_CUDA
execution_providers.push_back(DefaultCudaExecutionProvider());
#endif
execution_providers.push_back(DefaultCpuExecutionProvider());
tester.Run(OpTester::ExpectResult::kExpectSuccess, "", {}, nullptr, &execution_providers);
}

#endif

} // namespace test
} // namespace onnxruntime
Loading

0 comments on commit 3c80aa9

Please sign in to comment.