sync branches: develop and infrt (PaddlePaddle#39509)

* 【Pten】Adjust the Empyt dev_api (PaddlePaddle#39143) * adjust the Empyt dev_api * fix merge conflict * fix sparse_utils_kernel * Fix code conflict of empty dev_api (PaddlePaddle#39430) * fix code conflict * clear cache * just try * [PluggableDevice] custom kernel supports multi cpp_dtype registering (PaddlePaddle#39385) * [PTen] Add standard kernel suffix set (PaddlePaddle#39404) * add standard_suffix_set_and_remove_reshape_with_xshape * revert reshape change * polish reduce name * [pten] update isnan registration (PaddlePaddle#39419) * update isnan registration * fix compile * [bf16] add bf16 kernel: dropout & reshape & slice (PaddlePaddle#39395) * add dropout * add reshape * add slice * refien slice unittest * refine slice unittest * add cpu bf16 kernel * [bf16] add bf16 kernel: squeeze & unsqueeze & stack (PaddlePaddle#39402) * add squeeze unsqueeze stack * add unittest * add cpu kernel * Modify the unsqueeze dimension of input data in conv1d NCL And NLC format (PaddlePaddle#38425) * optimize conv1d forward * add conv opt * Optimize memory copy * delete share data with * set num_filters=512 * add nlc optimize * Optimize num_filter=512 data on A100 and V100 * Fix the workspace_size size setting of filter * 【Pten】Refactor C++ API code-gen (PaddlePaddle#39408) * refactor C++ API code-gen * fix windows problem of C++ API * Refactored Python-C Attributes Parsing Functions (PaddlePaddle#39328) * Add _get_parameter method to Lamb optimizer (PaddlePaddle#39416) * add _get_parameter func to lamb * remove duplicate code * mkldnn layout issue fix (PaddlePaddle#39422) * mkldnn conv fix * definetion * fix compile error on jetson (PaddlePaddle#39441) * move Masked select to pten (PaddlePaddle#39193) * move masked select cpu kernel * add masked selected gpu kernel; test=develop * fix bugs; test=develop * bug fix; test=develop * bug fix; test=develop * add namespace to set mask array; test=develop * fix bug; test=develop * fix bugs; test=develop * fix ddim bug; test=develop * fix npu op bug; test=develop * fix xpu dependecy bug; test=develop * move kernel args to sig.cc; test=develop * 【PaddlePaddle Hackathon】31. Add Java frontend for Paddle Inference (PaddlePaddle#37162) * fix check error of ResetHolder (PaddlePaddle#39439) * Added python-c code generation for final state Eager Dygraph (PaddlePaddle#39233) * Removed debug info * Added automatic code generation for final state Eager Dygraph * Modified backward yaml * Added EagerUtils helper functions for final state CodeGen * Adjusted CMakeFiles to support compilation for final state auto generated codes * Added python-c code generation for final state Eager Dygraph * Fixed minor issue * Fixed yaml.load() method failure * Fixed minor issues * Refactored Python-C Attributes Parsing Functions * Fixed minor issue with Python-C AddFunctions * Fixed issues from merge * Fixed merge issues * change dtype of pooling mask to 'int32' for Paddle2ONNX (PaddlePaddle#39314) * change dtype of pooling mask to 'int32' for Paddle2ONNX * empty commit to rerun ci * fix format * share MemOptVarInfos of external variables into cinn_launch subgraph (PaddlePaddle#39209) * add a graph pass to share MemOptVarInfos of external variables into subgraph * update pass name * fix compile failed * add share_mem_opt_info_to_subgraph_pass test * share_mem_opt_info_to_subgraph_pass_test pass * modify some codes for better style and more robust * update cmake * [NPU] add reduce_min (PaddlePaddle#39019) [NPU] add reduce_min * [MLU] add mlu kernel for accuracy op (PaddlePaddle#39337) * [MLU] add mlu kernel for accuracy op * fix license format * fix error message * [Dy2St]Handle `a, b = paddle.shape(x)` in Static Analysis (PaddlePaddle#39245) * refine Assign * add UT * 【Pten】Auto-Generate InterMeta register (PaddlePaddle#39436) * fix code conflict * generate inter_meta register * clear cache * just try * add sign c++ api * polish some code * Support different dtypes of inputs for elementwise ops (PaddlePaddle#38859) * improve backward performance * support different dtypes for elementwise ops * Add profiler node tree implementation (PaddlePaddle#39316) * add event node implementation * modify profiler.stop interface * fix according to review * fix file mode * modify class method name in event_node.cc * modify LLONG_MAX to ULLONG_MAX * fix ci error * fix ci error * add print pten kernel tool (PaddlePaddle#39371) * test=document_fix;add print pten kernel tool * test=document_fix * test=document_fix * test=document_fix * test=document_fix * add print_pten_kernels tool * add print_pten_kernels tool * fix windows complie * notest,test=rocm_ci * add merge tool * add comments * [new-exec] set type of op-kernel op by place (PaddlePaddle#39458) * Add log for executor (PaddlePaddle#39459) * add align for WorkQueue * add spinlock * merge develop * merge * Add EventsWaiter * Revert "Add EventsWaiter" This reverts commit e206173. * add log for Executor Co-authored-by: liutiexing <liutiexing@google.com> * [Paddle Inference] support ernie quant model with interleaved (PaddlePaddle#39424) * support ernie quant model with interleaved * support ernie quant model with interleaved * support ernie quant model with interleaved * support ernie quant model with interleaved * support ernie quant model with interleaved * support ernie quant model with interleaved * support ernie quant model with interleaved * 统一 ps 开发 - python (PaddlePaddle#39431) * delete gloo connect retry * the_one_ps dirs reconstruct * . * . * create the_one_ps dirs * create the_one_ps dirs * create the_one_ps dirs * create the_one_ps dirs * create the_one_ps dirs * create the_one_ps dirs * the one ps dirs modify * the one ps dirs modify * the one ps dirs modify * the one ps dirs modify * refactor ps optimize * refactor ps optimize * refactor ps optimize * . * . * . * . * . * . * refactor theoneps * the_one_ps * add ps pass unittest * add ps pass unittest * ps unitest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * ps unittest frame * add cpu_async_ps_mode test * add cpu_async_ps_mode test * add cpu_async_ps_mode test * ps unittest ready * ps unittest ready * solve dist_pass init conflict * solve import CommContext error * unittest ok * implement AllocateFrom * solve setup.py.in conflict * solve conflict * solve conflict * solve conflict * . * . * cpu-async-ps minimize test ok & gpu minimize test ok Co-authored-by: zkh2016 <zhangkaihuo@baidu.com> * [PTen] Move grad GetExpectedPtenKernelArgs into pten (PaddlePaddle#39418) * move grad get expected pten kernel args * fix reduce sum error * fix element_sub_grad failed * revert kernel judge change * fix compilation warning on mac (PaddlePaddle#39438) * get build time (PaddlePaddle#39368) * fix prelu trt convert (PaddlePaddle#39389) * Optimize bilinear interpolation foward (PaddlePaddle#39243) * bilinear_fw init * optimize code * pre-compute linear_interp input index * Optimize performance of softmax_bwd when axis!=-1 (PaddlePaddle#38609) * Optimize performance of softmax_bwd when axis!=-1 * fix * fix * fix * fix * [PTen] Remove pten core's dependency on fluid xxx_info.h (PaddlePaddle#39401) * ermove xxx_info include * fix namespace error * resolve conflict * skip xpu context in registry * fix macro error * resolve conflict * resolve conflict * revert xpu convert * remove trans to fluid place * remove useless headers * [Pten] move operators/math/math_function_* to pten/kernels/func (PaddlePaddle#39300) * move operators/math/math_function_* to pten/kernels/func * namespace from `paddle::operators::math` to `pten::funcs` * [MLU] add pool2d and pool2d_grad mlu kernel (PaddlePaddle#39453) * [MLU]support c_gen_cncl_id_op run on MLU device (PaddlePaddle#39336) Co-authored-by: zhangna <zhangna@cambricon.com> * [bf16] add bf16 kernel: transpose & unbind (PaddlePaddle#39457) * add transpose unbind * add unittest * refine transpose unittest * uniform_random op for mlu (PaddlePaddle#39450) * [MLU] add pool2d pytest (PaddlePaddle#39454) * Added shape (U)INT8/BF16/FP32 oneDNN kernel (PaddlePaddle#36033) * added shape oneDNN kernel * removed unnecessary import from test * added skipping tests for GPU * refactoring * refactored shape kernel * added tests in new framework * removed one line * minor change * added newline at EOF * added formatting * added attributes as extra * move memcpy.h into cc file (PaddlePaddle#39469) * Add TensorRT inspector into Paddle-TRT (PaddlePaddle#38362) * Fix add profiler node tree implementation cmake error (PaddlePaddle#39474) * add event node implementation * modify profiler.stop interface * fix according to review * fix file mode * modify class method name in event_node.cc * modify LLONG_MAX to ULLONG_MAX * fix ci error * fix ci error * fix dependency error * unify naming style (PaddlePaddle#39481) * [Pten] Generate Wrapped InferMeta by Yaml (PaddlePaddle#39482) * generate wrapped_infer_meta * add test for wrapped_infer_meta * Update test_meta_fn_utils.cc * change the dir of generated file Co-authored-by: Chen Weihang <chenweihang@baidu.com> Co-authored-by: Chen Weihang <chenwhpro@163.com> * Adjusted python-level trace_op to accomodate final state Eager Dygraph (PaddlePaddle#39319) * Removed debug info * Added automatic code generation for final state Eager Dygraph * Modified backward yaml * Added EagerUtils helper functions for final state CodeGen * Adjusted CMakeFiles to support compilation for final state auto generated codes * Added python-c code generation for final state Eager Dygraph * Fixed minor issue * Fixed yaml.load() method failure * Fixed minor issues * Refactored Python-C Attributes Parsing Functions * Fixed minor issue with Python-C AddFunctions * Adjusted python-level trace_op to accomodate final state Eager Dygraph * Added Logs for final state Eager Dygraph * Fixed merge issues * Fixed minor issue * Fixed get_tensor method for EagerTensor (PaddlePaddle#39414) * Enabled Eager OpTest PaddlePaddle#1 * Enabled Eager OpTest PaddlePaddle#1 * Fixed get_tensor method for EagerTensor * [Approver Update] update check approver of qili93, test=document_fix (PaddlePaddle#39483) * [MLU] add mlu kernel for c_broadcast op (PaddlePaddle#39470) * update xpu test build script and fix get_test_cover_info, *test=kunlun (PaddlePaddle#39235) * fix gather_nd, *test=kunlun (PaddlePaddle#39283) * [pten] add split kernel (PaddlePaddle#39060) * add split kernel * add split kernel signature * fix split bug * modify MakePtenScalarArrayFromVarList * modify MakePtenScalarArrayFromVarList * fix split windows register error * add test case for split kernel * replace raw split kernel with pten kernel * fix makeScalar/ScalarArray bug * remove debug log * remove int64_t type in buildPtcontext * update by code review * fix split dev test failed * change DenseTensorMeta to MetaTensor * change split api code from auto gen to manual * split cuda kernel support bfloat16 type * fix conflict * rm raw split kernel * merge develop branch * change to pten::errors * new may of test cases, *test=kunlun (PaddlePaddle#39444) * new may of test cases, *test=kunlun * new may of test cases, *test=kunlun * new may of test cases, *test=kunlun * [PTen] Add HasAttr for ArgumentMappingContext (PaddlePaddle#39464) * add has_attr for arg map context * skip useless attr now * skip attr if not exists * fix typo * [ROCm] fix missing dcu kernel in operator.cmake, test=develop (PaddlePaddle#39480) Co-authored-by: zyfncg <zhangyunfei07@baidu.com> Co-authored-by: Aganlengzi <aganlengzi@gmail.com> Co-authored-by: Chen Weihang <chenweihang@baidu.com> Co-authored-by: Leo Chen <chenqiuliang@baidu.com> Co-authored-by: zhangbo9674 <82555433+zhangbo9674@users.noreply.github.com> Co-authored-by: crystal <62974595+Zjq9409@users.noreply.github.com> Co-authored-by: Zhanlue Yang <jim19930609@gmail.com> Co-authored-by: sneaxiy <32832641+sneaxiy@users.noreply.github.com> Co-authored-by: wenbin <wang3323032@qq.com> Co-authored-by: Wilber <jiweibo@baidu.com> Co-authored-by: hong <43953930+phlrain@users.noreply.github.com> Co-authored-by: chenyanlann <62465397+chenyanlann@users.noreply.github.com> Co-authored-by: Wei Shengyu <weisy11@163.com> Co-authored-by: TeFeng Chen <ctfeng66@163.com> Co-authored-by: furnace <34057289+windstamp@users.noreply.github.com> Co-authored-by: fwenguang <95677191+fwenguang@users.noreply.github.com> Co-authored-by: 0x45f <23097963+0x45f@users.noreply.github.com> Co-authored-by: Zhang Ting <zhangting_2017@163.com> Co-authored-by: chenjian <chenjian26@baidu.com> Co-authored-by: Shang Zhizhou <shangzhizhou@baidu.com> Co-authored-by: liutiexing <74819124+liutiexing@users.noreply.github.com> Co-authored-by: liutiexing <liutiexing@google.com> Co-authored-by: Wangzheee <634486483@qq.com> Co-authored-by: ziyoujiyi <73728031+ziyoujiyi@users.noreply.github.com> Co-authored-by: zkh2016 <zhangkaihuo@baidu.com> Co-authored-by: zhangchunle <clzhang_cauc@163.com> Co-authored-by: JingZhuangzhuang <75348594+JZZ-NOTE@users.noreply.github.com> Co-authored-by: Lijunhui <1578034415@qq.com> Co-authored-by: Zhang Zheng <32410583+ZzSean@users.noreply.github.com> Co-authored-by: Feiyu Chan <chenfeiyu@baidu.com> Co-authored-by: zn <96479180+kangna-qi@users.noreply.github.com> Co-authored-by: zhangna <zhangna@cambricon.com> Co-authored-by: joeqiao12 <45232181+joeqiao12@users.noreply.github.com> Co-authored-by: jakpiase <jakpia21@gmail.com> Co-authored-by: Leo Chen <39020268+leo0519@users.noreply.github.com> Co-authored-by: Chen Weihang <chenwhpro@163.com> Co-authored-by: Qi Li <qili93@qq.com> Co-authored-by: maxhuiy <1508399706@qq.com> Co-authored-by: TTerror <tangzhiyi11@users.noreply.github.com> Co-authored-by: chentianyu03 <chentianyu03@baidu.com> Co-authored-by: helen88 <z8hanghuan@126.com>
winter-wang · Feb 16, 2022 · 9fd407c · 9fd407c
1 parent 2c7f6e6
commit 9fd407c
Show file tree

Hide file tree

Showing 29 changed files with 346 additions and 651 deletions.
diff --git a/paddle/fluid/framework/ir/memory_optimize_pass/share_varinfo_into_cinn_pass_test.cc b/paddle/fluid/framework/ir/memory_optimize_pass/share_varinfo_into_cinn_pass_test.cc
@@ -26,7 +26,7 @@
 
 USE_OP(mul);
 USE_OP(cinn_launch);
-USE_OP_ITSELF(elementwise_add);
+USE_OP(elementwise_add);
 namespace paddle::framework {
 
 using Name2VarInfoMap =

diff --git a/paddle/fluid/imperative/gradient_accumulator.cc b/paddle/fluid/imperative/gradient_accumulator.cc
@@ -744,14 +744,12 @@ void EagerGradientAccumulator::SumGrad(std::shared_ptr<VariableWrapper> var,
         VLOG(6) << "Dims of " << dst_var->Name() << " is set as: "
                 << var->Var().Get<framework::LoDTensor>().dims();
         tensor->Resize(var->Var().Get<framework::LoDTensor>().dims());
-        tensor->mutable_data(place,
-                             framework::TransToPtenDataType(var->DataType()));
+        tensor->mutable_data(place, var->DataType());
         pten::funcs::set_constant(*dev_ctx, tensor, 0.0);
       } else {
         auto* tensor =
             dst_var->MutableVar()->GetMutable<framework::LoDTensor>();
-        tensor->mutable_data(place,
-                             framework::TransToPtenDataType(var->DataType()));
+        tensor->mutable_data(place, var->DataType());
         pten::funcs::set_constant(*dev_ctx, tensor, 0.0);
       }
     }
@@ -878,14 +876,12 @@ void SortedGradientAccumulator::SumGrad(std::shared_ptr<VariableWrapper> var,
         VLOG(6) << "Dims of " << dst_var->Name() << " is set as: "
                 << var->Var().Get<framework::LoDTensor>().dims();
         tensor->Resize(var->Var().Get<framework::LoDTensor>().dims());
-        tensor->mutable_data(place,
-                             framework::TransToPtenDataType(var->DataType()));
+        tensor->mutable_data(place, var->DataType());
         pten::funcs::set_constant(*dev_ctx, tensor, 0.0);
       } else {
         auto* tensor =
             dst_var->MutableVar()->GetMutable<framework::LoDTensor>();
-        tensor->mutable_data(place,
-                             framework::TransToPtenDataType(var->DataType()));
+        tensor->mutable_data(place, var->DataType());
         pten::funcs::set_constant(*dev_ctx, tensor, 0.0);
       }
     }

diff --git a/paddle/fluid/operators/dropout_impl.cu.h b/paddle/fluid/operators/dropout_impl.cu.h
@@ -133,18 +133,15 @@ __global__ void VectorizedRandomGenerator(const size_t n, uint64_t seed,
 
 template <typename T, typename MaskType>
 struct CudaDropoutGradFunctor {
-  using MT = typename details::MPTypeTrait<T>::Type;
-
-  explicit CudaDropoutGradFunctor(const MT factor) : factor_(factor) {}
+  explicit CudaDropoutGradFunctor(const T factor) : factor_(factor) {}
 
   __device__ __forceinline__ T operator()(const T dout,
                                           const MaskType mask) const {
-    return static_cast<T>(static_cast<MT>(dout) * static_cast<MT>(mask) *
-                          factor_);
+    return dout * static_cast<T>(mask) * factor_;
   }
 
  private:
-  MT factor_;
+  T factor_;
 };
 
 template <typename T, typename MaskType, int VecSize>
@@ -287,7 +284,7 @@ void DropoutGradGPUKernelDriver(const platform::CUDADeviceContext& dev_ctx,
       if (dropout_prob == 1.0f) {
         dX.device(place) = static_cast<T>(0) * dY;
       } else {
-        auto factor = static_cast<MT>(1.0f / (1.0f - dropout_prob));
+        auto factor = static_cast<T>(1.0f / (1.0f - dropout_prob));
         auto stream = dev_ctx.stream();
         std::vector<const framework::Tensor*> ins = {&grad_y, &mask};
         std::vector<framework::Tensor*> outs = {grad_x};

diff --git a/paddle/fluid/operators/fill_constant_batch_size_like_op.h b/paddle/fluid/operators/fill_constant_batch_size_like_op.h
@@ -62,17 +62,15 @@ class FillConstantBatchSizeLikeOpKernel : public framework::OpKernel<T> {
     if (cpu_place) {
       auto &dev_ctx = *pool.Get(platform::CPUPlace());
       pten::funcs::SetConstant<platform::CPUDeviceContext, T> functor;
-      out->mutable_data(platform::CPUPlace(),
-                        framework::TransToPtenDataType(data_type));
+      out->mutable_data(platform::CPUPlace(), data_type);
       functor(reinterpret_cast<const platform::CPUDeviceContext &>(dev_ctx),
               out, static_cast<T>(value));
     }
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
     if (!cpu_place) {
       auto &dev_ctx = *pool.Get(ctx.GetPlace());
       pten::funcs::SetConstant<platform::CUDADeviceContext, T> functor;
-      out->mutable_data(ctx.GetPlace(),
-                        framework::TransToPtenDataType(data_type));
+      out->mutable_data(ctx.GetPlace(), data_type);
       functor(reinterpret_cast<const platform::CUDADeviceContext &>(dev_ctx),
               out, static_cast<T>(value));
     }

diff --git a/paddle/fluid/operators/fill_constant_batch_size_like_op_npu.cc b/paddle/fluid/operators/fill_constant_batch_size_like_op_npu.cc
@@ -71,8 +71,7 @@ class FillConstantBatchSizeLikeOpNPUKernel : public framework::OpKernel<T> {
     if (cpu_place) {
       auto &dev_ctx = *pool.Get(platform::CPUPlace());
       pten::funcs::SetConstant<platform::CPUDeviceContext, T> functor;
-      out->mutable_data(platform::CPUPlace(),
-                        framework::TransToPtenDataType(data_type));
+      out->mutable_data(platform::CPUPlace(), data_type);
       functor(reinterpret_cast<const platform::CPUDeviceContext &>(dev_ctx),
               out, static_cast<T>(value));
     } else {

diff --git a/paddle/fluid/operators/fill_constant_op.h b/paddle/fluid/operators/fill_constant_op.h
@@ -121,16 +121,14 @@ class FillConstantKernel : public framework::OpKernel<T> {
       VLOG(4) << "[CPU] FillConstantKernel"
               << ((data_type == framework::proto::VarType::BF16) ? "<bfloat16>"
                                                                  : "<T>");
-      tensor->mutable_data(platform::CPUPlace(),
-                           framework::TransToPtenDataType(data_type));
+      tensor->mutable_data(platform::CPUPlace(), data_type);
       pten::funcs::SetConstant<platform::CPUDeviceContext, T> functor;
       auto &dev_ctx = *pool.Get(platform::CPUPlace());
       functor(reinterpret_cast<const platform::CPUDeviceContext &>(dev_ctx),
               tensor, static_cast<T>(value));
     } else if (actual_place == 1) {
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
-      tensor->mutable_data(ctx.GetPlace(),
-                           framework::TransToPtenDataType(data_type));
+      tensor->mutable_data(ctx.GetPlace(), data_type);
       pten::funcs::SetConstant<platform::CUDADeviceContext, T> functor;
       auto &dev_ctx = *pool.Get(ctx.GetPlace());
       functor(reinterpret_cast<const platform::CUDADeviceContext &>(dev_ctx),
@@ -141,8 +139,7 @@ class FillConstantKernel : public framework::OpKernel<T> {
 #endif
     } else if (actual_place == 2) {
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
-      tensor->mutable_data(platform::CUDAPinnedPlace(),
-                           framework::TransToPtenDataType(data_type));
+      tensor->mutable_data(platform::CUDAPinnedPlace(), data_type);
       pten::funcs::SetConstant<platform::CUDAPinnedDeviceContext, T> functor;
       auto &dev_ctx = *pool.Get(platform::CUDAPinnedPlace());
       functor(
@@ -154,8 +151,7 @@ class FillConstantKernel : public framework::OpKernel<T> {
 #endif
     } else if (actual_place == 3) {
 #ifdef PADDLE_WITH_XPU
-      tensor->mutable_data(ctx.GetPlace(),
-                           framework::TransToPtenDataType(data_type));
+      tensor->mutable_data(ctx.GetPlace(), data_type);
       pten::funcs::SetConstant<platform::XPUDeviceContext, T> functor;
       auto &dev_ctx = *pool.Get(ctx.GetPlace());
       functor(reinterpret_cast<const platform::XPUDeviceContext &>(dev_ctx),