-
Notifications
You must be signed in to change notification settings - Fork 5.6k
PaddlePaddle 2.6.0 Release Note EN
- Paddle New generation IR(PIR) : In order to further improve scalability of the PaddlePaddle framework, we have developed a new generation intermediate representaion. It abstracts underlying core concepts of the PaddlePaddle framework, such as Operation, Attribute and Type, providing developers with flexible and efficient basic components. By introducing Dialect mechanism, PIR can comprehensively and hierarchically satisfy needs of each module for intermediate representations to greatly enhancing scalability of the framework. PIR strictly follows Static Single Assignment (SSA) principle, ensuring unity of top-level structure and harmonious coexistence of "operator sequentiality" and "computational graph semantics". In addition, PIR provides a more concise and low-cost Pass development process, with a series of built-in rich and functional Pass optimization strategies. It provides technical support for the ultimate performance optimization of large-scale models.
- Static graph construction and compiler Optimization Architecture: In order to further improve performance of the framework, PaddlePaddle's dynamic to static training capability has been comprehensively upgraded to support adaptive graph construction capability. This has been tested on more than 700 PaddlePaddle industry-level models, with 100% success rate of one line code converter to start static training. Meanwhile, Compiler Infrastructure for Neural Networks (CINN) of PaddlePaddle framework is integrated into PaddlePaddle main Repo, making the compiler and PaddlePaddle more integrated. CINN completes architectural optimization and improvement of expansion capability, increasing system stability. Based on PIR framework, it is much more easied to bind dynamic to static, primitive operator, executor and compiler together, to providing more space for boosting overall performance of PaddlePaddle framework.
- Enhanced dynamic graph distributed capability: Large models pose higher demands on the distributed training performance of framework. PaddlePaddle has comprehensive optimizations in dimensions of communication library, graph analysis, distributed strategy and task enable/disable, enhancing distributed computing capability of PaddlePaddle's dynamic graph and providing support for efficient training of large models. In terms of performance, training performance is further improved by reducing pipelined GPU memory occupation, adopting TensorFusion technology, implementing communication computation overlap, and reducing non-essential data synchronization copies. Meanwhile, flexibility of hybrid-parallel debugging is improved through environment variable control Optimizer. In addition, stability of system is significantly improved by fixing related Bugs.
- Auto parallel architecture with dynamic-static unification: In order to further reduce difficulty of programming and optimizing large models, PaddlePaddle has fully optimized the Semi-Auto Parallel programming paradigm with dynamic-static unification, simplifying programming complexity for developers. Developers do not need to deeply understand complex concepts and APIs under the manual parallel programming paradigm, such as row-parallel, and column-parallel. They only need a small amount of tensor distribution annotations to implement the hybrid parallelism. The distribution specification will be propagated to all tensors and operators automatically, and the framework would handle the communication and synchronization needed by distributed training appropriately. Meanwhile, it supports dynamic-to-static distributed training by adding one extra code only, allowing developers to efficiently implement any mixed parallelism strategy and deeply simplify the development process of hybrid-parallel training paradigm.
- Hardware Integration Solution (CustomDevice): With increased demand for parallel training on new hardware in large model scenarios, PaddlePaddle has added support for distributed advanced policies, custom operators, and custom fusion policies. Distributed communication library is upgraded, with newly added support for many advanced distributed policies such as MP, GroupShared, PP, SP and MOE. Moreover, it supports vendors to flexibly access Transformer operator libraries of different granularities and modify the computation graph through Fusion Pass for performance acceleration.
- Installation and development experience: use of modular compilation optimizes logics of CMake codes, and improves efficiency of PaddlePaddle full compilation and incremental compilation. In addition, this can increase efficiency of RD development. It supports Python3.12, CUDA12, Hopper architecture compilation, with introduction of Clang and other tools to fully optimize code formats. In addition, C++ is changed from linking static libraries to linking dynamic libraries to reduce compilation volume. These optimizations provide users with a smoother and more efficient installation and development experience.
- In order to avoid misuse, we removed the 0-dimensional Tensor compatibility state switch, to achieve the same API behaviors as industry's mainstream habits. In the previous version, we already supported 0-dimensional Tensor, but we added a compatibility state switch in order to avoid error reporting of some models, as much as possible. That is, in some scenarios where model suite is used frequently and modification is not completed, we still used 1-dimensional Tensor with only 1 element to replace the 0-dimensional Tensor by default. In this version, compatibility state switch is removed, so the 1-dimensional Tensor with only 1 element will no longer be used, to replace 0-dimensional Tensor in any scenario. Behaviors of 376 APIs that should support the 0-dimensional Tensor have been corrected and unified, to thoroughly complete support for the 0-dimensional Tensor.#57036, #54581, #54500
- To improve API usability, paddle.nn.functional.diag_embed has been streamlined to paddle.diag_embed, with support of use of Tensor.diag_embed. #58223
- In order to solve the problem of differential computation error caused by Tensor index writing (e.g., tensor[0] = 10) under static graphs, and to comply with static graph specifications, this version introduces paddle.static.setitem API. In static graph environments, this API is recommended to support indexed write operations for tensor, instead of subscript operators. This change does not affect dynamic graph environments, where index write using subscript operators are still allowed. #53682
- paddle.fluid API is completely retired in this version. In this update, we completely removed all paddle.fluid APIs and deleted the fluid directory. Meanwhile, a small number of PaddlePaddle underlying public components have been consolidated into the paddle.base directory. It is unnecessary for PaddlePaddle users to pay attention to fluid-related concepts and APIs, further simplifying PaddlePaddle API system and improving readability.#56576, #54424, #54829, #53992, #54806, #55754, #55986, #55345, #56099, #51717, #54152, #55522, #55757, #58521, #54936, #55007, #55661, #55970
This version comprehensively optimizes basic index, advanced index and joint index functions of Tensor, to better comply with industry standards and user habits. Specifically, we added support for view in basic index, fixed some wrong behaviors in advanced index, and implemented read function of joint index. In addition, we have sunk index parsing to C++ level, improved performance of high-level indexing operators, and removed redundant computations in bool indexing. With these optimizations, performance of Tensor's basic, advanced and joint index has been improved comprehensively. #56893, #58643, #57986, #56272, #58856, #55211, #57023, #56613, #55602, #59281, #57737
In earlier versions, in order to ensure correctness of inverse differentiation calculations, when reverse calculation of an API depends on its forward input data, PaddlePaddle avoids using Inplace operation method, with possibly overwriting original input data. This mechanism simplifies implementation process, and also limits the ability of many APIs to implement Inplace functionality. As a result, user experience may be affected. In this version, PaddlePaddle has fully upgraded the Inplace mechanism. It implements automatic detection of the dependency of reverse computation on forward inputs, to save input data when needed. Therefore, more Inplace operations are supported. This improvement not only improves memory usage efficiency, but also enhances functionality of the API. In addition, we have added 109 new APIs that support Inplace operations, including paddle.abs_, paddle.sin_/cos_/tan_, comparison operations such as paddle.greater_than_/less_than_/equal_, logical operations such as paddle.logical_and_/logical_or_/logical_not_, paddle.neg_ and paddle.log_. While enriching the feature set of PaddlePaddle, it improves users' efficiency and convenience in numerical computation and deep learning tasks. #54683, #55078, #55576, #56888, #55509, #57093
- Added paddle.nn.functional.scaled_dot_product_attention. This significantly improves computational efficiency of the attention mechanism in large models, and meets demand for high-performance computation in large-scale deep learning models. #55242
- Added a series of new scientific computing-related APIs, including paddle.cummax and paddle.cummin for cumulative maximum and minimum computation, paddle.index_fill and paddle.masked_fill for filling tensor by index or mask, paddle.linalg.pca_lowrank for low-rank principal component analysis, paddle.hypot for calculating length of the hypotenuses of right triangles, and paddle.atleast_1d, paddle.atleast_2d, and paddle.atleast_3d to ensure the tensor is at least one, two, or three dimensional. We also provide paddle.select_scatter and paddle.diagonal_scatter for more flexible selection and hashing of tensor data, and paddle.multigammaln for choosing the natural logarithm of multigamma function. In addition, new optimizer-related APIs are added in this version, including: paddle.optimizer.lr.LinearLR and paddle.optimizer.lr.CosineAnnealingWarmRestarts for learning rate scheduling strategies; introduction of paddle.io.SubsetRandomSampler to support random sampling from a subset of data. These added APIs will further enhance flexibility and efficiency of PaddlePaddle in various application scenarios. #57416, #53546, #53743, #57295, #57726, #58764, #58323, #57720, #58209, #58214, #57792, #51395, #57724, #57355, #57744, #58244, #57599, #59343, #57879
PIR systematically abstracts underlying core concepts such as Operation, Attribute and Type, to build a set of flexible and powerful base components for developers. In addition, PaddlePaddle can comprehensively and hierarchically manage requirements of each module on Intermediate Representation (IR) by introducing the concept of Dialect, and support developers to customize extension of Dialect according to specific needs to significantly improving scalability and adaptability of framework. In terms of designs, PIR strictly follows the Static Single Assignment (SSA) principle, unifies top-level structure, realizes compatibility of "Operator sequentiality" and "computational graph semantics". This provides a clear and consistent view of the complex computation process. In order to further optimize performance of large models, PIR also provides a set of more concise and low-cost Pass development processes, including Declarative Rewrite Rule (DRR) and Pattern Rewriter. In addition, a series of rich and full-featured Pass optimization strategies are built-in, to deeply optimize application according to characteristics of large models, thus providing strong support for ultimate performance of large models. Through these innovative designs and optimization methods, PIR lays a solid foundation for efficient operation and continuous expansion of the PaddlePaddle framework.
- Abstracted core concepts of IR bottom layer and provided developers with flexible base components, such as Operation, Attribute, Value, Type, Trait, and Interface. #56354,#57106,#57349,#54844,#54984,#54565,#54562,#57249,#57550,#59278,#54875,#55041,#54987,#55903,#57582,#57580,#58052,#55322,#57418,#57635,#55328,#57463,#59791,#59821,#59115,#57461,#59392,#57373,#59118
- Added Dialect mechanism to support comprehensive and hierarchical management of intermediate representation requirements of each module of framework. Through built-in Builtin Dialect, it supports developers to customize and extend Dialect according to their needs. #56325,#57539,#54682,#55381,#56156,#56431,#56615,#57103,#57209
- Normalized PaddlePaddle static graph operator system. Added OperatorDialect and KernelDialect. Managed conceptual differences of operators in the form of Dialect during compilation and execution, making Architecture clearer. #56284,#54469,#58660,#58975,#56680,#54790,#54826,#54840,#55699,#55648,#55880,#56101,#56754,#54944,#56836,#57185,#58757,#56243,#56436,#57741,#59124,#57054,#56984,#57403,#57904,#58031,#56924,#59270,#55343,#56557,#55693,#54428
- Added ShapeDialect with built-in rich shape operation operators for constructing dynamic shape constraints and expressions for AI compilers. #56727,#59254,#58368,#57069,#57337,#56351,#57029,#58036,#59032,#57961,#56427,#57459
- Unified top-level structure of Framework Program, supporting compatible representation of "operator sequentiality" and "computational graph semantics", decoupling dependency on ir::Graph, and strictly following the principle of Static Single Assignment (SSA). #59369,#54563,#57051,#57306,#57857
- Added IrPrinter and IrPaser components to support serialization and deserialization of PIR Programs, providing a friendly debugging experience for PIR development. #55695,#59449,#54369,#54499,#55518,#55784,#57180,#57471,#54859,#54968,#55209,#57314,#57969
- Built a new, simple and low-cost Pass development system based on Declarative Rewrite Rule (DDR) and Pattern Rewriter, with built-in a series of rich and full-featured Pass Optimization strategies, to accelerate training and inference execution process. #54385,#54738,#55859,#56638,#57090,#58673,#59415,#56729,#58655
- Added ProgramTranslator component, to support conversion from ProgramDesc to new generation of IR representations of PaddlePaddle by pressing one key, with provision of easy-to-use C++ and Python interfaces. #55433,#54470,#58044,#58390,#58100,#55403,#55406,#54719,#56550,#55448,#55453,#56294,#56308,#56842,#58517
- With help of automatic code generation technology, it can generate the full amount of static graph operator representations for PaddlePaddle framework by pressing one key. Sank static graph networking logic to C++ side and bind it to _C_ops module. This can greatly streamline API code on Python side, realize ultimate dynamic-static unification of APIs of PaddlePaddle Framework, and upgrade a lot of Python APIs to support static graph networking of the new IR. #56570,#55745,#56955,#57298,#57946,#57248,#56080,#54396,#54551,#56520,#55002,#57067,#59320,#59348,#57164,#57267,#59064,#54340,#54895,#55004,#56196,#56862,#58991,#55428,#55909,#56241,#56526,#56571,#56518,#57016,#56653,#56809,#57158,#55422,#55458,#55432,#55467,#55483,#55419,#55517,#55500,#56674,#57693,#55008,#57166,#57157,#57159,#57175,#57325,#57330,#57415,#57122,#57393,#57344,#57667,#57348,#57700,#58093,#58005,#58081,#58094,#58137,#58287,#58352,#58340,#58363,#58331,#58343,#58317,#58450,#58377,#58466,#58470,#58491,#58546,#58587,#58453,#58634,#58604,#58605,#58593,#58675,#58699,#58384,#58629,#58579,#58695,#58548,#58688,#58792,#58843,#58840,#58718,#58883,#58785,#58608,#58781,#58783,#58429,#58685,#58696,#58690,#58831,#58929,#58740,#58937,#58782,#58833,#58882,#58935,#58931,#59041,#59040,#58877,#58888,#59042,#58780,#58682,#58815,#58676,#58678,#58446,#59077,#59091,#58661,#58832,#58642,#58698,#59313,#59371,#58700,#58953,#58879,#59469,#59573,#59481,#59419,#59509,#58735,#59616,#59582,#59420,#59500,#58911,#59535,#54891,#56794,#57477,#57929,#57765,#58693,#58603,#56291,#57123,#57317,#57341,#57020,#57324,#57761,#57762,#57907,#57909,#58099,#58110,#58114,#58139,#58144,#58165,#58194,#58138,#58113,#58245,#58318,#58105,#58348,#58235,#58354,#58341,#58445,#58418,#58239,#58473,#58239,#58391,#58501,#58519,#58416,#58588,#58531,#58730,#58773,#58862,#58946,#58500,#56585,#57480,#57433,#58498
- Upgraded static graph executor to extend more Kernel Instruction types, and supported loading of PIR with efficiently scheduling execution. This has significant video memory and performance gains in training and inference. #54570,#58665,#57291,#54452,#57431,#54692,#55112,#55210,#55401,#55772,#55828,#56148,#54763,#56886,#57284,#57268,#57791,#56789,#56704,#57594,#58397,#58337,#58756,#58371
- Reconstructed auto-differentiation module for PIR, migrate and adapted the high-order auto-differentiation function. Optimized Stop Gradient transfer mechanism, so logic is clearer and function is more robust. #55660,#57084,#56890,#58942,#59373,#57206,#58145,#55235,#57255,#56925,#55957,#56163,#56316,#57294,#57449,#59520,#59565,#56265,#56512,#56650,#57183,#57956,#59100
- Optimized design and representation of control flow forward and reverse operators, introduced ControlFlow Dialect, and supported conversion and execution from control flow operators to PIR under ProgramDesc. #58729,#57364,#58625,#57475,#57265,#56799,#59033,#57342,#57801,#57958,#57949,#57937,#59231,#59496,#59321,#58088,#58198,#58024,#58089,#58086,#59175,#59423,#59567,#58098,#58163,#58250,#58277,#58355,#59020,#59200,#59585,#58109
- Upgraded dynamic to static execution flow to support PIR, optimized dynamic to static subgraph Pass mechanism, and supported users to try and use functions in the PIR system under the @to_static function. #57566,#55620,#56791,#57357,#59152,#59312,#58630,#56035,#59447,#57361,#59261,#59774
- Upgraded combination operator function with introducing the concept of Backend to manage logic of combination operator module of dynamic and static graphs in a hierarchical way. Sank necessary components and operator splitting rules into C++, to dramatically reduce maintenance costs. #58153,#56391,#56614,#57030,#57554,#58018,#58130,#58581,#58679,#59054,#55480,#58451,#55647,#56342,#56798,#57561,#58023,#57722
- Added PIR Program operators such as DCE and constant_folding_pass, and structure-optimized Pass. #54935,#59430,#58753,#58732
- Added optimization operators fusing class Pass, such as fused_attention, fused_dropout_add, fused_gemm_epilogue_pass, fused_linear_param_grad_add_pass, fused_weight_only_linear_pass, and fused_softmax_mask_upper_triangle, to improve training and inference performance. #57557,#58272,#58188,#58401,#59366,#57655,#57360,#56672,#58537,#56247,#59391,#58897,#54933
Dynamic to static graph conversion is a key technology in deep learning frameworks. It allows developers to find the best balance between flexibility and training efficiency. This version of PaddlePaddle has fully upgraded core Dynamic to Static functionality. Success rate of dynamic to static training is up to 100% among 700+ models in PaddlePaddle industry-grade model library.
- Adopted Python Eval Frame and VM simulation execution technology to innovatively implement an adaptive Graph Break mechanism. This mechanism is especially designed for control flow scenarios. By introducing the CallLayer mechanism, it makes full use of the advantage of PaddlePaddle dynamic-static unification motion. Support hybrid mode of Abstract Syntax Tree (AST) and bytecode simulation. Efficiently captures control flow operators, thus dramatically improving ability of computational graph to be static. At cache optimization level, fuse advanced optimization technologies such as common sub-expression elimination, to significantly improve execution efficiency of Guard. These optimizations not only reduce redundant computations, but also improve overall system operation speed. To enhance robustness of the system, a simple and efficient data intermediate layer structure is designed. Structure supports correctness recovery of SideEffects, ensuring stability and reliability of system in complex environments. In addition, it is widely compatible with mainstream interpreter versions from Python 3.8 to 3.11, providing users with a wide range of applicability. #57824,#55887,#58155,#56107,#57490,#58829,#57240,#57588,#58117,#59823,#56077,#58956,#57653,#59855,#59017,#58424,#58187,#57793,#59698,#59747,#59710,#59297,#58423,#56262,#58103,#58538,#58771,#59191,#57754,#59439,#59816,#59035
- Added dynamic to static syntax transcription parsing for PyLayer functions, making PyLayer's conversion between dynamic and static graphs smoother. Users can now seamlessly carry out dynamic to static training on PyLayer, to easily export inference models. #56108,#56531,#57066,#57633
- Fixed the issue that video memory is abnormal in some scenarios of dynamic to static in is_test=True mode. #58350
- Fixed the issue that function decorated by @to_static is exported to jit.save model in scenarios like foo(x,x,y). #55963
- Fixed the issue that dynamic and static logic of some API behaviors is not uniform. This improves success rate and user experience of dynamic to static graph conversion. #56092
- Fixed a potential security vulnerability in use of eval() in dynamic to static syntax transcription module. #60100
In order to meet the needs of large models, this version focuses on improving the distributed computing capability of the dynamic graph of the PaddlePaddle. Various improvements have been made in communication library, graph analysis, distributed policies and task enable/disable, to provide comprehensive support for large model training. In terms of performance, we further improved training performance by reducing streaming parallel GPU memory occupation, adopting TensorFusion technology, implementing communication computation overlap, and reducing non-essential data synchronization copies. Meanwhile, flexibility of hybrid-parallel debugging is improved through environment variable control Optimizer. In addition, stability of system is further improved by fixing related Bugs.
- Added TraceHang function in communication library, to quickly locate the faulty node when cluster training has Hang problem. #59217
- In order to improve training efficiency and reduce memory, dynamic graph supports stride mechanism. #55156,#54762,#55850,#59190,#57005,#57005,#57331,#58033,#58033,#58303,#57835,#57189
- Enhanced paddleviz function to facilitate analysis of computational graphs. #56837,#57626
- In distributed Sharding strategies (Stage1,2,3), added main_grad function to support higher precision gradient accumulation, and reduce precision loss caused by low precision accumulation. #57972,#57934,#57473,#57537,#59611,#57960
- In Sharding Stage1 strategy, added a switch variable to control whether to perform fusion calculation on Optimizer. #58790
- In Recompute function, added support for Tuple input parameters, enhancing calling ability of Recompute interface. #56793
- Enhanced Launch function, allowing distributed training without specifying endpoints in dynamic graphs. #54636
- Implemented new communication library with dynamic-static unification. Communication operators are fully adapted to PHI operator system, reducing development and maintenance costs to better support dynamic graphs and auto parallel architecture upgrade. #54417,#57768,#57897,#55537,#56604,#57519,#56088,#57153,#57161,#57252,#57251,#57208,#57305,#57424,#57548,#57560,#57564,#57233,#55726,#58073
- TCPStore is changed to a single instance to support dynamic graphs and auto parallel more flexibly. #55956
- Improved maintainability and flexibility of distributed policies such as MP/PP/SP, including addition of printing warning and error messages, structural cleanup of code files, and optimization of PP restrictions on inputs. #54448,#59762,#55462,#54788,#54664,#56456,#55540
- In PP strategy, added support for P2P communication in computation flow, making communication mode more flexible. #54747
- Sharding strategy supports reduce Operation on gradient. #58842,#57967,#55495
- Implemented timely release of last layer of PP strategy, to save video memory. #54505
- In MP strategy Tensor fusion, supported incoming params group to enhance Tensor fusion function. Improved allreduce asynchronous communication performance, and enhanced training performance through overlap of computation and communication. #57690,#55662
- In Sharding strategy, carried out overlap for reverse computation and gradient communication, to improve training performance. For Sharding stage1, added Tensor fusion and fuse grad clip, and optimizer, to improve computational efficiency. Supported overlap between VPP and DP/Sharding Stage1, to improve communication and computation parallelism. Optimized performance of Sharding Stage1 under FP16. Check only gradient responsible for this sharding rank in the check finite stage, to reduce computation overhead; added environment variables to control whether Optimize is performed to save video memory, to achieve use of fewer resources for model training debugging. #55598,#55427,#56063,#55766,#59848
- In Hybrid Parallel strategy, arranged Tensor fusion under PP/VPP to pre-run, to solve the problem of extra overhead of runtime fuse on video memory. Improved model training performance by reducing non-essential synchronous memcpy. #54403,#57215
- Fixed 13 bugs in PP, Launch function, MP strategy, and fuse_rope, to enhance stability of distributed strategies. At mechanism level, fixed errors of inplace and tensor reference to improve stability. #55116,#55782,#59609,#57394,#55864,#58482,#54571,#55896,#54648,#58307,#55679,#58133,#58408,#59707,#55342,#54703,#54869,#55568,#55233,#56418,#56428,#56892,#57192,#59161,#59340,#57006,#57353,#57352,#59088
- Fixed bug that PP strategy can't release single-layer output in time. Fixed the bug that initialization process may Hang. #54624,#58844,#54673,#58376
- Fixed the bug calculation is wrong when input data type is not uniform under MP strategy. Fixed the bug of parameter synchronization under MP strategy. Fixed the bug user input config is not used correctly. #58858,#57918,#58037
- Unified judgment method of dygraph and dynamic mode. #54633
- Fixed the bug shape of sin and cos in fuse_rope is not correct. #56132
- Fixed the bug task fails to due to long endpoints in Luanch distributed scenarios. Fixed the bug endpoints may be out of order. #55011,#55478
- Fixed the bug MEA function may cause segmentation fault error. #55408
This release fully optimizes Auto Parallel programming paradigm with dynamic-static unification to simplify programming complexity for developers. Developers do not need to understand complex concepts and APIs in manual parallel programming paradigm, such as row-parallel, column-parallel, and so on. A small amount of tensor distribution annotations is required to build a hybrid parallel model. Framework will handle the derivation of distribution states of all tensors and operators, and adding appropriate communication operators. Meanwhile, it supports the dynamic to static distributed training by just one extra code changed, enabling developers to efficiently and easily implement any hybrid parallel strategy. This can significantly reduce development costs of hybrid parallel training codes.
- Implemented auto parallel core APIs such as process_mesh, placement, shard_tensor, reshard, dtensor_from_fn, unshard_dtensor, shard_layer, to_static, and so on. #55494,#59059,#56561,#54425,#59557,#59682,#56565,#59862,#59856,#59342,#59575,#57604,#57293,#57278
- Implemented Sharding derivation rules based on Enisum expressions, and completed 20+ classes of operator Sharding derivation rules, which covers LLaMA, GPT and other transformer-like large language models. #55196,#53863,#56257,#55394,#54810,#55508,#56257,#57813,#58149,#58506,#58563,#58360,#58920,#59050,#58760,#59083,#59236,#59350,#59411,#59260,#54373,#54991,#55397,#55350,#55177,#56443,#58097,#56509,#56502,#56504,#56506,#56507,#56505,#57176,#57374,#57573,#57545,#57875,#57866,#58854,#59109,#59185,#58913,#59547,#58296,#59545,#59039,#59002,#58087,#56367,#57877,#56839,#59003,#57269,#55130,#58474,#57197,#57467,#57259,#57280,#56508
- Implemented distributed checkpoint storage and loading with dynamic-static unification. Supports ReShard upon arbitrary Sharding of storage and loading in a Sharding state. #59659,#59843,#60033,#60034
-
Basic data structure supplementation: Added DistTensor, Placements and other distributed specific basic data structures on C++ end, and exposed to Python end. Supports debugging and printing of related attributes and values. #58930,#59068,#55436,#56449,#59683,#55593,#58032,#56368,#59086
-
Added SPMD derivation and Reshard generation logic in execution flow for all operators, and adapted to multiple types of inputs and outputs such as vector and optional, as well as special mechanisms such as cpu fallback and multi-kernel selection. #56602,#57321,#57092,#56831,#57119,#58819,#58254,#55698,#59241,#59328,#58644,#56202,#59159,#58573,#59246,#59133,#59186,#57505,#57241,#58928
-
Adapted auto parallel execution logic for special types of operators, such as custom operators. Supports automatic conversion of DistTensor and DenseTensor as mixed inputs. #57774,#59108,#58436,#59523,#59136,#59352,#59062,#58434,#59148,#58553,#58716,#58369,#59061,#58841,#59139,#59141,#58837,#59137,#59143
-
Optimized dynamic graph execution system: Adapted Autograd execution process. Supports dynamic graph's inverse gradient aggregation, AMP, Hook, PyLayer, View, custom operators, and other surrounding mechanisms. #58437,#58769,#58796,#58339,#58409,#58772,#58380,#58447,#58706,#58656,#58172,#59401,#58727,#58238,#59243,#58469,#58442,#58487,#58476,#59706
-
Added support for Pipeline Parallelism, Sequence Parallelism and other distributed parallelism. #58126,#59766,#59060,#59841,#58609,#59688,#58449、#59598
-
Added various Reshard strategies and support tensor conversions between different distributed states. #58592,#59138,#59367,#59621,#59758,#59777,#56975,#58550,#58703,#57210,#58734,#56833,#59292,#57432,#57568,#56553,#58284,#56039,#55552,#56149
- Added Sequence Parallel Parallelism; added FThenB, Interleaved 1F1B, Eager 1F1B, VPP and other scheduling modes for Pipeline Parallel, and supported the hybrid parallel between the above new parallelism and original parallelism. Supported visualization of pipeline scheduling. Upgraded gradient synchronization mechanism which supports gradient synchronization when data is sharded on any broadcast dimension. #57605,#54727,#54409,#54787,#58313,#59179,#59416,#59719,#59822,#59057,#59522,#57061
- Adapted the executor to PIR, and supported PIR optimization Pass. In distributed scenarios, supports fuse_linear fuse, and etc., to improve performance. #58459,#58528,#55555,#59757,#59102,#57917
- Upgraded underlying architecture: upgraded the executor to reuse the results of data-flow dependency analysis and static kernel selection; upgraded entire graph based sharding completion mechanism, to switch to new sharding derivation rules and support some long-tailed cases; optimized the support of control flow under distributed static graph to adapt to more scenarios; reduced the graph compilation time and refined error message format to improve user experience. #55389,#55650,#54938,#57447,#57751,#57742,#59524,#59526,#58669,#57616,#56511,#55727,#58906,#56016,#54897
- Optimized the gpu memory usage in static graph mode, and added refined recomputing strategy; optimized auto mixed precision pass, and allows users to manually specify auto-cast region and fixed some bugs; supports parallel computation of cross-entropy; supports fusion operators such as scaled_dot_product_attention, fuse_rope, etc.; performs scheduling optimization to support better overlap between communication and computation in tensor parallelism and pipeline parallelsim. #58421,#58533,#59498,#59498,#59187,#59188,#58172,#58628,#56185,#56696,#59497,#58304,#58977
This release implements a profiling based automatic search and tuning tool named AutoTuner for parallel strategies, to automatically combine parallel and optimization strategies. Users can select effective combination configurations for experiments, and AutoTuner will search for the optimal configuration for large model training and inference given the model and hardware specification. In addition, AutoTuner implements a variety of pruning methods, including gpu memory modelling based pruning, so the search space and search time can be significantly reduced. #54460,#54668,#59794,#59727,#59782,#54834,#58127,#56968,#55466,#56939,#58183,#58314,#55499,#59748
In order to improve maintainability of PaddlePaddle framework, some deprecated operators in the framework (e.g. diag_v1, isfinite_v1, pad2d_v1, etc.) have been removed, and models using these operators saved through the PaddlePaddle 1.x training will not be able to infer on new version of PaddlePaddle. #57895,#57892,#57898,#57730,#57732,#57810,#57884,#57794,#57926,#57925,#57807,#57808
-
The complex kernels of PaddlePaddle PHI operator library have been further enhanced, and a total of 40+ complex kernels have been added. #55380, #56349, #56412, #56323, #56723, #56457, #56903#56914, #57116, #56048, #57244, #57639, #57638, #57540, #58545, #58336, #58532, #58839, #59079, #59277, #59122, #57058
-
Optimized and added XPU kernels for some operators, and enhanced the support for data types such as bfloat16 on XPU kernel. #54478, #57740, #58346, #58456, #58662, #59066, #59263), #59375, #59505, #59653, #55001, #57272, #56169, #59454, #59480, #55914, #54758, #54827, #58364, #58419, #58982, #57216, #59166, #55033, #55375, #58805, #59389, #57077, #55166, #56773
-
Added some operators for optimizing large model training and inference performance. #55758, #54998, #55400, #54630, #55969, #55026, #58986
-
Improved mechanism of Tensor Strided in the operator library. #59422, #59325, #56863, #56882, #56947
-
Optimized function implementation and template function in some kernels to reduce size of complied library package. #57083, #57299, #57261, #57290, #57118, #57551, #57509, #57558, #57064, #57365, #57327, #57603, #57671, #57672, #57631, #57082, #57721, #57823, #57821, #57815, #57822, #57541, #57817, #57838
- Fixed some bugs with CUDA 12 adaptation of the PaddlePaddle framework. #54640, #57820, #58958, #58179, #55594
- Added debugging class API paddle.amp.debugging.check_check_numerics. Calculated and returned number of outliers (NaN, Inf) and zero elements in this Tensor value. #54301
- Added fused_rope fusion operator to accelerate LLaMA class large model training.#54351
- Updated CUDNN Frontend API version to v0.9.1 and added fused_scale_bias_add_relu fusion operator to accelerate ResNet networks. Note this feature is in experimental period and is disabled by default. #58367, #54949, #58504
- Based on Flash-Attention v2, added Tensor-like Mask function support. Inverse operator supports deterministic computation for debugging. #57276, #56363
- Modified sparse conv3d backend implementation to support 2d shapes, avoiding front-end reshape overhead. #54707
- Added matmul_int8 operator. (#55228)
- Optimized CUDA Graph’s support for random number operators.#58310
- Enhanced automatic mixed-precision training default functionality, including:
- Optimizing the experience of using automatic mixed precision training interface.#58152,#55364,#57903
- Added matrix computation class operators such as fused_attention, fused_feedforward, and fused_gemm_epilogue to framework's default whitelist, and unified default black and white list settings for dynamic and static graphs. #55373, #55713
- The argsort, dist, erfinv, nanmedian, poisson operators and lamb optimizer operators support FP16 and BF16 low precision computing. #51662, #55105, #55287, #55824, #56056, #56184, #55641
- Fixed elementwise_max operator low-precision implementation. Changed to use FP32 type for numerical computing, and reduce precision loss. #54799
- Changed temporary result Tensor needed for Reduce class operator to FP32 type, to avoid precision loss caused by converting intermediate result to low precision. #55709)
- Optimized GPU codes for flip, roll & roll_grad, index_put & index_put_grad, etc. Removed unnecessary C++ templates to optimize compilation time and reduce compiled binary size without performance degradation. #57309, #57525
- For the bernoulli operator, added a check on legitimacy of input probabilities. #59174
- Optimized BroadcastKernel's support for large Tensor. Change to call INT32 version implementation for multiple times for large Tensor Sharding, improving operator performance by 7.27x. #57313, #57996
- Optimized performance of Tensor save interface by copying the Tensor to CPU and then converting to numpy, to avoid overhead of automatically converting the Tensor to a continuous Tensor when Tensor is not continuous. #57040
- Fixed bug of memmory_efficient_attention operator supporting the sm_90. #58070
- Fixed the NaN problem of softmax operator when axis=-1 and length is greater than 100000. #57851
- Fixed bug of GPU access error in some cases for set_constant operator. #59905
- Fixed GPU storage read/write contention issue in fast implementation version of layer_norm operator. #56435
In this update, PaddlePaddle CINN focuses on optimization of architecture and comprehensive expansion of its capabilities. In view of increasing demand for dynamic shapes for large models, effective operation and optimization strategies of compiler under dynamic shapes are initially explored and implemented. At the architectural level, Python DSL is introduced, significantly improving CINN's development convenience and Debug capability and enabling developers to write and debug codes more efficiently. Meanwhile, logic of Schedule has been refactored to be dominated by GroupSchedule, enabling more general and stable optimization strategies at operator Group level. In order to enhance stability of CINN, a strong constraint component is explored and introduced. This can effectively reduce uncertainties and potential errors in the system. In addition, historical tool classes and software structure of CINN are systematically organized, optimized and improved, to further enhance readability and maintainability of codes. In terms of integration with other PaddlePaddle components, tight integration of CINN with PIR and Paddle has been further strengthened, making compiler more coherent with overall PaddlePaddle framework. This improvement not only enhances performance of the compiler, but also provides developers with a smoother and more unified development experience.
- Updated storage read interface to be compatible with Paddle 2.0. #55836
- Updated relu6 Op Mapper compatibility. #55611
- Removed old Schedule form. #55566,#55391
- Removed some obsolete tests. #56245,#57987
- Removed the remove_nested_block Visitor tool that no longer works. #56972
- Removed other useless codes. #55413
- Added CINN paddle.framework.core.is_run_with_cinn() API on the PaddlePaddle side. #54355
- Added CINN related operator logics, including various combinatorial operator’s disassembly logic. #56072,#58210,#58502, #58591, #58981, #59135, #59274, #59306, #59202, #59176, #59534, #59713, #59798; Supports bf16, amp and other forms #54399, #54368, #54608; Supports operator zero-dimensional capability #54892, #54919, #54907, #54966
- Supports CINN and PaddlePaddle PIR, and combinator operator junction operation mode, so new PIR and CINN operation is integrated. #54732, #56074, #58216, #55680, #56302, #59037, #55186, #58641
- There are strongly constrained components to stabilize CINN changes. #58719, #59309, #58993
- Added Group Schedule related CINN architecture process. #58399, #56444
- Added CUTLASS, error handling, and NVRTC Cubin Fmad options to CINN architecture functions preliminarily. #58079, #57198, #58794
- Added Python interface language for CINN. #57731, #57515, #57644, #57981, #58009
- Added dynamic shape functionality for CINN to cover ASTGen to generate dynamic shape symbols, to replace the ISL to generate dynamic shape signals #56360, #57207, #57454; Added Bucket Conditional Compilation functionality #59165; Added Schedule, Device, and IR level support for dynamic shape #58988, #59493, #58717, #58602, #59196
- Supports CINN Group Schedule operator – at Group level, perform more general and stable Schedule optimization. #56122, #57777, #57569
- Enriched or improved operator functionality, including improvements to various operator processes such as Repair Reverse, FP16, Infershape, Operator Single Test, etc. #56320, #56845, #54939,#54378,#55321,#55336,#55337,#55442,#55470,#55489,#55510,#55547,#55505,#55563,#54280,#59650,#54862,#55135,#55292,#55333,#55316,#55379,#55326
- Improved CINN, PaddlePaddle, PIR, combinator operator junction operation, including various and PIR and its actuator interface and CINN mutual support. #59170,#58766,#59255,#59203,#59024,#57829,#58135,#58193,#58207,#58606,#59437,#59759,#55075,#56805,#57764,#58620,#59769,#58702,#58749,#59025,#58820,#58908,#58169
- There are strongly constrained components to stabilize CINN changes. #55090,#55705,#57587,#59501
- Improved CINN IR and related tool codes. #55145,#55955,#56307,#55519,#56958,#57019,#57230,#57531,#57532,#57524,#58770,#59337,#59096,#56274,#56350,#57312,#55171
- Supports CINN Group Schedule operator – at Group level, perform more general and stable Schedule optimization. #54982,#57963,#58220,#55484,#55935,#55590,#56530,#58344,#59810
- CINN architectural improvements, including parallel compilation, low-level storage allocation method, print information, Group structure, Pass structure, etc. #56282, #59014,#59209,#52660,#54749,#58694,#58940,#59504,#56123
- Improved CINN codegen, jit instruction, dim args, and host kernel to support dynamic shape. #58825,#59395,#59398,#59540,#59470,#59640
- CINN error reporting optimization. #54983,#55544
- Improved cleanup of CINN codes, including CI, file paths, C++17, Flags, third-party libraries, Docker, etc. #55018,#55121,#55009,#55888,#56168,#56192,#56896,#53861,#55208
- Fixed operator-related bugs. #56280,#57767,#58406,#54406,#54494,#54751,#55674,#55684,#55683,#57798,#57816,#57687,#56719,#59756,#59770,#58811
- Fixed process architecture-related bugs. #54899,#59737,#59356,#56105,#56662,#58146,#58910,#58121,#58943,#58886,#59642,#56164,#56338,#56966,#59112,#55820,#56660,#57307,#57530,#58236,#55190,#55043,#55667
- Other bugs. #57239,#55530,#56605,#58243,#58197,#58197,#56086,#56065,#58775,#54750,#58595,#58873
- Added README file. #58349
This version of the upgrade improves performance and ease-of-use of the inference engine on GPU and CPU, reducing user cost and application cost of online inference. On GPU: A high-performance multi-threaded asynchronous executor is supported, and inference performance of each model is improved by 5%~10%. The new version of TensorRT and BF16 inference capabilities are also supported, and TensorRT inference performance and ease of use are further improved. On CPU: The latest version of OneDNN high-performance inference is supported. SwinTransformer, FastRCNN and other series of models have greatly improved performance.
- matmul supports transpose and broadcast operations. #56827
- TruncatedNormal and Assign supports FP64 data types. #57507
- Supports conv2d explicit quantized inference. #57160,#58015
- Added conv_fuse_pass. Support conv + bn fusion. The conv2d_fusion is renamed fused_conv2d_add_act. #58724,#55374,#54477,#59431
- Mixed precision inference supports OP whitelisting. #56535
- OneDNN optimization is enabled by default. Supports SwinTransformer, FastRCNNd and other inference optimizations. #58560,#59394,#59421,#58435,#58488,#59259,#56303,#56782,#57598,#58361,#59641,#59527,#59663,#59744
- Added share_data and support for pass in specified data. #57933
The fine-grained fusion inference optimization of generative large models is realized. Optimization solution ensures high-performance inference capability and excellent expandability. Users can flexibly utilize various fine-grained fusion operators and PaddlePaddle native operators to build a network structure of generative large models in free combinations as required, thus achieving efficient and low-cost inference. In addition, our solution also supports mainstream generative large model structure, significantly reducing deployment cost of inference for such models and strongly supports efficient and low-cost implementation of generative large models.
- Supports the FMHA/MMHA for CacheKV division block scheduling. #59462
- RoPE encoding fusion operator supports input sin/cos values. #55415
- Added fine-grained fusion operators. Supports high-performance inference optimization of generative large models. Added operators such as quant_linear, weight_quantize, and linear_compress for support of large model quantitative inference. #57852,#55128,#59090,#56706,#59951,#55490,#59291,#59441,#59778,#59651#55301,#58637,#56673,#56401
- Supports variable length inference series API. #57948
- Supports the GQA inference. #58472,#58836
- Added masked multihead attention. Supports high performance MMHA inference. #55344,#56411,#58134,#57936
- weight_quantize/weight_only_linear supports the Volta architecture. #58082
- Added weight_only_linear_grad for support of large model weight only quantization gradient transfer-back. #57685
- Fixed large model dynamic to static bug. Optimized communication initialization logic between static graph cards. #56390,#57169,#56688,#56592,#58868
- Optimized top_p_sampling random number generation logic. #59494
- elementwise_add fusion supports NHWC format. #56795
- conv2d supports filter as input. #55246。
- Supports BF16 and FP64 inference. #59765,#55520
- Added MarkTrtEngineOutputs API. Users can specify TensorRT Engine outputs. #56858,#56188,#57407
- Customized OP can generate TensorRT Plugin automatically. #58976,#56037
- TensorRT inference allows users to specify input hook to optimize shape collection process. #59466,#54841,#57498,#54861,#54432,#55503
- TensorRT Inference supports inference model after saving Tuning. #55893,#56952,#57031
- Supports variable length Transformer model PromptTuning. #57034
- Added operators such as bitwise_and, bitwise_or, bitwise_not, cumsum, einsum, lookup_table, assign, flip, size, scatter, solve, unbind, reduce, and argsort. Optimized support of existing operators. #59214,#59293,#54882,#54097,#54860,#55426,#54372,#55688,#56069,#59563,#59317,#59424,#55476,#56043,#58549,#57326,#59409)
- TensorRT enables video memory sharing by default. #59495,#58251
- PrelnResidualBiasPluginDynamic supports 4D input. #56304
- Added support for FlashAttention for Paddle-TRT inference for architectures below SM80.#56492
- Fixed “Inference so” link flags conflict issue. #59755
- Fixed constant_folding pass execution error. #55556
- Fixed softmax forward speed bug and reverse accuracy bug. #56036,#57858#57538
- Fixed customized OP while error and export bug. #58898,#59318
- Fixed CUDA 12.0 compilation problem on Windows platform. #59852
- Fixed bug of inference partial operator error when TensorRT version is later than 8.6. #54379,#54679,#54251
- Fixed and removed inference fusion Pass. #54846,#54887,#55573,#56434,#56326,#56753,#57491,#56909,#54536,#55073,#55081,#55240,#56439,#59009
- Fixed error of multi-stream inference context switching. #57629,#58048,#54994
In this update, added support for distributed advanced strategy, custom operator and custom fusion strategy. By upgrading distributed communication library, supports MP, GroupShared, PP, SP, MOE and other advanced distributed strategies. Meanwhile, enables vendors to flexibly access Transformer operator libraries of different granularities, and modify computation graph through Fusion Pass for performance acceleration.
- Upgraded CustomDevice to support for Paddle's latest distributed communication library CommContext. Added a variety of advanced distributed strategies such as GroupShared and MOE. #56301,#54671,#57957,#56669,#54384,#54572,#54573,#54676
- Upgraded CustomDevice to support CustomOP. Users can register undefined operators in Paddle PHI operator library. CustomDevice can support CustomOP via CAPI. #57038,#55532,#56755,#55532,#55533,#55659
- Added CustomDevice's support for CustomPass function. Modified the computation graph IR through Python API. #55511,#55728
- Added CustomDevice’s support for Paddle run_check. #56318
- Added CustomDevice’s support for StreamSafeAllocator. #55393,#56380,#56536,#58035
- Added CustomDevice’s support for DataTransform. #56627
- Added CustomDevice’s support for more PaddlePaddle APIs such as Variable.set_value, adamw, share_external_data, mp_allreduce_sum, tensor.numpy, get_paddle_place, and GeneratorState. #55272, #56386, #57253, #56927,#56189,#55225,#55247
- Modified CustomDevice dynamic library loading method from RTLD_NOW to RTLD_LAZY, to facilitate subsequent checking of compatibility of CustomDevice related software stack version. #57544
- Added CustomDevice's detection function for FP16 operator under mixed precision training. #56053,#56176
- Fixed some problems in CustomDevice's support for distributed communication libraries. #55293,#58038,#59800
- Fixed some problems in CustomDevice on some operators, including c_softmax_with_cross_entropy,data loader,SplitDenseTensor,grad accumulation,atan2 grad.#56486,#55541,#55615,#56052,#56067
- Fixed some problems of device management in CustomDevice, including device exceptions (#56556,#58639,#55173), exception events (#56745,#58059), video memory exception (#56977,#59247,#54606), device initialization (#57099,#57994), device release (#54932,#55351,#55783), and device resource pooling, etc.(#55229,#56580)
- Fixed CustomDevice compilation-related issues. #56760,#56766
- Added XPTI (XPU Profiling Tool Interface) to support collection and analysis function of runtime performance data. #54685,#54690,#54800
- Supports Paddle's latest distributed communication library CommContext. #59418
- Added XPU fusion operators, for example, fast_where. #55628
- Added support for XPU Pluign function, facilitating users to develop XPU customized operators through XTDK programming. #55101,#59326
- Added XPU’s support for AutoGrowthAllocator. #54121
- Added operator support list of Kunlun3. #57683
- Upgraded XPU Inference API. #54342
- Optimized performance of some XPU operators. Added support for bf16 in some XPU operators, including unique/index_put,squeeze/unsqueeze kernels,swish/swish_grad,scatter_nd_add_grad/slice,rsqrt/bitwise_or/arange_tensor,where,collective. #56582,#58161,#58440,#58580,#58950,#58616,#59273
- Optimized XPU memory management to avoid memory leakage. #59334,#54847
- Supports INT8 inference. #57258
- Added support for FP16 series inference operators. #55642,#54410
- Supports share_external_memory interface to pass input and output. #55170
- Supports open source quantization model XPU inference. #58568
- Added context_gm_size configuration, instead of allocating global memory in Pass. #54674
- Added embedding and fast_gather_nd plugin. #56488,#56103
- Supports fusion of fast_layternorm + leaky_relu. #57113
- Supports elementwise_min/max/floordiv/where inference in KL1 and KL2 precision. #58422
- Supports autotune configuration of fc and conv2d operator. #58801
- Supports conv and fc dynamic quantization. #59307
- fc + act fusion support for sigmoid, swish and relu6. #54486
- elementwise_sub/elementwise_div supports int data type. #55920
- Fixed XPU communication library issues and some operator issues including rnn, layer_norm_grad, yolo_box. (#55475,#55515) (#55656,#54669,#55310
- Fixed some operator bugs of Hygon DCU, including rnn, concat/split, fft, and so on.#59402,#55821,#56340)
- Fixed issues related to communication library of Hygon DCU. #57110
- Fixed compilation-related problems of Hygon DCU. #59775,#55507,#55612,#54952,#55076,#56079,#54874)
- Fixed support issue of Hygon DCU for BF16 data type. #56517
Adopted modular compilation to optimize CMake codes, improving efficiency of compilation of PaddlePaddle. This can increase efficiency of RD local development. Meanwhile, supports compilation in Python3.12, CUDA12, and Hopper architecture, and using Clang tool to comprehensively optimize code formats. In addition, C++ unitest is changed from linking static libraries to linking dynamic libraries to reduce compilation size. These improvements provide users with a smoother and more efficient installation and development experience.
- CMake code optimization: stratify directories into independent static libraries, to improve incremental compilation efficiency. #59095, #58960,#56591,#58484
- CMake compilation stratification: to realize compilation layering of PaddlePaddle architecture from bottom-up and improve compilation efficiency. #56442,#54729,#55733,#56352,#55109,#54992,#57698,#55147,#55113,#56691,#58618,#58899,#59140,#59129,#59222,#59105,#59711
- Offline compilation of third-party libraries: Third-party dependent libraries are compiled offline, so CI/CE system does not need to download third-party libraries repeatedly in every compilation, improving operation efficiency of the CI/CE system. #54344,#54370,#54466,#54438,#54388,#54436,#54392,#54646,#54380,#55501,#55136,#54451,#55631,#55549,#56165,#54391,#54614,#54522,#54764,#54400,#54322
- PaddlePaddle supports Python 3.12. #59396,#58069
- Using Clang tool to optimize source codes and improve code quality. #59626,#55895,#56632,#54449,#54523,#54796,#55847,#55807,#56261,#57522,#57868,#57809,#55658,#58285,#55491,#55506,#55279,#55741,#55894,#55704,#55800,#55799,#55983,#55954,#55764,#56246,#56219,#56217,#56216,#56208,#56134,#56253,#56255,#56693,#56692,#56637,#56636,#56647,#56218,#56640,#56635,#55675,#56601,#56485,#56648,#56747,#56676,#56649,#56895,#56994,#56904,#56744,#56954,#57114,#57343,#57483,#57871,#57861,#58028,#57627,#59072
- C++ unitest has changed from linking static libraries to linking dynamic libraries, reducing compilation size and improving compilation efficiency. #59477,#56630,#57789,#54257,#59620,#59384,#59619,#58583,#58821,#58710,#58619
- Fixed bug related to source code compilation, improving compilation efficiency. #56617,#58195,#56136,#54540,#57172,#54429,#55603,#54807,#56102,#56829,#56951,#56555,#57781,#57836,#58807,#54535,#54946,#54437,#54411,#54411,#54391,#54466,#54480,#54480,#54724,#59193,#54735,#54812,#56430,#56655,#56684,#56774,#56936,#56949,#56974,#57171,#57712,#56617,#58181,#58253,#58268,#59051,#59048,#59081,#59076,#59155,#59253,#59347,#58957,#59443,#58998,#57574,#55889,#59078,#55762,#56252,#56715,#54905,#56978,#57032,#57179,#57179,#58996,#59915,#54883,#56746,#57674,#60117,#55627,#54568,#54450,#54513,#54615,#54913,#54916,#55148,#55125,#55479,#55723,#55831,#55904,#56085,#56259,#56366,#56366,#56546,#56679,#57222,#57387,#57993,#59556,#57931,#58112,#54228,#56913,#56993,#55042,#55305,#55286,#56634,#57778,#58374,#58640,#58822,#59055,#59303,#59487,#58400,#59283,#54791,#59134,#56206,#56199,#56670,#58923
- Fixed bug related to Paddle ARM compilation. #55416,#55548
Azure-Tang, zhaoyinglia, From00, JZ-LIANG, xysheng-baidu, SylarTiaNII, kuizhiqing, zhiqiu, FeixLiu, liuzhenhai93, GhostScreaming, pangengzheng, xiaoyewww, wanghuancoder, ForFishes, hitywt, danleifeng, tianshuo78520a, ykkk2333, houj04, lj970926, XiaociZhang, HarperCy, cqulilujia, runzhech, RuohengMa, Caozhou1995, kangguangli, heavyrain-lzy, zyfncg, SigureMo, YuanRisheng, lchdl, LiYuRio, AndSonder, Wennie396, zhangbo9674, liudongxue01, risemeup1, phlrain, winter-wang, yuanlehome, NALLEIN, Liujie0926, yuguo-Jack, gitliuyf, zh794390558, Aurelius84, 6clc, GGBond8488, xiaoguoguo626807, Wong4j, iosmers, xiaoxiaohehe001, LielinJiang, carryyu, Difers, yangxiaoyu14, xuxinyi389, cxxly, gongshaotian, jjyaoao, lijialin03, lxd-cumt, cyber-pioneer, HydrogenSulfate, MayYouBeProsperous, Charles-hit, Patrick-Star125, ScottWong98, huangjiyi, DrRyanHuang, jinyouzhi, BeingGod, Wanglongzhi2001, yangguohao, zyt1024, longranger2, 2742195759, megemini, thisjiang, kevincheng2, zhoutianzi666, Wangzheee, ming1753, tianhaodongbd, freeliuzc, zhenyun-li, MARD1NO, RichardWooSJTU, eee4017, leo0519, csy0225, wwbitejotunn, bukejiyu, jiweibo, iamsonderr, ckl117, ronny1996, zhanglirong1999, LLee233, ZHUI, wangxn12138, zhwesky2010, Courtesy-Xs, zoooo0820, llyyxx0413, Asthestarsfalll, zxcd, pkuzyc, idontkonwher, sneaxiy, hong19860320, ZibinGuo, leolishaohao, MuShangCC, zhupengyang, shentanyue, Travis-Lee, wz1qqx, frank-oops, newway, QingshuChen, zhangyk0314, HandSomeLEEw, Shixiaowei02, zhangyuqin1998, Xing-lil, zhhsplendid, jiahy0825, xinyu-intel, MarioLulab, 0x45f, Tom-Zheng, xingmingyyj, zhangbopd, gouzil, zeroRains, BiynXu, WintersMontagne10335, wuhuachaocoding, GreatV, chenwhql, deepllz, parap1uie-s, ozogxyz, FisherWY, changeyoung98, zhiboniu, YangQun1 dynamicheart, Xreki, liugddx, Lylinnnnn, YSF-A, zzjjay, YanhuiDua, lishicheng1996, USTCKAY, abenmao, cocoshe, HermitSun, ccsuzzh, sanbuphy, enkilee, RedContritio, Liyulingyue, zrr1999, chen2016013, Galaxy1458, chalsliu, mrcangye, XieYunshen, zhiheng-liu, haohongxiang, ZzSean, JamesLim-sy, yuehuayingxueluo, niuliling123, umiswing, sijunhe, littsk, SecretXV, zhurou603, zhangjun, caizejun, yangjianfengo1, vivienfanghuagood, Xinyu302, lizexu123, yghstill, Li-fAngyU, VigiZhang, co63oc, dhanush-2501, ooooo-create, PommesPeter, zeus2x7, akshatvishu, jzhang533, Sekiro-x, gumblex, BernieHuang2008, YibinLiu666, qiuwenbogdut, XavierZXY, MqLeet, zhangting2020, mingxu1067, Ainavo, SSKlearns, yuchen202, silverling, zade23, wenxiaohahaha, NKNaN, Tsaiyue, fsczz, Tomoko-hjf, rhmaaa, zbt78, Hhankyangg, wangzhen38, zhengqiwen1997, engineer1109, onepick, qili93, Rane2021, nemonameless, DesmonDay, RachelXu7, ceci3, lyuwenyu, liuruyan, LokeZhou, shiyutang, lanxianghit, feifei-111, Sahala08, sunzhongkai588, Kaedeharai, Candy2Tang, liyongchao911, whisky-12, InsaneOnion, yoyoIcy, KongAKun, linzeyang, MuhammadNizamani, eltociear, Ligoml, LUZY0726, Windfarer, FlyingQianMM, jeng1220, junelotus, zlsh80826, Vvsmile, Frida-a, TonibMw, guoshengCS, zhink, ZhangYulongg, AlbertVan, fengxin-hello, mjp9527, entired, DanGuge.