- 11 Mar, 2022 3 commits
-
-
Jianfeng yan authored
-
Jianfeng yan authored
-
Jianfeng Yan authored
* fixed a corner case in GetCoordinateResetStep * clean * rename num_accesses to num_access Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 10 Mar, 2022 1 commit
-
-
Qianfeng authored
* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction * Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter * Rename the folder name for the pool2d and reduce examples * Update to reduction test scripts * Add Readme for pool2d_fwd and reduce_blockwise examples * Tiny fix in reduce profiler and tiny update in reduce testing scripts * Tiny fix in testing script profile_reduce_no_index.sh * Tiny change in script/profile_reduce_with_index.sh * Renaming and refining in Reduction profiler/device layer/examples * Renaming and refining in Reduction profiler/device layer/examples * Renaming all NumReduceDims to NumReduceDim
-
- 09 Mar, 2022 1 commit
-
-
Chao Liu authored
* delete obselete files * move files * build * update cmake * update cmake * fix build * reorg examples * update cmake for example and test
-
- 07 Mar, 2022 1 commit
-
-
Chao Liu authored
-
- 05 Mar, 2022 5 commits
-
-
Qianfeng authored
* Initial adding of generic reduction * Initial adding of generic reduction ... * Updates to make compiling done * clang-format all files * clang-format some files again * Renaming in profiler/include/profile_reduce.hpp * Updates and make BlockWise cases passed * Updates and make ThreadWise and MultiBlockTwoCall cases passed * Remove the support for MUL and NORM1 reduceOp from the profiler and the device instances * Change to replace the dim0_max_vector_size/dim1_max_vector_size template argument in the device reduce classes * format * adding pooling * added max and average pooling * comment out cout and kernel timing * Tiny simplification in profiler/reduce_profiler.cpp * Add example for reduce_blockwise * Tiny updates * Change to pass the ElementWiseOp from device layer to kernel * Fix the vectorDim and vectorSize in Device layer * Enable vector load on both dim0 and dim1 for Threadwise method * Tiny updates * Change to let the user to pass the preUnaryOp and posUnaryOp * Make pooling example work * split device_reduce_instance into two libraries * Tiny update * Replace nanPropaOpt enum by boolean propagate_nan * Simplification in DeviceReduce layer codes * update build * Change to clarify the difference between ck::half_t and half_float::half * Renaming in all the reduction codes * Add VectorSize as template parameter for device layer * Add BetaIsZero as kernel template and as AccDataType for alpha * print * Small updates for pooling * Updates for host_generic_reduction for reference * Update to make AVG pooling pass * Update to make MAX pooling with indices output pass * fix * add OutDst vector store to threadwise reduction and pooling * tweak * turn off check_indices that caused build issue * refactor pooling * clean up * turn off check_indices for building issue for php-compiler * add more tile size for odd C * tweak conv for odd C * update script * clean up elementwise op * add hack in reduction_operator.hpp to avoid compile error. To fix it, need to use element_wise_op in reduction op * Add OutVectorSize as device and kernel tunable, also update to Elementwise Operations * Move reduce operator mapping to host layer file reduction_operator_mapping.hpp from reduction_operator.hpp * Change to the unary operators * Move the definitions of unary operations to element_wise_operation.hpp * re-org files * Refine in device interfaces and multiblock kernels * Split the reduction configurations into instances for specific methods * Update in getTypeString() of device pool2d * Renaming in host and kernel * Tiny update in profiler/src/profiler.cpp * Uncomment in device_operation/CMakeLists.txt to enable the building of all operations * Make check_indices a templated function to remove some linking issue * Renaming in the profiler reduce module * Add support for double Reduction (but disable MultiblockAtomicAdd for double) * Tiny correction of literal string * Rename DevicePoolFwd to DevicePool2dFwd * Split device_reduce_instance_xxx.cpp files according to the data types to speed up compiling * Add comments for lists of configurations, lists of instances and references of add_reduce_instances_xxx * Remove un-used header file gridwise_generic_reduction_wrapper_common.hpp * Renaming and refining in the Reduction codes * Tiny change in the unary operators * Renaming symbols and files * Renaming symbols in the kernels * Move kernel kernel_set_buffer_value to separate file * Add IndexDataType template parameter for kernels and use int32_t as index data type in device layer * Tiny update in the kernels * Remove definition of sqrtf()/isnan()/abs() for half_t due to some ADL issue * Simplify a helper function in device layer * Tiny adjustment in testing data initialization * Renaming in kernel/device/host * Add two testing scripts for reduction * Refine the Unary operators in element_wise_operation.hpp * Update in the reduce profiler module * Update to the reduction testing scripts * reduce compile parallelism * change CI docker to rocm5.0 * remove unused variables * fix build Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Chao Liu authored
revert changes in threadwise copy due to PR #101 (space filling curve used in threadwise copy) (#111)
-
rocking5566 authored
* Add int8 of mk_nk_mn to the ckProfiler * Add example of int8 gemm * Fix typo, use ushort instead of half_t for bfloat16 * replace ushortXXX_t to bhalfXXX_t * rename ushort to bhalf_t * Add bf16 example * Add bf16 gemm to ckProfiler * Fix alignment * Fix typo * Add unit test for gemm_xdl int8 * Add gemm_xdl fp32 unit test * Add gemm_xdl bf16 unit test * fix build * fix build issue due to merge conflict * Fix build * Fix build error * [What] gemm + relu inference [How] gemm + requant + relu + requant + clamp * clean Co-authored-by:
rocking <chunylai@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Chao Liu authored
* fix tests * remove useless file * fix test build * reduce parallelism when compiling * fix test
-
ltqin authored
* add wrw reference * start device * raw not split version * run simple example * start to use atomic add * simple transform result correct * first version that can run * fix atomic and set operator choice * add check split-k * format * change input parameter * add pad for t total * rename example index Co-authored-by:
ltqin <letaoqin@amd.com>
-
- 04 Mar, 2022 4 commits
-
-
rocking5566 authored
* Add int8 of mk_nk_mn to the ckProfiler * Add example of int8 gemm * Fix typo, use ushort instead of half_t for bfloat16 * replace ushortXXX_t to bhalfXXX_t * rename ushort to bhalf_t * Add bf16 example * Add bf16 gemm to ckProfiler * Fix alignment * Fix typo * Add unit test for gemm_xdl int8 * Add gemm_xdl fp32 unit test * Add gemm_xdl bf16 unit test * fix build * fix build issue due to merge conflict * Fix build * Fix build error Co-authored-by:
rocking <chunylai@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Chao Liu authored
-
Jianfeng Yan authored
* add space_filling_curve * cleanup and move space_filling_curve into test * WIP: start refactoring threadwise_transfer_v1r3 * threadwise_copy works but needs further refactoring * add some comments * add SpaceFillingCurve::GetIndices() * minor changes * removed GetIndices; refactored GetDstCoordinateResetStep * add DynamicBuffer::Transfer, but Add is not tested * rebased agaist develop * threadwise_copy_v6r1/v6r2/v6r3 using space-filling curve start to work * minor changes * refactored threadcopy v3r1, v2; removed old implementations * clang-format * cleanup * fix a typo in v6r3 * format Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
ltqin authored
* start conv2d bwd api * kernel running * add bwd reference * change to no shuffle * fix bwd reference * pass verification * add Filter1x1Stride1Pad0 and start testing * change some tuning parameter * fix test error * add fp16 tuning parameter * add bf16 tuning parameter * add int8 tuning parameters * change fp32 tuning parameter * add bwd to profiler * fix bug for bwd profiler * fix ckProfiler bug * change conv2d_bwd_xdl to fp16 * fix bug in comments * fix precompile id * fix enum conv name * chage _bwd_ to _bwd_data_ * change conv2d_bwd example id * bwd to bwd data * fix prehead * fix MakeDefaultBlock2CTileMap ,import form merge develop * format bwd instance * bwd to bwd data * change name bwd to bwd data * change name bwd to bwd data in example * formate code * change conv2d bwd data id in example * rewrite readme for example * fix CalculateMagicNumbers about div zero * add workaround CK_WORKAROUND_SWDEV_325164 * change test_conf2d_bwd_data show info * format * fix bug for workaround:CK_WORKAROUND_SWDEV_325164 * formate tuning parameters * formate tuning parameters again * formate tuning parameters 3 * formate tuning parameters 4 * remove add function template * format * update comment Co-authored-by:
ltqin <letaoqin@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 03 Mar, 2022 1 commit
-
-
JD authored
* add docker file and make default target buildable * add Jenkinsfile * remove empty env block * fix package stage * remove render group from docker run * clean up Jenkins file * add cppcheck as dev dependency * update cmake file * Add profiler build stage * add hip_version config file for reduction operator * correct jenkins var name * Build release instead of debug * Update test CMakeLists.txt reorg test dir add test stage * reduce compile threads to prevent compiler crash * add optional debug stage, update second test * remove old test target * fix tests to return proper results and self review * Fix package name and make test run without args * change Dockerfile to ues rocm4.3.1 * remove parallelism from build * Lower paralellism Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 28 Feb, 2022 1 commit
-
-
Anthony Chang authored
* add gitignore * host tensor: allow generating sequentially increasing value in a given dimension * gridwise gemm v3r1: allow distinct K0/K1 values for A/B block descriptor - remove dangling header include - modify example gemm_xdl accordingly - infer KPack value from M/NPerXdl - device conv2d fwd: update parameters accordingly for the underlying gridwise gemm v3r1 (API for conv2d fwd stays the same for now until we decide to expose individual K0s for activation and weight) * add LDS data dump utility * profiler: reflect API change for distinct K0/K1 for A/B matrices * profiler: add conflict-free LDS write FP16 kernel instances * fix accidental perf regression * address feedback; cosmetic changes * clang-format for new files * format Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 25 Feb, 2022 2 commits
-
-
zjing14 authored
* init for splitk f16 * a working prototype * debug * perf debug * update example * instances for mk kn * add instances for all layers * clean * clean * add tuning * format * add mn_padding into irregular tile * clean Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Jianfeng Yan authored
* add space_filling_curve * cleanup and move space_filling_curve into test * add functions for backward and forward step; hard coded results in unit test * minor changes
-
- 23 Feb, 2022 3 commits
-
-
Chao Liu authored
* clean up * add mutilple thread scratch to ThreadwiseTensorSliceTransfer_v3r1 * add 2 stage prefetch * add more sanity check into transform_tensor_descriptor * tweak * enabling 2 stage prefetch to exsiting gridwise gemm; tweak * enabling 2 stage prefetch to exsiting gridwise gemm * move gridwise gemm pipeline in class; clean up * add some irregular tile size * update CalculateHasMainK0BlockLoop for multi-stage-prefetch * refactor gridwise gemm pipeline class
-
Adam Osewski authored
* Convolution ND * Code unification across dimensions for generating tensor descriptors. * Example * Instances * Move convnd f32 instance file to comply with repo structure. * Conv 1D tensor layouts. * Formatting and use ReferenceConv * Reference ConvFwd supporting 1D and 2D convolution. * Debug printing TensorLayout name. * Conv fwd 1D instance f32 * Refactor conv ND example. Needed to support various conv dimensio. Needed to support various conv dimensions * Rename conv nd example director to prevent conflicts. * Refactor some common utility to single file. Plus some tests. * Refactor GetHostTensorDescriptor + UT. * Add 1D test case. * Test reference convolution 1d/2d * Remove some leftovers. * Fix convolution example error for 1D * Refactor test check errors utility function. * Test Conv2D Fwd XDL * More UT for 1D case. * Parameterize input & weight initializers. * Rename example...
-
Jianfeng Yan authored
* conv3d compiles but has memory error * conv3d works * fix performance issue by using __builtin_amdgc_readfirstlane * change MakeBlock2CTileMap to MakeDefaultBlock2CTileMap; change c_blockid_to* to cblockid_to* * clang-format * remove CK_EXPERIMENTAL_PASS_TENSOR_DECRIPTOR_BY_*; moved wrapper into DeviceConv3d * format * remove useless marc * add comment Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 21 Feb, 2022 1 commit
-
-
rocking5566 authored
* [What] Refactor verification of gemm alpha_beta, move to reference operation [Why] Sync with other verification * Profile mk_nk for gemm bias 2d * Support bias 2d with mn * kn in profiler * Support bias 2d with km*kn and km*nk in profiler * Support fp32 bias 2d in profiler * format * format Co-authored-by:
rocking <chunylai@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 19 Feb, 2022 1 commit
-
-
JD authored
* add docker file and make default target buildable * add Jenkinsfile * remove empty env block * fix package stage * remove render group from docker run * clean up Jenkins file * add cppcheck as dev dependency * update cmake file * Add profiler build stage * add hip_version config file for reduction operator * correct jenkins var name * Build release instead of debug * clean up Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 12 Feb, 2022 1 commit
-
-
ltqin authored
* add fwd bf16 conv * change tunning parametor * add int8 for conv fwd * remove comments * change tunning parametor for int8 * change init int8 example * add test for conv2d fwd * change device operation file pos because merge develop * fwd int8 use reference * test_conv_fwd use reference * add braket for if statement * rename fwd example name * remove StaticBufferOfVectorTypeV2 * tweak example Co-authored-by:
ltqin <letaoqin@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 11 Feb, 2022 4 commits
-
-
zjing14 authored
* add DeviceGemmSplitKXdl * add file device_gemm_splitk_xdl.hpp * set c matrix zero * using atomic * add all tuning parameter to f32 mkkn * grid size change to 720 * add tunning parameter for NT * add tunning parameter for TN * add tunning parameter for TT * add m=96tunning parameter * add lost config * debug * fix sweep * add failed tuning params * fixed sweep logic * clean * add padding to M/N for irr tile size * clean code * add element wise operation * fixed MPerBlock=96 * remove marco for slpitk swtich * add test * add new line at the end of device_gemm_xdl_instance.hpp * remove step hack * seperate split-k instance files * add tunning parameters * change disired grid size to parameters * remove slice length * add desiredgridsize parameter to ckProfiler * add losting file device_gemm_xdl_splitk_instance.hpp * change desired gride size to kbatch * format * format * clean up * add selection of device_instances * clean code * clean code * add small tile size in fp16 nn * test for rocm 4.5 * merge develop * clean * clean * clean * remove no-use code * add padding switch to device_gemm_xdl * add padding switch for ksplit fp32 * clean * clean * add files * rename * Update profiler.cpp * format Co-authored-by:
ltqin <letaoqin@amd.com> Co-authored-by:
ltqin <letao.qin@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
zjing14 authored
* prepare host for batched_gemm * init commit of batched kernels * fixed * refine transform with freeze * m/n padding * fixed a bug; clean * add small tiles * clean * clean code * clean code * add nt, tn, tt layout * add missing file * use StaticBufferTupleOfVector instead * add reference_batched_gemm * fixed a macro
-
rocking5566 authored
* [What] Add 2d version of bias, prepare to implement alpha / beta scaling * Add alpha / beta functor * Refine parameter of example * [What] Use real type instead of template [Why] Prevent implicit cast * Rename parameter for general operator * Remove redundant comment * Fix compile error Co-authored-by:
rocking <chunylai@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Anthony Chang authored
- device_gemm_xdl_c_shuffle function signature matches split-k - retire host_driver since it is no longer maintained - linter error (unused variable) Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 07 Feb, 2022 1 commit
-
-
Chao Liu authored
* tweak conv for odd C * update script * clean up elementwise op * fix build * clean up * added example for gemm+bias+relu+add * added example for gemm+bias+relu * add profiler for gemm_s_shuffle; re-org files * add profiler * fix build * clean up * clean up * clean up * fix build
-
- 04 Feb, 2022 1 commit
-
-
ltqin authored
* add reference * clean up * add reference for conv * rename Co-authored-by:
ltqin <letaoqin@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 03 Feb, 2022 2 commits
-
-
zjing14 authored
* test mfma builtins * add fp16 buildins * add int8 buildins * add bfl16 buildins * simplify host conv forward * clean * clean
-
ltqin authored
* add DeviceGemmSplitKXdl * add file device_gemm_splitk_xdl.hpp * set c matrix zero * using atomic * add all tuning parameter to f32 mkkn * grid size change to 720 * add tunning parameter for NT * add tunning parameter for TN * add tunning parameter for TT * add m=96tunning parameter * add lost config * add element wise operation * fixed MPerBlock=96 * remove marco for slpitk swtich * add test * add new line at the end of device_gemm_xdl_instance.hpp * remove step hack * seperate split-k instance files * add tunning parameters * change disired grid size to parameters * remove slice length * add desiredgridsize parameter to ckProfiler * add losting file device_gemm_xdl_splitk_instance.hpp * change desired gride size to kbatch * format * format * clean up * add selection of device_instances * clean code * fix build issue Co-authored-by:
ltqin <letaoqin@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com> Co-authored-by:
Jing Zhang <jizhan@amd.com>
-
- 25 Jan, 2022 1 commit
-
-
rocking5566 authored
* Do not hardcode the function parameter, use template instead. * [What] Remove AThreadTransferSrcResetCoordinateAfterRun and BThreadTransferSrcResetCoordinateAfterRun in host API [Why] "C_Shuffle" version is supposed to be similar to the vanilla one * Fix typo Let DeviceGemmXdl_C_Shuffle use kernel_gemm_xdlops_v3r1
-
- 21 Jan, 2022 1 commit
-
-
rocking5566 authored
* [What] 1. Add DeviceGemmXdl_C_Shuffle 2. Revise example of gemm_xdl [Why] Prepare to add shuffle version of D = alpha * (A * B) + beta * C [How] Imitate DeviceGemmXdl and device_conv2d_fwd_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp
-
- 18 Jan, 2022 1 commit
-
-
Chao Liu authored
* fix build issue
-
- 26 Dec, 2021 1 commit
-
-
Chao Liu authored
* fix relu * clean up * clean up * adding 1x1 conv * adding 1x1 conv * added 1x1 conv * refactor * refactor * refactor * added profiler for conv+bias+relu+add * clean up * adding conv+bias+relu * adding conv+bias+relu * added conv+bias+relu * Update README.md * update cpu verification * adding c shuffle * update static_tensor for dealing with invalid element * adding c shuffle * debugging * fix bug * convert to fp16 before shuffle * shuffle more than one M/NRepeat * clean up * remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1 * clean up * remove coordinate step hack from all gridwise gemm xdl * clean up coordinate step hack * clean up coordinate step hack * ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst * adding output shuffle in conv+bias+relu+add * update * added conv+bias+relu+add with c shuffle * added conv+bias+relu+add with c shuffle * fix forward_sweep bugs in threadwise copy * clean up * refactor * clean up * clean up * added conv_c_shuffle+bias_relu * clean up * added conv+bias+relu+atomic_add * clean up * clean up * clean up * clean up * clean up * clean up * misc fixes; add 1x1 specialization * clean up * delete unused device op * clean up * add support for odd C value
-
- 13 Dec, 2021 1 commit
-
-
Chao Liu authored
* Bug in BlockwiseGemmXdlops_k0mk1_k0nk1_m0n0m1n1m2m3m4n2_v1::MakeCGridDescriptor_M0_N0_M1_N1_M2_M3_M4_N2() * Bug in ThreadwiseTensorSliceTransfer_v1r3 logic for calculating "forward_sweep"
-
- 04 Dec, 2021 1 commit
-
-
Chao Liu authored
* fix relu * clean up * clean up
-
- 03 Dec, 2021 1 commit
-
-
Chao Liu authored
* gemm+activation * move C pointwise operation into threadwise copy * add pointwise operation to A/B matrix * update ckProfiler * adding bias add * adding bias add * adding bias add * added bias add; worked around compiler issues * clean up * clean up * Update README.md * Update README.md * Update README.md * clean up * add conv_xdl example * adding conv_xdl_bias_relu_add example * add conv+bias+relu+add, but has register spill issue * tweak * tweak * refactor * Update README.md update readme for example/2_gemm_xdl_bias_relu_add * clean up * Update README.md update readme for example/3_conv_xdl * Update README.md
-