1. 19 Apr, 2022 1 commit
  2. 15 Apr, 2022 1 commit
    • Illia Silin's avatar
      Compile CK for all targets (#188) · 4221505d
      Illia Silin authored
      
      * compile ck for all targets
      
      * update the target criteria
      
      * change the target condition
      
      * fixed some typos
      
      * fixed missed file
      
      * revert changes in README
      
      * revert device_conv3d_fwd_xdl_...
      
      * update device_conv3d_fwd_xdl_...
      
      * update device_batched_gemm_reduce...
      
      * test the unused arguments fix
      
      * test the warning suppression
      
      * try suppress warnings in device_batched_gemm_reduce_xdl...
      
      * fix the last warnings
      
      * replace UNUSED with std::ignore
      
      * fix a typo
      
      * replaced std::ignore with ignore
      
      * add igonre header to common_header
      
      * refactor atomicAdd
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      4221505d
  3. 07 Apr, 2022 1 commit
  4. 05 Apr, 2022 4 commits
    • Adam Osewski's avatar
      Common forward convolution utility refactor. (#141) · abf4bdb9
      Adam Osewski authored
      
      * Convolution ND
      
      * Code unification across dimensions for generating tensor descriptors.
      * Example
      * Instances
      
      * Move convnd f32 instance file to comply with repo structure.
      
      * Conv 1D tensor layouts.
      
      * Formatting and use ReferenceConv
      
      * Reference ConvFwd supporting 1D and 2D convolution.
      
      * Debug printing TensorLayout name.
      
      * Conv fwd 1D instance f32
      
      * Refactor conv ND example.
      
      Needed to support various conv dimensio.
      
      Needed to support various conv dimensions
      
      * Rename conv nd example director to prevent conflicts.
      
      * Refactor some common utility to single file.
      
      Plus some tests.
      
      * Refactor GetHostTensorDescriptor + UT.
      
      * Add 1D test case.
      
      * Test reference convolution 1d/2d
      
      * Remove some leftovers.
      
      * Fix convolution example error for 1D
      
      * Refactor test check errors utility function.
      
      * Test Conv2D Fwd XDL
      
      * More UT for 1D case.
      
      * Parameterize input & weight initializers.
      
      * Rename example to prevent conflicts.
      
      * Split convnd instance into separate files for 1d/2d
      
      * Address review comments.
      
      * Fix data type for flops/gbytes calculations.
      
      * Assign example number 11.
      
      * 3D cases for convolution utility functions.
      
      * 3D reference convolution.
      
      * Add support for 3D convolution.
      
      * Check for inputs bigger than  2GB.
      
      * Formatting
      
      * Support for bf16/f16/f32/i8 - conv instances + UT.
      
      * Use check_err from test_util.hpp.
      
      * Split convnd test into separate files for each dim.
      
      * Fix data generation and use proper instances.
      
      * Formatting
      
      * Skip tensor initialization if not necessary.
      
      * Fix CMakefiles.
      
      * Remove redundant conv2d_fwd test.
      
      * Lower problem size for conv3D UT.
      
      * 3D case for convnd example.
      
      * Remove leftovers after merge.
      
      * Add Conv Specialization string to GetTypeString
      
      * Skip instance causing numerical errors.
      
      * Small fixes.
      
      * Remove redundant includes.
      
      * Fix namespace name error.
      
      * Script for automatic testing and logging convolution fwd UTs
      
      * Comment out numactl cmd.
      
      * Refine weights initalization and relax rtol for fp16
      
      * Move test_util.hpp to check_err.hpp
      
      * Refine weights initalization and relax rtol for fp16
      
      * Refactor common part of test conv utils.
      
      * Move utility function to single common place.
      
      * Add additional common functions to utility.
      
      * Refactor convnd_fwd_xdl examples.
      
      * Remove redundant files.
      * Unify structure.
      
      * Add constructor to ConvParams.
      
      * And add input parameters validation.
      
      * Modify conv examples to use single utility file.
      
      * Remove check_error from host_tensor.hpp
      
      * Get rid of check_indices function.
      
      * Remove bf16_to_f32 function overload for scalars.
      
      * Fix namespace.
      
      * Add half_float::half for check_err.
      
      * Fix conv params size in UT.
      
      * Fix weights initialization for int8.
      
      * Fix weights initialization for int8.
      
      * Add type_convert when store output in ref conv 1D.
      
      * Get back old conv2d_fwd_xdl operation.
      
      * Silence conv debug print.
      
      * format
      
      * clean
      
      * clean
      
      * Fix merge.
      
      * Fix namespace for check_err
      
      * Formatting.
      
      * Fix merge artifacts.
      
      * Remove deleted header.
      
      * Fix some includes and use ck::utils::check_err.
      
      * Remove unused check_indices restored by previous merge.
      
      * Fix namespaces after merge.
      
      * Fix compilation error.
      
      * Small fixes.
      
      * Use common functions.
      * Fix filename
      * Fix namespaces.
      
      * Fix merge artifact - retrieve removed by accident fun.
      
      * Fix ConvForwardSpecialization.
      
      * Adhere to coding style rules.
      
      * Fix merge artifacts.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      abf4bdb9
    • ltqin's avatar
      Patch for bwd data comments (#174) · 6717168c
      ltqin authored
      * change function name and way to set input zero
      
      * change enable if
      6717168c
    • ltqin's avatar
      NHWC Conv2d Bwd weight fp16 ckprofiler and test (#166) · 781cacd2
      ltqin authored
      * change backward weight name
      
      * start add bwd weight lib and profiler
      
      * change tuning paramter
      
      * change output info
      
      * add bwd weight test
      
      * change test info
      
      * using conv_util
      
      * change wgt to weight
      
      * add }
      
      * add fp32
      781cacd2
    • Qianfeng's avatar
      Improve Reduction kernel api (#152) · 82c8b9f8
      Qianfeng authored
      * Add ThreadwiseReduction functor as per-thread reduction api
      
      * Using ThreadwiseReduce api and some change in using PartitionedBlockwiseReduction api to simply the kernels
      
      * Add comments and remove useless declarations in the kernels
      
      * Tiny updates
      82c8b9f8
  5. 01 Apr, 2022 1 commit
  6. 31 Mar, 2022 7 commits
  7. 30 Mar, 2022 1 commit
    • Jianfeng Yan's avatar
      Batched gemm and reduction (#156) · 34c661e7
      Jianfeng Yan authored
      * adding batched_gemm_and_reduction
      
      * batched_gemm_reduce works with bactch_count=1
      
      * fix a bug in grid_size; batched_gemm_reduce works for batch_count > 1
      
      * adding profiler for batched_gemm_fp16
      
      * fixed a bug in declaration of d1 and d0; both example and profiler work
      
      * clang-format
      
      * cleanup
      
      * batched_gemm_reduce: add test
      
      * minor change
      
      * fixed some typo in function names
      34c661e7
  8. 29 Mar, 2022 2 commits
    • rocking5566's avatar
      Refine kernel parameter of int8 (ScalarPerVector) (#155) · 98e1e2d0
      rocking5566 authored
      * Change int8 ScalarPerVector
      
      * Modify vector width of C
      98e1e2d0
    • ltqin's avatar
      Unified implementation of 1d/2d/3d conv bwd-data. fp32/fp16/bfp16/int8 (#134) · 0536f2b3
      ltqin authored
      
      * start convnd bwd data
      
      * add 3d laoyout name
      
      * add conv1d reference
      
      * add con3d reference
      
      * finished example client code
      
      * conv1d kernel finished
      
      * fix input error
      
      * add conv3d
      
      * add 3d layout in conv_utils.hpp
      
      * fix sepecial check
      
      * addconvnd lib
      
      * add test for bwd data
      
      * finished test
      
      * add check slice length
      
      * convnd bwd data start
      
      * profiler can be compiled
      
      * fix some bug
      
      * set input to zero
      
      * modify readme for example
      
      * fix test_convnd_bwd_data bug
      
      * test_convnd_bwd_data parameter desc
      
      * workaround for 1d
      
      * workaroud for 2d
      
      * change init value
      
      * workaround for 3d int8
      
      * fix init value bug
      
      * remove workaround
      
      * fix acc data type
      
      * add int32
      
      * change select function to template
      
      * tilda to tilde
      
      * remove int32 instance
      
      * fix commit for device hpp
      
      * fix comments for profiler
      
      * using profile imp to test
      
      * add pass verification
      
      * fix conv2d reference
      
      * fix conflict
      
      * remove double batched_gemm
      
      * fix exampel conv2d data and test convnd
      
      * format
      
      * change conv2d_bwd_data return value
      
      * remove repeat = 1
      
      * remove conv bwd data
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      0536f2b3
  9. 28 Mar, 2022 1 commit
  10. 25 Mar, 2022 1 commit
  11. 24 Mar, 2022 3 commits
    • rocking5566's avatar
      Gemm test return value (#148) · 3ba14932
      rocking5566 authored
      * Add return value
      
      * Replace _Float16 to ck::half_t
      
      * A test should return 0 if success and return non-zero if fail
      3ba14932
    • zjing14's avatar
      fixed alloc mem size (#145) · 12f4cfce
      zjing14 authored
      12f4cfce
    • Chao Liu's avatar
      Gemm+Reduce Fusion (#128) · f95267f1
      Chao Liu authored
      * add gridwise gemm v4r1
      
      * rename
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * use sfc in shuffling
      
      * remove hardcode
      
      * remove hardcode
      
      * refactor
      
      * fix build
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * format
      
      * clean
      
      * adding gemm+reduce
      
      * adding profiler for gemm+reduce
      
      * adding gemm+reduce profiler
      
      * fix build
      
      * clean up
      
      * gemm+reduce
      
      * fix build
      
      * update DeviceGemm_Xdl_CShuffle; update enum to enum class
      
      * clean up
      
      * add test for gemm+reduce
      
      * clean up
      
      * refactor
      
      * fix build
      
      * fix build
      f95267f1
  12. 23 Mar, 2022 2 commits
    • Adam Osewski's avatar
      Unified conv3D API + support for all data types. (#133) · f91579aa
      Adam Osewski authored
      
      * Convolution ND
      
      * Code unification across dimensions for generating tensor descriptors.
      * Example
      * Instances
      
      * Move convnd f32 instance file to comply with repo structure.
      
      * Conv 1D tensor layouts.
      
      * Formatting and use ReferenceConv
      
      * Reference ConvFwd supporting 1D and 2D convolution.
      
      * Debug printing TensorLayout name.
      
      * Conv fwd 1D instance f32
      
      * Refactor conv ND example.
      
      Needed to support various conv dimensio.
      
      Needed to support various conv dimensions
      
      * Rename conv nd example director to prevent conflicts.
      
      * Refactor some common utility to single file.
      
      Plus some tests.
      
      * Refactor GetHostTensorDescriptor + UT.
      
      * Add 1D test case.
      
      * Test reference convolution 1d/2d
      
      * Remove some leftovers.
      
      * Fix convolution example error for 1D
      
      * Refactor test check errors utility function.
      
      * Test Conv2D Fwd XDL
      
      * More UT for 1D case.
      
      * Parameterize input & weight initializers.
      
      * Rename example to prevent conflicts.
      
      * Split convnd instance into separate files for 1d/2d
      
      * Address review comments.
      
      * Fix data type for flops/gbytes calculations.
      
      * Assign example number 11.
      
      * 3D cases for convolution utility functions.
      
      * 3D reference convolution.
      
      * Add support for 3D convolution.
      
      * Check for inputs bigger than  2GB.
      
      * Formatting
      
      * Support for bf16/f16/f32/i8 - conv instances + UT.
      
      * Use check_err from test_util.hpp.
      
      * Split convnd test into separate files for each dim.
      
      * Fix data generation and use proper instances.
      
      * Formatting
      
      * Skip tensor initialization if not necessary.
      
      * Fix CMakefiles.
      
      * Remove redundant conv2d_fwd test.
      
      * Lower problem size for conv3D UT.
      
      * 3D case for convnd example.
      
      * Remove leftovers after merge.
      
      * Add Conv Specialization string to GetTypeString
      
      * Skip instance causing numerical errors.
      
      * Small fixes.
      
      * Remove redundant includes.
      
      * Fix namespace name error.
      
      * Script for automatic testing and logging convolution fwd UTs
      
      * Comment out numactl cmd.
      
      * Refine weights initalization and relax rtol for fp16
      
      * Fix weights initialization for int8.
      
      * Add type_convert when store output in ref conv 1D.
      
      * Get back old conv2d_fwd_xdl operation.
      
      * Silence conv debug print.
      
      * format
      
      * clean
      
      * clean
      
      * Fix merge.
      
      * Fix namespace for check_err
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      f91579aa
    • Chao Liu's avatar
      clean (#143) · 22061366
      Chao Liu authored
      22061366
  13. 22 Mar, 2022 3 commits
    • Jianfeng Yan's avatar
      Batched gemm bf16 (#142) · d91f9f11
      Jianfeng Yan authored
      * add bf16 for batched gemm
      
      * batched_gemm_bf16 works
      
      * recover accidently changed files
      d91f9f11
    • zjing14's avatar
      Grouped GEMM for fp16 (#126) · 716f1c7f
      zjing14 authored
      * init of grouped_gemm
      
      * 2 gemm test
      
      * perf test
      
      * clean
      
      * wrap desc into a struct
      
      * test cast static_arr to pointer
      
      * add ptr to GemmDesc
      
      * add grouped gemm profiler
      
      * fixed mem issue with unique_ptr
      
      * clean
      
      * clean
      
      * finished ckprofiler
      
      * Update README.md
      
      * readme
      
      * fixed readme
      
      * add example
      
      * improve code
      
      * fixed comments: reserve, seperate ptr and gemm_shapes
      
      * merge group and non-group
      
      * fixed comments: replace push_back with emplace_back to avoid copy constructor
      
      * fixed comments: unified blk2ctile; add test
      
      * ci fix
      
      * fixed ci
      
      * fixed ci
      
      * fixed ci
      716f1c7f
    • Qianfeng's avatar
      Reduction for int8 and bfloat16 (#125) · 9a8ee8a3
      Qianfeng authored
      
      * Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction
      
      * Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter
      
      * Rename the folder name for the pool2d and reduce examples
      
      * Update to reduction test scripts
      
      * Add Readme for pool2d_fwd and reduce_blockwise examples
      
      * Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX)
      
      * Tiny fix in reduce profiler and tiny update in reduce testing scripts
      
      * Tiny fix in testing script profile_reduce_no_index.sh
      
      * Tiny fix in testing script profile_reduce_no_index.sh
      
      * Add support for bfp16 reduction (using bhalf_t = ushort)
      
      * Tiny fix in amd_buffer_addressing.hpp
      
      * Tiny change in script/profile_reduce_with_index.sh
      
      * Use AccDataType for Beta value and use element_wise::PassThrough
      
      * Use type_convert for type converting in host layer reduction
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming all NumReduceDims to NumReduceDim
      
      * Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2
      
      * Update to testing scripts to add bf16 support
      
      * added more static_assert
      
      * Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp
      
      * Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations
      
      * minor change
      
      * Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass
      
      * Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp
      
      * Tiny fix in script/profile_reduce_no_index.sh
      
      * Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims
      
      * Generic renaming in host reduction and DeviceReduce layer
      
      * Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances
      
      * Use multi-thread and simplification for host Reduction implementation
      
      * Add ctest for reduction
      
      * Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/
      
      * Update to the reduce CTest executables to enable default testing behavior when no command argument
      
      * Renaming
      Co-authored-by: default avatarJianfeng yan <jfyan008@gmail.com>
      9a8ee8a3
  14. 21 Mar, 2022 3 commits
    • Jianfeng Yan's avatar
      refactored deviceBatchedGemm; removed GridwiseBatchedGemm; added fp32 and int8 to profiler (#120) · cb87b049
      Jianfeng Yan authored
      changed long_index_t to index_t when computing memory offset
      
      uncomment other ops in profiler
      
      added test for batched_gemm
      cb87b049
    • rocking5566's avatar
      Gemm_c_shuffle (4 layouts) X (fp32 bf16 int8) (#131) · 485ea46a
      rocking5566 authored
      
      * [What] Separate fixpoint gemm from gemm example
      [Why] let example of gemm_int8 be pure gemm.
      [What]
      1. Add gemm_requant_relu_requant,
      2. Let CDataType be int32 in pure gemm, because no one use int8 CDataType. It is also part of gemm_requant_relu_requant
      
      * Fix path
      
      * Revise cmakelist due to merge develop
      
      * Add gemm fp16 test
      
      * Extract PrepareGemmTensor
      
      * Extract TestGemm
      
      * Add test for different layout
      
      * Add 4 layouts of shuffle version of fp32
      
      * Add 4 layouts of shuffle version of int8
      
      * Add 4 layouts of shuffle version of bf16
      
      * replace all DeviceGemmPtr_ with DeviceGemmNoOpPtr to fit naming convension
      
      * Add test for non-shuffle verstion of gemm
      
      * Fix typo
      
      * Print kernel information
      
      * Add rest of the fp32 kernel to the test
      
      * 1. Add rest of the fp16 device iop.
      2. Mark the invalid device operation
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      485ea46a
    • ltqin's avatar
      Fix conv2d bwd data bug when filter is 1x1 and stride = 2 (#132) · b51808d7
      ltqin authored
      
      * fix bwd data filter1strid2 bug
      
      * fichangeshort to ck::bhalf_t
      
      * reset input to zero
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      b51808d7
  15. 12 Mar, 2022 1 commit
    • rocking5566's avatar
      Consider gemm requant relu requant as gemm fusuion (#116) · 9a17e7fb
      rocking5566 authored
      
      * [What] Separate fixpoint gemm from gemm example
      [Why] let example of gemm_int8 be pure gemm.
      [What]
      1. Add gemm_requant_relu_requant,
      2. Let CDataType be int32 in pure gemm, because no one use int8 CDataType. It is also part of gemm_requant_relu_requant
      
      * Fix path
      
      * Revise cmakelist due to merge develop
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      9a17e7fb
  16. 11 Mar, 2022 2 commits
  17. 10 Mar, 2022 1 commit
    • Qianfeng's avatar
      Pr82 followup (#115) · 827301d9
      Qianfeng authored
      * Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction
      
      * Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter
      
      * Rename the folder name for the pool2d and reduce examples
      
      * Update to reduction test scripts
      
      * Add Readme for pool2d_fwd and reduce_blockwise examples
      
      * Tiny fix in reduce profiler and tiny update in reduce testing scripts
      
      * Tiny fix in testing script profile_reduce_no_index.sh
      
      * Tiny change in script/profile_reduce_with_index.sh
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming all NumReduceDims to NumReduceDim
      827301d9
  18. 09 Mar, 2022 1 commit
    • Chao Liu's avatar
      Reorganize files, Part 1 (#119) · 5d37d7bf
      Chao Liu authored
      * delete obselete files
      
      * move files
      
      * build
      
      * update cmake
      
      * update cmake
      
      * fix build
      
      * reorg examples
      
      * update cmake for example and test
      5d37d7bf
  19. 07 Mar, 2022 1 commit
  20. 05 Mar, 2022 3 commits
    • Qianfeng's avatar
      Reduction in Composable Kernel (#82) · e17c0d80
      Qianfeng authored
      
      * Initial adding of generic reduction
      
      * Initial adding of generic reduction ...
      
      * Updates to make compiling done
      
      * clang-format all files
      
      * clang-format some files again
      
      * Renaming in profiler/include/profile_reduce.hpp
      
      * Updates and make BlockWise cases passed
      
      * Updates and make ThreadWise and MultiBlockTwoCall cases passed
      
      * Remove the support for MUL and NORM1 reduceOp from the profiler and the device instances
      
      * Change to replace the dim0_max_vector_size/dim1_max_vector_size template argument in the device reduce classes
      
      * format
      
      * adding pooling
      
      * added max and average pooling
      
      * comment out cout and kernel timing
      
      * Tiny simplification in profiler/reduce_profiler.cpp
      
      * Add example for reduce_blockwise
      
      * Tiny updates
      
      * Change to pass the ElementWiseOp from device layer to kernel
      
      * Fix the vectorDim and vectorSize in Device layer
      
      * Enable vector load on both dim0 and dim1 for Threadwise method
      
      * Tiny updates
      
      * Change to let the user to pass the preUnaryOp and posUnaryOp
      
      * Make pooling example work
      
      * split device_reduce_instance into two libraries
      
      * Tiny update
      
      * Replace nanPropaOpt enum by boolean propagate_nan
      
      * Simplification in DeviceReduce layer codes
      
      * update build
      
      * Change to clarify the difference between ck::half_t and half_float::half
      
      * Renaming in all the reduction codes
      
      * Add VectorSize as template parameter for device layer
      
      * Add BetaIsZero as kernel template and as AccDataType for alpha
      
      * print
      
      * Small updates for pooling
      
      * Updates for host_generic_reduction for reference
      
      * Update to make AVG pooling pass
      
      * Update to make MAX pooling with indices output pass
      
      * fix
      
      * add OutDst vector store to threadwise reduction and pooling
      
      * tweak
      
      * turn off check_indices that caused build issue
      
      * refactor pooling
      
      * clean up
      
      * turn off check_indices for building issue for php-compiler
      
      * add more tile size for odd C
      
      * tweak conv for odd C
      
      * update script
      
      * clean up elementwise op
      
      * add hack in reduction_operator.hpp to avoid compile error. To fix it, need to use element_wise_op in reduction op
      
      * Add OutVectorSize as device and kernel tunable, also update to Elementwise Operations
      
      * Move reduce operator mapping to host layer file reduction_operator_mapping.hpp from reduction_operator.hpp
      
      * Change to the unary operators
      
      * Move the definitions of unary operations to element_wise_operation.hpp
      
      * re-org files
      
      * Refine in device interfaces and multiblock kernels
      
      * Split the reduction configurations into instances for specific methods
      
      * Update in getTypeString() of device pool2d
      
      * Renaming in host and kernel
      
      * Tiny update in profiler/src/profiler.cpp
      
      * Uncomment in device_operation/CMakeLists.txt to enable the building of all operations
      
      * Make check_indices a templated function to remove some linking issue
      
      * Renaming in the profiler reduce module
      
      * Add support for double Reduction (but disable MultiblockAtomicAdd for double)
      
      * Tiny correction of literal string
      
      * Rename DevicePoolFwd to DevicePool2dFwd
      
      * Split device_reduce_instance_xxx.cpp files according to the data types to speed up compiling
      
      * Add comments for lists of configurations, lists of instances and references of add_reduce_instances_xxx
      
      * Remove un-used header file gridwise_generic_reduction_wrapper_common.hpp
      
      * Renaming and refining in the Reduction codes
      
      * Tiny change in the unary operators
      
      * Renaming symbols and files
      
      * Renaming symbols in the kernels
      
      * Move kernel kernel_set_buffer_value to separate file
      
      * Add IndexDataType template parameter for kernels and use int32_t as index data type in device layer
      
      * Tiny update in the kernels
      
      * Remove definition of sqrtf()/isnan()/abs() for half_t due to some ADL issue
      
      * Simplify a helper function in device layer
      
      * Tiny adjustment in testing data initialization
      
      * Renaming in kernel/device/host
      
      * Add two testing scripts for reduction
      
      * Refine the Unary operators in element_wise_operation.hpp
      
      * Update in the reduce profiler module
      
      * Update to the reduction testing scripts
      
      * reduce compile parallelism
      
      * change CI docker to rocm5.0
      
      * remove unused variables
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      e17c0d80
    • Chao Liu's avatar
      revert changes in threadwise copy due to PR #101 (space filling curve used in... · 12dfba3d
      Chao Liu authored
      revert changes in threadwise copy due to PR #101 (space filling curve used in threadwise copy) (#111)
      
      12dfba3d
    • rocking5566's avatar
      Int8 qunatization gemm xdl (#108) · ad41aa0e
      rocking5566 authored
      
      * Add int8 of mk_nk_mn to the ckProfiler
      
      * Add example of int8 gemm
      
      * Fix typo, use ushort instead of half_t for bfloat16
      
      * replace ushortXXX_t to bhalfXXX_t
      
      * rename ushort to bhalf_t
      
      * Add bf16 example
      
      * Add bf16 gemm to ckProfiler
      
      * Fix alignment
      
      * Fix typo
      
      * Add unit test for gemm_xdl int8
      
      * Add gemm_xdl fp32 unit test
      
      * Add gemm_xdl bf16 unit test
      
      * fix build
      
      * fix build issue due to merge conflict
      
      * Fix build
      
      * Fix build error
      
      * [What] gemm + relu inference
      [How] gemm + requant + relu + requant + clamp
      
      * clean
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      ad41aa0e