1. 11 Mar, 2022 3 commits
  2. 10 Mar, 2022 1 commit
    • Qianfeng's avatar
      Pr82 followup (#115) · 827301d9
      Qianfeng authored
      * Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction
      
      * Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter
      
      * Rename the folder name for the pool2d and reduce examples
      
      * Update to reduction test scripts
      
      * Add Readme for pool2d_fwd and reduce_blockwise examples
      
      * Tiny fix in reduce profiler and tiny update in reduce testing scripts
      
      * Tiny fix in testing script profile_reduce_no_index.sh
      
      * Tiny change in script/profile_reduce_with_index.sh
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming all NumReduceDims to NumReduceDim
      827301d9
  3. 09 Mar, 2022 1 commit
    • Chao Liu's avatar
      Reorganize files, Part 1 (#119) · 5d37d7bf
      Chao Liu authored
      * delete obselete files
      
      * move files
      
      * build
      
      * update cmake
      
      * update cmake
      
      * fix build
      
      * reorg examples
      
      * update cmake for example and test
      5d37d7bf
  4. 07 Mar, 2022 1 commit
  5. 05 Mar, 2022 5 commits
    • Qianfeng's avatar
      Reduction in Composable Kernel (#82) · e17c0d80
      Qianfeng authored
      
      * Initial adding of generic reduction
      
      * Initial adding of generic reduction ...
      
      * Updates to make compiling done
      
      * clang-format all files
      
      * clang-format some files again
      
      * Renaming in profiler/include/profile_reduce.hpp
      
      * Updates and make BlockWise cases passed
      
      * Updates and make ThreadWise and MultiBlockTwoCall cases passed
      
      * Remove the support for MUL and NORM1 reduceOp from the profiler and the device instances
      
      * Change to replace the dim0_max_vector_size/dim1_max_vector_size template argument in the device reduce classes
      
      * format
      
      * adding pooling
      
      * added max and average pooling
      
      * comment out cout and kernel timing
      
      * Tiny simplification in profiler/reduce_profiler.cpp
      
      * Add example for reduce_blockwise
      
      * Tiny updates
      
      * Change to pass the ElementWiseOp from device layer to kernel
      
      * Fix the vectorDim and vectorSize in Device layer
      
      * Enable vector load on both dim0 and dim1 for Threadwise method
      
      * Tiny updates
      
      * Change to let the user to pass the preUnaryOp and posUnaryOp
      
      * Make pooling example work
      
      * split device_reduce_instance into two libraries
      
      * Tiny update
      
      * Replace nanPropaOpt enum by boolean propagate_nan
      
      * Simplification in DeviceReduce layer codes
      
      * update build
      
      * Change to clarify the difference between ck::half_t and half_float::half
      
      * Renaming in all the reduction codes
      
      * Add VectorSize as template parameter for device layer
      
      * Add BetaIsZero as kernel template and as AccDataType for alpha
      
      * print
      
      * Small updates for pooling
      
      * Updates for host_generic_reduction for reference
      
      * Update to make AVG pooling pass
      
      * Update to make MAX pooling with indices output pass
      
      * fix
      
      * add OutDst vector store to threadwise reduction and pooling
      
      * tweak
      
      * turn off check_indices that caused build issue
      
      * refactor pooling
      
      * clean up
      
      * turn off check_indices for building issue for php-compiler
      
      * add more tile size for odd C
      
      * tweak conv for odd C
      
      * update script
      
      * clean up elementwise op
      
      * add hack in reduction_operator.hpp to avoid compile error. To fix it, need to use element_wise_op in reduction op
      
      * Add OutVectorSize as device and kernel tunable, also update to Elementwise Operations
      
      * Move reduce operator mapping to host layer file reduction_operator_mapping.hpp from reduction_operator.hpp
      
      * Change to the unary operators
      
      * Move the definitions of unary operations to element_wise_operation.hpp
      
      * re-org files
      
      * Refine in device interfaces and multiblock kernels
      
      * Split the reduction configurations into instances for specific methods
      
      * Update in getTypeString() of device pool2d
      
      * Renaming in host and kernel
      
      * Tiny update in profiler/src/profiler.cpp
      
      * Uncomment in device_operation/CMakeLists.txt to enable the building of all operations
      
      * Make check_indices a templated function to remove some linking issue
      
      * Renaming in the profiler reduce module
      
      * Add support for double Reduction (but disable MultiblockAtomicAdd for double)
      
      * Tiny correction of literal string
      
      * Rename DevicePoolFwd to DevicePool2dFwd
      
      * Split device_reduce_instance_xxx.cpp files according to the data types to speed up compiling
      
      * Add comments for lists of configurations, lists of instances and references of add_reduce_instances_xxx
      
      * Remove un-used header file gridwise_generic_reduction_wrapper_common.hpp
      
      * Renaming and refining in the Reduction codes
      
      * Tiny change in the unary operators
      
      * Renaming symbols and files
      
      * Renaming symbols in the kernels
      
      * Move kernel kernel_set_buffer_value to separate file
      
      * Add IndexDataType template parameter for kernels and use int32_t as index data type in device layer
      
      * Tiny update in the kernels
      
      * Remove definition of sqrtf()/isnan()/abs() for half_t due to some ADL issue
      
      * Simplify a helper function in device layer
      
      * Tiny adjustment in testing data initialization
      
      * Renaming in kernel/device/host
      
      * Add two testing scripts for reduction
      
      * Refine the Unary operators in element_wise_operation.hpp
      
      * Update in the reduce profiler module
      
      * Update to the reduction testing scripts
      
      * reduce compile parallelism
      
      * change CI docker to rocm5.0
      
      * remove unused variables
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      e17c0d80
    • Chao Liu's avatar
      revert changes in threadwise copy due to PR #101 (space filling curve used in... · 12dfba3d
      Chao Liu authored
      revert changes in threadwise copy due to PR #101 (space filling curve used in threadwise copy) (#111)
      
      12dfba3d
    • rocking5566's avatar
      Int8 qunatization gemm xdl (#108) · ad41aa0e
      rocking5566 authored
      
      * Add int8 of mk_nk_mn to the ckProfiler
      
      * Add example of int8 gemm
      
      * Fix typo, use ushort instead of half_t for bfloat16
      
      * replace ushortXXX_t to bhalfXXX_t
      
      * rename ushort to bhalf_t
      
      * Add bf16 example
      
      * Add bf16 gemm to ckProfiler
      
      * Fix alignment
      
      * Fix typo
      
      * Add unit test for gemm_xdl int8
      
      * Add gemm_xdl fp32 unit test
      
      * Add gemm_xdl bf16 unit test
      
      * fix build
      
      * fix build issue due to merge conflict
      
      * Fix build
      
      * Fix build error
      
      * [What] gemm + relu inference
      [How] gemm + requant + relu + requant + clamp
      
      * clean
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      ad41aa0e
    • Chao Liu's avatar
      Fix Tests build (#109) · 5b178874
      Chao Liu authored
      * fix tests
      
      * remove useless file
      
      * fix test build
      
      * reduce parallelism when compiling
      
      * fix test
      5b178874
    • ltqin's avatar
      Example for conv2d backward weight fp16 (#106) · 7a9b93f4
      ltqin authored
      
      * add wrw reference
      
      * start device
      
      * raw not split version
      
      * run simple example
      
      * start to use atomic add
      
      * simple transform result correct
      
      * first version that can run
      
      * fix atomic and set operator choice
      
      * add check split-k
      
      * format
      
      * change input parameter
      
      * add pad for t total
      
      * rename example index
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      7a9b93f4
  6. 04 Mar, 2022 4 commits
    • rocking5566's avatar
      [Bf16 & int8] [example & ckprofiler] (#100) · 7e9a9d32
      rocking5566 authored
      
      * Add int8 of mk_nk_mn to the ckProfiler
      
      * Add example of int8 gemm
      
      * Fix typo, use ushort instead of half_t for bfloat16
      
      * replace ushortXXX_t to bhalfXXX_t
      
      * rename ushort to bhalf_t
      
      * Add bf16 example
      
      * Add bf16 gemm to ckProfiler
      
      * Fix alignment
      
      * Fix typo
      
      * Add unit test for gemm_xdl int8
      
      * Add gemm_xdl fp32 unit test
      
      * Add gemm_xdl bf16 unit test
      
      * fix build
      
      * fix build issue due to merge conflict
      
      * Fix build
      
      * Fix build error
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      7e9a9d32
    • Chao Liu's avatar
      fix type in PR #101 (#107) · 0c79af12
      Chao Liu authored
      0c79af12
    • Jianfeng Yan's avatar
      Refactor threadwise copy using sfcurve (#101) · 0619ebf7
      Jianfeng Yan authored
      
      * add space_filling_curve
      
      * cleanup and move space_filling_curve into test
      
      * WIP: start refactoring threadwise_transfer_v1r3
      
      * threadwise_copy works but needs further refactoring
      
      * add some comments
      
      * add SpaceFillingCurve::GetIndices()
      
      * minor changes
      
      * removed GetIndices; refactored GetDstCoordinateResetStep
      
      * add DynamicBuffer::Transfer, but Add is not tested
      
      * rebased agaist develop
      
      * threadwise_copy_v6r1/v6r2/v6r3 using space-filling curve start to work
      
      * minor changes
      
      * refactored threadcopy v3r1, v2; removed old implementations
      
      * clang-format
      
      * cleanup
      
      * fix a typo in v6r3
      
      * format
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      0619ebf7
    • ltqin's avatar
      NHWC conv 2d: bwd fp32/fp16/bfp16/int8, Device level tuning and host API (#92) · c254e5ab
      ltqin authored
      
      * start conv2d bwd api
      
      * kernel running
      
      * add bwd reference
      
      * change to no shuffle
      
      * fix bwd reference
      
      * pass verification
      
      * add Filter1x1Stride1Pad0 and start testing
      
      * change some tuning parameter
      
      * fix test error
      
      * add fp16 tuning parameter
      
      * add bf16 tuning parameter
      
      * add int8 tuning parameters
      
      * change fp32 tuning parameter
      
      * add bwd to profiler
      
      * fix bug for bwd profiler
      
      * fix ckProfiler bug
      
      * change conv2d_bwd_xdl to fp16
      
      * fix bug in comments
      
      * fix precompile id
      
      * fix enum conv name
      
      * chage _bwd_ to _bwd_data_
      
      * change conv2d_bwd example id
      
      * bwd to bwd data
      
      * fix prehead
      
      * fix MakeDefaultBlock2CTileMap ,import form merge develop
      
      * format bwd instance
      
      * bwd to bwd data
      
      * change name bwd to bwd data
      
      * change name bwd to bwd data in example
      
      * formate code
      
      * change conv2d bwd data id in example
      
      * rewrite readme for example
      
      * fix CalculateMagicNumbers about div zero
      
      * add workaround CK_WORKAROUND_SWDEV_325164
      
      * change test_conf2d_bwd_data show info
      
      * format
      
      * fix bug for workaround:CK_WORKAROUND_SWDEV_325164
      
      * formate tuning parameters
      
      * formate tuning parameters again
      
      * formate tuning parameters 3
      
      * formate tuning parameters 4
      
      * remove add function template
      
      * format
      
      * update comment
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      c254e5ab
  7. 03 Mar, 2022 1 commit
    • JD's avatar
      Update test CMakeLists to add new tests automatically and add Jenkins stage for tests (#88) · 992f71e3
      JD authored
      
      * add docker file and make default target buildable
      
      * add Jenkinsfile
      
      * remove empty env block
      
      * fix package stage
      
      * remove render group from docker run
      
      * clean up Jenkins file
      
      * add cppcheck as dev dependency
      
      * update cmake file
      
      * Add profiler build stage
      
      * add hip_version config file for reduction operator
      
      * correct jenkins var name
      
      * Build release instead of debug
      
      * Update test CMakeLists.txt
      reorg test dir
      add test stage
      
      * reduce compile threads to prevent compiler crash
      
      * add optional debug stage, update second test
      
      * remove old test target
      
      * fix tests to return proper results and self review
      
      * Fix package name and make test run without args
      
      * change Dockerfile to ues rocm4.3.1
      
      * remove parallelism from build
      
      * Lower paralellism
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      992f71e3
  8. 28 Feb, 2022 1 commit
    • Anthony Chang's avatar
      Allow distinct K0/K1 values for A/B block descriptor (#98) · 6d4450ef
      Anthony Chang authored
      
      * add gitignore
      
      * host tensor: allow generating sequentially increasing value in a given dimension
      
      * gridwise gemm v3r1: allow distinct K0/K1 values for A/B block descriptor
      
      - remove dangling header include
      - modify example gemm_xdl accordingly
      - infer KPack value from M/NPerXdl
      - device conv2d fwd: update parameters accordingly for the underlying gridwise gemm v3r1
      (API for conv2d fwd stays the same for now until we decide to expose individual K0s for activation and weight)
      
      * add LDS data dump utility
      
      * profiler: reflect API change for distinct K0/K1 for A/B matrices
      
      * profiler: add conflict-free LDS write FP16 kernel instances
      
      * fix accidental perf regression
      
      * address feedback; cosmetic changes
      
      * clang-format for new files
      
      * format
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      6d4450ef
  9. 25 Feb, 2022 2 commits
    • zjing14's avatar
      Split k f16 (#97) · e221d11e
      zjing14 authored
      
      * init for splitk f16
      
      * a working prototype
      
      * debug
      
      * perf debug
      
      * update example
      
      * instances for mk kn
      
      * add instances for all layers
      
      * clean
      
      * clean
      
      * add tuning
      
      * format
      
      * add mn_padding into irregular tile
      
      * clean
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      e221d11e
    • Jianfeng Yan's avatar
      Space filling curve (#96) · bdedf64b
      Jianfeng Yan authored
      * add space_filling_curve
      
      * cleanup and move space_filling_curve into test
      
      * add functions for backward and forward step; hard coded results in unit test
      
      * minor changes
      bdedf64b
  10. 23 Feb, 2022 3 commits
    • Chao Liu's avatar
      Add gridwise GEMM pipeline (#89) · 22d438ae
      Chao Liu authored
      * clean up
      
      * add mutilple thread scratch to ThreadwiseTensorSliceTransfer_v3r1
      
      * add 2 stage prefetch
      
      * add more sanity check into transform_tensor_descriptor
      
      * tweak
      
      * enabling 2 stage prefetch to exsiting gridwise gemm; tweak
      
      * enabling 2 stage prefetch to exsiting gridwise gemm
      
      * move gridwise gemm pipeline in class; clean up
      
      * add some irregular tile size
      
      * update CalculateHasMainK0BlockLoop for multi-stage-prefetch
      
      * refactor gridwise gemm pipeline class
      22d438ae
    • Adam Osewski's avatar
      Unify Convolution FWD XDL 1D/2D implementation. (#93) · 756a7617
      Adam Osewski authored
      * Convolution ND
      
      * Code unification across dimensions for generating tensor descriptors.
      * Example
      * Instances
      
      * Move convnd f32 instance file to comply with repo structure.
      
      * Conv 1D tensor layouts.
      
      * Formatting and use ReferenceConv
      
      * Reference ConvFwd supporting 1D and 2D convolution.
      
      * Debug printing TensorLayout name.
      
      * Conv fwd 1D instance f32
      
      * Refactor conv ND example.
      
      Needed to support various conv dimensio.
      
      Needed to support various conv dimensions
      
      * Rename conv nd example director to prevent conflicts.
      
      * Refactor some common utility to single file.
      
      Plus some tests.
      
      * Refactor GetHostTensorDescriptor + UT.
      
      * Add 1D test case.
      
      * Test reference convolution 1d/2d
      
      * Remove some leftovers.
      
      * Fix convolution example error for 1D
      
      * Refactor test check errors utility function.
      
      * Test Conv2D Fwd XDL
      
      * More UT for 1D case.
      
      * Parameterize input & weight initializers.
      
      * Rename example...
      756a7617
    • Jianfeng Yan's avatar
      Conv3d new (#94) · 6dfb92bb
      Jianfeng Yan authored
      
      * conv3d compiles but has memory error
      
      * conv3d works
      
      * fix performance issue by using __builtin_amdgc_readfirstlane
      
      * change MakeBlock2CTileMap to MakeDefaultBlock2CTileMap; change c_blockid_to* to cblockid_to*
      
      * clang-format
      
      * remove CK_EXPERIMENTAL_PASS_TENSOR_DECRIPTOR_BY_*; moved wrapper into DeviceConv3d
      
      * format
      
      * remove useless marc
      
      * add comment
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      6dfb92bb
  11. 21 Feb, 2022 1 commit
  12. 19 Feb, 2022 1 commit
    • JD's avatar
      Initial Setup for CI (#86) · 2778e997
      JD authored
      
      * add docker file and make default target buildable
      
      * add Jenkinsfile
      
      * remove empty env block
      
      * fix package stage
      
      * remove render group from docker run
      
      * clean up Jenkins file
      
      * add cppcheck as dev dependency
      
      * update cmake file
      
      * Add profiler build stage
      
      * add hip_version config file for reduction operator
      
      * correct jenkins var name
      
      * Build release instead of debug
      
      * clean up
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      2778e997
  13. 12 Feb, 2022 1 commit
    • ltqin's avatar
      NHWC conv 2d: fwd bfp16/int8, Device level tuning and host API (#73) · 880fbee9
      ltqin authored
      
      * add fwd bf16 conv
      
      * change tunning parametor
      
      * add int8 for conv fwd
      
      * remove comments
      
      * change tunning parametor for int8
      
      * change init int8 example
      
      * add test for conv2d fwd
      
      * change device operation file pos because merge develop
      
      * fwd int8 use reference
      
      * test_conv_fwd use reference
      
      * add braket for if statement
      
      * rename fwd example name
      
      * remove StaticBufferOfVectorTypeV2
      
      * tweak example
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      880fbee9
  14. 11 Feb, 2022 4 commits
    • zjing14's avatar
      Add small tile size for fp16/fp32 and NN layout (#80) · 20a672d0
      zjing14 authored
      
      * add DeviceGemmSplitKXdl
      
      * add file device_gemm_splitk_xdl.hpp
      
      * set c matrix zero
      
      * using atomic
      
      * add all tuning parameter to f32 mkkn
      
      * grid size change to 720
      
      * add tunning parameter for NT
      
      * add tunning parameter for TN
      
      * add tunning parameter for TT
      
      * add m=96tunning parameter
      
      * add lost config
      
      * debug
      
      * fix sweep
      
      * add failed tuning params
      
      * fixed sweep logic
      
      * clean
      
      * add padding to M/N for irr tile size
      
      * clean code
      
      * add element wise operation
      
      * fixed MPerBlock=96
      
      * remove marco for slpitk swtich
      
      * add test
      
      * add new line at the end of device_gemm_xdl_instance.hpp
      
      * remove step hack
      
      * seperate split-k instance files
      
      * add tunning parameters
      
      * change disired grid size to parameters
      
      * remove slice length
      
      * add desiredgridsize parameter to ckProfiler
      
      * add losting file device_gemm_xdl_splitk_instance.hpp
      
      * change desired gride size to kbatch
      
      * format
      
      * format
      
      * clean up
      
      * add selection of device_instances
      
      * clean code
      
      * clean code
      
      * add small tile size in fp16 nn
      
      * test for rocm 4.5
      
      * merge develop
      
      * clean
      
      * clean
      
      * clean
      
      * remove no-use code
      
      * add padding switch to device_gemm_xdl
      
      * add padding switch for ksplit fp32
      
      * clean
      
      * clean
      
      * add files
      
      * rename
      
      * Update profiler.cpp
      
      * format
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarltqin <letao.qin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      20a672d0
    • zjing14's avatar
      Batched GEMM for fp16 (#79) · b53e9d08
      zjing14 authored
      * prepare host for batched_gemm
      
      * init commit of batched kernels
      
      * fixed
      
      * refine transform with freeze
      
      * m/n padding
      
      * fixed a bug; clean
      
      * add small tiles
      
      * clean
      
      * clean code
      
      * clean code
      
      * add nt, tn, tt layout
      
      * add missing file
      
      * use StaticBufferTupleOfVector instead
      
      * add reference_batched_gemm
      
      * fixed a macro
      b53e9d08
    • rocking5566's avatar
      Support alpha beta scaling for GEMM (#78) · 6f928a08
      rocking5566 authored
      
      * [What] Add 2d version of bias, prepare to implement alpha / beta scaling
      
      * Add alpha / beta functor
      
      * Refine parameter of example
      
      * [What] Use real type instead of template
      [Why] Prevent implicit cast
      
      * Rename parameter for general operator
      
      * Remove redundant comment
      
      * Fix compile error
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      6f928a08
    • Anthony Chang's avatar
      fix build breaks (#81) · 904cbe2a
      Anthony Chang authored
      
      - device_gemm_xdl_c_shuffle function signature matches split-k
      - retire host_driver since it is no longer maintained
      - linter error (unused variable)
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      904cbe2a
  15. 07 Feb, 2022 1 commit
    • Chao Liu's avatar
      GEMM+Bias+ReLU+Add (#76) · 823657ed
      Chao Liu authored
      * tweak conv for odd C
      
      * update script
      
      * clean up elementwise op
      
      * fix build
      
      * clean up
      
      * added example for gemm+bias+relu+add
      
      * added example for gemm+bias+relu
      
      * add profiler for gemm_s_shuffle; re-org files
      
      * add profiler
      
      * fix build
      
      * clean up
      
      * clean up
      
      * clean up
      
      * fix build
      823657ed
  16. 04 Feb, 2022 1 commit
  17. 03 Feb, 2022 2 commits
    • zjing14's avatar
      Replace llvm Intrinsics with clang buildins (#65) · 6d92959a
      zjing14 authored
      * test mfma builtins
      
      * add fp16 buildins
      
      * add int8 buildins
      
      * add bfl16 buildins
      
      * simplify host conv forward
      
      * clean
      
      * clean
      6d92959a
    • ltqin's avatar
      add split-k GEMM (#59) · 4be7f019
      ltqin authored
      
      * add DeviceGemmSplitKXdl
      
      * add file device_gemm_splitk_xdl.hpp
      
      * set c matrix zero
      
      * using atomic
      
      * add all tuning parameter to f32 mkkn
      
      * grid size change to 720
      
      * add tunning parameter for NT
      
      * add tunning parameter for TN
      
      * add tunning parameter for TT
      
      * add m=96tunning parameter
      
      * add lost config
      
      * add element wise operation
      
      * fixed MPerBlock=96
      
      * remove marco for slpitk swtich
      
      * add test
      
      * add new line at the end of device_gemm_xdl_instance.hpp
      
      * remove step hack
      
      * seperate split-k instance files
      
      * add tunning parameters
      
      * change disired grid size to parameters
      
      * remove slice length
      
      * add desiredgridsize parameter to ckProfiler
      
      * add losting file device_gemm_xdl_splitk_instance.hpp
      
      * change desired gride size to kbatch
      
      * format
      
      * format
      
      * clean up
      
      * add selection of device_instances
      
      * clean code
      
      * fix build issue
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      4be7f019
  18. 25 Jan, 2022 1 commit
    • rocking5566's avatar
      Do not hardcode the function parameter, use template instead. (#72) · ca47a6cf
      rocking5566 authored
      * Do not hardcode the function parameter, use template instead.
      
      * [What] Remove AThreadTransferSrcResetCoordinateAfterRun and BThreadTransferSrcResetCoordinateAfterRun in host API
      [Why] "C_Shuffle" version is supposed to be similar to the vanilla one
      
      * Fix typo
      Let DeviceGemmXdl_C_Shuffle use kernel_gemm_xdlops_v3r1
      ca47a6cf
  19. 21 Jan, 2022 1 commit
    • rocking5566's avatar
      Add gemm_shuffle host api (#71) · 4d40b197
      rocking5566 authored
      * [What]
      1. Add DeviceGemmXdl_C_Shuffle
      2. Revise example of gemm_xdl
      [Why] Prepare to add shuffle version of D = alpha * (A * B) + beta * C
      [How] Imitate DeviceGemmXdl and device_conv2d_fwd_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp
      4d40b197
  20. 18 Jan, 2022 1 commit
  21. 26 Dec, 2021 1 commit
    • Chao Liu's avatar
      Fusion Conv+Bias+ReLU(+Add) (#62) · acbd7bd7
      Chao Liu authored
      * fix relu
      
      * clean up
      
      * clean up
      
      * adding 1x1 conv
      
      * adding 1x1 conv
      
      * added 1x1 conv
      
      * refactor
      
      * refactor
      
      * refactor
      
      * added profiler for conv+bias+relu+add
      
      * clean up
      
      * adding conv+bias+relu
      
      * adding conv+bias+relu
      
      * added conv+bias+relu
      
      * Update README.md
      
      * update cpu verification
      
      * adding c shuffle
      
      * update static_tensor for dealing with invalid element
      
      * adding c shuffle
      
      * debugging
      
      * fix bug
      
      * convert to fp16 before shuffle
      
      * shuffle more than one M/NRepeat
      
      * clean up
      
      * remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1
      
      * clean up
      
      * remove coordinate step hack from all gridwise gemm xdl
      
      * clean up coordinate step hack
      
      * clean up coordinate step hack
      
      * ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst
      
      * adding output shuffle in conv+bias+relu+add
      
      * update
      
      * added conv+bias+relu+add with c shuffle
      
      * added conv+bias+relu+add with c shuffle
      
      * fix forward_sweep bugs in threadwise copy
      
      * clean up
      
      * refactor
      
      * clean up
      
      * clean up
      
      * added conv_c_shuffle+bias_relu
      
      * clean up
      
      * added conv+bias+relu+atomic_add
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * misc fixes; add 1x1 specialization
      
      * clean up
      
      * delete unused device op
      
      * clean up
      
      * add support for odd C value
      acbd7bd7
  22. 13 Dec, 2021 1 commit
    • Chao Liu's avatar
      manually apply bug fix changes in pr #63 (#64) · a4f24233
      Chao Liu authored
      * Bug in BlockwiseGemmXdlops_k0mk1_k0nk1_m0n0m1n1m2m3m4n2_v1::MakeCGridDescriptor_M0_N0_M1_N1_M2_M3_M4_N2()
      * Bug in ThreadwiseTensorSliceTransfer_v1r3 logic for calculating "forward_sweep"
      a4f24233
  23. 04 Dec, 2021 1 commit
  24. 03 Dec, 2021 1 commit
    • Chao Liu's avatar
      GEMM/Conv+BiasAdd+ReLU+Add (#55) · 41cdd380
      Chao Liu authored
      * gemm+activation
      
      * move C pointwise operation into threadwise copy
      
      * add pointwise operation to A/B matrix
      
      * update ckProfiler
      
      * adding bias add
      
      * adding bias add
      
      * adding bias add
      
      * added bias add; worked around compiler issues
      
      * clean up
      
      * clean up
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      * clean up
      
      * add conv_xdl example
      
      * adding conv_xdl_bias_relu_add example
      
      * add conv+bias+relu+add, but has register spill issue
      
      * tweak
      
      * tweak
      
      * refactor
      
      * Update README.md
      
      update readme for example/2_gemm_xdl_bias_relu_add
      
      * clean up
      
      * Update README.md
      
      update readme for example/3_conv_xdl
      
      * Update README.md
      41cdd380