1. 25 May, 2022 6 commits
  2. 24 May, 2022 4 commits
    • Jianfeng Yan's avatar
      Navi21 gemm (#197) · 40b59a63
      Jianfeng Yan authored
      
      * start adding navi21 GEMM
      
      * navi_gemm_km_kn_mn_fp32 compiles and passes one test.
      
      * rename variables and functions in gridwise_gemm_dlops_v1r3
      
      * add other 3 layouts; format instance
      
      * adding more tuning parameters
      
      add tuning parameters for other 3 layouts
      
      * add gemm_dlops_f16
      
      * tmp
      
      * add dependence of DeviceGemm::IsSupportedArg() on arch
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * push gemm_dlops into profiler
      
      * minor changes
      
      * if using xdl or dlops is moved into profiler_gemm_impl
      
      * minor changes
      
      * minor changes
      
      * remove is_xdl from profile_gemm_impl
      
      * make IsSupportedArg dependent on arch for other device_gemm
      
      * minor changes
      
      * minor changes
      
      * fix a bug in f_generate_tensor_value
      
      * add 64x64x64 for gemm_dlops_int8
      
      * add 64x64x64 for gemm_dlops_int8
      
      * comment out 3 layouts in gemm_dlops_int8; add 32x32x32 for gemm_dlops_int8; init A values to 1
      
      * fix
      
      * start fixing tuning parameters
      
      * monir
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * fixing
      
      * adding example
      
      * adding example
      
      * adding example
      
      * add gemm fp32 example
      
      * clean up
      
      * use 128x128x16 as MNK tile in navi21 gemm example
      
      * bug fix
      
      * fix test
      
      * use new block c tile
      
      * clean
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarshaojiewang <wsjmessi@163.com>
      40b59a63
    • Qianfeng's avatar
      Overhaul to Reducton and its dependants (#237) · 63eee2d9
      Qianfeng authored
      * Tiny fix in dynamic_buffer.hpp to support vectorized AtomicAdd for double type
      
      * Update to host layer and host reduction
      
      * Merge and remove reduction kernels
      
      * Merge and remove reduction device interfaces and update pooling device interface
      
      * Merge and remove useless reduction device instances
      
      * Update to reduction profiler and reduction ctests
      
      * Update to reduction and pooling examples and add one reduction example
      
      * Change to reduction examples to let them testable by ctest
      
      * Add explicit pass checking for reduction and pooling examples
      
      * Explicit assignment of tensor shapes in example reduce_blockwise_two_call
      
      * Use atomic_add to repace atomicAdd and add atomic_add for double type
      
      * Add reduce ctest support for double data type
      
      * Replace to_int_vector() by using c++ std::vector::assign()
      
      * Keep DeviceReduceThreadWise separated from DeviceReduceBlockWise
      
      * Merge DeviceReduceBlockWise and DeviceReduceMultiBlockAtomicAdd into DeviceReduceMultiBlock
      
      * Add GetAtomicOperationZeroValue() support for AtomicMax
      
      * Tiny change to reduce example README.md
      
      * Fix some tiny issues due to branch merging
      
      * Revoke previous change in dynamic_buffer.hpp and add atomic_add for double2_t
      
      * Add reduce multiblock_atomic_add instances for fp64 to verify vectorized atomic_add on fp64
      
      * Renaming
      
      * Clean the header includings in device_reduce instances header files
      63eee2d9
    • Illia Silin's avatar
      Add performance tests as a stage of CI. (#247) · 1085794d
      Illia Silin authored
      * modify ckProfiler_gemm output
      
      * fix syntax
      
      * change ckProfiler output and return 0
      
      * fix syntax
      
      * output datatype
      
      * fix syntax
      
      * output datatype in another way
      
      * fix syntax
      
      * fix syntax
      
      * test return values of ckProfiler
      
      * add layout info and tests, make sure ckprofiler returns 0
      
      * fix syntax
      
      * change layout output
      
      * fix syntax
      
      * fix syntax again
      
      * update script to process perf results
      
      * rearrange jenkins stages
      
      * fix typo
      
      * add python packages to Docker file
      
      * adding setuptools-rust package
      
      * modify parsing for new test parameters
      
      * test db credentials on jenkins
      
      * fix syntax
      
      * update python script to handle incomplete lines
      
      * ungrade python to 3.8 and write the gemm_params table
      
      * add sqlalchemy package to docker
      
      * move perf data processing to master node
      
      * move the master node inside a steps region
      
      * add new stage for result processing
      
      * move results processing to separate stage
      
      * reduce number of tests to speedup debugging
      
      * pass config to processPerfResults stage
      
      * run script on master in a docker container
      
      * replace show_node_info
      
      * try loading docker on master node again
      
      * use ansible node instead of master
      
      * get rid of pymysql package
      
      * try ssh connection using paramiko
      
      * put back pymysql
      
      * put the perf data processing back on the gpu node
      
      * put back artifact definition
      
      * archive the perf_log before parsing
      
      * clean up jenkinsfile, fix parsing
      
      * fix typo
      
      * enable all perf tests
      
      * put all stages in original order, finalize script
      
      * fix gpu_arch version
      
      * update parsing script
      
      * remove obsolete file causing merge conflict
      1085794d
    • Shaojie WANG's avatar
      add GetWorkSpaceSize to base arg (#253) · 0d08cf18
      Shaojie WANG authored
      * add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight
      
      * remove redundant compute
      
      * use datatype and split k to check whether a workspace is used
      
      * remove unused computation for work space size
      0d08cf18
  3. 23 May, 2022 1 commit
  4. 20 May, 2022 8 commits
    • Shaojie WANG's avatar
      example of conv bwd weight 1d/2d/3d fp32/fp16/bf16 xdl (#244) · ac543313
      Shaojie WANG authored
      
      * enable example of conv 1d/3d for bwd weight
      
      * make bf16 kernel do not use atomic add
      
      * using new gridwise gemm for bwd weight on convnd bwd weight
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      ac543313
    • Chao Liu's avatar
      remove options.hpp.in (#240) · 44943e0e
      Chao Liu authored
      44943e0e
    • Anthony Chang's avatar
      Refactor block to C tile map (#235) · a054f7d6
      Anthony Chang authored
      * refactor block-to-ctile-map
      
      * gridwise gemm block2ctile generic validity check
      
      * format
      
      * amend split-k gemm block2ctile map refactor
      
      * add test
      
      * format
      
      * amend
      
      * revert to calculating batch index in kernel instead of passing as block_id_z
      
      * move file
      
      * add valid ctile index check to gridwise v2r4
      a054f7d6
    • Shaojie WANG's avatar
      [conv bwd-weight]Binding gemm k1 to conv n (#202) · 070619fb
      Shaojie WANG authored
      
      * add some instance to develop
      
      * avoid bank conflicts for wrw for all instance
      
      * add small K1 test
      
      * delete some unused instance
      
      * binding gemm k1 to conv n
      
      * try using half_4 to do ds_read
      
      * reset buffer load oob and ds memcpy to default option
      
      * remove useless instances
      
      * remove redandunt space
      
      * remove printf code
      
      * clang-format-10 change
      
      * use fastest config
      
      * fix clang format for the other files
      
      * remove gemmk0 pad for output
      
      * add gemmk padding macro
      
      * add bank length computation
      
      * add template to distinguish the instance that need lds padding for wrw
      
      * use rocm5.1 as docker
      
      * use integer value for GEMM test
      
      * add Right padding macro
      
      * add 2 test asm code
      
      * using 256x256x32 tile size
      
      * 1. move dedicated transform into gridwisegemm's head file. 2. make lds tensor params a struct templete. 3. remove useless code
      
      * using small vec
      
      * 256*128 kernel size for example
      
      * remove asm files
      
      * use a new gridwise gemm header for bwd-weight
      
      * revert gridwise gemm v2r4r2
      
      * change foramt
      
      * reset gridwise gemm v2r4r2
      
      * remove unused code
      
      * revert instance file
      
      * revert example instance
      
      * format file
      
      * remove macros
      
      * resolve compile error
      
      * rename wrw kernel invoker
      
      * use gridwisegemm pipeline struct instead of implement run fucntion in the same header
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      070619fb
    • Shaojie WANG's avatar
    • Shaojie WANG's avatar
      [Perf][Bwd-weights]Lds re-layout to avoid ds read/write bank conflict and... · b9b9c3b8
      Shaojie WANG authored
      [Perf][Bwd-weights]Lds re-layout to avoid ds read/write bank conflict and balance ds ops with address calculations (#190)
      
      * add some instance to develop
      
      * avoid bank conflicts for wrw for all instance
      
      * add small K1 test
      
      * delete some unused instance
      
      * reset buffer load oob and ds memcpy to default option
      
      * remove useless instances
      
      * remove redandunt space
      
      * remove printf code
      
      * clang-format-10 change
      
      * fix clang format for the other files
      
      * add bank length computation
      
      * add template to distinguish the instance that need lds padding for wrw
      
      * use rocm5.1 as docker
      
      * use integer value for GEMM test
      
      * 1. move dedicated transform into gridwisegemm's head file. 2. make lds tensor params a struct templete. 3. remove useless code
      
      * use a new gridwise gemm header for bwd-weight
      
      * revert gridwise gemm v2r4r2
      
      * change foramt
      
      * rename kernel invoker
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      b9b9c3b8
    • rocking5566's avatar
      Hotfix eltiwseop (#242) · bb4b82a9
      rocking5566 authored
      
      * Use vector constructor instead
      
      * Fix typo
      
      * Move blockSize to the MakeArgumentPointer
      
      * Fix naming
      
      * Fix clang format
      
      * remove blockSize from DeviceBinaryElementwise::Argument()
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      bb4b82a9
    • rocking5566's avatar
      Gemm reduce max (#209) · 0ffe956a
      rocking5566 authored
      
      * [What] Rename the example
      [Why] Prepare to add unary reduction
      
      * Add global oparation to the parameter
      
      * Add atomicmax
      
      * Fix compile error
      
      * Support atomicMax (hip library)
      
      * Rename the reduction example
      
      * Fix target name
      
      * use p_d1_grid as the indicator directly
      
      * Prevent performance issue. Let passthrough handle it.
      
      * Implement the function template the specialize the float2
      
      * No need to separate into two lines
      
      * Remove empty line
      
      * add comment
      
      * Fix compile error due to merge from develop
      
      * make the implementation of atomic_max / atomic_add explicit for each datatype
      
      * Refine typo
      
      * For future CI test
      
      * Fix compiler error in ckProfiler
      
      * Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'
      
      * simply use remove_pointer
      
      * Rename type and var
      
      * Refine example
      
      * Modify reducemax example
      
      * Fix bug in reduction
      
      * Change initialize range
      
      * Implement F64 version of atomicMax
      
      * Move reduction  code together
      
      * Add buffer atomic_max
      
      * Fix coding style by clang-format
      
      * Integrate new api of DeviceGemmReduce_Xdl_CShuffle
      
      * Integrate Batch gemm reduction
      
      * Fix example
      
      * fix example
      
      * clean up
      
      * Fix batch gemm tensor operation
      
      * Fix coding style
      
      * Fix template augument
      
      * Fix clang format
      
      * Keep flexible of different stride for each D tensor
      
      * Fix compile error for ckProfiler
      
      * Fix typo
      
      * [What] Fix naming
      [Why] Prepare to add out elementop
      
      * Add DoutElementOp
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      0ffe956a
  5. 19 May, 2022 1 commit
    • rocking5566's avatar
      elementwise op (#238) · aafc3ac2
      rocking5566 authored
      
      * Add elementwise operation kernel and example
      
      * Add comment
      
      * Add template argument of dim . Prepare to support multiple dimension
      
      * Rename example
      
      * Support 1 dimension
      
      * Add static assert
      
      * Add comment
      
      * Extract pad
      
      * Remove redundant argument
      
      * Support any dimension for elementwise operation
      
      * Remove line
      
      * Let it be the multiple number of CU
      
      * Move thread per block to the parameter of constructor
      
      * rename threadPerBlock with blockSize
      
      * Support double
      
      * rename kernel function name
      
      * remove redundant include header
      
      * Refine type
      
      * Need to the final dimension
      
      * Refine variable name
      
      * Refine type
      
      * Use index_t instead of int in API
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      aafc3ac2
  6. 13 May, 2022 1 commit
  7. 12 May, 2022 2 commits
    • JD's avatar
      Add host API (#220) · cec69bc3
      JD authored
      
      * Add host API
      
      * manually rebase on develop
      
      * clean
      
      * manually rebase on develop
      
      * exclude tests from all target
      
      * address review comments
      
      * update client app name
      
      * fix missing lib name
      
      * clang-format update
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * fix test issue
      
      * refactor
      
      * refactor
      
      * refactor
      
      * upate cmake and readme
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      cec69bc3
    • ltqin's avatar
      enable convnd bwd data test (#234) · 0f912e20
      ltqin authored
      0f912e20
  8. 11 May, 2022 1 commit
    • Anthony Chang's avatar
      Manual control of MAC cluster for improved interwave performance (#184) · 76764d8c
      Anthony Chang authored
      * manual control of MAC cluster for improved 2-wave performance
      
      ensure setprio's order; ensure inner loop size >= local read size
      
      synchronize when single mac cluster
      
      * format
      
      * use value field from ck::integral_constant
      
      * roll out inter-wave loop scheduler to c-shuffle gemm variants
      
      will gradually roll out to other applicable device ops when occasional reg spill is resolved
      
      * additional comments
      
      * format
      
      * fix mismatch between inter-wave pipeline and interwave blockwise gemm
      
      * address review feedback
      
      * amend
      76764d8c
  9. 10 May, 2022 1 commit
  10. 09 May, 2022 3 commits
    • myamlak's avatar
      Resolution of issue #153: Add compiler warning on comparing int and size_t (#212) · f03a1738
      myamlak authored
      
      * Turning compare warnings on
      
      * Cleaning part I
      
      * Cleaning part II
      
      * Explicit static_cast to ck::type_convert
      
      * Resolving large tensor size issue.
      
      * format
      
      * revert change to tensor descriptor; promote lementSpaceSize to 64bit
      
      * use integer value for GEMM test
      
      * Review remarks
      
      * Review remarks + issues with (un)signed arithmetic
      
      * Format fix
      
      * Format
      
      * Clang-format.
      
      * fix 2gb limit issue
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      f03a1738
    • Wen-Heng (Jack) Chung's avatar
      Update README.md (#228) · 968bd932
      Wen-Heng (Jack) Chung authored
      968bd932
    • Chao Liu's avatar
      Code refactor (#175) · ec7c2e91
      Chao Liu authored
      * format
      
      * improving pipeline
      
      * fix typo
      
      * format
      
      * adding thread group
      
      * adding thread group
      
      * adding thread group
      
      * adding gemm pipeline
      
      * tweak
      
      * refactor
      
      * refactor
      
      * add missing type convert
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * fix build
      
      * refactor
      
      * format
      
      * clean up
      
      * use remove_cvref_t
      
      * clean
      
      * clean up
      
      * clean up
      
      * clean up
      ec7c2e91
  11. 08 May, 2022 1 commit
    • Illia Silin's avatar
      Add Benchmark test into CI (#226) · a3c910ac
      Illia Silin authored
      
      * add performance test to jenkins pipeline
      
      * fix typo
      
      * fix the syntax in conv_fwd_util.cpp
      
      * fix the error message syntax spacing
      
      * fix the error message syntax spacing again
      
      * run profile_gemm and archive results
      
      * fix typo
      
      * try to figure out the paths
      
      * try to figure out the paths one more time
      
      * skip the copying step
      
      * build ckProfiler release only once
      
      * change directory using dir
      
      * fix dir syntax
      
      * change the gemm parameters
      
      * do not pipe script output to file
      
      * try running ckProfiler directly
      
      * fix typo
      
      * use set +e
      
      * run profile_gemm.sh || true
      
      * run multiple gemms and parse results
      
      * fix typo in jenkinsfile
      
      * fix syntax
      
      * add new gemm sizes, update scripts
      
      * put all jenkins steps in original order
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      a3c910ac
  12. 30 Apr, 2022 6 commits
  13. 29 Apr, 2022 5 commits