1. 17 Jan, 2022 2 commits
  2. 26 Dec, 2021 1 commit
    • Chao Liu's avatar
      Fusion Conv+Bias+ReLU(+Add) (#62) · acbd7bd7
      Chao Liu authored
      * fix relu
      
      * clean up
      
      * clean up
      
      * adding 1x1 conv
      
      * adding 1x1 conv
      
      * added 1x1 conv
      
      * refactor
      
      * refactor
      
      * refactor
      
      * added profiler for conv+bias+relu+add
      
      * clean up
      
      * adding conv+bias+relu
      
      * adding conv+bias+relu
      
      * added conv+bias+relu
      
      * Update README.md
      
      * update cpu verification
      
      * adding c shuffle
      
      * update static_tensor for dealing with invalid element
      
      * adding c shuffle
      
      * debugging
      
      * fix bug
      
      * convert to fp16 before shuffle
      
      * shuffle more than one M/NRepeat
      
      * clean up
      
      * remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1
      
      * clean up
      
      * remove coordinate step hack from all gridwise gemm xdl
      
      * clean up coordinate step hack
      
      * clean up coordinate step hack
      
      * ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst
      
      * adding output shuffle in conv+bias+relu+add
      
      * update
      
      * added conv+bias+relu+add with c shuffle
      
      * added conv+bias+relu+add with c shuffle
      
      * fix forward_sweep bugs in threadwise copy
      
      * clean up
      
      * refactor
      
      * clean up
      
      * clean up
      
      * added conv_c_shuffle+bias_relu
      
      * clean up
      
      * added conv+bias+relu+atomic_add
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * misc fixes; add 1x1 specialization
      
      * clean up
      
      * delete unused device op
      
      * clean up
      
      * add support for odd C value
      acbd7bd7
  3. 13 Dec, 2021 1 commit
    • Chao Liu's avatar
      manually apply bug fix changes in pr #63 (#64) · a4f24233
      Chao Liu authored
      * Bug in BlockwiseGemmXdlops_k0mk1_k0nk1_m0n0m1n1m2m3m4n2_v1::MakeCGridDescriptor_M0_N0_M1_N1_M2_M3_M4_N2()
      * Bug in ThreadwiseTensorSliceTransfer_v1r3 logic for calculating "forward_sweep"
      a4f24233
  4. 04 Dec, 2021 1 commit
  5. 03 Dec, 2021 1 commit
    • Chao Liu's avatar
      GEMM/Conv+BiasAdd+ReLU+Add (#55) · 41cdd380
      Chao Liu authored
      * gemm+activation
      
      * move C pointwise operation into threadwise copy
      
      * add pointwise operation to A/B matrix
      
      * update ckProfiler
      
      * adding bias add
      
      * adding bias add
      
      * adding bias add
      
      * added bias add; worked around compiler issues
      
      * clean up
      
      * clean up
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      * clean up
      
      * add conv_xdl example
      
      * adding conv_xdl_bias_relu_add example
      
      * add conv+bias+relu+add, but has register spill issue
      
      * tweak
      
      * tweak
      
      * refactor
      
      * Update README.md
      
      update readme for example/2_gemm_xdl_bias_relu_add
      
      * clean up
      
      * Update README.md
      
      update readme for example/3_conv_xdl
      
      * Update README.md
      41cdd380
  6. 02 Dec, 2021 3 commits
  7. 30 Nov, 2021 2 commits
  8. 24 Nov, 2021 1 commit
  9. 18 Nov, 2021 3 commits
    • Chao Liu's avatar
      Use __builtin_memcpy to implement bit_cast and for accessing vector from pointer of scalars (#53) · 64350aff
      Chao Liu authored
      * reworking vector_type
      
      * use __builtin_memcpy for bit_cast and vector access of scalar pointer
      
      * clean up
      64350aff
    • zjing14's avatar
      v5r1 fusion kernels for inference (#49) · 970fa3e9
      zjing14 authored
      
      * init
      
      * refactor for 1x1
      
      * rename e0_e1
      
      * add e1 with bugs
      
      * debug
      
      * fixed
      
      * fixed e1
      
      * add timer
      
      * imprve threadwise gemm with dot2
      
      * add e2
      
      * tuning
      
      * seperate c2
      
      * add nhwc
      
      * restore nchwc
      
      * clean
      
      * opt
      
      * fixed; tuning
      
      * add BGlobalMoveSliceWindowStepHacks{}
      
      * tuning
      
      * repeat running
      
      * adjust
      
      * merge v5r1 nchwc
      
      * add adaptors
      
      * split k0 k1 in c_thread_grid
      
      * split h and w
      
      * remove v5r1 nhwc
      
      * clean for pr
      
      * remove host_conv_add
      
      * clean code
      
      * clean
      
      * add dynamic support
      
      * static mode
      
      * test static
      
      * add conv+add fusion
      
      * fixed validation
      
      * naming fix
      
      * use activ_enum
      
      * make static
      
      * refactor conv_add for InMem::add
      
      * add bias
      
      * add conv_out
      
      * add configurable makeddesc
      
      * add maxpool fusion
      
      * add maxpool host for validation
      
      * enable static desc
      
      * conv-only use v5r1_add
      
      * test
      
      * test
      
      * for binary dumps
      
      * fixed incorrect results due to typo
      
      * clean
      
      * debugging maxpool
      
      * workaround with offset trick
      
      * clean code
      
      * modularize ops of fusion
      
      * add gridwise_gemm_v3
      
      * create seperate fusion fun
      
      * enable dynamic mode of conv and conv+resize_add
      
      * add dynamic mode of maxpool
      
      * add pass by point
      
      * add activ_type as arguments
      
      * merge develop
      
      * clean
      
      * reset config to old default
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      970fa3e9
    • zjing14's avatar
      Fixed bfp16 host_conv_fwd (#52) · a651ea4f
      zjing14 authored
      
      * fixed bfloat16 issues
      
      * refactor type_convert
      
      * fixed host_convolution_forward for ushort
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      a651ea4f
  10. 16 Nov, 2021 2 commits
  11. 15 Nov, 2021 2 commits
    • zjing14's avatar
      Add bfp16/int8 support into XDL GEMM operator (#50) · 3737bb03
      zjing14 authored
      
      * init StaticBufferV2
      
      * clean
      
      * adopt old output stage for staticBufferV2
      
      * clean
      
      * remove hack
      
      * clean
      
      * clean
      
      * add parameters
      
      * clean code
      
      * move c_buffer alloc into blockwise gemm
      
      * add adaptors for m/n_thread_data_on_grid
      
      * tweak gemm
      
      * adjust blockwise_gemm_xdlops
      
      * tweak
      
      * update conv
      
      * update script
      
      * adding bwd 1x1
      
      * update script
      
      * adding 1x1 bwd
      
      * debugging bwd 1x1 failure
      
      * update script
      
      * update script
      
      * test
      
      * test v100
      
      * add bf16_1k
      
      * clang-format
      
      * clean
      
      * add bfp16 for gfx908
      
      * add verification
      
      * clean up
      
      * clean code
      
      * restore bfl16
      
      * clean
      
      * add bfp16 support into gemm_driver
      
      * apply new generator to other drivers
      
      * add int8 support
      
      * cleanb
      
      * clean
      
      * clean
      
      * clean
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      Co-authored-by: default avatarroot <root@hayabusa6111.amd.com>
      3737bb03
    • Chao Liu's avatar
      FP16 data in-register transpose (#41) · b491ebf3
      Chao Liu authored
      * start fixing 16bit data packing
      
      * adding StaticTensor
      
      * adding StaticTensor
      
      * adding StaticTensor
      
      * add missing constexpr
      
      * adding static tensor
      
      * adding static tensor
      
      * adding transpose
      
      * add inline asm for transpose 2x2 of half_t
      
      * add general transpose_vectors(), but have unnecessary register initialization using v_mov
      
      * fix unnecessary register initialization in transpose_vector by using more pass-by-reference
      
      * add hardcoded logic for NHWC wrw
      
      * improve asm for v_pack
      
      * make ThreadwiseTensorSliceTransfer_v3r2 support any tensor
      
      * tweak
      
      * reorganize file
      b491ebf3
  12. 14 Nov, 2021 1 commit
    • Chao Liu's avatar
      ckProfiler and device-level XDL GEMM operator (#48) · e823d518
      Chao Liu authored
      * add DeviceGemmXdl
      
      * update script
      
      * fix naming issue
      
      * fix comment
      
      * output HostTensorDescriptor
      
      * rename
      
      * padded GEMM for fwd v4r4r4 nhwc
      
      * refactor
      
      * refactor
      
      * refactor
      
      * adding ckProfiler
      
      * adding ckProfiler
      
      * refactor
      
      * fix tuning parameter bug
      
      * add more gemm instances
      
      * add more fp16 GEMM instances
      
      * fix profiler driver
      
      * fix bug in tuning parameter
      
      * add fp32 gemm instances
      
      * small fix
      
      * refactor
      
      * rename
      
      * refactor gemm profiler; adding DeviceConv and conv profiler
      
      * refactor
      
      * fix
      
      * add conv profiler
      
      * refactor
      
      * adding more GEMM and Conv instance
      
      * Create README.md
      
      Add build instruction for ckProfiler
      
      * Create README.md
      
      Add Readme for gemm_xdl example
      
      * Update README.md
      
      Remove build instruction from top most folder
      
      * Update README.md
      
      * clean up
      e823d518
  13. 27 Oct, 2021 3 commits
  14. 26 Oct, 2021 1 commit
  15. 21 Oct, 2021 1 commit
  16. 19 Oct, 2021 2 commits
    • Chao Liu's avatar
      bug fix (#39) · c3018794
      Chao Liu authored
      c3018794
    • ltqin's avatar
      add nchw atomic , nhwc and nhwc atomic method for backward weight (#30) · fd49ff80
      ltqin authored
      
      * add add new algorithm from v4r4r2
      
      * program once issue
      
      * add split k functiion
      
      * redefine code
      
      * add a matrix unmerge
      
      * add b matrix unmerge k0
      
      * trans a and b to gridegemm
      
      * nhwc init
      
      * no hacks and vector load
      
      * add hacks
      
      * modify some parameter
      
      * fix tuning prometer for fp32
      
      * fix tuning prometer for fp16
      
      * start change gridwise k split
      
      * init ok
      
      * revome a b matrix k0mk1 desc in grid
      
      * carewrite lculate gridsize
      
      * add kbatch to CalculateBottomIndex
      
      * remove some unused funtion
      
      * add clear data function before call kernel
      
      * out hacks
      
      * in hacks
      
      * rename device convolution file and function name
      
      * modify kBatch value
      
      * fix some tuning code
      
      * start from v4r4 nhwc
      
      * nhwc atomic is able to run
      
      * just for fp32
      
      * enable nchw atomic
      
      * tweak
      
      * tweak
      
      * re-arrange gridwise gemm hot loop for wrw
      
      * add wrw v4r5
      
      * v4r4r5 fp16
      
      * v4r4r4 fp16
      
      * v4r4r2 fp16
      
      * V4R4R4XDLNHWC fp16
      
      * V4R4R2XDLATOMICNCHW fp16
      
      * adjust for fp16
      
      * input gridsize
      
      * change kbatch to gridsize
      
      * testing wrw
      
      * clean up
      
      * k_batch to gridsize
      
      * fix bug
      
      * wrw v4r4r4 kbatch change to gride size
      
      * wrw v4r4r2 kbatch change to gride size
      
      * after merge , change gridwise gemm v2r4
      
      * change MakeCBlockClusterAdaptor
      
      * other method use new gridwise gemm
      
      * clean up
      
      * chapad method nge to make_right_pad_transform
      
      * kbatch out from transform function
      
      * clean up and fix bug
      
      * fix bug
      
      * using function type reduce template parameters
      
      * using auto replace define fuction type
      
      * clean up
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      fd49ff80
  17. 06 Oct, 2021 3 commits
    • Qianfeng's avatar
      [MIOpen Downstream] Fix Reduction Kernel (#34) · b2dc55f8
      Qianfeng authored
      
      * Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel
      
      * Fix with regard to implementing GetZeroVal() in both kernel and host
      
      * Avoid convert to compType from dstDataType before writting the output value
      
      * Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator
      
      * Add CONSTANT decorator for descriptor read buffer
      
      * Use get_thread_local_1d_id() for thread local Id
      
      * Rename GetZeroVal() to GetReductionZeroVal() in the kernels
      
      * Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp
      
      * Occasional tiny simplification and update in the kernel files
      
      * Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers
      
      * Update to remove OpenCL tidy checking failures
      
      * Update for better readability
      
      * Remove unused codes and not-needed template parameters in the kernel wrappers
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      b2dc55f8
    • Chao Liu's avatar
      Tweak GEMM kernel (#38) · b3e8d57d
      Chao Liu authored
      * add parameters
      
      * tweak gemm
      
      * tweak
      
      * update conv
      
      * update script
      
      * adding bwd 1x1
      
      * update script
      
      * adding 1x1 bwd
      
      * debugging bwd 1x1 failure
      
      * update script
      
      * update script
      
      * test
      
      * test v100
      
      * clean up
      b3e8d57d
    • zjing14's avatar
      Add VectorType support into StaticBuffer (#27) · 846f462b
      zjing14 authored
      
      * init StaticBufferV2
      
      * clean
      
      * adopt old output stage for staticBufferV2
      
      * clean
      
      * remove hack
      
      * clean
      
      * clean
      
      * clean code
      
      * move c_buffer alloc into blockwise gemm
      
      * add adaptors for m/n_thread_data_on_grid
      
      * adjust blockwise_gemm_xdlops
      
      * reorder ops in GEMM hot loop
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      846f462b
  18. 29 Sep, 2021 1 commit
    • Qianfeng's avatar
      [Enhancements] Several bugfixes and refactoring of dynamic generic reduction (#1156) · dfb80c4e
      Qianfeng authored
      * Squashed 'src/composable_kernel/' content from commit f6edda61
      
      git-subtree-dir: src/composable_kernel
      git-subtree-split: f6edda61
      
      * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
      
      * Squashed 'src/composable_kernel/' changes from f6edda61..5781adf5
      
      5781adf5 Update develop (#5) (#6)
      97e6d514 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
      7b1ec41e refactor
      49c33aae refactor
      54b3e73d rename
      
      git-subtree-dir: src/composable_kernel
      git-subtree-split: 5781adf5
      
      
      
      * fix
      
      * refactor
      
      * remove online compilation from CK
      
      * refactor
      
      * fix
      
      * add ctest
      
      * tidy
      
      * add tidy
      
      * tidy
      
      * tidy
      
      * tidy
      
      * tidy
      
      * tidy
      
      * tidy
      
      * tidy
      
      * tidy
      
      * tidy
      
      * add c-style pointer cast
      
      * vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
      
      * fix clang warning suppression
      
      * tidy
      
      * suppress cppcheck
      
      * fix enum issue
      
      * revert chagnes to hip build
      
      * fix kernel filename
      
      * update CK build script
      
      * rename
      
      * rename
      
      * make innner product compatiable on gfx900
      
      * Update src/include/miopen/solver/ck_utility_common.hpp
      Co-authored-by: default avatarJD <Jehandad.Khan@amd.com>
      
      * compiler parameter use stream
      
      * use int instead of index_t in kernel wrapper
      
      * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
      
      * refactor
      
      * refactor
      
      * change cmakelist
      
      * change ck common utility
      
      * fix
      
      * Squashed 'src/composable_kernel/' changes from 5781adf5..31b40352
      
      31b40352 Merge pull request #16 from ROCmSoftwarePlatform/develop
      b62bf8c3 Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration
      ccc4a1d3 Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration
      67ad47e7 refactor
      16effa76 refactor
      a91b68df DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
      2cbabbba use int instead of index_t in kernel wrapper
      0834bc76 compiler parameter use stream
      f2ac7832 make innner product compatiable on gfx900
      4e57b30a rename
      c03045ce rename
      b2589957 update CK build script
      2c48039d fix kernel filename
      d626dccc fix enum issue
      643ebd4f tidy
      ddd49ec9 fix clang warning suppression
      4f566c62 vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
      172036d7 add c-style pointer cast
      76f31319 tidy
      d1842890 tidy
      f885c131 tidy
      80120f0a tidy
      c3efeb5e tidy
      56fc0842 tidy
      54fba515 tidy
      e62bae7a tidy
      24c87289 add tidy
      61487e0a fix
      ae98b52a remove online compilation from CK
      cb954213 refactor
      73ca9701 Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composable_kernel_init_integration_v3
      3b866461 Merge pull request #7 from ROCmSoftwarePlatform/master
      d09ea4f4 Update develop (#5)
      3d32ae94 add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
      
      git-subtree-dir: src/composable_kernel
      git-subtree-split: 31b40352
      
      
      
      * Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel
      
      * Fix with regard to implementing GetZeroVal() in both kernel and host
      
      * Avoid convert to compType from dstDataType before writting the output value
      
      * Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator
      
      * Add CONSTANT decorator for descriptor read buffer
      
      * Use get_thread_local_1d_id() for thread local Id
      
      * Rename GetZeroVal() to GetReductionZeroVal() in the kernels
      
      * Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp
      
      * Occasional tiny simplification and update in the kernel files
      
      * Update in src/reducetensor.cpp for consistent IDs passing to the kernel
      
      * Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers
      
      * Update to remove OpenCL tidy checking failures
      
      * Small updates in src/reducetensor.cpp
      
      * Update for better readability
      
      * Remove unused codes and not-needed template parameters in the kernel wrappers
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarJD <Jehandad.Khan@amd.com>
      dfb80c4e
  19. 21 Sep, 2021 4 commits
  20. 05 Sep, 2021 2 commits
  21. 31 Aug, 2021 1 commit
    • ltqin's avatar
      Backward weight v4r4r2 with xdlops (#18) · 627d8ef3
      ltqin authored
      
      * start
      
      * modify transformat
      
      * modify device convolutiion
      
      * modify host
      
      * added host conv bwd and wrw
      
      * remove bwd, seperate wrw
      
      * clean
      
      * hacall k to zero
      
      * out log
      
      * fixed
      
      * fixed
      
      * change to (out in wei)
      
      * input hack
      
      * hack to out
      
      * format
      
      * fix by comments
      
      * change wei hacks(wei transform has not merge)
      
      * fix program once issue
      
      * fix review comment
      
      * fix vector load issue
      
      * tweak
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      627d8ef3
  22. 27 Aug, 2021 2 commits
    • Chao Liu's avatar
      Misc fixes (#24) · 10bb8110
      Chao Liu authored
      * use cast_pointer_to_generic_address_space() in v6r1 kernel wrapper, DynamcBuffer and buffer_load take customized invalid-element-value, add buffer_load/store for fp64
      
      * use remove_cvref_t
      10bb8110
    • Qianfeng's avatar
      [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108) · 9e80cdce
      Qianfeng authored
      
      * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
      
      * make inner product compatible on gfx900
      
      * Update src/include/miopen/solver/ck_utility_common.hpp
      
      * compiler parameter use stream
      
      * use int instead of index_t in kernel wrapper
      
      * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
      
      * Add dynamic generic reduction kernel layer (kernel wrappers, kernel implementations and utilities)
      
      * Some updates to dynamic composable kernel facility for the need of dynamic generic reduction
      
      * Update to generic reduction C++ host interface layer to support dynamic generic reduction
      
      * Update to remove tidy complaints in host interface layer
      
      * Change the unary operator form from void op(T &x) to T op(T x)
      
      * Update to pass single workspace pointer for all kernels (fix for OpenCL backend)
      
      * Use cppcheck-suppress to prevent some strange warnings
      
      * Re-use operator [] and () for DynamicBuffer and update to depending codes
      
      * Remove useless codes in first call threadwise/warpwise/blockwise kernel wrappers
      
      * [performance] Remove un-needed local buffer initialization
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarJD <Jehandad.Khan@amd.com>
      9e80cdce