1. 31 Aug, 2022 6 commits
  2. 30 Aug, 2022 2 commits
    • Adam Osewski's avatar
      Gemm reduce examples int4/int8/fp32/bf16 (#368) · d00e6115
      Adam Osewski authored
      
      * GEMM + Reduce max fp16+fp32
      
      * GEmm + Max bf16 + int8
      
      * Refactor common definitions.
      
      * Refactor common func of mean meansquare example.
      
      * More examples for mean meansquare.
      
      * Update int8 examples and skip them cause of random errors.
      
      * Int4 examples.
      
      * Fix examples for max int4/8
      
      * Tensor conversion for int4 input data for mean meansquare example.
      
      * Remove int4 mean_meansquare example
      
      * Fix int8 mean_meansquare example.
      
      -All ReductionAccData and R<N>DataType have to be F32. The INT32 data
      type is giving wrong results.
      
      * Guard int4 with ifdef
      
      * Change int8 example to add_addsquare due to div rounding err.
      
      * Clang format
      
      * Change the return type of common function.
      
      * Get back int8 example with division.
      
      * Remove int8 mean meansquare.
      
      * Use proper cast for BF16 data type.
      
      * Use ck::literals.
      
      * Use proper data type for host tensors & reference.
      
      - Use ReduceAccDataType for reference gemm output data type.
      - Cast host reference output tensor to EDataType
      - Fix ifdefs for int4.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      d00e6115
    • Shaojie WANG's avatar
      Padding for attention: bmm+scale+softmax+bmm kernel (#385) · 45adb736
      Shaojie WANG authored
      
      * add padding algo for bmm+scale+softmax+bmm. Version for verification
      
      * remove verification code
      
      * remove comments
      
      * add padded bmm scale softmax bmm example
      
      * format
      
      * refactor
      
      * add comments for usages of padding bmm+scale+softmax+bmm
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      45adb736
  3. 29 Aug, 2022 2 commits
  4. 26 Aug, 2022 2 commits
    • Illia Silin's avatar
      Add an option to build CK with clang directly (#387) · 1e5b59df
      Illia Silin authored
      * replace hipcc compiler with clang++
      
      * build client app with hipcc
      
      * build client app with clang
      
      * add an option to build with hipcc ro clang
      
      * fix the environment for client app
      
      * fix setting up compiler in cmake_build
      
      * change the way the compiler is set
      1e5b59df
    • zjing14's avatar
      Fixed splitk gemm fp32 (#384) · 9881625b
      zjing14 authored
      * add scripts
      
      * fixed splitK_gemm_fp32
      
      * clean
      
      * clean
      9881625b
  5. 25 Aug, 2022 5 commits
  6. 24 Aug, 2022 2 commits
  7. 23 Aug, 2022 5 commits
    • Po Yen Chen's avatar
      Add examples of Gemm (data type: int4) (#367) · fa2d894b
      Po Yen Chen authored
      * Add GEMM examples for int4
      
      Currently the source files are just copied from int8 examples
      
      * Re-use pre-defined alias in int4 exmples
      
      * Distinguish user-side type from kernel-side type
      
      * Add int4_t support for check_err()
      
      * Allow conversion between Tensor<> specializations
      
      * Re-format source files
      
      * Use different type for host tensors
      
      * Re-use CopyAsType<>() to implement copy ctor
      
      * Re-use element-wise operation type alias
      
      * Fix typo in alias names
      
      * Complete the int4 examples
      
      * Add constraint to Tensor<> templated methods
      
      * Add type traits 'is_signed_integral<>'
      
      * Add type constraints for integer version check_err<>()
      
      * Allow comparing different-sized integral types in check_err()
      
      * Check converted Tensor<int4_t> with golden Tensor<int8_t>
      
      * Remove constraint of Tensor<>::CopyAsType()
      
      * Avoid compilation error while disabling ck::int4_t support
      
      * Remove debug messages
      
      * Add #error directive to prevent compile sources with wrong setting
      
      * Simplify tensor usages in examples
      
      * Add constraint to check_err() input reference type
      
      * Align design with other PR
      
      * Use ""_uz to simplify example code
      
      * Avoid too much generalizing check_err()
      
      * Re-format GEMM instance template arguments
      
      * Extract int4 example common codes
      
      * Sort include directives
      
      * Move #include directives into new header
      
      * Move common codes together
      
      * Re-format template argument in example code
      
      * Reuse same implementation code for most of GEMM examples
      
      * Re-format common.hpp
      
      * Unify structured comment in examples
      
      * Use reinterpret_cast<>() for cross-type pointer conversion
      
      * Revert "Add type traits 'is_signed_integral<>'"
      
      This reverts commit f2c148ef.
      
      * Allow unsigned integer arguments for check_err()
      
      * Fix compilation error in check_err()
      
      * Remove unnecessary copy ctor for Tensor<>
      
      * Mark Tensor<> special member functions as 'default'
      
      * Use more strict condition to add code in examples
      
      * Fix wrong program return value of GEMM examples
      
      * Handle the case while user specify all the strides
      
      * Fix never-ran examples
      
      * Exit successfully if GEMM instance does not support given problem
      
      * Add missing 'else' keyword
      
      * Re-format CMakeLists.txt
      
      * Add wrapper function to hide value conversion while copying memory
      
      * Add new DeviceMem API to copy memory
      
      * Use new DeviceMem API to implement examples
      
      * Revert "Add new DeviceMem API to copy memory"
      
      This reverts commit 3f190b07.
      
      * Add conversion ctor for Tensor<>
      
      * Write Tensor<> conversion logics explicitly in example code
      
      * Convert Tensor<> values after transfer data to host
      fa2d894b
    • Anthony Chang's avatar
      Attention with output permutation (#370) · e0d8806c
      Anthony Chang authored
      * comment on specialization for TensorSpecialization::Packed
      
      * gemm_softmax_gemm with output permutation
      
      * scaling
      
      * refactor MatrixPadder; rename to GemmPadder
      
      * remove old sanity check
      
      * restore original gemm_softmax_gemm
      
      * revise comment in gemm_softmax_gemm example
      
      * use GetElementSpaceSize()
      
      * remove extra header
      
      * typo
      
      * remove archaic DeviceOpPtr
      e0d8806c
    • zjing14's avatar
      Add examples of batched/grouped/SplitK Gemm for int8/bfp16/fp16/fp32 (#361) · 60914583
      zjing14 authored
      
      * add examples into grouped/batched_gemm
      
      * adding splitK examples
      
      * fixed splitK
      
      * add bfp16 int8 example into splitK
      
      * formatting
      
      * use static_cast
      
      * added common for batched_gemm
      
      * add commons for examples of splitK/batched/grouped_gemm
      
      * return true
      
      * adjust splitK check tol
      
      * update example
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      60914583
    • Po Yen Chen's avatar
      Add example of Gemm + AddAddFastGelu (data type: int4) (#369) · 2327f1a6
      Po Yen Chen authored
      * Add custom target to bundle examples together
      
      * Add int4 example conditionally (just copy from int8 example)
      
      * Extract common code into common.hpp
      
      * Move ref gemm type alias into data-type-specific sources
      
      * Add #error directive to prevent compile with wrong setting
      
      * Let AddAddFastGelu support int4 parameter type
      
      * Let check_err() support int4 parameter type
      
      * Add wrapper function to hide value conversion while copying memory
      
      * Finish int4 example for GEMM + AddAddFastGelu
      
      * Add new DeviceMem API to copy memory
      
      * Use new DeviceMem API to implement examples
      
      * Fix wrongly use of macro 'CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4'
      
      * Revert "Add new DeviceMem API to copy memory"
      
      This reverts commit e26e7af7.
      
      * Add conversion ctor for Tensor<>
      
      * Add 'const' specifier to Tensor<>::CopyAsType()
      
      * Convert Tensor<> values before/after transfer between host & device
      2327f1a6
    • Anthony Chang's avatar
      Implement padding and sanity checks for fused GEMM+GEMM (#376) · f4047c94
      Anthony Chang authored
      * GemmPadder and GemmGemmPadder
      
      * proper padding using GemmGemmPadder
      
      * test gemm_gemm padding
      
      * properly check size K in IsSupportedArgument()
      
      * properly check size requirement given SrcScalarPerVector in IsSupportedArgument()
      
      * comment
      
      * format
      f4047c94
  8. 22 Aug, 2022 1 commit
  9. 18 Aug, 2022 2 commits
  10. 17 Aug, 2022 1 commit
  11. 15 Aug, 2022 2 commits
    • Anthony Chang's avatar
      Hotfix LDS data hazard in fused attention (#360) · c961ce92
      Anthony Chang authored
      * avoid LDS data hazard in gemm_softmax_gemm pipeline
      
      * trivial refactors
      
      * comments
      
      * shrink blockwise gemm v2 thread buffer size
      
      * reclaim A block lds space when during 2nd gemm
      
      * amend
      
      * amend
      c961ce92
    • Qianfeng's avatar
      Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320) · 53ea4713
      Qianfeng authored
      * Implement multiple-reduction in one kernel (kernels, device ops, examples)
      
      * Add generic elementwise kernel and device interface
      
      * Add generator for normal-distributed data initialization
      
      * Add host refer implementation of batchnorm-forward and batchnorm-infer
      
      * Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels
      
      * Remove un-needed including in batchnorm example
      
      * Renaming generic_elementwise to elementiwise in kernel and device classes/functions
      
      * Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise
      
      * Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise
      
      * Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise
      
      * Add DeviceElementwiseBase and use it in device_normalize_instance.cpp
      
      * Removing and renaming files
      
      * Update to synchronize gemm_layernorm client example to the generic element-wise device op API
      
      * Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming
      
      * Merge two static member functions in device_elementwise.hpp
      
      * Remove unary_elementwise_1d kernel and device
      53ea4713
  12. 13 Aug, 2022 8 commits
    • Chao Liu's avatar
      fix build issue (#357) · 5ee30459
      Chao Liu authored
      * fix build
      
      * excludeexample_gemm_max_xdl_fp16 from testing due to random failure on gfx908
      5ee30459
    • cloudhan's avatar
      Change all device operations to use add_instance_library (#338) · fb1cbf02
      cloudhan authored
      
      * Change all device operations to use add_instance_library to avoid duplicated cmake configuration.
      
      * update DeviceMem
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      fb1cbf02
    • rocking5566's avatar
      Layernorm welford (#346) · 0bd6b842
      rocking5566 authored
      
      * Add threadwise and blockwise welford
      
      * Rename gridwise op, prepare to add welford version
      
      * implement welford and integrate welford into layernorm
      
      * Take care of tail loop
      
      * Fix buf when ThreadSliceK > 1
      
      * Fix bug of merging of two empty set
      
      * Rename clip to clamp
      
      * 1. Fix type of count
      2. Remove useless static_assert
      
      * Do not inherit Reduction::Argument
      
      * [What] replace __syncthreads() with block_sync_lds()
      [Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)
      
      * Add y stride
      
      * Rename.
      DeviceLayernorm -> DeviceLayernormImpl
      DeviceNormalization2 -> DeviceLayernorm
      
      * Move literal ""_uz & ""_zu into namespace 'literals'
      
      * Move namespace 'literals' as 'ck::literals'
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      0bd6b842
    • Anthony Chang's avatar
      Fused GEMM+GEMM (#351) · c20a75b0
      Anthony Chang authored
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * add gemm_gemm instances and tests
      
      * avoid LDS data hazard
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      c20a75b0
    • ltqin's avatar
      Skip lds of b matrix (#326) · 10b3278b
      ltqin authored
      * start
      
      * read for gridwise gemm
      
      * add MakeBGridDescriptor_K0_N0_N1_N2_N3_K1
      
      * add thread  copy desc and register buffer
      
      * add K0PerBlock dim
      
      * add read global data
      
      * finish gridwise gemm
      
      * finish blockwise gemm
      
      * add print data
      
      * add smallest config
      
      * add compare code for gridwis gemm
      
      * fix NXdlPerWave
      
      * fix k0perthread and gridewis gemm main loop
      
      * remove b matrix lds alloc
      
      * fix name
      
      * add test code
      
      * create b_grid_desc_k0_k1_k2_n0_n1_n2_n3_k3 from parameter
      
      * add double register
      
      * modify b_thread_desc_
      
      * add float
      
      * fp16 tag
      
      * add tail for pipeline
      
      * finish main loop
      
      * optimize main loop
      
      * start clear gridwise gemm
      
      * clear code
      
      * clear redundant code
      
      * change file name
      
      * change file name
      
      * fix bug after merge develop
      
      * fix input parameters
      
      * using MultiK0 control b load data loop
      
      * fix some config
      
      * 4 buffer
      
      * fix bug
      
      * one can use
      
      * change read order
      
      * change buffer array to tuple
      
      * change to 8 buffer
      
      * interleave buffer load
      
      * change to 16
      
      * read 8 buffer
      
      * add data buffer to template
      
      * fix after merge develop(head file)
      
      * format
      
      * change to 4 buffer
      
      * remove unnecessary lambda fun
      10b3278b
    • Qianfeng's avatar
      Add examples for reduction fp16/fp32/bp16/int8/fp64 for 3d/4d/5d (#342) · 14932e8d
      Qianfeng authored
      * Update the reduce_blockwise example to support user specified data type and input+reducing dimensions
      
      * Add examples for using reduce_multiblock_atomic_add
      
      * Add more running examples to the default command-line
      
      * Remove un-necessary header including
      
      * Update to the example README.md
      14932e8d
    • rocking5566's avatar
      Gemm multiple d multiple r (#335) · 6c3c06bf
      rocking5566 authored
      * Imitate XXX_gemm_multiple_d, add XXX_gemm_multiple_d_multiple_r for gemm + reduction
      
      * Implement run of kernel
      
      * Add example
      
      * Fix parameter of typo
      
      * Rewrite the reduceMax example
      
      * Rewrite the reduceMean + reduceMeanSquare example
      
      * Refine naming
      
      * Refine folder name
      
      * refine naming
      
      * Rewrite the gemm + bias + relu + add + layernorm example
      
      * Rewrite the gemm + layernorm example
      
      * clang-format
      
      * Fix bug if sync lds
      
      * Fix compile error
      6c3c06bf
    • Anthony Chang's avatar
      Fused attention (#345) · cac014f1
      Anthony Chang authored
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * attention host validation
      
      * add blockwsie softmax v1
      
      * iteratively update softmax+gemm
      
      * transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum
      
      * add init method for easier debugging
      
      * do away with manual thread cluster calculation
      
      * generalize blockwise softmax interface
      
      * row-wise softmax sum & max
      
      * format
      
      * rename to DeviceBatchedGemmSoftmaxGemm
      
      * add gemm_softmax_gemm instances and tests
      
      * comment
      Co-authored-by: default avatarltqin <letao.qin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      cac014f1
  13. 12 Aug, 2022 2 commits