1. 12 Jul, 2022 1 commit
  2. 08 Jul, 2022 2 commits
    • Po Yen Chen's avatar
      GEMM pipeline v2 (#317) · 63914743
      Po Yen Chen authored
      * format
      
      * improving pipeline
      
      * fix typo
      
      * format
      
      * adding thread group
      
      * adding thread group
      
      * adding thread group
      
      * adding gemm pipeline
      
      * tweak
      
      * refactor
      
      * refactor
      
      * add missing type convert
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * fix build
      
      * refactor
      
      * format
      
      * clean up
      
      * use remove_cvref_t
      
      * clean
      
      * use pipeline_v2 for gemm kernel
      
      * Remove inconsistent indent
      
      * Fix compilation errors due to incomplete merge process
      
      * Add missing include directives
      
      * Fix compilation errors in currently unused files
      
      * Add license in newly added files
      
      * Re-format touched files by clang-format-10
      
      * Fix wrong template argument count of DeviceGemm<>
      
      * Use language construct to choose between types
      
      * Use language construct to choose GEMM example instance
      
      * Fix compilation error due to interface change
      
      * Re-use type alias to avoid duplication
      
      * Unify type alias usage in source file
      
      * Only use v2 pipeline in one gridwise GEMM type
      
      * Remove no-longer used include directives
      
      * Add static_assert() to check pipeline type requirements
      
      * Revert "Add static_assert() to check pipeline type requirements"
      
      This reverts commit f0985f0a
      
      .
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarshaojiewang <wsjmessi@163.com>
      63914743
    • Shaojie WANG's avatar
      add conv1d/3d bwd weight instances (#318) · 763ca615
      Shaojie WANG authored
      * add conv1d/3d bwd weight instances
      
      * add profiler code
      763ca615
  3. 07 Jul, 2022 1 commit
    • Chao Liu's avatar
      N-D Tensor Contraction example, instance, and client example (#270) · 4fe9c393
      Chao Liu authored
      * adding contraction
      
      * add contraction example
      
      * update examle
      
      * update example
      
      * format
      
      * update readme
      
      * clean header
      
      * clean header
      
      * contraction with multiple D
      
      * rename
      
      * fix naming issue; add instances for contraction+bilinear
      
      * change assumed virtual layout of contraction; add client example
      
      * update example
      
      * update
      
      * contraction+scale
      
      * use type_convert
      
      * rename
      4fe9c393
  4. 06 Jul, 2022 1 commit
  5. 02 Jul, 2022 1 commit
    • Chao Liu's avatar
      Gemm+Bilinear (#316) · 9e4429f9
      Chao Liu authored
      * refactor
      
      * update example
      
      * update example
      
      * gemm bilinear
      
      * clean
      
      * update
      9e4429f9
  6. 01 Jul, 2022 5 commits
    • guangzlu's avatar
      modified grouped gemm addressing method (#307) · 8e374781
      guangzlu authored
      
      * modified grouped gemm addressing method
      
      * modified addressing method in device_grouped_gemm_xdl.hpp
      Co-authored-by: default avatarroot <root@dc-smc-13.amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      8e374781
    • Anthony Chang's avatar
      Single-kernel GEMM + layernorm (#263) · 63fd5da6
      Anthony Chang authored
      
      * dump lds content in appropriate precision type
      
      * add squared add reduction op; allows sq sum
      
      * initial stub from regular gemm impl
      
      * layernorm example code & host verification
      
      * initial layernorm implementation
      
      * tidy up
      
      * make C0 precision type consistent with C
      
      * clang-tidy and additional comments
      
      * tighten up example code
      
      * account for extra flops/bytes from normalization
      
      * clang-format
      
      * c0 bias/beta/gamma now have its own precision type
      
      * AccElemOp for gemm outputs prior to feeding to layernorm
      
      * update workgroup mapping
      
      * rename kernel template param to reflect its dual use
      
      * use LDS mem pool for reduction workspace
      
      * change cshuffle precision type to f16; clean up
      
      * clang-format
      
      * correct naming
      
      * explicit cast
      
      * fully implemented gemm + bias + activation + add + norm
      
      * activation in correct order
      
      * reflect reduction API's recent change
      
      * amend
      
      * clean up; add comment
      
      * keep up with recent changes in reduction API
      
      * format
      
      * resolve merge conflicts
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      63fd5da6
    • zjing14's avatar
      add batch_stride into batched gemm (#314) · 1c8126a4
      zjing14 authored
      
      * add batch_stride
      
      * fixed test
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      1c8126a4
    • Chao Liu's avatar
      Improve external interface for GEMM and GEMM+add+add+fastgelu (#311) · 0dcb3496
      Chao Liu authored
      * interface for GEMM and GEMM+add+add+fastgelu
      
      * rename namespace
      
      * instance factory
      
      * fix build
      
      * fix build; add GEMM client example
      
      * clean
      0dcb3496
    • zjing14's avatar
      Gemm + bias + c_permute (#312) · fa9a0a5c
      zjing14 authored
      * init commit
      
      * add desc
      
      * finished c permute
      
      * fixed vector lens
      fa9a0a5c
  7. 30 Jun, 2022 3 commits
    • zjing14's avatar
      Grouped Gemm ckProfiler hotfix (#313) · ab6c82c9
      zjing14 authored
      * add setWorkspace in profiler
      
      * fix
      ab6c82c9
    • Anthony Chang's avatar
      Standalone sweep once softmax kernel w/ ckProfiler (#295) · 93c99f3d
      Anthony Chang authored
      * use 'sweep once' softmax kernel where applicable
      
      * threadwise copy's dst buffer can specify invalid element value
      
      * add int8 in/out float compute softmax support
      
      give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error
      
      * format
      
      * softmax inherits DeviceNormalization
      
      * softmax profiler stub
      
      * tighten up reference softmax interface
      
      * example prints tensor dimension
      
      * add fp32 to softmax profiler
      
      * rename header
      
      * hook with ckProfiler
      
      * format
      
      * resolve merge conflict
      
      * resolve merge conflicts
      
      * update normalization profiler help string
      
      * resolve conflict
      
      * typo
      
      * remove residual
      
      * softmax profiler: address feedback
      
      * test for mixed precision input/output
      
      * fully qualify ck::math::isnan
      
      * add comment for device normalization interface
      
      * revise wording
      
      * constness for alpha/beta scaler pointer
      93c99f3d
    • Liam Wrubleski's avatar
      eccf8773
  8. 27 Jun, 2022 2 commits
    • rocking5566's avatar
      external api for gemm + layernorm (#285) · 12235112
      rocking5566 authored
      * Extract base class for elementwise
      
      * Refactor interface of DeviceGemmReduce. Do not use tuple in interface
      
      * [What] Rename d into reduce in gemm + reduction related code
      [Why] Prepare to add d term for add
      
      * Unify base class of gemm + reduce and gemm + bias + add + reduce
      
      * 1. Rename gemm_bias_add_reduce for external api
       2. Refine cmake
      
      * Add normalize device operation
      
      * [What] Reorder the argument
      [Why] Because d0 is also the input of c.
      
      * Add type string
      
      * Add example of gemm_bias_add_layernorm  via external api
      
      * Refactor example code
      
      * clang-format
      
      * Fix compile error
      
      * clang-format
      
      * Add external api for gemm_add_add_layernorm and normalize
      
      * Add client example
      
      * clang-format
      12235112
    • Chao Liu's avatar
      External Interface (#304) · aebd211c
      Chao Liu authored
      * add client example
      
      * clean
      
      * clean
      
      * reorg
      
      * clean up profiler
      
      * reorg
      
      * clea
      
      * fix profiler
      
      * function for getinstances
      
      * update client example
      
      * update client example
      
      * update client example
      
      * update
      
      * update example
      
      * update Jenkins file
      
      * update cmake
      
      * update Jenkins
      aebd211c
  9. 25 Jun, 2022 3 commits
    • Liam Wrubleski's avatar
      Switch to standard ROCm packaging (#301) · b653c5eb
      Liam Wrubleski authored
      
      * Switch to standard ROCm packaging
      
      * Revert .gitignore changes
      
      * install new rocm-cmake version
      
      * update readme
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      b653c5eb
    • Chao Liu's avatar
      add license in file (#303) · d3051d75
      Chao Liu authored
      d3051d75
    • Chao Liu's avatar
      Absolute include path (#281) · d1db6a0c
      Chao Liu authored
      * ad gelu and fast_gelu
      
      * added GeLU and fast GeLU
      
      * clean up
      
      * add gemm+fastgelu example
      
      * add gemm+gelu instances
      
      * update profiler
      
      * clean up
      
      * clean up
      
      * adding gemm+bias+activation
      
      * clean
      
      * adding bias
      
      * clean
      
      * adding gemm multiple d
      
      * debugging
      
      * add gemm bias add fastgelu
      
      * rename, clean
      
      * refactoring; add readme
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * fix
      
      * fix
      
      * update example
      
      * update example
      
      * rename
      
      * update example
      
      * add ckProfiler
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * add client app example
      
      * update readme
      
      * delete obselete files
      
      * remove old client app
      
      * delete old file
      
      * cleaning
      
      * clean
      
      * remove half
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path for all examples
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * revert client app example
      
      * clean build
      
      * fix build
      
      * temporary disable client test on Jenkins
      
      * clean
      
      * clean
      
      * clean
      d1db6a0c
  10. 23 Jun, 2022 2 commits
    • Chao Liu's avatar
      update license (#297) · a49115b9
      Chao Liu authored
      * update license
      
      * update license
      
      * update license
      
      * update license
      a49115b9
    • Adam Osewski's avatar
      Testing all fwd convolution specializations. (#259) · a2edd7d8
      Adam Osewski authored
      
      * UniforFill with integer values.
      
      * Log tested instance type string.
      
      * Add UT for all convolution specializations.
      
      * debugging conv
      
      * Fix dangling reference bug.
      
      * Small refinements.
      
      * Fix call to error checking function.
      
      * Small refinements to tests.
      
      * Configure error tolerance
      * Change problem size.
      * Remove OddC case from types that do not support it.
      
      * Add helper traits for AccumulatorDataType.
      
      * Print first 5 errs in check_err for integral types.
      
      * Rename FillUniform to FillUniformDistribution
      
      * Refactor
      
      * Do not use typed tests.
      * Instead use plain fixture class with templatized member functions.
      * Initialize tensors with integer values.
      
      * Refine test instances.
      
      * Properly set accumulator data type.
      * Add another "big" instance.
      
      * Refactor convolution tests.
      
      * Revert "debugging conv"
      
      This reverts commit b109516455631ff8fd6dce99cf7c14bf8e323ebb.
      
      * Add pragma once + format + small refinement.
      
      * Fix some unwanted changes.
      
      * Clang-format
      
      * Fix profile_convnd to use renamed tensor initializer.
      
      * Add instances for ConvFWDND kernel case 2D
      
      * Helpers to get ConvNDFwd 2D instances.
      
      * Refactoring.
      
      * Remove "small block" instance as it was generating compiler errors.
      * Remove default template parameters values.
      
      * Refine and fix test.
      
      * Fix problem with default template parameter types.
      * Adjust error thresholds for floating point values test.
      * Use integer values initialization for instances test.
      * Add tests for ConvNDFwd 2D case.
      
      * Remove AccumulatorDataType type trait.
      
      * Update unit-tests.
      
      * Remove operator<< overload.
      
      * Unlock conv1d/3d nd fwd instances.
      
      * Enable skipping calculating reference using flag.
      
      * Fix number of channels for first ResNet50 layer.
      
      * Clang-format.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      a2edd7d8
  11. 21 Jun, 2022 5 commits
    • Shaojie WANG's avatar
      fix Issue 291 (#294) · 4634b120
      Shaojie WANG authored
      * rename for typeconvert functor
      
      * refine code
      4634b120
    • Anthony Chang's avatar
      Standalone softmax kernel (#284) · 15c89e81
      Anthony Chang authored
      * initial stub for standalone softmax
      
      * start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m
      
      * host softmax validates
      
      * compiles; to implement beta scaling
      
      * use NaN trick to efficiently ignore OOB values during sum of exponentials
      
      * freeload device_reduce's utility functions
      
      * clean up interface
      
      * adding prior value (beta scaling)
      
      * remove restriction related to perf considerations
      
      * apply clang-format
      
      * clean; disable diagnostics
      
      * resolve conflicts
      
      * add exp wrapper
      
      * honor HostTensorDesc interface; allow implicit cast from different vector<T> type
      
      * test softmax for fp16/fp32
      
      * update readme
      
      * amend commit NaN trick
      
      * remove redundant param added during development
      
      * format
      
      * replace ScalarDataType with AccDataType
      
      * separate out test programs by precision type
      
      * move softmax sample code to its own folder
      
      * format
      
      * keep up with recent changes in reduction API
      
      * remove extra header
      15c89e81
    • Chao Liu's avatar
      Create MIT LICENSE (#229) · be60d60d
      Chao Liu authored
      * Create LICENSE
      
      * add contributors, add license into config.hpp
      
      * update
      be60d60d
    • Anthony Chang's avatar
    • Chao Liu's avatar
      update readme and script (#290) · ccbd8d90
      Chao Liu authored
      ccbd8d90
  12. 19 Jun, 2022 1 commit
    • Chao Liu's avatar
      GEMM with Multiple Source, GEMM+Bias+Add+FastGeLU example and ckProfiler (#241) · 56adf7e9
      Chao Liu authored
      * ad gelu and fast_gelu
      
      * added GeLU and fast GeLU
      
      * clean up
      
      * add gemm+fastgelu example
      
      * add gemm+gelu instances
      
      * update profiler
      
      * clean up
      
      * clean up
      
      * adding gemm+bias+activation
      
      * clean
      
      * adding bias
      
      * clean
      
      * adding gemm multiple d
      
      * debugging
      
      * add gemm bias add fastgelu
      
      * rename, clean
      
      * refactoring; add readme
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * fix
      
      * fix
      
      * update example
      
      * update example
      
      * rename
      
      * update example
      
      * add ckProfiler
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * add comment
      
      * use type_convert
      
      * clean
      
      * clean element wise op
      56adf7e9
  13. 17 Jun, 2022 5 commits
    • Illia Silin's avatar
      Don't look up the /sys/module/amdgpu/version file. (#287) · e4584d91
      Illia Silin authored
      
      * use pre-built docker instead of building a new one
      
      * try docker.image.pull
      
      * change syntax in docker.image()
      
      * add 30 min timeout
      
      * increase timeout to 3 hours
      
      * move performance tests to first stage for testing
      
      * set image variable to the new container name
      
      * update image name
      
      * check available images
      
      * check available images in both places
      
      * try different image name
      
      * use image ID to refer to image
      
      * run performance on gfx90a
      
      * fix the gpu_arch labeling, add parameter
      
      * move env vars out of stages
      
      * add stand-alone performance script, MI200 tests, CU numbers
      
      * dos2unix for run_perf_tests.sh
      
      * try the new git credentials
      
      * use env var for git credentials
      
      * don't look up /sys/module/amdgpu/version
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      e4584d91
    • Qianfeng's avatar
      Regulate reduction accumulator operations and Element-wise operations (#274) · 1f543bfa
      Qianfeng authored
      * Remove template from Reducton operation classes and add template to their operator() and GetIdentityValue() interfaces
      
      * Change to unary elementwise operators and the reduce_unary_operator (class for mapping) and dependent variations in all host layers
      
      * Remove the data type template parameter from reduce_binary_operator (class for mapping) and dependent variations in host layers
      
      * Add InMemoryDataOperatonSupportedOnDataType to check the matching between data type and InMemoryDataOperation
      
      * Use struct-scope operator template instantiation for binary and unary element-wise operations
      
      * Change a few more elementwise operations to use template for operator()
      
      * Tiny correction in Normalize operator
      
      * Add static_assert to check the data type appliability for some reduction accumulator and element-wise operatons
      
      * Correction in some examples with regard to using ReduceAccDataType
      
      * Use static_assert for UnaryDivide
      
      * Update to merged codes to use Element-wise operations and Reduction Accumulator operations correctly
      
      * Tiny fix with regard to SetWorkSpacePointer()
      1f543bfa
    • Shaojie WANG's avatar
      63cdd923
    • ltqin's avatar
      add p_workspace to baseargument (#275) · c7a96ed5
      ltqin authored
      c7a96ed5
    • rocking5566's avatar
      Gemm + bias + relu + add + layernorm (#272) · 6eb55499
      rocking5566 authored
      * Copy "gemm reduce" to "gemm bias add reduce"
      
      * Implement gemm bias add reduction
      
      * Fix compiler error due to merge from develop
      
      * Add tensor operation for gemm + bias + add + reduce
      
      * Add gemm_bais_add_reduce to ckProfiler
      
      * Add c1 functor
      
      * Refine type
      
      * Use reduceAccDataType instead of explicitly float
      
      * Change to use check_err()
      
      * Do relu in float32 instead of bhalf_t. Because bhalf_t is unsigned
      
      * Refactor relu. using type_trait instead of overloading
      
      * Rename DxsReduceAccElementwiseOperation to DxsReduceAccElementwiseOperation
      
      * Fix denominator
      
      * Refine nameing
      
      * Fix denominator  in host
      
      * Remove useless include header
      
      * Use AccDataType
      
      * Fix static_cast order
      
      * Refine type
      
      * [What] Remove tuple type in the base class
      [Why] External api depend on base class. if base class has relationship with type, we will need many class for different type
      6eb55499
  14. 16 Jun, 2022 2 commits
    • Shaojie WANG's avatar
      example for convnd bwd weight bf16 splitk (#265) · 561ec12f
      Shaojie WANG authored
      * add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight
      
      * add bwd weight for bf16: init
      
      * remove redundant compute
      
      * use datatype and split k to check whether a workspace is used
      
      * remove unused computation for work space size
      
      * add some code for bfp16
      
      * add device/grid unary op
      
      * add unary type convert to bwd-weight example
      
      * support bf16 splitk kernel for convnd bwd weight
      
      * 1. remove comments. 2. add checkvalidity. 3. add gridsize computation
      
      * add workspace size check
      
      * fix format
      
      * change function name
      561ec12f
    • Illia Silin's avatar
      Use new github credentials (#278) · fb9b6b1e
      Illia Silin authored
      * use pre-built docker instead of building a new one
      
      * try docker.image.pull
      
      * change syntax in docker.image()
      
      * add 30 min timeout
      
      * increase timeout to 3 hours
      
      * move performance tests to first stage for testing
      
      * set image variable to the new container name
      
      * update image name
      
      * check available images
      
      * check available images in both places
      
      * try different image name
      
      * use image ID to refer to image
      
      * run performance on gfx90a
      
      * fix the gpu_arch labeling, add parameter
      
      * move env vars out of stages
      
      * add stand-alone performance script, MI200 tests, CU numbers
      
      * dos2unix for run_perf_tests.sh
      
      * try the new git credentials
      
      * use env var for git credentials
      fb9b6b1e
  15. 10 Jun, 2022 1 commit
    • Illia Silin's avatar
      Add performance tests on MI200 in CI, reporting number of CUs, add stand-alone perf test. (#277) · 1ced00a5
      Illia Silin authored
      * use pre-built docker instead of building a new one
      
      * try docker.image.pull
      
      * change syntax in docker.image()
      
      * add 30 min timeout
      
      * increase timeout to 3 hours
      
      * move performance tests to first stage for testing
      
      * set image variable to the new container name
      
      * update image name
      
      * check available images
      
      * check available images in both places
      
      * try different image name
      
      * use image ID to refer to image
      
      * run performance on gfx90a
      
      * fix the gpu_arch labeling, add parameter
      
      * move env vars out of stages
      
      * add stand-alone performance script, MI200 tests, CU numbers
      1ced00a5
  16. 02 Jun, 2022 3 commits
  17. 31 May, 2022 2 commits
    • zjing14's avatar
      Pass gemm_descs for grouped gemm via __constant__ buff (#232) · b6eaf3eb
      zjing14 authored
      * moved gemm_descs_args into const buff
      
      * use CK_CONSTANT_ADDRESS_SPACE instead of global constant
      
      * clean
      
      * moved hipMemAlloc outside of deviceOp
      
      * add SetWorkSpacePointer
      
      * fix ignore
      b6eaf3eb
    • myamlak's avatar
      Multi-kernel CGEMM (#230) · 7b1e2c37
      myamlak authored
      * Reference CGEMM + test stub
      
      * Format.
      
      * Incomplete simple implementation
      
      * Library instances
      
      * Sketch of tests
      
      * Test fixes.
      
      * Example added
      
      * Cosmetics
      
      * Add elementwise operation kernel and example
      
      * Add comment
      
      * Add template argument of dim . Prepare to support multiple dimension
      
      * Rename example
      
      * Support 1 dimension
      
      * Add static assert
      
      * Add comment
      
      * Second auxiliary buffer added
      
      * Extract pad
      
      * Remove redundant argument
      
      * Support any dimension for elementwise operation
      
      * Remove line
      
      * Let it be the multiple number of CU
      
      * Move thread per block to the parameter of constructor
      
      * Consuming binary ops to do A+B / A-B
      
      * Fix + cosmetics + bf16 test commented out temporarily
      
      * Format
      
      * Enabling bf16 test
      
      * Revert "Enabling bf16 test"
      
      This reverts commit f497e2ba.
      
      * Fix + test reenabled
      
      * fix build
      
      * Revert "fix build"
      
      This reverts commit d7310238
      
      .
      
      * post PR #235 merge fix
      
      * amend
      
      * Single workspace for cgemm + helper
      
      * Perf calc fix
      
      * Review remarks: static_cast
      
      * Review remarks: binary ops templated
      
      * Cleaning
      
      * Removal of instances and their tests
      
      * Review remarks from aosew addressed
      
      * Review remark: unnecessary attribute
      
      * Post-merge fixes
      
      * Restrict 4gemm to PassThrough + bug fix
      
      * Review remarks
      
      * update licence
      
      * change cgemm example to fp16
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      7b1e2c37