Commits · modified_grouped_gemm_addressing_method · liyinrong / composable_kernel

01 Jul, 2022 2 commits
- Merge branch 'develop' into modified_grouped_gemm_addressing_method · f5de8b57
  Chao Liu authored 3 years ago
  
  f5de8b57
- Gemm + bias + c_permute (#312) · fa9a0a5c
  zjing14 authored 3 years ago
```
* init commit

* add desc

* finished c permute

* fixed vector lens
```
  fa9a0a5c
30 Jun, 2022 3 commits

Grouped Gemm ckProfiler hotfix (#313) · ab6c82c9
zjing14 authored 3 years ago
```
* add setWorkspace in profiler

* fix
```
ab6c82c9

Standalone sweep once softmax kernel w/ ckProfiler (#295) · 93c99f3d

Anthony Chang authored 3 years ago

* use 'sweep once' softmax kernel where applicable

* threadwise copy's dst buffer can specify invalid element value

* add int8 in/out float compute softmax support

give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error

* format

* softmax inherits DeviceNormalization

* softmax profiler stub

* tighten up reference softmax interface

* example prints tensor dimension

* add fp32 to softmax profiler

* rename header

* hook with ckProfiler

* format

* resolve merge conflict

* resolve merge conflicts

* update normalization profiler help string

* resolve conflict

* typo

* remove residual

* softmax profiler: address feedback

* test for mixed precision input/output

* fully qualify ck::math::isnan

* add comment for device normalization interface

* revise wording

* constness for alpha/beta scaler pointer

93c99f3d

Remove incorrect old packaging statement (#308) · eccf8773
Liam Wrubleski authored 3 years ago

eccf8773

28 Jun, 2022 2 commits
- modified addressing method in device_grouped_gemm_xdl.hpp · e83c7061
  root authored 3 years ago
  
  e83c7061
- modified grouped gemm addressing method · e536e096
  root authored 3 years ago
  
  e536e096
27 Jun, 2022 2 commits

external api for gemm + layernorm (#285) · 12235112

rocking5566 authored 3 years ago

* Extract base class for elementwise

* Refactor interface of DeviceGemmReduce. Do not use tuple in interface

* [What] Rename d into reduce in gemm + reduction related code
[Why] Prepare to add d term for add

* Unify base class of gemm + reduce and gemm + bias + add + reduce

* 1. Rename gemm_bias_add_reduce for external api
 2. Refine cmake

* Add normalize device operation

* [What] Reorder the argument
[Why] Because d0 is also the input of c.

* Add type string

* Add example of gemm_bias_add_layernorm  via external api

* Refactor example code

* clang-format

* Fix compile error

* clang-format

* Add external api for gemm_add_add_layernorm and normalize

* Add client example

* clang-format

12235112

External Interface (#304) · aebd211c

Chao Liu authored 3 years ago

* add client example

* clean

* clean

* reorg

* clean up profiler

* reorg

* clea

* fix profiler

* function for getinstances

* update client example

* update client example

* update client example

* update

* update example

* update Jenkins file

* update cmake

* update Jenkins

aebd211c

25 Jun, 2022 3 commits

Switch to standard ROCm packaging (#301) · b653c5eb

Liam Wrubleski authored 3 years ago


* Switch to standard ROCm packaging

* Revert .gitignore changes

* install new rocm-cmake version

* update readme
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

b653c5eb

add license in file (#303) · d3051d75
Chao Liu authored 3 years ago

d3051d75

Absolute include path (#281) · d1db6a0c

Chao Liu authored 3 years ago

* ad gelu and fast_gelu

* added GeLU and fast GeLU

* clean up

* add gemm+fastgelu example

* add gemm+gelu instances

* update profiler

* clean up

* clean up

* adding gemm+bias+activation

* clean

* adding bias

* clean

* adding gemm multiple d

* debugging

* add gemm bias add fastgelu

* rename, clean

* refactoring; add readme

* refactor

* refactor

* refactor

* refactor

* refactor

* refactor

* fix

* fix

* update example

* update example

* rename

* update example

* add ckProfiler

* clean

* clean

* clean

* clean

* add client app example

* update readme

* delete obselete files

* remove old client app

* delete old file

* cleaning

* clean

* remove half

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path for all examples

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* revert client app example

* clean build

* fix build

* temporary disable client test on Jenkins

* clean

* clean

* clean

d1db6a0c

23 Jun, 2022 2 commits

update license (#297) · a49115b9

Chao Liu authored 3 years ago

* update license

* update license

* update license

* update license

a49115b9

Testing all fwd convolution specializations. (#259) · a2edd7d8

Adam Osewski authored 3 years ago


* UniforFill with integer values.

* Log tested instance type string.

* Add UT for all convolution specializations.

* debugging conv

* Fix dangling reference bug.

* Small refinements.

* Fix call to error checking function.

* Small refinements to tests.

* Configure error tolerance
* Change problem size.
* Remove OddC case from types that do not support it.

* Add helper traits for AccumulatorDataType.

* Print first 5 errs in check_err for integral types.

* Rename FillUniform to FillUniformDistribution

* Refactor

* Do not use typed tests.
* Instead use plain fixture class with templatized member functions.
* Initialize tensors with integer values.

* Refine test instances.

* Properly set accumulator data type.
* Add another "big" instance.

* Refactor convolution tests.

* Revert "debugging conv"

This reverts commit b109516455631ff8fd6dce99cf7c14bf8e323ebb.

* Add pragma once + format + small refinement.

* Fix some unwanted changes.

* Clang-format

* Fix profile_convnd to use renamed tensor initializer.

* Add instances for ConvFWDND kernel case 2D

* Helpers to get ConvNDFwd 2D instances.

* Refactoring.

* Remove "small block" instance as it was generating compiler errors.
* Remove default template parameters values.

* Refine and fix test.

* Fix problem with default template parameter types.
* Adjust error thresholds for floating point values test.
* Use integer values initialization for instances test.
* Add tests for ConvNDFwd 2D case.

* Remove AccumulatorDataType type trait.

* Update unit-tests.

* Remove operator<< overload.

* Unlock conv1d/3d nd fwd instances.

* Enable skipping calculating reference using flag.

* Fix number of channels for first ResNet50 layer.

* Clang-format.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

a2edd7d8

21 Jun, 2022 5 commits

fix Issue 291 (#294) · 4634b120
Shaojie WANG authored 3 years ago
```
* rename for typeconvert functor

* refine code
```
4634b120

Standalone softmax kernel (#284) · 15c89e81

Anthony Chang authored 3 years ago

* initial stub for standalone softmax

* start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m

* host softmax validates

* compiles; to implement beta scaling

* use NaN trick to efficiently ignore OOB values during sum of exponentials

* freeload device_reduce's utility functions

* clean up interface

* adding prior value (beta scaling)

* remove restriction related to perf considerations

* apply clang-format

* clean; disable diagnostics

* resolve conflicts

* add exp wrapper

* honor HostTensorDesc interface; allow implicit cast from different vector<T> type

* test softmax for fp16/fp32

* update readme

* amend commit NaN trick

* remove redundant param added during development

* format

* replace ScalarDataType with AccDataType

* separate out test programs by precision type

* move softmax sample code to its own folder

* format

* keep up with recent changes in reduction API

* remove extra header

15c89e81

Create MIT LICENSE (#229) · be60d60d

Chao Liu authored 3 years ago

* Create LICENSE

* add contributors, add license into config.hpp

* update

be60d60d

bring up to date with the usage of __builtin_amdgcn_sched_barrier (#293) · 1ae24109
Anthony Chang authored 3 years ago

1ae24109
update readme and script (#290) · ccbd8d90
Chao Liu authored 3 years ago

ccbd8d90

19 Jun, 2022 1 commit

GEMM with Multiple Source, GEMM+Bias+Add+FastGeLU example and ckProfiler (#241) · 56adf7e9

Chao Liu authored 3 years ago

* ad gelu and fast_gelu

* added GeLU and fast GeLU

* clean up

* add gemm+fastgelu example

* add gemm+gelu instances

* update profiler

* clean up

* clean up

* adding gemm+bias+activation

* clean

* adding bias

* clean

* adding gemm multiple d

* debugging

* add gemm bias add fastgelu

* rename, clean

* refactoring; add readme

* refactor

* refactor

* refactor

* refactor

* refactor

* refactor

* fix

* fix

* update example

* update example

* rename

* update example

* add ckProfiler

* clean

* clean

* clean

* clean

* add comment

* use type_convert

* clean

* clean element wise op

56adf7e9

17 Jun, 2022 5 commits

Don't look up the /sys/module/amdgpu/version file. (#287) · e4584d91

Illia Silin authored 3 years ago


* use pre-built docker instead of building a new one

* try docker.image.pull

* change syntax in docker.image()

* add 30 min timeout

* increase timeout to 3 hours

* move performance tests to first stage for testing

* set image variable to the new container name

* update image name

* check available images

* check available images in both places

* try different image name

* use image ID to refer to image

* run performance on gfx90a

* fix the gpu_arch labeling, add parameter

* move env vars out of stages

* add stand-alone performance script, MI200 tests, CU numbers

* dos2unix for run_perf_tests.sh

* try the new git credentials

* use env var for git credentials

* don't look up /sys/module/amdgpu/version
Co-authored-by: Chao Liu <chao.liu2@amd.com>

e4584d91

Regulate reduction accumulator operations and Element-wise operations (#274) · 1f543bfa

Qianfeng authored 3 years ago

* Remove template from Reducton operation classes and add template to their operator() and GetIdentityValue() interfaces

* Change to unary elementwise operators and the reduce_unary_operator (class for mapping) and dependent variations in all host layers

* Remove the data type template parameter from reduce_binary_operator (class for mapping) and dependent variations in host layers

* Add InMemoryDataOperatonSupportedOnDataType to check the matching between data type and InMemoryDataOperation

* Use struct-scope operator template instantiation for binary and unary element-wise operations

* Change a few more elementwise operations to use template for operator()

* Tiny correction in Normalize operator

* Add static_assert to check the data type appliability for some reduction accumulator and element-wise operatons

* Correction in some examples with regard to using ReduceAccDataType

* Use static_assert for UnaryDivide

* Update to merged codes to use Element-wise operations and Reduction Accumulator operations correctly

* Tiny fix with regard to SetWorkSpacePointer()

1f543bfa

use universal workspace pointer in bwd-weight (#286) · 63cdd923
Shaojie WANG authored 3 years ago

63cdd923
add p_workspace to baseargument (#275) · c7a96ed5
ltqin authored 3 years ago

c7a96ed5

Gemm + bias + relu + add + layernorm (#272) · 6eb55499

rocking5566 authored 3 years ago

* Copy "gemm reduce" to "gemm bias add reduce"

* Implement gemm bias add reduction

* Fix compiler error due to merge from develop

* Add tensor operation for gemm + bias + add + reduce

* Add gemm_bais_add_reduce to ckProfiler

* Add c1 functor

* Refine type

* Use reduceAccDataType instead of explicitly float

* Change to use check_err()

* Do relu in float32 instead of bhalf_t. Because bhalf_t is unsigned

* Refactor relu. using type_trait instead of overloading

* Rename DxsReduceAccElementwiseOperation to DxsReduceAccElementwiseOperation

* Fix denominator

* Refine nameing

* Fix denominator  in host

* Remove useless include header

* Use AccDataType

* Fix static_cast order

* Refine type

* [What] Remove tuple type in the base class
[Why] External api depend on base class. if base class has relationship with type, we will need many class for different type

6eb55499

16 Jun, 2022 2 commits

example for convnd bwd weight bf16 splitk (#265) · 561ec12f

Shaojie WANG authored 3 years ago

* add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight

* add bwd weight for bf16: init

* remove redundant compute

* use datatype and split k to check whether a workspace is used

* remove unused computation for work space size

* add some code for bfp16

* add device/grid unary op

* add unary type convert to bwd-weight example

* support bf16 splitk kernel for convnd bwd weight

* 1. remove comments. 2. add checkvalidity. 3. add gridsize computation

* add workspace size check

* fix format

* change function name

561ec12f

Use new github credentials (#278) · fb9b6b1e

Illia Silin authored 3 years ago

* use pre-built docker instead of building a new one

* try docker.image.pull

* change syntax in docker.image()

* add 30 min timeout

* increase timeout to 3 hours

* move performance tests to first stage for testing

* set image variable to the new container name

* update image name

* check available images

* check available images in both places

* try different image name

* use image ID to refer to image

* run performance on gfx90a

* fix the gpu_arch labeling, add parameter

* move env vars out of stages

* add stand-alone performance script, MI200 tests, CU numbers

* dos2unix for run_perf_tests.sh

* try the new git credentials

* use env var for git credentials

fb9b6b1e

10 Jun, 2022 1 commit

Add performance tests on MI200 in CI, reporting number of CUs, add stand-alone perf test. (#277) · 1ced00a5

Illia Silin authored 3 years ago

* use pre-built docker instead of building a new one

* try docker.image.pull

* change syntax in docker.image()

* add 30 min timeout

* increase timeout to 3 hours

* move performance tests to first stage for testing

* set image variable to the new container name

* update image name

* check available images

* check available images in both places

* try different image name

* use image ID to refer to image

* run performance on gfx90a

* fix the gpu_arch labeling, add parameter

* move env vars out of stages

* add stand-alone performance script, MI200 tests, CU numbers

1ced00a5

02 Jun, 2022 3 commits

Adding Resnet50 test to Performance tests (#268) · 1677cf70

Illia Silin authored 3 years ago

* add resnet50 test to performance tests

* add blanks before gpu_arch in log files

* add resnet50 test with N=4 and process its results

* add ROCM and HIP versions to test tables

* uncomment the sql queries

* fix script syntax in jenkinsfile

1677cf70

use old ctile to avoid conv2d fwd bias relu add compute error (#271) · 1c5d06f2
Shaojie WANG authored 3 years ago

1c5d06f2

Unify the naming of the math functions used by the host and kernel (#262) · 86185bd7

Qianfeng authored 3 years ago

* Use the unified naming for math functions on host and HIP kernel

* Corresponding change/simplification in reduction host/profiler/examples due to unified math functions renaming

* Renaming GetReductionZeroVal() to GetIdentityValue()

* Tiny renaming in profile_reduce_impl.hpp

* More renaming in profile_reduce_impl.hpp

* Replace zeroVal by identiyVal

* Remove ck_ prefix in the naming of ck::math provided functions

86185bd7

31 May, 2022 3 commits

Pass gemm_descs for grouped gemm via __constant__ buff (#232) · b6eaf3eb

zjing14 authored 3 years ago

* moved gemm_descs_args into const buff

* use CK_CONSTANT_ADDRESS_SPACE instead of global constant

* clean

* moved hipMemAlloc outside of deviceOp

* add SetWorkSpacePointer

* fix ignore

b6eaf3eb

Multi-kernel CGEMM (#230) · 7b1e2c37

myamlak authored 3 years ago

* Reference CGEMM + test stub

* Format.

* Incomplete simple implementation

* Library instances

* Sketch of tests

* Test fixes.

* Example added

* Cosmetics

* Add elementwise operation kernel and example

* Add comment

* Add template argument of dim . Prepare to support multiple dimension

* Rename example

* Support 1 dimension

* Add static assert

* Add comment

* Second auxiliary buffer added

* Extract pad

* Remove redundant argument

* Support any dimension for elementwise operation

* Remove line

* Let it be the multiple number of CU

* Move thread per block to the parameter of constructor

* Consuming binary ops to do A+B / A-B

* Fix + cosmetics + bf16 test commented out temporarily

* Format

* Enabling bf16 test

* Revert "Enabling bf16 test"

This reverts commit f497e2ba.

* Fix + test reenabled

* fix build

* Revert "fix build"

This reverts commit d7310238

.

* post PR #235 merge fix

* amend

* Single workspace for cgemm + helper

* Perf calc fix

* Review remarks: static_cast

* Review remarks: binary ops templated

* Cleaning

* Removal of instances and their tests

* Review remarks from aosew addressed

* Review remark: unnecessary attribute

* Post-merge fixes

* Restrict 4gemm to PassThrough + bug fix

* Review remarks

* update licence

* change cgemm example to fp16
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>

7b1e2c37

Minor fix for recent PR (#260) · 85fc91c3

Chao Liu authored 3 years ago

* fix example

* update IsSupportedArgument

* fix

* disable fp64 conv example as test

85fc91c3

30 May, 2022 1 commit

gemm + layernorm (#261) · d32a67a9

rocking5566 authored 3 years ago

* Implement reduction meand and reduction square mean

* Refine file name

* Add reduce mean and square mean

* Fix parameter name

* Add normalize device op (not implement invoker::run())

* Remove epislon

* Refine deviceop

* Add 5ary elementwise for normalization

* Add layernorm example

* layerNorm verication

* Fix compiler error due to merge from develop

* Fix typo

* Fix compile error

* Refine naming

* [What] Suport non pointer for invoker and argument
[Why] Snyc coding style with gemm

* Refine folder name

* Refine class name

* Evaluate perf of the kernel

* Fix compile error

* [What] Refine perf evaluation in example of gemm + reduction
[Why] evaluation of gemm + reduction may cause verification fail. Because evaluation will not initial global memory

* clang-format

d32a67a9

27 May, 2022 1 commit

Fixing conv bug (#258) · 91d8b7d6

Chao Liu authored 3 years ago


* debugging conv

* fix oversight where ctile map is constructed before initializing c desc

* example program should returns error code

* clean up

* changed Block2CTileMap in conv2d and convnd

* clean up

* clean up

* cleanup
Co-authored-by: Anthony Chang <ac.chang@outlook.com>

91d8b7d6

26 May, 2022 2 commits

Add FP64 XDL GEMM built-in function (#199) · 3e6c2610

ltqin authored 3 years ago


* add intrin_mfma_f64_16x16x4f64

* add example

* gemm reference add double data type

* chang init data

* fix M N PerXdlops

* fix ifdef

* add comparsion config

* add conv fwd example

* format log out

* change rc matrix egister layout

* reorganize example

* reorganize example 2

* format,because merge develop

* fix call impl adding acc data type

* lost ;

* add compiler warning

* change example tunning parameters

* add test for fp64

* add instance

* add test/gemm/gemm_fp64.cpp

* fix get name issue

* remove some tunning parameter

* fix conflict

* format

* use integer value for GEMM test

* add acc data type

* remove typeid because fp16

* fix streamconfig etc bug from merging develop

* format

* remove test_gemm_xdl_fp64

* add AccDataType

* AccDataType problem
Co-authored-by: qinletao <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

3e6c2610

Add pooling example (#257) · 97c4d486

Qianfeng authored 3 years ago

* Add example for computing LayerNorm mean and meansquare

* Refactor the pool2d_fwd example and add example for float type testing

* Revert "Add example for computing LayerNorm mean and meansquare"

This reverts commit df52e6f9d897b00c981baa48f291450bcd60925d.

* Tiny fix in pool2d_fwd_common.hpp

97c4d486

25 May, 2022 2 commits

Hotfix binary elementwise (for broadcast on fastest axis) (#254) · 82d7d993

rocking5566 authored 3 years ago


* Support different length of ScalarPerVector

* Add example of broadcast on fastest axis

* Typo

* Refine fastest example

* Add dimension check

* Modify fastest broadcast example to 3d

* Enforce users give scalarPerVector explicitely

* 1. Add CscalarPerVedctor
2. Not only broadcast on fastest need to set scalarPerVector to 1

* Rename var

* Move IsScalarPerVectorValid() inside IsSupportedArgument()

* Separate GridDesc_M0 into A, B and C

* rename var

* Rename var of length
Co-authored-by: rocking <chunylai@amd.com>

82d7d993

Tensile-style block to C tile map (#239) · e579c9e5

Anthony Chang authored 3 years ago

* fix build

* Revert "fix build"

This reverts commit d7310238

.

* post PR #235 merge fix

* amend

* adds tensile-stype c-tile map

* make it dynamic version

* add k-split flavor tile map

* apply tensile-style tile map to all xdl gridwise gemms

* remove dead code
Co-authored-by: Chao Liu <chao.liu2@amd.com>

e579c9e5

GitLab

Menu