Commits · intrinsic-in-reg-transpose · liyinrong / composable_kernel

15 Sep, 2022 1 commit
- work around inline asm potential hazard using intrinsic · 3fd0dc32
  Anthony Chang authored 2 years ago
  
  3fd0dc32
14 Sep, 2022 1 commit

batched_gemm + multiple_d + gemm + multiple_d (#394) · 370efa6c

ltqin authored 2 years ago


* refactor

* start

* add device gemm file

* add BatchStrideD0

* add stridd0

* add gridwise file

* add d0 parameters to gridwise gemm

* add c layout transformer

* add d0 threadwise copy

* init kernel

* init kernel

* regular code

* nm desc put to out

* kernel parameter can not use reference

* host add bias+gelu

* run right for bias+gelu

* change AddFastGelu into another file

* interface add d1 bias parameters

* add d1 parameter to argument

* add d1 parameter to gridwise

* first all code,not verify

* gelu change to relu and GetElementSpaceSize bug

* add instance

* start add to ckprofiler

* ckprofiler finish code

* change input parameter for ckProfiler

* fix host bias+gelu bug

* show help for ckProfiler

* fix bug for lunch kernel ignore parametes

* add pad and fix about bug

* mutiple d0

* add dynamic d0_element_op

* change profiler and  instance to mutiple d0

* example have 2 d0

* remove some comments not using

* change 2 d0 have self  parameters

* change d element_op name

* change class name(multiple_d)

* fix bug

* fix bug that don't find file

* update profiler

* refactor

* update profiler

* clean

* revert example change

* add gon layout

* optimize parameter for gno

* add gon to gemm+gemm

* change helping input parameters

* change to GemmPadder_v2

* using ForEach

* fix gb_per_sec
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: ltqin <letaoqin@amd.com>

Unverified

370efa6c

13 Sep, 2022 1 commit

Upgrade the OS and ROCM versions. (#411) · b22ebd44

Illia Silin authored 2 years ago

* upgrade the OS and ROCM versions in CK docker

* add cxx flags to link code with rocm5.2 and ck-9110 compiler

* rename the docker image

* run ONNX gemms using init=1

Unverified

b22ebd44

09 Sep, 2022 1 commit

embedding fuse layernorm (#405) · efd1d257

carlushuang authored 2 years ago


* add gridwise/device sparse embedding

* update code

* update code

* remove useless makefile

* code fix

* workable

* work properly

* emb add

* add more instance

* format

* remove useless code

* fix format

* fix clang-tidy

* clean

* fix a compile error
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

Unverified

efd1d257

08 Sep, 2022 1 commit

Fix gemm-softmax-gemm-permute padding cases (#409) · d6709dc3

Anthony Chang authored 2 years ago

* fix example; make padding on by default in example; fix argument checks

* fix Gemm1KPacK which has since regressed from PR #399

Unverified

d6709dc3

07 Sep, 2022 1 commit

Add stderr to QA logfiles, process splitK and ONNX gemm kernels (#402) · ce74cea4

Illia Silin authored 2 years ago

* add processing for the onng_gemm and splitK_gemm

* add profile_onnx_gemm.sh

* add stderr to logfiles, add splitK and onnx gemm parsing

* enable splitK gemm wresults posting to db

Unverified

ce74cea4

06 Sep, 2022 3 commits

Fused attention instances & padding tests (#395) · 868e5c55

Anthony Chang authored 2 years ago

* modify comment

* trim unnecessary check

* add gemm spec in kernel name

* add TNTT gemm_gemm + atten kernel instances

* refactor attention padding to better fit in unit tests

This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
Also added compile-time conditionals that load OOB value as NaN only after padding is enabled

* add adhoc padding test for atten

* shrink input value range for attention kernel validation to avoid occasional error by 1e-3

Still unsure whether this kind of deterministic floating point accurary issue is expected
or not. May want to try exact same approach as the GPU kernel in the host reference
GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
shrink the input value range as it is less likely to produce errors of around ~1e-3.

* attention kernel proper granular padding for all 4 dims

* IsSupportedArgument checks

* test more padded cases

* block PadK specialization in attention kernels

* workaround clang crash for gfx908

(gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
VGPR_32: Cannot scavenge register without an emergency spill slot!"
this fall back to less ideal way of handle NPadding in fused attention kernel

* comment out kernels giving wrong results on MI100; MI200 doesn't seem affected

Unverified

868e5c55

GemmGemm TNNT instances (#399) · fe52c94c

Anthony Chang authored 2 years ago

* add gemm_gemm TNNT instance

* sanitize Gemm1KPack

* disable instances that failed validation on mi100

Unverified

fe52c94c

Softmax client example (#396) · 3da5c19e

Adam Osewski authored 2 years ago


* Update Softmax device operation interface.

* Update ckProfiler.

* Update Softmax UT.

* Update example.

* Client example.

* Clang format
Co-authored-by: Adam Osewski <aosewski@amd.com>

Unverified

3da5c19e

02 Sep, 2022 1 commit

[Hotfix] SplitK Gemm fp32 (#401) · 75891161

zjing14 authored 2 years ago

* add scripts

* fixed splitK_gemm_fp32

* clean

* clean

* use gemm_xdl_splitK_c_shuffle into profiler

* remove device_gemm_xdl_splitk.hpp

Unverified

75891161

01 Sep, 2022 1 commit

add more datatype to gemm+gemm and conv+conv example (#397) · 204ef976

Chao Liu authored 2 years ago

* refactor

* refactor

* adding int4/int8/fp16/bf16 for conv+conv and gemm+gemm

* adding int4/int8/fp16/bf16 for conv+conv and gemm+gemm

* clean

Unverified

204ef976

31 Aug, 2022 2 commits

Add examples of Conv + reduction (data type: int4, int8, bf16, fp16, fp32) (#380) · 46a675aa

Po Yen Chen authored 2 years ago

* Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle

* Add 'DeviceGroupedConvFwdMultipleDMultipleR' interface

* Add DeviceGroupedConvFwdMultipleDMultipleR_Xdl_CShuffle

* Remove 'GridwiseConvFwdMultipleDMultipleR_xdl_cshuffle'

* Add 'TransformConvFwdToGemm<>' utility class (from Chao)

* Use 'TransformConvFwdToGemm<>' to shorten code

* Fix ill-formed method declaration

* Re-implement MakeRGridDescriptor_M() function

* Change problem description

* Use macro to define layout types

* Define K-reduced output tensor layout types

* Let user to decide R output tensor layout

* Rename variables

* Add padding to the reduced output tensor if necessary

* Extract common code as helper method

* Remove debug message

* Add missing include directive

* Add partial fp16 Conv + Reduction example

* Add example verification code for 2D Conv problem

* Use type alias to simplify ...

Unverified

46a675aa

conv+conv (1x1 only) example using gemm+gemm (#393) · 4df6d93f
Chao Liu authored 2 years ago
```
* refactor conv

* add conv+conv example, 1x1 only
```
Unverified

4df6d93f

30 Aug, 2022 2 commits

Gemm reduce examples int4/int8/fp32/bf16 (#368) · d00e6115

Adam Osewski authored 2 years ago

* GEMM + Reduce max fp16+fp32

* GEmm + Max bf16 + int8

* Refactor common definitions.

* Refactor common func of mean meansquare example.

* More examples for mean meansquare.

* Update int8 examples and skip them cause of random errors.

* Int4 examples.

* Fix examples for max int4/8

* Tensor conversion for int4 input data for mean meansquare example.

* Remove int4 mean_meansquare example

* Fix int8 mean_meansquare example.

-All ReductionAccData and R<N>DataType have to be F32. The INT32 data
type is giving wrong results.

* Guard int4 with ifdef

* Change int8 example to add_addsquare due to div rounding err.

* Clang format

* Change the return type of common function.

* Get back int8 example with division.

* Remove int8 mean meansquare.

* Use proper cast for BF16 data type.

* Use ck::literals.

* Use proper data type for host tensors & reference.

- Use ReduceAccDataType for reference gemm output data...

Unverified

d00e6115

Padding for attention: bmm+scale+softmax+bmm kernel (#385) · 45adb736

Shaojie WANG authored 2 years ago


* add padding algo for bmm+scale+softmax+bmm. Version for verification

* remove verification code

* remove comments

* add padded bmm scale softmax bmm example

* format

* refactor

* add comments for usages of padding bmm+scale+softmax+bmm
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

Unverified

45adb736

29 Aug, 2022 2 commits
- Try to workaround flaky GemmSoftmaxGemm tests (#386) · 138faf39
  Anthony Chang authored 2 years ago
```
* avoid potential hazard; flaky test issue persists

* pin down the random seed to avoid flakiness
```
  Unverified
  
  138faf39
- Fix the slow cpu reference batched gemm kernels. (#388) · 9061d39b
  Illia Silin authored 2 years ago
```
* fix the performance of the batched gemm verification

* fix tabs
```
  Unverified
  
  9061d39b
26 Aug, 2022 2 commits

Add an option to build CK with clang directly (#387) · 1e5b59df

Illia Silin authored 2 years ago

* replace hipcc compiler with clang++

* build client app with hipcc

* build client app with clang

* add an option to build with hipcc ro clang

* fix the environment for client app

* fix setting up compiler in cmake_build

* change the way the compiler is set

Unverified

1e5b59df

Fixed splitk gemm fp32 (#384) · 9881625b
zjing14 authored 2 years ago
```
* add scripts

* fixed splitK_gemm_fp32

* clean

* clean
```
Unverified

9881625b

25 Aug, 2022 5 commits

More int4 tests. (#374) · 57fadf6f

Adam Osewski authored 2 years ago


* More int4 UT.

* Disable BitwiseRepresentation UT.

* Add UT with static_cast

* Surround cout statements with #if
Co-authored-by: Adam Osewski <aosewski@amd.com>

Unverified

57fadf6f

GEMM batched/splitK/cgemm/grouped int4 examples (#383) · 3ab20fd7

Adam Osewski authored 2 years ago


* Grouped GEmm int4.

* Formatting + fix K dimension for int8.

* Batched Gemm int4 example.

* CGEMM int4 example.

* Include inc filese in clang-format.

* SplitK int4 example

* Refactoring of performance measurement.

* Fix #ifdef statements.
Co-authored-by: Adam Osewski <aosewski@amd.com>

Unverified

3ab20fd7

Add int4 example for convnd_fwd_bias_relu_add (#375) · b73ae242

Rostyslav Geyyer authored 2 years ago

* Add int4 example for convnd_fwd_bias_relu_add

* Fix AddReluAdd for building without int4 support

* Update CMakeLists.txt

* Format

* Convert int4 tensors for int8 kernel

* Fix device memory allocation

* Format

* Format

Unverified

b73ae242

Add int4 reduction examples (#372) · d520d0cf

Qianfeng authored 2 years ago

* Add int4 reduction examples

* Contain all using of int4_t inside the pre-compiling condition checking

Unverified

d520d0cf

add scripts (#382) · f246fd2c
zjing14 authored 2 years ago

Unverified

f246fd2c

24 Aug, 2022 2 commits
- layernorm external api (#379) · e1a3fff6
  rocking5566 authored 2 years ago
```
* Add layernorm client example

* [What] Add default make install dir to gitignore
[Why] client example need to make install
```
  Unverified
  
  e1a3fff6
- Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle (#378) · 88e43744
  Po Yen Chen authored 2 years ago
  
  Unverified
  
  88e43744
23 Aug, 2022 5 commits

Add examples of Gemm (data type: int4) (#367) · fa2d894b

Po Yen Chen authored 2 years ago

* Add GEMM examples for int4

Currently the source files are just copied from int8 examples

* Re-use pre-defined alias in int4 exmples

* Distinguish user-side type from kernel-side type

* Add int4_t support for check_err()

* Allow conversion between Tensor<> specializations

* Re-format source files

* Use different type for host tensors

* Re-use CopyAsType<>() to implement copy ctor

* Re-use element-wise operation type alias

* Fix typo in alias names

* Complete the int4 examples

* Add constraint to Tensor<> templated methods

* Add type traits 'is_signed_integral<>'

* Add type constraints for integer version check_err<>()

* Allow comparing different-sized integral types in check_err()

* Check converted Tensor<int4_t> with golden Tensor<int8_t>

* Remove constraint of Tensor<>::CopyAsType()

* Avoid compilation error while disabling ck::int4_t support

* Remove debug messages

* Add #error directive to prevent compile sources with wrong setting

* Simplify tensor usages in examples

* Add constraint to check_err() input reference type

* Align design with other PR

* Use ""_uz to simplify example code

* Avoid too much generalizing check_err()

* Re-format GEMM instance template arguments

* Extract int4 example common codes

* Sort include directives

* Move #include directives into new header

* Move common codes together

* Re-format template argument in example code

* Reuse same implementation code for most of GEMM examples

* Re-format common.hpp

* Unify structured comment in examples

* Use reinterpret_cast<>() for cross-type pointer conversion

* Revert "Add type traits 'is_signed_integral<>'"

This reverts commit f2c148ef.

* Allow unsigned integer arguments for check_err()

* Fix compilation error in check_err()

* Remove unnecessary copy ctor for Tensor<>

* Mark Tensor<> special member functions as 'default'

* Use more strict condition to add code in examples

* Fix wrong program return value of GEMM examples

* Handle the case while user specify all the strides

* Fix never-ran examples

* Exit successfully if GEMM instance does not support given problem

* Add missing 'else' keyword

* Re-format CMakeLists.txt

* Add wrapper function to hide value conversion while copying memory

* Add new DeviceMem API to copy memory

* Use new DeviceMem API to implement examples

* Revert "Add new DeviceMem API to copy memory"

This reverts commit 3f190b07.

* Add conversion ctor for Tensor<>

* Write Tensor<> conversion logics explicitly in example code

* Convert Tensor<> values after transfer data to host

Unverified

fa2d894b

Attention with output permutation (#370) · e0d8806c

Anthony Chang authored 2 years ago

* comment on specialization for TensorSpecialization::Packed

* gemm_softmax_gemm with output permutation

* scaling

* refactor MatrixPadder; rename to GemmPadder

* remove old sanity check

* restore original gemm_softmax_gemm

* revise comment in gemm_softmax_gemm example

* use GetElementSpaceSize()

* remove extra header

* typo

* remove archaic DeviceOpPtr

Unverified

e0d8806c

Add examples of batched/grouped/SplitK Gemm for int8/bfp16/fp16/fp32 (#361) · 60914583

zjing14 authored 2 years ago


* add examples into grouped/batched_gemm

* adding splitK examples

* fixed splitK

* add bfp16 int8 example into splitK

* formatting

* use static_cast

* added common for batched_gemm

* add commons for examples of splitK/batched/grouped_gemm

* return true

* adjust splitK check tol

* update example
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

Unverified

60914583

Add example of Gemm + AddAddFastGelu (data type: int4) (#369) · 2327f1a6

Po Yen Chen authored 2 years ago

* Add custom target to bundle examples together

* Add int4 example conditionally (just copy from int8 example)

* Extract common code into common.hpp

* Move ref gemm type alias into data-type-specific sources

* Add #error directive to prevent compile with wrong setting

* Let AddAddFastGelu support int4 parameter type

* Let check_err() support int4 parameter type

* Add wrapper function to hide value conversion while copying memory

* Finish int4 example for GEMM + AddAddFastGelu

* Add new DeviceMem API to copy memory

* Use new DeviceMem API to implement examples

* Fix wrongly use of macro 'CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4'

* Revert "Add new DeviceMem API to copy memory"

This reverts commit e26e7af7.

* Add conversion ctor for Tensor<>

* Add 'const' specifier to Tensor<>::CopyAsType()

* Convert Tensor<> values before/after transfer between host & device

Unverified

2327f1a6

Implement padding and sanity checks for fused GEMM+GEMM (#376) · f4047c94

Anthony Chang authored 2 years ago

* GemmPadder and GemmGemmPadder

* proper padding using GemmGemmPadder

* test gemm_gemm padding

* properly check size K in IsSupportedArgument()

* properly check size requirement given SrcScalarPerVector in IsSupportedArgument()

* comment

* format

Unverified

f4047c94

22 Aug, 2022 1 commit
- [What] Fix bug of verification fail on E Matrix (#371) · c366de55
  rocking5566 authored 2 years ago
```
[Why] We need to sync lds even in first loop because Gemm also use the same LDS.
```
  Unverified
  
  c366de55
18 Aug, 2022 2 commits

restart the stages on MI200 in case of failures (#366) · 9efd033b
Illia Silin authored 2 years ago
```
* restart the stages on MI200

* fix the docker image storage issue
```
Unverified

9efd033b

int4 data type (#364) · e00149ac

Adam Osewski authored 2 years ago


* Introduce int4 data type.

* Add unit-tests for int4

* Compile int4 UT only when int4 enabled.

* clang-format
Co-authored-by: Adam Osewski <aosewski@amd.com>

Unverified

e00149ac

17 Aug, 2022 1 commit
- use scale (#363) · bac7df8f
  Chao Liu authored 2 years ago
  
  Unverified
  
  bac7df8f
15 Aug, 2022 2 commits

Hotfix LDS data hazard in fused attention (#360) · c961ce92

Anthony Chang authored 2 years ago

* avoid LDS data hazard in gemm_softmax_gemm pipeline

* trivial refactors

* comments

* shrink blockwise gemm v2 thread buffer size

* reclaim A block lds space when during 2nd gemm

* amend

* amend

Unverified

c961ce92

Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320) · 53ea4713

Qianfeng authored 2 years ago

* Implement multiple-reduction in one kernel (kernels, device ops, examples)

* Add generic elementwise kernel and device interface

* Add generator for normal-distributed data initialization

* Add host refer implementation of batchnorm-forward and batchnorm-infer

* Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels

* Remove un-needed including in batchnorm example

* Renaming generic_elementwise to elementiwise in kernel and device classes/functions

* Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise

* Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise

* Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise

* Add DeviceElementwiseBase and use it in device_normalize_instance.cpp

* Removing and renaming files

* Update to synchronize gemm_layernorm client example to the generic element-wise device op API

* Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming

* Merge two static member functions in device_elementwise.hpp

* Remove unary_elementwise_1d kernel and device

Unverified

53ea4713

13 Aug, 2022 3 commits

fix build issue (#357) · 5ee30459

Chao Liu authored 2 years ago

* fix build

* excludeexample_gemm_max_xdl_fp16 from testing due to random failure on gfx908

Unverified

5ee30459

Change all device operations to use add_instance_library (#338) · fb1cbf02

cloudhan authored 2 years ago


* Change all device operations to use add_instance_library to avoid duplicated cmake configuration.

* update DeviceMem
Co-authored-by: Chao Liu <chao.liu2@amd.com>

Unverified

fb1cbf02

Layernorm welford (#346) · 0bd6b842

rocking5566 authored 2 years ago


* Add threadwise and blockwise welford

* Rename gridwise op, prepare to add welford version

* implement welford and integrate welford into layernorm

* Take care of tail loop

* Fix buf when ThreadSliceK > 1

* Fix bug of merging of two empty set

* Rename clip to clamp

* 1. Fix type of count
2. Remove useless static_assert

* Do not inherit Reduction::Argument

* [What] replace __syncthreads() with block_sync_lds()
[Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)

* Add y stride

* Rename.
DeviceLayernorm -> DeviceLayernormImpl
DeviceNormalization2 -> DeviceLayernorm

* Move literal ""_uz & ""_zu into namespace 'literals'

* Move namespace 'literals' as 'ck::literals'
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

Unverified

0bd6b842

GitLab

Menu