Commits · pipeline_selector · liyinrong / composable_kernel

03 Oct, 2022 3 commits
- Enable the instances · 6da1d4a1
  Rosty Geyyer authored 2 years ago
  
  6da1d4a1
- Merge branch 'develop' into pipeline_selector · 3c05d24f
  Rostyslav Geyyer authored 2 years ago
  
  3c05d24f
- update document: Readme, contributors, citation, (#463) · 473ba5bc
  Chao Liu authored 2 years ago
```
* update cmake script

* update readme

* Update README.md

* add citation

* add images

* Update README.md

* update

* Update README.md

* Update CONTRIBUTORS.md

* Update README.md

* Update CITATION.cff

* Update README.md

* Update CITATION.cff
```
  473ba5bc
01 Oct, 2022 1 commit

Allow setting ROCM version, activate cchache, etc. (#462) · 7fc3ed76

Illia Silin authored 2 years ago

* enable ccache and decouple it from MIOpen ccache use

* fix the ccache check script

* use another method to get server name

* fix syntax

* add quotes around the server name variable

* use check_host as function

* change syntax

* fix syntax

* test if server name is parsed correctly

* try different syntax

* check the env var value

* test new check node function

* add ROCMVERSION parameter and fix script syntax

* fix script syntax

* add missing instances of rocm version

* install ccache in the docker image

* do not check GPU in clang format stage, clean up old code

* update defaults and clean up

7fc3ed76

27 Sep, 2022 1 commit

Fix build issues, set new compiler default, etc. (#451) · b8825547

Illia Silin authored 2 years ago

* add an option to select specific compiler commit

* change the logic of forcing building a docker

* add check for compiler commit in dockerfile

* compiler check syntax fix

* change compiler selection logic

* fix the new compiler build issue

* set new compiler as default, update dev-requirements

* fix jenkins syntax

* fix docker syntax

* get rid of hipcc.pl editing in jenkinsfile

* fix the hipcc.pl in both places

* try to fix the 10738 compiler linking bug

* fix syntax

* use dockerhub to store images

* use newer amd-stg-open commit as default

b8825547

26 Sep, 2022 1 commit
- Disable flag check · e9e1ee3b
  Rosty Geyyer authored 2 years ago
  
  e9e1ee3b
23 Sep, 2022 4 commits
- Test disable flag check · ed9b5faa
  Rosty Geyyer authored 2 years ago
  
  ed9b5faa
- Merge branch 'develop' into pipeline_selector · 5abe4cde
  Rostyslav Geyyer authored 2 years ago
  
  5abe4cde
- Add flags to disable added instances · 87c4afb2
  Rosty Geyyer authored 2 years ago
  
  87c4afb2
- Fix device instance libarary to include all instances (#418) · 2c6d63d0
  JD authored 2 years ago
```
* fix device instance library to add all instances

* remove cppcheck from requirements.txt
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  2c6d63d0
22 Sep, 2022 4 commits
- Merge branch 'develop' into pipeline_selector · 689d1cbb
  Rostyslav Geyyer authored 2 years ago
  
  689d1cbb
- fix build (#434) · e9d4e893
  Chao Liu authored 2 years ago
```
* fix

* fix

* add instance
```
  e9d4e893
- Merge branch 'develop' into pipeline_selector · a4bc5fad
  Rostyslav Geyyer authored 2 years ago
  
  a4bc5fad
- Replace the obsolete offload-arch flags with GPU_TARGETS and fix a bug. (#437) · aa0b0515
  Illia Silin authored 2 years ago
```
* replace obsolete offload-arch flags with GPU_TARGETS

* fix a build error for client app

* replace commma with semicolon in GPU_TARGETS
```
  aa0b0515
21 Sep, 2022 4 commits

Updated the supported components (#435) · 7acbf104
Lixun Zhang authored 2 years ago

7acbf104

Build the CK targets only once. (#433) · 85b0920d

Illia Silin authored 2 years ago

* build CK only once, use deb package in all subsequent stages

* update jenkins file

* change prefix for build_CK stage

* update writing deb metadata to control file

* update ubuntu source for docker, script syntax for deb package metadata

* try different way to create deb metadata

* clean up DEBIAN before creating one

* fix the CI folder names, fix splitK qa

* use correct docker in all stages, separate tests for splitK verification and performance

* clean old comments, change dir before packaging

* use different package syntax

* change packaging syntax

* package with cmake

* remove unnecessary build prefix

* get rid of unnecessary paths

* change paths during unpacking

* change script syntax while unpacking

* get rid of unneccesary steps

* get rid of comments in the scripts

* use double quotes for scripts

* add ccache during build, try dpkg -x

* pull and install each package separately

* use full package names

* try to use stashing for packages

* change stash/unstash syntax

* move unstash out of shell, run tests on any gpu node

* unpack each package separately

* try re-using existing workspace

* merge the build and test stages, only stash ckProfiler

* merge the build and test stages, only stash zipped ckProfiler

* fix syntax

* add GPU check before build and test, rename docker to usual name

85b0920d

Merge branch 'develop' into pipeline_selector · ff519fc3
Rostyslav Geyyer authored 2 years ago

ff519fc3
fixed G offset calc for long_index (#428) · 01876afa
zjing14 authored 2 years ago

01876afa

20 Sep, 2022 7 commits

fix build (#427) · 567f70f5
Chao Liu authored 2 years ago
```
* fix build

* fix build
```
567f70f5

MNKO padding support on bmm+masking+scale+softmax+bmm+premute (#425) · ebab84b6

Shaojie WANG authored 2 years ago


* add lower triangle bmm

* init code for tile skipping

* functionality right with lower triangle mask

* add decoder lower triangular mask calculation

* use 7*13 group

* fix n2 compute error

* attention with lower triangle mask with tile skipping

* add template to distinguish masking kernel

* rename template and remove default template value

* remove lower triangle gemm reference struct

* add some comments on example

* add 10 instance for masking bmm + scale + softmax + bmm + permute kernels

* add test

* add test file

* add gtest for bmm masking scale softmax bmm permute

* clang-format

* fix compile error

* check lef bottom corner for tile skipping

* fix error: check left bottom corner for tile skipping

* add k padding

* add test and instance for MNK padding

* passing a mask struct

* fix instances

* delete used comments

* format
Co-authored-by: danyao12 <yaodan@dc-smc-13.amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

ebab84b6

Merge branch 'develop' into pipeline_selector · bf94b0b2
Rostyslav Geyyer authored 2 years ago

bf94b0b2
use rocm5.2 compiler as default, use same flags for amd-stg-open as for release (#426) · 9f7c1930
Illia Silin authored 2 years ago

9f7c1930

Group norm (#417) · 4eba345f

rocking5566 authored 2 years ago


* Add groupnorm example by layernorm
1.  Reference is not ready
2. shape of gamma and beta need to be fix

* Let shape of gamma and beta can be same as x

* Modify test, instance and client example

* [What] Fix bug of layernorm for greater than 2 dimension.
[Why] We need to get upper length from merge transform instead of embed transform.

* Add reference for groupnorm

* Fuse sigmoid after groupnorm

* [What] Rename original layernorm into layernorm2d
[Why] Prepare to add groupnorm using layernorm5d

* clang-format

* Add groupnorm test

* Refine error message

* Add groupnorm ckProfiler

* Test groupnorm kernel from device_instance

* update example

* upadte profiler

* Fix test naming

* Fix argc number

* Move descriptor and sweeponce to argument for quick debugging
Co-authored-by: Chao Liu <chao.liu2@amd.com>

4eba345f

Add 'Permute' device op & example (#408) · f584ab0c

Po Yen Chen authored 2 years ago

* Add example folder for 'DeviceElementwise'

* Re-structure example files

* Move common parts into common.hpp

* Use more strict input

* Add more helper methods in 'DeviceElementwise'

* Use more specific method to write example

* Allow specify problem through command line argument

* Allow specify problem 'axes' through command line argument

* Add check to template type argument

* Add transpose_shape() to generalize shape permute

* Generalize transpose utility functions

* Use better name for tensor indices

* Add checks in helper functions

* Remove debug messages

* Refine error message for check_err()

* Generalize variable naming in example code

* Add device op 'DevicePermute'

This device op is clone of 'DeviceElementwise'

* Use 'DevicePermute' device op in example

* Remove 'elementwise' from identifiers

* Remove 'elementwise' from file paths

* Remove base class of 'DevicePermute'

* Let 'DevicePermute' inherit from 'BaseOperator'

* Add simple type traits to validate device op type

* Add static_assert() to check type constraints

* Create 'DevicePermuteBase' to generate methods

* Use indirect base type to generate methods

* Remove 'is_device_op<>' type traits

* Only accept single-input-single-output for 'DervicePermute'

* Simplify 'DevicePermute' interface

* Re-format 'DeviceElementwise'

* Use CRTP to generate overridden virtual method

* Remove unnecessary include directives

* Distinguish input & output shape in 'DevicePermute'

* Passing 'axes' to 'DevicePermute'

* Use more reasonable return value for Invoker::Run()

* Add 'GridwisePermute' kernel

This kernel is a clone of 'GridwiseElementwise_1D'

* Remove no-longer used type argument

* Check if input/output shape meet the requirement

* Remove no-longer used method

* Remove never-entered-if-clause

* Change problem description for 'DevicePermute'

* Transform descriptor into 3 dimensions

* Add debug code the verify result

* Add comment to indicate template argument location

* Add N/H/WPerBlock template parameter to 'DevicePermute'

* Rename 'GridwisePermute' to 'GridwiseCopy'

* Check tensor descriptor dimensions in 'GridwiseElementwise_1D'

* Add missing include directive

* Add 'BlockSize' parameter to 'DevicePermute'

* Remove no-longer used method

* Add 'BlockToTileMap' for 'GridwiseCopy'

* Use the normal Block2TileMap convention

* Rename 'BlockToTileMap' as 'Block2TileMap'

* Fix most of compilation errors

* Let 'Block2TileMap' map block to 2d coordinate

* Allow data transfer in 'GridwiseCopy'

* Fix wrong output descriptor for 2nd blockwise copy

* Rename 'GridwiseCopy' as 'GridwisePermute'

* Remove '1d' in identifiers

* Remove commented-out codes

* Remove 'MPerThread' template parameter

* Seperate template parameters

* Unify variable namming convention

* Use more verbose way to create expressions

* Add template parameter 'InBlockLdsExtraW'

* Release the constraint on In/OutGridDesc

* Use date type directly as template argument

* Re-arrange template arguments for blockwise copy

* Remove no-longer used template parameters

* Embed layout in the variable names

* Add GridwisePermute::CheckValidity()

* Extract local types as template parameters

* Rename local type alias

* Add more template parameters (vector width related)

* Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions

* Fill tensor values start from 1

* Re-formate example code

* Avoid too-large block id

* Add comment

* Make sure 'SrcVectorDim' is not same as 'DstVectorDim'

* Add check for the 'VectorDim' & 'ScalarPerVector' template params

* Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc

* Remove no-longer used template parameter 'NPerBlock'

* Fix wrong descriptor creation logics

* Specify problem in each examples

* Use better example name

* Add new example 'example_permute_NxHxW_fp32'

* Add example for demonstrating bundle multiple elems in tensor

* Add support to permute multiple elements together

* Change the default problem size

* Add span<> class template

* Use span<> to generalize check_err() interface

* Fix ambiguous ctor call

* Avoid create necessary objects

* Use helper functions to simplify example code

* Add example for 4xfp16 permute

* Disable failed-to-compile example

* Add check for the NUM_ELEMS_IN_BUNDLE

* Remove redundant parameter in helper lambda function

* Add check for the input tensor type's byte-size

* Check scalar-per-vector with padded length

* Use more verbose name to avoid name collision

* Use fixed 'VectorDim' & 'ScalarPerVector' for LDS

* Embed shape info in name of descriptor constructor

* Rename example folder '36_permute' into '37_permute'

* Avoid using too-large LDS in kernel code

* Remove redundant example

* Usw switch() to group similar codes

* Add const to the span<> type arguement

* Simply initialize tensor with floating point values

* Use fp16 as data type in all examples

* Enlarge tensor size in example

* Enalrge N-dim in example

* Add check for the bundled type in example

* Use more stricter error threshold

* Remove global load/store loop in kernel code

* Measure execution time by default

* Use faster device op config for example 'NxHxW_fp16'

* Use faster device op config for example '1xHxW_fp16'

* Use faster device op config for example 'HxWx4_fp16'

* Remove cmd arg parsing logics

* Rename functions

* Extract bundle permutation logic out

* Simplify permute bundle example

* Add Tensor<>::GetElementSpaceSizeInBytes()

* Add Tensor<>::data()

* Use new methods to simplify code

* Use type alias to replace duplicated code

* Use existing method to shorten code

* Allow FillUniformDistribution accept range arugment

* Intialize random values in range

* Add Tensor<>::size()

* Use more meaningful names in permute bundle example

* Use more meaningful names in permute element examples

* Use rangified copy() to copy elements

* Use function return value directly to eliminate variables

* Add to_array() conversion tool to eliminate more variables

* Add Tensor<>::AsSpan<>() to create view of tensor values

* Use AsSpan() to shorten check_err() calls

* Remove no-longer-used 'using' directives

* Move 'using' directive to proper code position

* Remove redudant variables

* Remove useless static_assert()

* Add check for range types

* Declare variable right before first use

* Move long return type as tailing return type

* Add BaseInvokerCRTP<> class template to generate method

* Create new base type for 'DervicePermute' implementations

* Move 'NumDim' template param to the first

* Rename 'DevicePermute' to 'DevicePermuteImpl'

* Add 'noexcept' specifier to CRTP generated method

* Move 'Block2TileMap' definition into 'GridwisePermute'

* Use type alias to reduce code

* Unify naming style in 'DevicePermute'

* Add comments in 'GridwisePermute'

* Rename permute example folder

* Use std::cerr to report error

* Use larger shape in examples

* Rename '38_permute' to '39_permute'

* Make sure we use unsigned type for shape & indices

* Remove opt-ed out assertion

* Remove template BaseInvokerCRTP<>

f584ab0c

Add batched attention special kernel instances (#424) · 7c788e10
Anthony Chang authored 2 years ago
```
* sanity check

* add attribution

* add irrgular k tile size for batched attention

* format
```
7c788e10

19 Sep, 2022 3 commits

work around inline asm potential hazard using intrinsic (#416) · c6b8b472
Anthony Chang authored 2 years ago

c6b8b472

Grouped batched attention + permute (#412) · 9287b7c6

Anthony Chang authored 2 years ago

* grouped attn without batch validates; now move toward grouped batched attn

* grouped batched attention

* working

* remove debug logging

clean up

clean up

* reintroduce g_ prefix back to host tensor variables

* format

* rename file

* restore old file

* rename

* consolidate padded/non-padded attention example

* harmonize padding specialization in attn examples

9287b7c6

Conv bwd data multiple d (#404) · 27858374

Shaojie WANG authored 2 years ago


* init commit of convnd bwd data

* begin compiling example

* have a first version that produce a right result

* refine device level launch kernel code

* add more instances in example and get right results

* clang-format

* format example file

* add more instances

* fix instances

* adding conv_bwd_data multile_d

* adding conv_bwd_data multile_d

* adding conv_bwd multiple d

* adding conv_bwd multiple d

* adding conv_bwd multiple d

* refactor

* refactor

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* refactor

* update conv fwd's bias impl

* refactor

* reorg file

* clean up cmake

* clean

* clean

* clean
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

27858374

16 Sep, 2022 5 commits
- Fix the merge conflict · 3c687364
  Rosty Geyyer authored 2 years ago
  
  3c687364
- Merge branch 'develop' into pipeline_selector · 0583e012
  Rostyslav Geyyer authored 2 years ago
  
  0583e012
- Format · f49472b1
  Rosty Geyyer authored 2 years ago
  
  f49472b1
- Update instances · adab015f
  Rosty Geyyer authored 2 years ago
  
  adab015f
- disable print for group conv multiple D (#421) · 43c898f6
  Chao Liu authored 2 years ago
  
  43c898f6
15 Sep, 2022 2 commits

Add enum PipelineVersion · 06f6f168
Rosty Geyyer authored 2 years ago

06f6f168

use defualt loop scheduling for supported gemm ops · 37f0fc64

Anthony Chang authored 2 years ago

for blanket-applying interwave scheduling for all supported gemm ops, define macro CK_EXPERIMENTAL_DEFAULT_TO_INTER_WAVE_SCHEDULING=1. this should be discouraged though as it is not covered by CI

37f0fc64

14 Sep, 2022 2 commits

batched_gemm + multiple_d + gemm + multiple_d (#394) · 370efa6c

ltqin authored 2 years ago


* refactor

* start

* add device gemm file

* add BatchStrideD0

* add stridd0

* add gridwise file

* add d0 parameters to gridwise gemm

* add c layout transformer

* add d0 threadwise copy

* init kernel

* init kernel

* regular code

* nm desc put to out

* kernel parameter can not use reference

* host add bias+gelu

* run right for bias+gelu

* change AddFastGelu into another file

* interface add d1 bias parameters

* add d1 parameter to argument

* add d1 parameter to gridwise

* first all code,not verify

* gelu change to relu and GetElementSpaceSize bug

* add instance

* start add to ckprofiler

* ckprofiler finish code

* change input parameter for ckProfiler

* fix host bias+gelu bug

* show help for ckProfiler

* fix bug for lunch kernel ignore parametes

* add pad and fix about bug

* mutiple d0

* add dynamic d0_element_op

* change profiler and  instance to mutiple d0

* example have 2 d0

* remove some comments not using

* change 2 d0 have self  parameters

* change d element_op name

* change class name(multiple_d)

* fix bug

* fix bug that don't find file

* update profiler

* refactor

* update profiler

* clean

* revert example change

* add gon layout

* optimize parameter for gno

* add gon to gemm+gemm

* change helping input parameters

* change to GemmPadder_v2

* using ForEach

* fix gb_per_sec
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: ltqin <letaoqin@amd.com>

370efa6c

set up inter-wave configuration · cbf68933
Anthony Chang authored 2 years ago

cbf68933

13 Sep, 2022 1 commit

Upgrade the OS and ROCM versions. (#411) · b22ebd44

Illia Silin authored 2 years ago

* upgrade the OS and ROCM versions in CK docker

* add cxx flags to link code with rocm5.2 and ck-9110 compiler

* rename the docker image

* run ONNX gemms using init=1

b22ebd44

09 Sep, 2022 1 commit

embedding fuse layernorm (#405) · efd1d257

carlushuang authored 2 years ago


* add gridwise/device sparse embedding

* update code

* update code

* remove useless makefile

* code fix

* workable

* work properly

* emb add

* add more instance

* format

* remove useless code

* fix format

* fix clang-tidy

* clean

* fix a compile error
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

efd1d257

08 Sep, 2022 1 commit

Fix gemm-softmax-gemm-permute padding cases (#409) · d6709dc3

Anthony Chang authored 2 years ago

* fix example; make padding on by default in example; fix argument checks

* fix Gemm1KPacK which has since regressed from PR #399

d6709dc3

GitLab

Menu