1. 03 Oct, 2022 3 commits
  2. 01 Oct, 2022 1 commit
    • Illia Silin's avatar
      Allow setting ROCM version, activate cchache, etc. (#462) · 7fc3ed76
      Illia Silin authored
      * enable ccache and decouple it from MIOpen ccache use
      
      * fix the ccache check script
      
      * use another method to get server name
      
      * fix syntax
      
      * add quotes around the server name variable
      
      * use check_host as function
      
      * change syntax
      
      * fix syntax
      
      * test if server name is parsed correctly
      
      * try different syntax
      
      * check the env var value
      
      * test new check node function
      
      * add ROCMVERSION parameter and fix script syntax
      
      * fix script syntax
      
      * add missing instances of rocm version
      
      * install ccache in the docker image
      
      * do not check GPU in clang format stage, clean up old code
      
      * update defaults and clean up
      7fc3ed76
  3. 27 Sep, 2022 1 commit
    • Illia Silin's avatar
      Fix build issues, set new compiler default, etc. (#451) · b8825547
      Illia Silin authored
      * add an option to select specific compiler commit
      
      * change the logic of forcing building a docker
      
      * add check for compiler commit in dockerfile
      
      * compiler check syntax fix
      
      * change compiler selection logic
      
      * fix the new compiler build issue
      
      * set new compiler as default, update dev-requirements
      
      * fix jenkins syntax
      
      * fix docker syntax
      
      * get rid of hipcc.pl editing in jenkinsfile
      
      * fix the hipcc.pl in both places
      
      * try to fix the 10738 compiler linking bug
      
      * fix syntax
      
      * use dockerhub to store images
      
      * use newer amd-stg-open commit as default
      b8825547
  4. 26 Sep, 2022 1 commit
  5. 23 Sep, 2022 4 commits
  6. 22 Sep, 2022 4 commits
  7. 21 Sep, 2022 4 commits
    • Lixun Zhang's avatar
      Updated the supported components (#435) · 7acbf104
      Lixun Zhang authored
      7acbf104
    • Illia Silin's avatar
      Build the CK targets only once. (#433) · 85b0920d
      Illia Silin authored
      * build CK only once, use deb package in all subsequent stages
      
      * update jenkins file
      
      * change prefix for build_CK stage
      
      * update writing deb metadata to control file
      
      * update ubuntu source for docker, script syntax for deb package metadata
      
      * try different way to create deb metadata
      
      * clean up DEBIAN before creating one
      
      * fix the CI folder names, fix splitK qa
      
      * use correct docker in all stages, separate tests for splitK verification and performance
      
      * clean old comments, change dir before packaging
      
      * use different package syntax
      
      * change packaging syntax
      
      * package with cmake
      
      * remove unnecessary build prefix
      
      * get rid of unnecessary paths
      
      * change paths during unpacking
      
      * change script syntax while unpacking
      
      * get rid of unneccesary steps
      
      * get rid of comments in the scripts
      
      * use double quotes for scripts
      
      * add ccache during build, try dpkg -x
      
      * pull and install each package separately
      
      * use full package names
      
      * try to use stashing for packages
      
      * change stash/unstash syntax
      
      * move unstash out of shell, run tests on any gpu node
      
      * unpack each package separately
      
      * try re-using existing workspace
      
      * merge the build and test stages, only stash ckProfiler
      
      * merge the build and test stages, only stash zipped ckProfiler
      
      * fix syntax
      
      * add GPU check before build and test, rename docker to usual name
      85b0920d
    • Rostyslav Geyyer's avatar
      ff519fc3
    • zjing14's avatar
      fixed G offset calc for long_index (#428) · 01876afa
      zjing14 authored
      01876afa
  8. 20 Sep, 2022 7 commits
    • Chao Liu's avatar
      fix build (#427) · 567f70f5
      Chao Liu authored
      * fix build
      
      * fix build
      567f70f5
    • Shaojie WANG's avatar
      MNKO padding support on bmm+masking+scale+softmax+bmm+premute (#425) · ebab84b6
      Shaojie WANG authored
      
      * add lower triangle bmm
      
      * init code for tile skipping
      
      * functionality right with lower triangle mask
      
      * add decoder lower triangular mask calculation
      
      * use 7*13 group
      
      * fix n2 compute error
      
      * attention with lower triangle mask with tile skipping
      
      * add template to distinguish masking kernel
      
      * rename template and remove default template value
      
      * remove lower triangle gemm reference struct
      
      * add some comments on example
      
      * add 10 instance for masking bmm + scale + softmax + bmm + permute kernels
      
      * add test
      
      * add test file
      
      * add gtest for bmm masking scale softmax bmm permute
      
      * clang-format
      
      * fix compile error
      
      * check lef bottom corner for tile skipping
      
      * fix error: check left bottom corner for tile skipping
      
      * add k padding
      
      * add test and instance for MNK padding
      
      * passing a mask struct
      
      * fix instances
      
      * delete used comments
      
      * format
      Co-authored-by: default avatardanyao12 <yaodan@dc-smc-13.amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      ebab84b6
    • Rostyslav Geyyer's avatar
      bf94b0b2
    • Illia Silin's avatar
    • rocking5566's avatar
      Group norm (#417) · 4eba345f
      rocking5566 authored
      
      * Add groupnorm example by layernorm
      1.  Reference is not ready
      2. shape of gamma and beta need to be fix
      
      * Let shape of gamma and beta can be same as x
      
      * Modify test, instance and client example
      
      * [What] Fix bug of layernorm for greater than 2 dimension.
      [Why] We need to get upper length from merge transform instead of embed transform.
      
      * Add reference for groupnorm
      
      * Fuse sigmoid after groupnorm
      
      * [What] Rename original layernorm into layernorm2d
      [Why] Prepare to add groupnorm using layernorm5d
      
      * clang-format
      
      * Add groupnorm test
      
      * Refine error message
      
      * Add groupnorm ckProfiler
      
      * Test groupnorm kernel from device_instance
      
      * update example
      
      * upadte profiler
      
      * Fix test naming
      
      * Fix argc number
      
      * Move descriptor and sweeponce to argument for quick debugging
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      4eba345f
    • Po Yen Chen's avatar
      Add 'Permute' device op & example (#408) · f584ab0c
      Po Yen Chen authored
      * Add example folder for 'DeviceElementwise'
      
      * Re-structure example files
      
      * Move common parts into common.hpp
      
      * Use more strict input
      
      * Add more helper methods in 'DeviceElementwise'
      
      * Use more specific method to write example
      
      * Allow specify problem through command line argument
      
      * Allow specify problem 'axes' through command line argument
      
      * Add check to template type argument
      
      * Add transpose_shape() to generalize shape permute
      
      * Generalize transpose utility functions
      
      * Use better name for tensor indices
      
      * Add checks in helper functions
      
      * Remove debug messages
      
      * Refine error message for check_err()
      
      * Generalize variable naming in example code
      
      * Add device op 'DevicePermute'
      
      This device op is clone of 'DeviceElementwise'
      
      * Use 'DevicePermute' device op in example
      
      * Remove 'elementwise' from identifiers
      
      * Remove 'elementwise' from file paths
      
      * Remove base class of 'DevicePermute'
      
      * Let 'DevicePermute' inherit from 'BaseOperator'
      
      * Add simple type traits to validate device op type
      
      * Add static_assert() to check type constraints
      
      * Create 'DevicePermuteBase' to generate methods
      
      * Use indirect base type to generate methods
      
      * Remove 'is_device_op<>' type traits
      
      * Only accept single-input-single-output for 'DervicePermute'
      
      * Simplify 'DevicePermute' interface
      
      * Re-format 'DeviceElementwise'
      
      * Use CRTP to generate overridden virtual method
      
      * Remove unnecessary include directives
      
      * Distinguish input & output shape in 'DevicePermute'
      
      * Passing 'axes' to 'DevicePermute'
      
      * Use more reasonable return value for Invoker::Run()
      
      * Add 'GridwisePermute' kernel
      
      This kernel is a clone of 'GridwiseElementwise_1D'
      
      * Remove no-longer used type argument
      
      * Check if input/output shape meet the requirement
      
      * Remove no-longer used method
      
      * Remove never-entered-if-clause
      
      * Change problem description for 'DevicePermute'
      
      * Transform descriptor into 3 dimensions
      
      * Add debug code the verify result
      
      * Add comment to indicate template argument location
      
      * Add N/H/WPerBlock template parameter to 'DevicePermute'
      
      * Rename 'GridwisePermute' to 'GridwiseCopy'
      
      * Check tensor descriptor dimensions in 'GridwiseElementwise_1D'
      
      * Add missing include directive
      
      * Add 'BlockSize' parameter to 'DevicePermute'
      
      * Remove no-longer used method
      
      * Add 'BlockToTileMap' for 'GridwiseCopy'
      
      * Use the normal Block2TileMap convention
      
      * Rename 'BlockToTileMap' as 'Block2TileMap'
      
      * Fix most of compilation errors
      
      * Let 'Block2TileMap' map block to 2d coordinate
      
      * Allow data transfer in 'GridwiseCopy'
      
      * Fix wrong output descriptor for 2nd blockwise copy
      
      * Rename 'GridwiseCopy' as 'GridwisePermute'
      
      * Remove '1d' in identifiers
      
      * Remove commented-out codes
      
      * Remove 'MPerThread' template parameter
      
      * Seperate template parameters
      
      * Unify variable namming convention
      
      * Use more verbose way to create expressions
      
      * Add template parameter 'InBlockLdsExtraW'
      
      * Release the constraint on In/OutGridDesc
      
      * Use date type directly as template argument
      
      * Re-arrange template arguments for blockwise copy
      
      * Remove no-longer used template parameters
      
      * Embed layout in the variable names
      
      * Add GridwisePermute::CheckValidity()
      
      * Extract local types as template parameters
      
      * Rename local type alias
      
      * Add more template parameters (vector width related)
      
      * Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions
      
      * Fill tensor values start from 1
      
      * Re-formate example code
      
      * Avoid too-large block id
      
      * Add comment
      
      * Make sure 'SrcVectorDim' is not same as 'DstVectorDim'
      
      * Add check for the 'VectorDim' & 'ScalarPerVector' template params
      
      * Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc
      
      * Remove no-longer used template parameter 'NPerBlock'
      
      * Fix wrong descriptor creation logics
      
      * Specify problem in each examples
      
      * Use better example name
      
      * Add new example 'example_permute_NxHxW_fp32'
      
      * Add example for demonstrating bundle multiple elems in tensor
      
      * Add support to permute multiple elements together
      
      * Change the default problem size
      
      * Add span<> class template
      
      * Use span<> to generalize check_err() interface
      
      * Fix ambiguous ctor call
      
      * Avoid create necessary objects
      
      * Use helper functions to simplify example code
      
      * Add example for 4xfp16 permute
      
      * Disable failed-to-compile example
      
      * Add check for the NUM_ELEMS_IN_BUNDLE
      
      * Remove redundant parameter in helper lambda function
      
      * Add check for the input tensor type's byte-size
      
      * Check scalar-per-vector with padded length
      
      * Use more verbose name to avoid name collision
      
      * Use fixed 'VectorDim' & 'ScalarPerVector' for LDS
      
      * Embed shape info in name of descriptor constructor
      
      * Rename example folder '36_permute' into '37_permute'
      
      * Avoid using too-large LDS in kernel code
      
      * Remove redundant example
      
      * Usw switch() to group similar codes
      
      * Add const to the span<> type arguement
      
      * Simply initialize tensor with floating point values
      
      * Use fp16 as data type in all examples
      
      * Enlarge tensor size in example
      
      * Enalrge N-dim in example
      
      * Add check for the bundled type in example
      
      * Use more stricter error threshold
      
      * Remove global load/store loop in kernel code
      
      * Measure execution time by default
      
      * Use faster device op config for example 'NxHxW_fp16'
      
      * Use faster device op config for example '1xHxW_fp16'
      
      * Use faster device op config for example 'HxWx4_fp16'
      
      * Remove cmd arg parsing logics
      
      * Rename functions
      
      * Extract bundle permutation logic out
      
      * Simplify permute bundle example
      
      * Add Tensor<>::GetElementSpaceSizeInBytes()
      
      * Add Tensor<>::data()
      
      * Use new methods to simplify code
      
      * Use type alias to replace duplicated code
      
      * Use existing method to shorten code
      
      * Allow FillUniformDistribution accept range arugment
      
      * Intialize random values in range
      
      * Add Tensor<>::size()
      
      * Use more meaningful names in permute bundle example
      
      * Use more meaningful names in permute element examples
      
      * Use rangified copy() to copy elements
      
      * Use function return value directly to eliminate variables
      
      * Add to_array() conversion tool to eliminate more variables
      
      * Add Tensor<>::AsSpan<>() to create view of tensor values
      
      * Use AsSpan() to shorten check_err() calls
      
      * Remove no-longer-used 'using' directives
      
      * Move 'using' directive to proper code position
      
      * Remove redudant variables
      
      * Remove useless static_assert()
      
      * Add check for range types
      
      * Declare variable right before first use
      
      * Move long return type as tailing return type
      
      * Add BaseInvokerCRTP<> class template to generate method
      
      * Create new base type for 'DervicePermute' implementations
      
      * Move 'NumDim' template param to the first
      
      * Rename 'DevicePermute' to 'DevicePermuteImpl'
      
      * Add 'noexcept' specifier to CRTP generated method
      
      * Move 'Block2TileMap' definition into 'GridwisePermute'
      
      * Use type alias to reduce code
      
      * Unify naming style in 'DevicePermute'
      
      * Add comments in 'GridwisePermute'
      
      * Rename permute example folder
      
      * Use std::cerr to report error
      
      * Use larger shape in examples
      
      * Rename '38_permute' to '39_permute'
      
      * Make sure we use unsigned type for shape & indices
      
      * Remove opt-ed out assertion
      
      * Remove template BaseInvokerCRTP<>
      f584ab0c
    • Anthony Chang's avatar
      Add batched attention special kernel instances (#424) · 7c788e10
      Anthony Chang authored
      * sanity check
      
      * add attribution
      
      * add irrgular k tile size for batched attention
      
      * format
      7c788e10
  9. 19 Sep, 2022 3 commits
    • Anthony Chang's avatar
    • Anthony Chang's avatar
      Grouped batched attention + permute (#412) · 9287b7c6
      Anthony Chang authored
      * grouped attn without batch validates; now move toward grouped batched attn
      
      * grouped batched attention
      
      * working
      
      * remove debug logging
      
      clean up
      
      clean up
      
      * reintroduce g_ prefix back to host tensor variables
      
      * format
      
      * rename file
      
      * restore old file
      
      * rename
      
      * consolidate padded/non-padded attention example
      
      * harmonize padding specialization in attn examples
      9287b7c6
    • Shaojie WANG's avatar
      Conv bwd data multiple d (#404) · 27858374
      Shaojie WANG authored
      
      * init commit of convnd bwd data
      
      * begin compiling example
      
      * have a first version that produce a right result
      
      * refine device level launch kernel code
      
      * add more instances in example and get right results
      
      * clang-format
      
      * format example file
      
      * add more instances
      
      * fix instances
      
      * adding conv_bwd_data multile_d
      
      * adding conv_bwd_data multile_d
      
      * adding conv_bwd multiple d
      
      * adding conv_bwd multiple d
      
      * adding conv_bwd multiple d
      
      * refactor
      
      * refactor
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * refactor
      
      * update conv fwd's bias impl
      
      * refactor
      
      * reorg file
      
      * clean up cmake
      
      * clean
      
      * clean
      
      * clean
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      27858374
  10. 16 Sep, 2022 5 commits
  11. 15 Sep, 2022 2 commits
  12. 14 Sep, 2022 2 commits
    • ltqin's avatar
      batched_gemm + multiple_d + gemm + multiple_d (#394) · 370efa6c
      ltqin authored
      
      * refactor
      
      * start
      
      * add device gemm file
      
      * add BatchStrideD0
      
      * add stridd0
      
      * add gridwise file
      
      * add d0 parameters to gridwise gemm
      
      * add c layout transformer
      
      * add d0 threadwise copy
      
      * init kernel
      
      * init kernel
      
      * regular code
      
      * nm desc put to out
      
      * kernel parameter can not use reference
      
      * host add bias+gelu
      
      * run right for bias+gelu
      
      * change AddFastGelu into another file
      
      * interface add d1 bias parameters
      
      * add d1 parameter to argument
      
      * add d1 parameter to gridwise
      
      * first all code,not verify
      
      * gelu change to relu and GetElementSpaceSize bug
      
      * add instance
      
      * start add to ckprofiler
      
      * ckprofiler finish code
      
      * change input parameter for ckProfiler
      
      * fix host bias+gelu bug
      
      * show help for ckProfiler
      
      * fix bug for lunch kernel ignore parametes
      
      * add pad and fix about bug
      
      * mutiple d0
      
      * add dynamic d0_element_op
      
      * change profiler and  instance to mutiple d0
      
      * example have 2 d0
      
      * remove some comments not using
      
      * change 2 d0 have self  parameters
      
      * change d element_op name
      
      * change class name(multiple_d)
      
      * fix bug
      
      * fix bug that don't find file
      
      * update profiler
      
      * refactor
      
      * update profiler
      
      * clean
      
      * revert example change
      
      * add gon layout
      
      * optimize parameter for gno
      
      * add gon to gemm+gemm
      
      * change helping input parameters
      
      * change to GemmPadder_v2
      
      * using ForEach
      
      * fix gb_per_sec
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      370efa6c
    • Anthony Chang's avatar
      set up inter-wave configuration · cbf68933
      Anthony Chang authored
      cbf68933
  13. 13 Sep, 2022 1 commit
    • Illia Silin's avatar
      Upgrade the OS and ROCM versions. (#411) · b22ebd44
      Illia Silin authored
      * upgrade the OS and ROCM versions in CK docker
      
      * add cxx flags to link code with rocm5.2 and ck-9110 compiler
      
      * rename the docker image
      
      * run ONNX gemms using init=1
      b22ebd44
  14. 09 Sep, 2022 1 commit
  15. 08 Sep, 2022 1 commit