Unverified Commit 4682d070 authored by Chao Liu's avatar Chao Liu Committed by GitHub
Browse files

Create README.md (#45)

* Create README.md
parent aafb5eb1
develop 9110_issue_gemm_ex3 DeviceGemmXdl_C_Shuffle_bias GemmMultipleDMultipleR PR285-code-review add-license-1 add-license-2 add-license-3 add-license-4 add_bfp16_gemm add_bfp16_int8_gemm add_bfp16_xdlops add_get_work_space add_mfma_f64 add_more_static_assert_to_threadwise_copy add_splitk_perf_tests add_workspace_to_argument amd-miopen-upstream aosewski/clang-format aosewski/client_conv2d_fwd aosewski/example_gemm_reduction aosewski/gemm_int4_examples aosewski/gtest_refinements aosewski/instance_impl_refactor aosewski/int4 aosewski/int4_UT aosewski/softmax_ut aosewski/util_refactor att-with-mask att_diagnal att_lower_triangle att_with_MNKOPadding atten-in-out-permute b_skip_lds backward_weight_v4r4r2_xdlops backward_weight_v4r4r2_xdlops_try backward_weight_v4r4r4_xdlops backward_weight_v4r4r4_xdlops_atomic batched_gemm batched_gemm_and_reduction batched_gemm_bf16 batched_gemm_c_permute batched_gemm_c_permute_bias batched_gemm_documentation batched_gemm_g_stride_fix batched_gemm_multiD batched_gemm_output_transpose batched_gemm_use_profiler_in_ctest batchnorm_pr batchnorm_welford binding_gemm_k1_to_conv_n blockwise_copy_iterator_hack bmatrix_skip_lds bnorm_bwd_pr bug_fix bug_fix_0806 bug_fix_ThreadwiseTensorSliceTransfer_v2 bug_fix_grid_gemm_k0_main_loop bugfix_for_pr101 bwd_asm_test bwd_weight_bf16_splitk bwd_weight_profiler_fp16 c_shuffle ccache-build change_init_method change_min_block_per_cu ck-trial ckProfiler_add_args_packed_gemm ck_clang_support_test ck_conv_bwd_data_unified ck_conv_bwd_data_unified_issue ck_conv_bwd_fp16_issue ck_conv_bwd_fp32 ck_conv_bwd_wgt_example_fixed_name ck_conv_bww_fp16 ck_conv_fwd_bf16 ck_for_very_small_gemm ck_integration_clang_support ck_reference_fwd_bias_relu_add clean clean_conv compiler_swdev_242912 compiler_wrong_result contraction contraction_hipTENSOR conv3d_new conv_activ_xdlops conv_bwd_data_bias_relu conv_bwd_weight_1d_xdl conv_conv conv_conv_int4 conv_conv_v2 conv_quant_int8 conv_splitk_f32 conv_splitk_f32_try convnd_bwd_weight_instance cpu cpu_avx2 debug_conv_0525 dfeng_fix eltwise_base_class eltwise_op embedding_fuse_layernorm enable_bwd_data_test example_contraction_splitk examples_batched_grouped_gemm exp-gemm-only exp-use-pipeline-v2 external_interface fastgelu feature/add-accumulate-n feature/add-auto-cast feature/add-ck-debug feature/add-convnd-fwd-reduce-examples feature/add-empty-array feature/add-gemm-add-add-fastgelu-int4-examples feature/add-gemm-int4-examples feature/add-get-memory-size feature/add-permute-device-op feature/add-range-to feature/add-to-array feature/rangify-check-err feature/rangify-stl-algorithms feature/rangify-tensor-ctor feature/rangify-uniform-fill feature/refactor-gemm-multiple-d-multiple-r feature/simplify-examples fix-export-utility fix-schedbar-usage fix_0524 fix_0525 fix_0526 fix_0620 fix_0813 fix_0816 fix_0916 fix_0920 fix_0921 fix_16bit_packing fix_bug_v5r1 fix_build fix_build_0331 fix_build_0521 fix_clang_format_issue_in_dev_branch fix_conv2d_bwd_data_filter1strid2_bug fix_ctile_err_for_conv2d_fwd_bias_relu_add fix_issue164_batched_gemm_failure fix_naming fix_poly fix_pr349 fix_some_stuff fix_splitK_gemm_f32_v2 fix_test fix_threadwise_copy_error_in_reduction fix_typo_in_batched_gemm_profiler fix_wrw_kpad fixed_batched_contraction_g_offset fixed_splitk_gemm_fp32 format forward_add_group fp16_transfer_to_bf16 fp64_xdlops_compiler_issue fused-attention fused-attn-instance fused-gemm fused-kernel-fixes fusion_example gelu gemm-shuffle-host-api gemm_activation gemm_add_bias_reduction gemm_alpha_beta gemm_alpha_beta_profiler gemm_alpha_beta_v6r4 gemm_bias gemm_bias_c_permute gemm_bias_e_permute_large_c_example gemm_bias_gelu_gemm_bias gemm_c_permute_bias gemm_driver gemm_gemm gemm_irr_tile_size_debug gemm_ksplit_f32 gemm_layernorm_welford gemm_norm gemm_reduce gemm_softmax gemm_splitk_bias gemm_test_result generic_eltwise gridwise_2d gridwise_gemm_double_buffer group_conv group_norm grouped_gemm_args_const_buff grouped_gemm_bias_transpose grouped_gemm_fp16 grouped_gemm_mem_alloc_fix grouped_gemm_multiD grouped_gemm_output_transpose grouped_gemm_profiler_hotfix grouped_gemm_test_fix groupgemm_reduce_softmax groupnorm_check hotfix-build-break hotfix-pr370 illsilin-lwpck-141 improve_gemm improve_pipeline improve_pipeline_for_wrw improve_pipeline_v2 improve_pipeline_v3 int8_qunatization_gemm_xdl interface interface_v2 interface_v3 interwave-opt intrinsic-in-reg-transpose irr issue_291 issue_epilogue jd/90a_ci jd/dev_pkg jd/fix_device_libs jd/host_api jfy_batched_gemm_bak jfy_batched_gemm_rebased jfy_demo_block2CTileMap layernorm_client_example layernorm_profiler layernorm_welford lwpck-359 lwpck-359_int4 lwpck-402 lwpck-405 lwpck-409 lwpck-411 lwpck-412 lwpck-415 magic_division_use_umulhi master merge_use_division_mod miopen miopen_downstream-dynamic_reduction_pr miopen_downstream-dynamic_reduction_pr-CK_upstream_0821-chao_git miopen_downstream-dynamic_reduction_pr-CK_upstream_0821-chao_git-clean_miopen miopen_downstream-reduction_fix_generic-chao_git miopen_downstream_20211020 miopen_downstream_all miopen_downstream_init_integration miopen_downstream_test miopen_downstream_test2 miopen_integration miopen_upstream modified_grouped_gemm_addressing_method multi_target myamlak/cgemm myamlak/issue153-sign-compare navi21_gemm navi21_gemm_bak navi21_gemm_v2 new_splitK no_cmake_gen paded_conv patch_for_bwd_data path_for_bwd_data_comments pipeline_selector pipeline_selector_patched pooling pr-lds-conflict-free-instances pr-lds-pattern prototype_flash_attention_softmax qa_test redu refactor_gemm_driver refactor_gemm_with_contraction refactor_splitK refactor_threadwise_copy_using_sfcurve reopen_masking_att_instance reorg_file reproduce_error_in_reduction revert_sfc revert_threadwise_copy rework_ector_type rocm_4.5_debug rosenrdot/gemm-gemm-padding rosenrodt/attention-c-permute rosenrodt/attn-odd-kperblock rosenrodt/block-to-ctile-refactor rosenrodt/fix-softmax-race rosenrodt/fix-softmax-race-2 rosenrodt/gelu-bench rosenrodt/gemm-layernorm rosenrodt/new-ctile-map rosenrodt/revert-tilemap-param rosenrodt/split-k-bug rosenrodt/standalone-softmax rosenrodt/standalone-softmax-2 rosenrodt/tilemap-param rosenrodt/tilemap-param-8 rosenrodt/validate-examples-in-ci rosenrodt/wa-flaky-attn-test script-count-regs separate_online_compile single-thd-tensor-init skip_all_lds skip_b_lds_interleave_bufferload skip_both_lds_splitk small_tile_size some_fix_210727 space_filling_curve splitK_f16 standalone-layernorm static_buffer_vec_type static_ck_small_gemm tensor_permutation test test-jenkinsfile test_conv_bwd_weight test_develop test_freq test_layernorm test_magic_division transform-issue trinary_elementwise tweak tweak_conv_odd_c two_stage_prefetch update_ctest update_group_conv update_readme updated_f32_to_f16 upstream-2wave-opt use_universal_workspace_ptr ushort_to_bhalf_t v4r4r4_xdlops_olc v5r1_add v5r1_refactor v5r1_refactor_test v5r1_static vector_cast vgpr_spilling_issue wavelet_model whchung-patch-1 wrw_conv_impr wrw_nchw_ksplit wrw_nchw_ksplit-try wrw_nchw_ksplit_fixed xdl_lds_load_fix xdlops_buildins xdlops_inline_asm_test xdlops_refactor yao_perf_eval zjing14-patch-1
No related merge requests found
Showing with 180 additions and 0 deletions
+180 -0
# How to build and run
# Docker
```
docker run \
-it \
--rm \
--privileged \
--group-add sudo \
-w /root/workspace \
-v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace \
rocm/tensorflow:rocm4.2-tf2.4-dev \
/bin/bash
```
# Install Boost for online compilation
https://www.boost.org/doc/libs/1_66_0/more/getting_started/unix-variants.html#easy-build-and-install
# Build
Change target ID in source code, example below is gfx908
https://github.com/asroy/modular_convolution/blob/aafb5eb18781f1ac9e06a17c3e53d968dd53dcc0/composable_kernel/include/utility/config.amd.hpp.in#L16-L23
Add path of Boost
```
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
```
```
mkdir build && cd build
# need to manually set target ID, example below is gfx908
cmake \
-D CMAKE_BUILD_TYPE=Release \
-D DEVICE_BACKEND=AMD \
-D CMAKE_CXX_FLAGS="-O3 --amdgpu-target=gfx908 -mllvm --amdgpu-spill-vgpr-to-agpr=0 -gline-tables-only -save-temps=$CWD" \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-D CMAKE_PREFIX_PATH=/opt/rocm \
-D CMAKE_VERBOSE_MAKEFILE:BOOL=ON \
..
```
Build drivers: \
``conv_driver_v2`` is (offline compilation) driver for forward convolution, \
``conv_bwd_data_driver_v2`` is (offline compilation) driver for backward-data convolution \
``conv_driver_v2_olc`` is (online compilation) driver for forward convolution
```
make -j conv_driver_v2
make -j conv_bwd_data_driver_v2
make -j conv_driver_v2_olc
```
# Run
* layout: 0 = NCHW; 1 = NHWC
* algo:
* Forward convolution: https://github.com/asroy/modular_convolution/blob/aafb5eb18781f1ac9e06a17c3e53d968dd53dcc0/driver/conv_driver_v2.cpp#L38
* Backward data convolution: https://github.com/asroy/modular_convolution/blob/aafb5eb18781f1ac9e06a17c3e53d968dd53dcc0/driver/conv_bwd_data_driver_v2.cpp#L22
* verify: 0 = no verification; 1 = do verification
* init: 0 ~ 3. initialization method
* log: 0 = no log; 1 = do log
* repeat: number of time kernel being launched
```
########################### layout algo verify init log repeat N__ K___ C___ Y X Hi_ Wi__ Strides Dilations LeftPads RightPads
./conv_driver_v2 0 6 0 3 0 1 128 256 192 3 3 71 71 2 2 1 1 1 1 1 1
./conv_driver_v2 0 6 0 3 0 1 256 1024 256 3 3 14 14 1 1 1 1 1 1 1 1
./conv_driver_v2 1 9 0 3 0 1 128 256 192 3 3 71 71 2 2 1 1 1 1 1 1
./conv_driver_v2 1 9 0 3 0 1 256 1024 256 3 3 14 14 1 1 1 1 1 1 1 1
./conv_bwd_data_driver_v2 1 1 0 3 0 1 256 256 1024 3 3 14 14 1 1 1 1 1 1 1 1
```
# Result
Forward convoltuion, FP16, NCHW
```
./conv_driver_v2 0 6 0 3 0 1 128 256 192 3 3 71 71 2 2 1 1 1 1 1 1
layout: 0
in: dim 4, lengths {128, 192, 71, 71}, strides {967872, 5041, 71, 1}
wei: dim 4, lengths {256, 192, 3, 3}, strides {1728, 9, 3, 1}
out: dim 4, lengths {128, 256, 36, 36}, strides {331776, 1296, 36, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {2, 2, }
ConvDilations size 2, {1, 1, }
device_dynamic_convolution_forward_implicit_gemm_v4r4r2_xdlops_nchw_kcyx_nkhw
a_k0_m_k1_grid_desc{216, 256, 8}
b_k0_n_k1_grid_desc{216, 165888, 8}
c_m_n_grid_desc{ 256, 165888}
launch_and_time_kernel: grid_dim {1296, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 1.4155 ms, 103.686 TFlop/s
```
Forward convoltuion, FP16, NCHW
```
./conv_driver_v2 0 6 0 3 0 1 256 1024 256 3 3 14 14 1 1 1 1 1 1 1 1
layout: 0
in: dim 4, lengths {256, 256, 14, 14}, strides {50176, 196, 14, 1}
wei: dim 4, lengths {1024, 256, 3, 3}, strides {2304, 9, 3, 1}
out: dim 4, lengths {256, 1024, 14, 14}, strides {200704, 196, 14, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {1, 1, }
ConvDilations size 2, {1, 1, }
device_dynamic_convolution_forward_implicit_gemm_v4r4r2_xdlops_nchw_kcyx_nkhw
a_k0_m_k1_grid_desc{288, 1024, 8}
b_k0_n_k1_grid_desc{288, 50176, 8}
c_m_n_grid_desc{ 1024, 50176}
launch_and_time_kernel: grid_dim {1568, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 2.21357 ms, 106.959 TFlop/s
```
Forward convolution, FP16, NHWC
```
./conv_driver_v2 1 9 0 3 0 1 128 256 192 3 3 71 71 2 2 1 1 1 1 1 1
layout: 1
in: dim 4, lengths {128, 71, 71, 192}, strides {967872, 13632, 192, 1}
wei: dim 4, lengths {256, 3, 3, 192}, strides {1728, 576, 192, 1}
out: dim 4, lengths {128, 36, 36, 256}, strides {331776, 9216, 256, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {2, 2, }
ConvDilations size 2, {1, 1, }
device_dynamic_convolution_forward_implicit_gemm_v4r4r4_xdlops_nhwc_kyxc_nhwk
a_k0_m_k1_grid_desc{216, 165888, 8}
b_k0_n_k1_grid_desc{216, 256, 8}
c_m_n_grid_desc{ 165888, 256}
launch_and_time_kernel: grid_dim {1296, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 1.12014 ms, 131.025 TFlop/s
```
Forward convolution, FP16, NHWC
```
./conv_driver_v2 1 9 0 3 0 1 256 1024 256 3 3 14 14 1 1 1 1 1 1 1 1
layout: 1
in: dim 4, lengths {256, 14, 14, 256}, strides {50176, 3584, 256, 1}
wei: dim 4, lengths {1024, 3, 3, 256}, strides {2304, 768, 256, 1}
out: dim 4, lengths {256, 14, 14, 1024}, strides {200704, 14336, 1024, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {1, 1, }
ConvDilations size 2, {1, 1, }
device_dynamic_convolution_forward_implicit_gemm_v4r4r4_xdlops_nhwc_kyxc_nhwk
a_k0_m_k1_grid_desc{288, 50176, 8}
b_k0_n_k1_grid_desc{288, 1024, 8}
c_m_n_grid_desc{ 50176, 1024}
launch_and_time_kernel: grid_dim {1568, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 1.86877 ms, 126.693 TFlop/s
```
Backward data convolution, FP16, NHWC
```
./conv_bwd_data_driver_v2 1 1 0 3 0 1 256 256 1024 3 3 14 14 1 1 1 1 1 1 1 1
layout: 1
in: dim 4, lengths {256, 14, 14, 1024}, strides {200704, 14336, 1024, 1}
wei: dim 4, lengths {256, 3, 3, 1024}, strides {9216, 3072, 1024, 1}
out: dim 4, lengths {256, 14, 14, 256}, strides {50176, 3584, 256, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {1, 1, }
ConvDilations size 2, {1, 1, }
device_dynamic_convolution_backward_data_implicit_gemm_v4r1r2_xdlops_nhwc_kyxc_nhwk
a_k0_m_k1_grid_desc{288, 50176, 8}
b_k0_n_k1_grid_desc{288, 1024, 8}
c_m_n_grid_desc{ 50176, 1024}
launch_and_time_kernel: grid_dim {1568, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 2.22461 ms, 106.428 TFlop/s
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment