• Qianfeng's avatar
    Reduction in Composable Kernel (#82) · e17c0d80
    Qianfeng authored
    * Initial adding of generic reduction
    
    * Initial adding of generic reduction ...
    
    * Updates to make compiling done
    
    * clang-format all files
    
    * clang-format some files again
    
    * Renaming in profiler/include/profile_reduce.hpp
    
    * Updates and make BlockWise cases passed
    
    * Updates and make ThreadWise and MultiBlockTwoCall cases passed
    
    * Remove the support for MUL and NORM1 reduceOp from the profiler and the device instances
    
    * Change to replace the dim0_max_vector_size/dim1_max_vector_size template argument in the device reduce classes
    
    * format
    
    * adding pooling
    
    * added max and average pooling
    
    * comment out cout and kernel timing
    
    * Tiny simplification in profiler/reduce_profiler.cpp
    
    * Add example for reduce_blockwise
    
    * Tiny updates
    
    * Change to pass the ElementWiseOp from device layer to kernel
    
    * Fix the vectorDim and vectorSize in Device layer
    
    * Enable vector load on both dim0 and dim1 for Threadwise method
    
    * Tiny updates...
    e17c0d80
Dockerfile 2.95 KB