template<typename T, int BLOCK_THREADS>
class cub::BlockReduce< T, BLOCK_THREADS >
BlockReduce provides variants of parallel reduction across a CUDA threadblock.
.
- Overview
- A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.
- For convenience, BlockReduce exposes a spectrum of entrypoints that differ by:
- Granularity (single vs. multiple items per thread)
- Input (full tile vs. partial-tile having some undefined elements)
- Template Parameters
-
| T | The reduction input/output element type |
| BLOCK_THREADS | The threadblock size in threads |
- Algorithm
- BlockReduce entrypoints have O(n) work complexity and are implemented in three phases:
- Sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
- A single-warp performs a raking upsweep across partial reductions shared each thread in the threadblock.
- A warp-synchronous Kogge-Stone style reduction within the raking warp to produce the total aggregate.
Data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.
- Usage Considerations
- Supports non-commutative reduction operators
- Supports partially-full threadblocks (i.e., the most-significant thread ranks having undefined values).
- Assumes a blocked arrangement of elements across threads
- The threadblock-wide scalar reduction output is only considered valid in thread0
- After any operation, a subsequent
__syncthreads() barrier is required if the supplied BlockReduce::SmemStorage is to be reused or repurposed by the threadblock
- Performance Considerations
- Very efficient (only one synchronization barrier).
- Zero bank conflicts for most types.
- Computation is slightly more efficient (i.e., having lower instruction overhead) for:
T is a built-in C++ primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)
BLOCK_THREADS is a multiple of the architecture's warp size
- Every thread has a valid input (i.e., full vs. partial-tiles)
- Examples
- Example 1. Perform a simple reduction of 512 integer keys that are partitioned in a blocked arrangement across a 128-thread threadblock (where each thread holds 4 keys).
#include <cub.cuh>
__global__ void SomeKernel(...)
{
int data[4];
...
...
- Example 2: Perform a guarded reduction of only
num_elements keys that are partitioned in a partially-full blocked arrangement across BLOCK_THREADS threads. #include <cub.cuh>
template <int BLOCK_THREADS>
__global__ void SomeKernel(..., int num_elements)
{
int data;
if (threadIdx.x < num_elements) data = ...;
...
|
|
| template<typename ReductionOp > |
| static __device__ __forceinline__ T | Reduce (SmemStorage &smem_storage, T input, ReductionOp reduction_op) |
| | Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element. More...
|
| |
| template<int ITEMS_PER_THREAD, typename ReductionOp > |
| static __device__ __forceinline__ T | Reduce (SmemStorage &smem_storage, T(&inputs)[ITEMS_PER_THREAD], ReductionOp reduction_op) |
| | Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements. More...
|
| |
| template<typename ReductionOp > |
| static __device__ __forceinline__ T | Reduce (SmemStorage &smem_storage, T input, ReductionOp reduction_op, const unsigned int &valid_threads) |
| | Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. The first valid_threads threads each contribute one input element. More...
|
| |
|
| static __device__ __forceinline__ T | Sum (SmemStorage &smem_storage, T input) |
| | Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element. More...
|
| |
| template<int ITEMS_PER_THREAD> |
| static __device__ __forceinline__ T | Sum (SmemStorage &smem_storage, T(&inputs)[ITEMS_PER_THREAD]) |
| | Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements. More...
|
| |
| static __device__ __forceinline__ T | Sum (SmemStorage &smem_storage, T input, const unsigned int &valid_threads) |
| | Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. The first valid_threads threads each contribute one input element. More...
|
| |
template<typename T , int BLOCK_THREADS>
template<typename ReductionOp >
| static __device__ __forceinline__ T cub::BlockReduce< T, BLOCK_THREADS >::Reduce |
( |
SmemStorage & |
smem_storage, |
|
|
T |
input, |
|
|
ReductionOp |
reduction_op |
|
) |
| |
|
inlinestatic |
Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element.
The return value is undefined in threads other than thread0.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
- Template Parameters
-
| ReductionOp | [inferred] Binary reduction functor type (a model of Binary Function). |
- Parameters
-
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input |
| [in] | reduction_op | Binary associative reduction functor |
template<typename T , int BLOCK_THREADS>
template<int ITEMS_PER_THREAD, typename ReductionOp >
| static __device__ __forceinline__ T cub::BlockReduce< T, BLOCK_THREADS >::Reduce |
( |
SmemStorage & |
smem_storage, |
|
|
T(&) |
inputs[ITEMS_PER_THREAD], |
|
|
ReductionOp |
reduction_op |
|
) |
| |
|
inlinestatic |
Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements.
The return value is undefined in threads other than thread0.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
- Template Parameters
-
| ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
| ReductionOp | [inferred] Binary reduction functor type (a model of Binary Function). |
- Parameters
-
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | inputs | Calling thread's input segment |
| [in] | reduction_op | Binary associative reduction functor |
template<typename T , int BLOCK_THREADS>
template<typename ReductionOp >
| static __device__ __forceinline__ T cub::BlockReduce< T, BLOCK_THREADS >::Reduce |
( |
SmemStorage & |
smem_storage, |
|
|
T |
input, |
|
|
ReductionOp |
reduction_op, |
|
|
const unsigned int & |
valid_threads |
|
) |
| |
|
inlinestatic |
Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. The first valid_threads threads each contribute one input element.
The return value is undefined in threads other than thread0.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
- Template Parameters
-
| ReductionOp | [inferred] Binary reduction functor type (a model of Binary Function). |
- Parameters
-
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input |
| [in] | reduction_op | Binary associative reduction functor |
| [in] | valid_threads | Number of threads containing valid elements (may be less than BLOCK_THREADS) |
template<typename T , int BLOCK_THREADS>
Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element.
The return value is undefined in threads other than thread0.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
- Parameters
-
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input |
template<typename T , int BLOCK_THREADS>
template<int ITEMS_PER_THREAD>
Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements.
The return value is undefined in threads other than thread0.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
- Template Parameters
-
| ITEMS_PER_THREAD | [inferred] The number of consecutive items partitioned onto each thread. |
- Parameters
-
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | inputs | Calling thread's input segment |
template<typename T , int BLOCK_THREADS>
| static __device__ __forceinline__ T cub::BlockReduce< T, BLOCK_THREADS >::Sum |
( |
SmemStorage & |
smem_storage, |
|
|
T |
input, |
|
|
const unsigned int & |
valid_threads |
|
) |
| |
|
inlinestatic |
Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. The first valid_threads threads each contribute one input element.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
The return value is undefined in threads other than thread0.
- Parameters
-
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input |
| [in] | valid_threads | Number of threads containing valid elements (may be less than BLOCK_THREADS) |