CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Groups Pages
Classes | Public Types | List of all members
cub::BlockReduce< T, BLOCK_THREADS > Class Template Reference

Detailed description

template<typename T, int BLOCK_THREADS>
class cub::BlockReduce< T, BLOCK_THREADS >

BlockReduce provides variants of parallel reduction across a CUDA threadblock.

reduce_logo.png
.
Overview
A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.
For convenience, BlockReduce exposes a spectrum of entrypoints that differ by:
  • Granularity (single vs. multiple items per thread)
  • Input (full tile vs. partial-tile having some undefined elements)
Template Parameters
TThe reduction input/output element type
BLOCK_THREADSThe threadblock size in threads
Algorithm
BlockReduce entrypoints have O(n) work complexity and are implemented in three phases:
  1. Sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
  2. A single-warp performs a raking upsweep across partial reductions shared each thread in the threadblock.
  3. A warp-synchronous Kogge-Stone style reduction within the raking warp to produce the total aggregate.
    block_reduce.png
    Data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.
Usage Considerations
  • Supports non-commutative reduction operators
  • Supports partially-full threadblocks (i.e., the most-significant thread ranks having undefined values).
  • Assumes a blocked arrangement of elements across threads
  • The threadblock-wide scalar reduction output is only considered valid in thread0
  • After any operation, a subsequent __syncthreads() barrier is required if the supplied BlockReduce::SmemStorage is to be reused or repurposed by the threadblock
Performance Considerations
  • Very efficient (only one synchronization barrier).
  • Zero bank conflicts for most types.
  • Computation is slightly more efficient (i.e., having lower instruction overhead) for:
    • T is a built-in C++ primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)
    • BLOCK_THREADS is a multiple of the architecture's warp size
    • Every thread has a valid input (i.e., full vs. partial-tiles)
Examples
Example 1. Perform a simple reduction of 512 integer keys that are partitioned in a blocked arrangement across a 128-thread threadblock (where each thread holds 4 keys).
#include <cub.cuh>
__global__ void SomeKernel(...)
{
// Parameterize BlockReduce for the parallel execution context
typedef cub::BlockReduce<int, 128> BlockReduce;
// Declare shared memory for BlockReduce
__shared__ typename BlockReduce::SmemStorage smem_storage;
// A segment of consecutive input items per thread
int data[4];
// Obtain items in blocked order
...
// Compute the threadblock-wide sum for thread0
int aggregate = BlockReduce::Sum(smem_storage, data);
...
Example 2: Perform a guarded reduction of only num_elements keys that are partitioned in a partially-full blocked arrangement across BLOCK_THREADS threads.
#include <cub.cuh>
template <int BLOCK_THREADS>
__global__ void SomeKernel(..., int num_elements)
{
// Parameterize BlockReduce for use with BLOCK_THREADS threads on type int
// Declare shared memory for BlockReduce
__shared__ typename BlockReduce::SmemStorage smem_storage;
// Guarded load
int data;
if (threadIdx.x < num_elements) data = ...;
// Compute the threadblock-wide sum of valid elements in thread0
int aggregate = BlockReduce::Sum(smem_storage, data, num_elements);
...

Public Types

typedef _SmemStorage SmemStorage
 The operations exposed by BlockReduce require shared memory of this type. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated shared memory or union'd with other types to facilitate shared memory reuse.
 

Static Public Methods

Generic reductions
template<typename ReductionOp >
static __device__ __forceinline__ T Reduce (SmemStorage &smem_storage, T input, ReductionOp reduction_op)
 Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element. More...
 
template<int ITEMS_PER_THREAD, typename ReductionOp >
static __device__ __forceinline__ T Reduce (SmemStorage &smem_storage, T(&inputs)[ITEMS_PER_THREAD], ReductionOp reduction_op)
 Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements. More...
 
template<typename ReductionOp >
static __device__ __forceinline__ T Reduce (SmemStorage &smem_storage, T input, ReductionOp reduction_op, const unsigned int &valid_threads)
 Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. The first valid_threads threads each contribute one input element. More...
 
Summation reductions
static __device__ __forceinline__ T Sum (SmemStorage &smem_storage, T input)
 Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element. More...
 
template<int ITEMS_PER_THREAD>
static __device__ __forceinline__ T Sum (SmemStorage &smem_storage, T(&inputs)[ITEMS_PER_THREAD])
 Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements. More...
 
static __device__ __forceinline__ T Sum (SmemStorage &smem_storage, T input, const unsigned int &valid_threads)
 Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. The first valid_threads threads each contribute one input element. More...
 

Member Function Documentation

template<typename T , int BLOCK_THREADS>
template<typename ReductionOp >
static __device__ __forceinline__ T cub::BlockReduce< T, BLOCK_THREADS >::Reduce ( SmemStorage smem_storage,
input,
ReductionOp  reduction_op 
)
inlinestatic

Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element.

The return value is undefined in threads other than thread0.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ReductionOp[inferred] Binary reduction functor type (a model of Binary Function).
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input
[in]reduction_opBinary associative reduction functor
template<typename T , int BLOCK_THREADS>
template<int ITEMS_PER_THREAD, typename ReductionOp >
static __device__ __forceinline__ T cub::BlockReduce< T, BLOCK_THREADS >::Reduce ( SmemStorage smem_storage,
T(&)  inputs[ITEMS_PER_THREAD],
ReductionOp  reduction_op 
)
inlinestatic

Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements.

The return value is undefined in threads other than thread0.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.
ReductionOp[inferred] Binary reduction functor type (a model of Binary Function).
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputsCalling thread's input segment
[in]reduction_opBinary associative reduction functor
template<typename T , int BLOCK_THREADS>
template<typename ReductionOp >
static __device__ __forceinline__ T cub::BlockReduce< T, BLOCK_THREADS >::Reduce ( SmemStorage smem_storage,
input,
ReductionOp  reduction_op,
const unsigned int &  valid_threads 
)
inlinestatic

Computes a threadblock-wide reduction for thread0 using the specified binary reduction functor. The first valid_threads threads each contribute one input element.

The return value is undefined in threads other than thread0.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ReductionOp[inferred] Binary reduction functor type (a model of Binary Function).
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input
[in]reduction_opBinary associative reduction functor
[in]valid_threadsNumber of threads containing valid elements (may be less than BLOCK_THREADS)
template<typename T , int BLOCK_THREADS>
static __device__ __forceinline__ T cub::BlockReduce< T, BLOCK_THREADS >::Sum ( SmemStorage smem_storage,
input 
)
inlinestatic

Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element.

The return value is undefined in threads other than thread0.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input
template<typename T , int BLOCK_THREADS>
template<int ITEMS_PER_THREAD>
static __device__ __forceinline__ T cub::BlockReduce< T, BLOCK_THREADS >::Sum ( SmemStorage smem_storage,
T(&)  inputs[ITEMS_PER_THREAD] 
)
inlinestatic

Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements.

The return value is undefined in threads other than thread0.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputsCalling thread's input segment
template<typename T , int BLOCK_THREADS>
static __device__ __forceinline__ T cub::BlockReduce< T, BLOCK_THREADS >::Sum ( SmemStorage smem_storage,
input,
const unsigned int &  valid_threads 
)
inlinestatic

Computes a threadblock-wide reduction for thread0 using addition (+) as the reduction operator. The first valid_threads threads each contribute one input element.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

The return value is undefined in threads other than thread0.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input
[in]valid_threadsNumber of threads containing valid elements (may be less than BLOCK_THREADS)

The documentation for this class was generated from the following file: