CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Groups Pages
Public Types | List of all members
cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS > Class Template Reference

Detailed description

template<typename T, int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
class cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >

WarpScan provides variants of parallel prefix scan across a CUDA warp.

warp_scan_logo.png
.
Overview
Given a list of input elements and a binary reduction operator, a prefix scan produces an output list where each element is computed to be the reduction of the elements occurring earlier in the input list. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction includes the ith input. The term exclusive indicates the ith input is not computed into the ith output reduction.
For convenience, WarpScan exposes a spectrum of entrypoints that differ by:
  • Operator (generic scan vs. prefix sum for numeric types)
  • Output ordering (inclusive vs. exclusive)
  • Warp-wide prefix (identity vs. call-back functor)
  • Output (scanned elements only vs. scanned elements and the total aggregate)
Template Parameters
TThe scan input/output element type
WARPSThe number of "logical" warps performing concurrent warp scans
LOGICAL_WARP_THREADS[optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size associated with the CUDA Compute Capability targeted by the compiler (e.g., 32 warps for SM20).
Usage Considerations
  • Supports non-commutative scan operators
  • Supports "logical" warps smaller than the physical warp size (e.g., a logical warp of 8 threads)
  • Warp scans are concurrent if more than one warp is participating
  • Any warp-wide scalar inputs and outputs (e.g., warp_prefix_op and warp_aggregate) are only considered valid in lane0
  • After any operation, a subsequent __syncthreads() barrier is required if the supplied WarpScan::SmemStorage is to be reused or repurposed by the threadblock
Performance Considerations
  • Uses special instructions when applicable (e.g., warp SHFL)
  • Uses synchronization-free communication between warp lanes when applicable
  • Zero bank conflicts for most types.
  • Computation is slightly more efficient (i.e., having lower instruction overhead) for:
    • Prefix sum variants (vs. generic scan)
    • Exclusive variants (vs. inclusive)
    • Basic scan variants that don't require scalar inputs and outputs (e.g., warp_prefix_op and warp_aggregate)
    • Scan parameterizations where T is a built-in C++ primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)
    • Scan parameterizations where LOGICAL_WARP_THREADS is a multiple of the architecture's warp size
Algorithm
These parallel prefix scan variants implement a warp-synchronous Kogge-Stone algorithm having O(logn) steps and O(nlogn) work complexity, where n = LOGICAL_WARP_THREADS (which defaults to the warp size associated with the CUDA Compute Capability targeted by the compiler).

kogge_stone_scan.png
Data flow within a 16-thread Kogge-Stone scan construction. Junctions represent binary operators.

Examples
Example 1. Perform a simple exclusive prefix sum for one warp
#include <cub.cuh>
__global__ void SomeKernel(...)
{
// A parameterized int-based WarpScan type for use with one warp.
typedef cub::WarpScan<int, 1> WarpScan;
// Opaque shared memory for WarpScan
__shared__ typename WarpScan::SmemStorage smem_storage;
// Perform prefix sum of threadIds in first warp
if (threadIdx.x < 32)
{
int input = threadIdx.x;
int output;
WarpScan::ExclusiveSum(smem_storage, input, output);
printf("tid(%d) output(%d)\n\n", threadIdx.x, output);
}
Printed output:
tid(0) output(0)
tid(1) output(0)
tid(2) output(1)
tid(3) output(3)
tid(4) output(6)
...
tid(31) output(465)
Example 2. Perform an exclusive prefix sum for one warp seeded with a warp-wide prefix
#include <cub.cuh>
struct WarpPrefixOp
{
int warp_prefix;
__device__ WarpPrefixOp(int warp_prefix) : warp_prefix(warp_prefix) {}
__device__ int operator(int warp_aggregate)
{
int old_prefix = warp_prefix;
warp_prefix += warp_aggregate;
return old_prefix;
}
}
__global__ void SomeKernel(...)
{
// A parameterized int-based WarpScan type for use with one warp.
typedef cub::WarpScan<int, 1> WarpScan;
// Opaque shared memory for WarpScan
__shared__ typename WarpScan::SmemStorage smem_storage;
// Perform prefix sum of 2s, all seeded with a warp prefix value of 10
if (threadIdx.x < 32)
{
int input = 2;
int output;
int warp_aggregate;
WarpPrefixOp warp_prefix_op(10);
WarpScan::ExclusiveSum(smem_storage, input, output,
warp_aggregate, warp_prefix_op);
printf("tid(%d) output(%d)\n\n", threadIdx.x, output);
if (threadIdx.x == 0)
printf("updated aggregate(%d) and warp_prefix(%d)\n",
aggregate, warp_prefix_op.warp_prefix);
}
Printed output:
tid(0) output(10)
tid(1) output(12)
tid(2) output(14)
tid(3) output(16)
tid(4) output(18)
...
tid(31) output(72)
updated aggregate(74) and warp_prefix(84)

Public Types

typedef _SmemStorage SmemStorage
 The operations exposed by WarpScan require shared memory of this type. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated shared memory or union'd with other types to facilitate shared memory reuse.
 

Static Public Methods

Inclusive prefix sums
static __device__
__forceinline__ void 
InclusiveSum (SmemStorage &smem_storage, T input, T &output)
 Computes an inclusive prefix sum in each logical warp. More...
 
static __device__
__forceinline__ void 
InclusiveSum (SmemStorage &smem_storage, T input, T &output, T &warp_aggregate)
 Computes an inclusive prefix sum in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More...
 
template<typename WarpPrefixOp >
static __device__
__forceinline__ void 
InclusiveSum (SmemStorage &smem_storage, T input, T &output, T &warp_aggregate, WarpPrefixOp &warp_prefix_op)
 Computes an inclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate. More...
 
Exclusive prefix sums
static __device__
__forceinline__ void 
ExclusiveSum (SmemStorage &smem_storage, T input, T &output)
 Computes an exclusive prefix sum in each logical warp. More...
 
static __device__
__forceinline__ void 
ExclusiveSum (SmemStorage &smem_storage, T input, T &output, T &warp_aggregate)
 Computes an exclusive prefix sum in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More...
 
template<typename WarpPrefixOp >
static __device__
__forceinline__ void 
ExclusiveSum (SmemStorage &smem_storage, T input, T &output, T &warp_aggregate, WarpPrefixOp &warp_prefix_op)
 Computes an exclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate. More...
 
Inclusive prefix scans
template<typename ScanOp >
static __device__
__forceinline__ void 
InclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op)
 Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. More...
 
template<typename ScanOp >
static __device__
__forceinline__ void 
InclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate)
 Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More...
 
template<typename ScanOp , typename WarpPrefixOp >
static __device__
__forceinline__ void 
InclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate, WarpPrefixOp &warp_prefix_op)
 Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. The call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate. More...
 
Exclusive prefix scans
template<typename ScanOp >
static __device__
__forceinline__ void 
ExclusiveScan (SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op)
 Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. More...
 
template<typename ScanOp >
static __device__
__forceinline__ void 
ExclusiveScan (SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op, T &warp_aggregate)
 Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More...
 
template<typename ScanOp , typename WarpPrefixOp >
static __device__
__forceinline__ void 
ExclusiveScan (SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op, T &warp_aggregate, WarpPrefixOp &warp_prefix_op)
 Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate. More...
 
Exclusive prefix scans (without supplied identity)
template<typename ScanOp >
static __device__
__forceinline__ void 
ExclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op)
 Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the output computed for thread-lane0 is invalid. More...
 
template<typename ScanOp >
static __device__
__forceinline__ void 
ExclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate)
 Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the output computed for thread-lane0 is invalid. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More...
 
template<typename ScanOp , typename WarpPrefixOp >
static __device__
__forceinline__ void 
ExclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate, WarpPrefixOp &warp_prefix_op)
 Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The warp_prefix_op value from thread-thread-lane0 is applied to all scan outputs. Also computes the warp-wide warp_aggregate of all inputs for thread-thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate. More...
 

Member Function Documentation

template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::InclusiveSum ( SmemStorage smem_storage,
input,
T &  output 
)
inlinestatic

Computes an inclusive prefix sum in each logical warp.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::InclusiveSum ( SmemStorage smem_storage,
input,
T &  output,
T &  warp_aggregate 
)
inlinestatic

Computes an inclusive prefix sum in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0.

The warp_aggregate is undefined in threads other than thread-lane0.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[out]warp_aggregate[warp-lane0 only] Warp-wide aggregate reduction of input items.
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
template<typename WarpPrefixOp >
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::InclusiveSum ( SmemStorage smem_storage,
input,
T &  output,
T &  warp_aggregate,
WarpPrefixOp &  warp_prefix_op 
)
inlinestatic

Computes an inclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate.

The warp_aggregate is undefined in threads other than thread-lane0.

The warp_prefix_op functor must implement a member function T operator()(T warp_aggregate). The functor's input parameter warp_aggregate is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
WarpPrefixOp[inferred] Call-back functor type having member T operator()(T warp_aggregate)
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[out]warp_aggregate[warp-lane0 only] Warp-wide aggregate reduction of input items, exclusive of the warp_prefix_op value
[in,out]warp_prefix_op[warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs.
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::ExclusiveSum ( SmemStorage smem_storage,
input,
T &  output 
)
inlinestatic

Computes an exclusive prefix sum in each logical warp.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::ExclusiveSum ( SmemStorage smem_storage,
input,
T &  output,
T &  warp_aggregate 
)
inlinestatic

Computes an exclusive prefix sum in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0.

The warp_aggregate is undefined in threads other than thread-lane0.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[out]warp_aggregate[warp-lane0 only] Warp-wide aggregate reduction of input items.
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
template<typename WarpPrefixOp >
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::ExclusiveSum ( SmemStorage smem_storage,
input,
T &  output,
T &  warp_aggregate,
WarpPrefixOp &  warp_prefix_op 
)
inlinestatic

Computes an exclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate.

The warp_aggregate is undefined in threads other than thread-lane0.

The warp_prefix_op functor must implement a member function T operator()(T warp_aggregate). The functor's input parameter warp_aggregate is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
WarpPrefixOp[inferred] Call-back functor type having member T operator()(T warp_aggregate)
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[out]warp_aggregate[warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value).
[in,out]warp_prefix_op[warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs.
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
template<typename ScanOp >
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::InclusiveScan ( SmemStorage smem_storage,
input,
T &  output,
ScanOp  scan_op 
)
inlinestatic

Computes an inclusive prefix sum using the specified binary scan functor in each logical warp.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[in]scan_opBinary scan operator having member T operator()(const T &a, const T &b)
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
template<typename ScanOp >
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::InclusiveScan ( SmemStorage smem_storage,
input,
T &  output,
ScanOp  scan_op,
T &  warp_aggregate 
)
inlinestatic

Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0.

The warp_aggregate is undefined in threads other than thread-lane0.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[in]scan_opBinary scan operator having member T operator()(const T &a, const T &b)
[out]warp_aggregate[warp-lane0 only] Warp-wide aggregate reduction of input items.
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
template<typename ScanOp , typename WarpPrefixOp >
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::InclusiveScan ( SmemStorage smem_storage,
input,
T &  output,
ScanOp  scan_op,
T &  warp_aggregate,
WarpPrefixOp &  warp_prefix_op 
)
inlinestatic

Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. The call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate.

The warp_aggregate is undefined in threads other than thread-lane0.

The warp_prefix_op functor must implement a member function T operator()(T warp_aggregate). The functor's input parameter warp_aggregate is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)
WarpPrefixOp[inferred] Call-back functor type having member T operator()(T warp_aggregate)
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[in]scan_opBinary scan operator having member T operator()(const T &a, const T &b)
[out]warp_aggregate[warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value).
[in,out]warp_prefix_op[warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs.
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
template<typename ScanOp >
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::ExclusiveScan ( SmemStorage smem_storage,
input,
T &  output,
const T &  identity,
ScanOp  scan_op 
)
inlinestatic

Computes an exclusive prefix scan using the specified binary scan functor in each logical warp.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[in]identityIdentity value
[in]scan_opBinary scan operator having member T operator()(const T &a, const T &b)
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
template<typename ScanOp >
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::ExclusiveScan ( SmemStorage smem_storage,
input,
T &  output,
const T &  identity,
ScanOp  scan_op,
T &  warp_aggregate 
)
inlinestatic

Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0.

The warp_aggregate is undefined in threads other than thread-lane0.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[in]identityIdentity value
[in]scan_opBinary scan operator having member T operator()(const T &a, const T &b)
[out]warp_aggregate[warp-lane0 only] Warp-wide aggregate reduction of input items.
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
template<typename ScanOp , typename WarpPrefixOp >
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::ExclusiveScan ( SmemStorage smem_storage,
input,
T &  output,
const T &  identity,
ScanOp  scan_op,
T &  warp_aggregate,
WarpPrefixOp &  warp_prefix_op 
)
inlinestatic

Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate.

The warp_aggregate is undefined in threads other than thread-lane0.

The warp_prefix_op functor must implement a member function T operator()(T warp_aggregate). The functor's input parameter warp_aggregate is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)
WarpPrefixOp[inferred] Call-back functor type having member T operator()(T warp_aggregate)
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[in]identityIdentity value
[in]scan_opBinary scan operator having member T operator()(const T &a, const T &b)
[out]warp_aggregate[warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value).
[in,out]warp_prefix_op[warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs.
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
template<typename ScanOp >
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::ExclusiveScan ( SmemStorage smem_storage,
input,
T &  output,
ScanOp  scan_op 
)
inlinestatic

Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the output computed for thread-lane0 is invalid.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[in]scan_opBinary scan operator having member T operator()(const T &a, const T &b)
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
template<typename ScanOp >
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::ExclusiveScan ( SmemStorage smem_storage,
input,
T &  output,
ScanOp  scan_op,
T &  warp_aggregate 
)
inlinestatic

Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the output computed for thread-lane0 is invalid. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0.

The warp_aggregate is undefined in threads other than thread-lane0.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[in]scan_opBinary scan operator having member T operator()(const T &a, const T &b)
[out]warp_aggregate[warp-lane0 only] Warp-wide aggregate reduction of input items.
template<typename T , int WARPS, int LOGICAL_WARP_THREADS = DeviceProps::WARP_THREADS>
template<typename ScanOp , typename WarpPrefixOp >
static __device__ __forceinline__ void cub::WarpScan< T, WARPS, LOGICAL_WARP_THREADS >::ExclusiveScan ( SmemStorage smem_storage,
input,
T &  output,
ScanOp  scan_op,
T &  warp_aggregate,
WarpPrefixOp &  warp_prefix_op 
)
inlinestatic

Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The warp_prefix_op value from thread-thread-lane0 is applied to all scan outputs. Also computes the warp-wide warp_aggregate of all inputs for thread-thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate.

The warp_aggregate is undefined in threads other than thread-lane0.

The warp_prefix_op functor must implement a member function T operator()(T warp_aggregate). The functor's input parameter warp_aggregate is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Template Parameters
ScanOp[inferred] Binary scan operator type having member T operator()(const T &a, const T &b)
WarpPrefixOp[inferred] Call-back functor type having member T operator()(T warp_aggregate)
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]inputCalling thread's input item.
[out]outputCalling thread's output item. May be aliased with input.
[in]scan_opBinary scan operator having member T operator()(const T &a, const T &b)
[out]warp_aggregate[warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value).
[in,out]warp_prefix_op[warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs.

The documentation for this class was generated from the following file: