|
CUB
|
WarpScan provides variants of parallel prefix scan across a CUDA warp.
| T | The scan input/output element type |
| WARPS | The number of "logical" warps performing concurrent warp scans |
| LOGICAL_WARP_THREADS | [optional] The number of threads per "logical" warp (may be less than the number of hardware warp threads). Default is the warp size associated with the CUDA Compute Capability targeted by the compiler (e.g., 32 warps for SM20). |
warp_prefix_op and warp_aggregate) are only considered valid in lane0__syncthreads() barrier is required if the supplied WarpScan::SmemStorage is to be reused or repurposed by the threadblockSHFL)warp_prefix_op and warp_aggregate)T is a built-in C++ primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)LOGICAL_WARP_THREADS is a multiple of the architecture's warp sizeLOGICAL_WARP_THREADS (which defaults to the warp size associated with the CUDA Compute Capability targeted by the compiler).
Public Types | |
| typedef _SmemStorage | SmemStorage |
The operations exposed by WarpScan require shared memory of this type. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated shared memory or union'd with other types to facilitate shared memory reuse. | |
Static Public Methods | |
Inclusive prefix sums | |
| static __device__ __forceinline__ void | InclusiveSum (SmemStorage &smem_storage, T input, T &output) |
| Computes an inclusive prefix sum in each logical warp. More... | |
| static __device__ __forceinline__ void | InclusiveSum (SmemStorage &smem_storage, T input, T &output, T &warp_aggregate) |
Computes an inclusive prefix sum in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More... | |
| template<typename WarpPrefixOp > | |
| static __device__ __forceinline__ void | InclusiveSum (SmemStorage &smem_storage, T input, T &output, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) |
Computes an inclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate. More... | |
Exclusive prefix sums | |
| static __device__ __forceinline__ void | ExclusiveSum (SmemStorage &smem_storage, T input, T &output) |
| Computes an exclusive prefix sum in each logical warp. More... | |
| static __device__ __forceinline__ void | ExclusiveSum (SmemStorage &smem_storage, T input, T &output, T &warp_aggregate) |
Computes an exclusive prefix sum in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More... | |
| template<typename WarpPrefixOp > | |
| static __device__ __forceinline__ void | ExclusiveSum (SmemStorage &smem_storage, T input, T &output, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) |
Computes an exclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate. More... | |
Inclusive prefix scans | |
| template<typename ScanOp > | |
| static __device__ __forceinline__ void | InclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op) |
| Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. More... | |
| template<typename ScanOp > | |
| static __device__ __forceinline__ void | InclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate) |
Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More... | |
| template<typename ScanOp , typename WarpPrefixOp > | |
| static __device__ __forceinline__ void | InclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) |
Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. The call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate. More... | |
Exclusive prefix scans | |
| template<typename ScanOp > | |
| static __device__ __forceinline__ void | ExclusiveScan (SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op) |
| Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. More... | |
| template<typename ScanOp > | |
| static __device__ __forceinline__ void | ExclusiveScan (SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op, T &warp_aggregate) |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More... | |
| template<typename ScanOp , typename WarpPrefixOp > | |
| static __device__ __forceinline__ void | ExclusiveScan (SmemStorage &smem_storage, T input, T &output, const T &identity, ScanOp scan_op, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate. More... | |
Exclusive prefix scans (without supplied identity) | |
| template<typename ScanOp > | |
| static __device__ __forceinline__ void | ExclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op) |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the output computed for thread-lane0 is invalid. More... | |
| template<typename ScanOp > | |
| static __device__ __forceinline__ void | ExclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate) |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the output computed for thread-lane0 is invalid. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. More... | |
| template<typename ScanOp , typename WarpPrefixOp > | |
| static __device__ __forceinline__ void | ExclusiveScan (SmemStorage &smem_storage, T input, T &output, ScanOp scan_op, T &warp_aggregate, WarpPrefixOp &warp_prefix_op) |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The warp_prefix_op value from thread-thread-lane0 is applied to all scan outputs. Also computes the warp-wide warp_aggregate of all inputs for thread-thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate. More... | |
|
inlinestatic |
Computes an inclusive prefix sum in each logical warp.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
|
inlinestatic |
Computes an inclusive prefix sum in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0.
The warp_aggregate is undefined in threads other than thread-lane0.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items. |
|
inlinestatic |
Computes an inclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate.
The warp_aggregate is undefined in threads other than thread-lane0.
The warp_prefix_op functor must implement a member function T operator()(T warp_aggregate). The functor's input parameter warp_aggregate is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| WarpPrefixOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items, exclusive of the warp_prefix_op value |
| [in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
|
inlinestatic |
Computes an exclusive prefix sum in each logical warp.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
|
inlinestatic |
Computes an exclusive prefix sum in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0.
The warp_aggregate is undefined in threads other than thread-lane0.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items. |
|
inlinestatic |
Computes an exclusive prefix sum in each logical warp. Instead of using 0 as the warp-wide prefix, the call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate.
The warp_aggregate is undefined in threads other than thread-lane0.
The warp_prefix_op functor must implement a member function T operator()(T warp_aggregate). The functor's input parameter warp_aggregate is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| WarpPrefixOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value). |
| [in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
|
inlinestatic |
Computes an inclusive prefix sum using the specified binary scan functor in each logical warp.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
|
inlinestatic |
Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0.
The warp_aggregate is undefined in threads other than thread-lane0.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
| [out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items. |
|
inlinestatic |
Computes an inclusive prefix sum using the specified binary scan functor in each logical warp. The call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate.
The warp_aggregate is undefined in threads other than thread-lane0.
The warp_prefix_op functor must implement a member function T operator()(T warp_aggregate). The functor's input parameter warp_aggregate is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
| WarpPrefixOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
| [out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value). |
| [in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
|
inlinestatic |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [in] | identity | Identity value |
| [in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
|
inlinestatic |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0.
The warp_aggregate is undefined in threads other than thread-lane0.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [in] | identity | Identity value |
| [in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
| [out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items. |
|
inlinestatic |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The call-back functor warp_prefix_op is invoked to provide the "seed" value that logically prefixes the warp's scan inputs. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate.
The warp_aggregate is undefined in threads other than thread-lane0.
The warp_prefix_op functor must implement a member function T operator()(T warp_aggregate). The functor's input parameter warp_aggregate is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
| WarpPrefixOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [in] | identity | Identity value |
| [in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
| [out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value). |
| [in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
|
inlinestatic |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the output computed for thread-lane0 is invalid.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
|
inlinestatic |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. Because no identity value is supplied, the output computed for thread-lane0 is invalid. Also computes the warp-wide warp_aggregate of all inputs for thread-lane0.
The warp_aggregate is undefined in threads other than thread-lane0.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
| [out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items. |
|
inlinestatic |
Computes an exclusive prefix scan using the specified binary scan functor in each logical warp. The warp_prefix_op value from thread-thread-lane0 is applied to all scan outputs. Also computes the warp-wide warp_aggregate of all inputs for thread-thread-lane0. The warp_prefix_op is further updated by the value of warp_aggregate.
The warp_aggregate is undefined in threads other than thread-lane0.
The warp_prefix_op functor must implement a member function T operator()(T warp_aggregate). The functor's input parameter warp_aggregate is the same value also returned by the scan operation. This functor is expected to return a warp-wide prefix to be applied to all inputs. The functor will be invoked by the entire warp of threads, however the input and output are undefined in threads other than warp-lane0. Can be stateful.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.
| ScanOp | [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) |
| WarpPrefixOp | [inferred] Call-back functor type having member T operator()(T warp_aggregate) |
| [in] | smem_storage | Shared reference to opaque SmemStorage layout |
| [in] | input | Calling thread's input item. |
| [out] | output | Calling thread's output item. May be aliased with input. |
| [in] | scan_op | Binary scan operator having member T operator()(const T &a, const T &b) |
| [out] | warp_aggregate | [warp-lane0 only] Warp-wide aggregate reduction of input items (exclusive of the warp_prefix_op value). |
| [in,out] | warp_prefix_op | [warp-lane0 only] Call-back functor for specifying a warp-wide prefix to be applied to all inputs. |
1.8.3.1