CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Groups Pages
Classes | Public Types | List of all members
cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG > Class Template Reference

Detailed description

template<typename KeyType, int BLOCK_THREADS, int ITEMS_PER_THREAD, typename ValueType = NullType, int RADIX_BITS = 5, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte>
class cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG >

BlockRadixSort provides variants of parallel radix sorting across a CUDA threadblock.

sorting_logo.png
.
Overview
The radix sorting method relies upon a positional representation for keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits, characters, etc.) specified from least-significant to most-significant. For a given input sequence of keys and a set of rules specifying a total ordering of the symbolic alphabet, the radix sorting method produces a lexicographic ordering of those keys.
BlockRadixSort can sort all of the built-in C++ numeric primitive types, e.g.: unsigned char, int, double, etc. Within each key, the implementation treats fixed-length bit-sequences of RADIX_BITS as radix digit places. Although the direct radix sorting method can only be applied to unsigned integral types, BlockRadixSort is able to sort signed and floating-point types via simple bit-wise transformations that ensure lexicographic key ordering.
For convenience, BlockRadixSort exposes a spectrum of entrypoints that differ by:
  • Value association (keys-only vs. key-value-pairs)
  • Input/output data arrangements (combinations of blocked and striped arrangements)
Template Parameters
KeyTypeKey type
BLOCK_THREADSThe threadblock size in threads
ITEMS_PER_THREADThe number of items per thread
ValueType[optional] Value type (default: cub::NullType)
RADIX_BITS[optional] The number of radix bits per digit place (default: 5 bits)
SMEM_CONFIG[optional] Shared memory bank mode (default: cudaSharedMemBankSizeFourByte)
Usage Considerations
  • After any sorting operation, a subsequent __syncthreads() barrier is required if the supplied BlockRadixSort::SmemStorage is to be reused or repurposed by the threadblock.
  • BlockRadixSort can only accommodate one associated tile of values. To "truck along" more than one tile of values, simply perform a key-value sort of the keys paired with a temporary value array that enumerates the key indices. The reordered indices can then be used as a gather-vector for exchanging other associated tile data through shared memory.
Performance Considerations
  • The operations are most efficient (lowest instruction overhead) when:
    • BLOCK_THREADS is a multiple of the architecture's warp size
    • KeyType is an unsigned integral type
    • Keys are partitioned across the threadblock in a blocked arrangement
Algorithm
BlockRadixSort is based on the method presented by Merrill et al.[1]. The implementation has O(n) work complexity and iterates over digit places using rounds constructed of
Examples
Example 1. Perform a radix sort over a tile of 512 integer keys that are partitioned in a blocked arrangement across a 128-thread threadblock (where each thread holds 4 keys).
#include <cub.cuh>
__global__ void SomeKernel(...)
{
// Parameterize BlockRadixSort for the parallel execution context
// Declare shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::SmemStorage smem_storage;
// A segment of consecutive input items per thread
int keys[4];
// Obtain keys in blocked order
...
// Sort keys in ascending order
BlockRadixSort::SortBlocked(smem_storage, keys);
Example 2. Perform a key-value radix sort over the lower 20-bits of a tile of 32-bit integer keys paired with floating-point values. The data are partitioned in a striped arrangement across the threadblock.
#include <cub.cuh>
template <int BLOCK_THREADS, int ITEMS_PER_THREAD>
__global__ void SomeKernel(...)
{
// Parameterize BlockRadixSort for the parallel execution context
// Declare shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::SmemStorage smem_storage;
// Input keys and values per thread (striped across the threadblock)
int keys[ITEMS_PER_THREAD];
float values[ITEMS_PER_THREAD];
// Obtain keys and values in striped order
...
// Sort pairs in ascending order (using only the lower 20 distinguishing key bits)
BlockRadixSort::SortStriped(smem_storage, keys, values, 0, 20);
}

Public Types

typedef _SmemStorage SmemStorage
 The operations exposed by BlockRadixSort require shared memory of this type. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated shared memory or union'd with other types to facilitate shared memory reuse.
 

Static Public Methods

Keys-only sorting
static __device__
__forceinline__ void 
SortBlocked (SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8)
 Performs a threadblock-wide radix sort over a blocked arrangement of keys. More...
 
static __device__
__forceinline__ void 
SortBlockedToStriped (SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8)
 Performs a radix sort across a blocked arrangement of keys, leaving them in a striped arrangement. More...
 
static __device__
__forceinline__ void 
SortStriped (SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8)
 Performs a radix sort across a striped arrangement of keys. More...
 
Key-value pair sorting
static __device__
__forceinline__ void 
SortBlocked (SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], ValueType(&values)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8)
 Performs a radix sort across a blocked arrangement of keys and values. More...
 
static __device__
__forceinline__ void 
SortBlockedToStriped (SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], ValueType(&values)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8)
 Performs a radix sort across a blocked arrangement of keys and values, leaving them in a striped arrangement. More...
 
static __device__
__forceinline__ void 
SortStriped (SmemStorage &smem_storage, KeyType(&keys)[ITEMS_PER_THREAD], ValueType(&values)[ITEMS_PER_THREAD], unsigned int begin_bit=0, const unsigned int &end_bit=sizeof(KeyType)*8)
 Performs a radix sort across a striped arrangement of keys and values. More...
 

Member Function Documentation

template<typename KeyType , int BLOCK_THREADS, int ITEMS_PER_THREAD, typename ValueType = NullType, int RADIX_BITS = 5, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte>
static __device__ __forceinline__ void cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG >::SortBlocked ( SmemStorage smem_storage,
KeyType(&)  keys[ITEMS_PER_THREAD],
unsigned int  begin_bit = 0,
const unsigned int &  end_bit = sizeof(KeyType) * 8 
)
inlinestatic

Performs a threadblock-wide radix sort over a blocked arrangement of keys.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in,out]keysKeys to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison
template<typename KeyType , int BLOCK_THREADS, int ITEMS_PER_THREAD, typename ValueType = NullType, int RADIX_BITS = 5, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte>
static __device__ __forceinline__ void cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG >::SortBlockedToStriped ( SmemStorage smem_storage,
KeyType(&)  keys[ITEMS_PER_THREAD],
unsigned int  begin_bit = 0,
const unsigned int &  end_bit = sizeof(KeyType) * 8 
)
inlinestatic

Performs a radix sort across a blocked arrangement of keys, leaving them in a striped arrangement.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in,out]keysKeys to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison
template<typename KeyType , int BLOCK_THREADS, int ITEMS_PER_THREAD, typename ValueType = NullType, int RADIX_BITS = 5, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte>
static __device__ __forceinline__ void cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG >::SortStriped ( SmemStorage smem_storage,
KeyType(&)  keys[ITEMS_PER_THREAD],
unsigned int  begin_bit = 0,
const unsigned int &  end_bit = sizeof(KeyType) * 8 
)
inlinestatic

Performs a radix sort across a striped arrangement of keys.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in,out]keysKeys to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison
template<typename KeyType , int BLOCK_THREADS, int ITEMS_PER_THREAD, typename ValueType = NullType, int RADIX_BITS = 5, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte>
static __device__ __forceinline__ void cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG >::SortBlocked ( SmemStorage smem_storage,
KeyType(&)  keys[ITEMS_PER_THREAD],
ValueType(&)  values[ITEMS_PER_THREAD],
unsigned int  begin_bit = 0,
const unsigned int &  end_bit = sizeof(KeyType) * 8 
)
inlinestatic

Performs a radix sort across a blocked arrangement of keys and values.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in,out]keysKeys to sort
[in,out]valuesValues to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison
template<typename KeyType , int BLOCK_THREADS, int ITEMS_PER_THREAD, typename ValueType = NullType, int RADIX_BITS = 5, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte>
static __device__ __forceinline__ void cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG >::SortBlockedToStriped ( SmemStorage smem_storage,
KeyType(&)  keys[ITEMS_PER_THREAD],
ValueType(&)  values[ITEMS_PER_THREAD],
unsigned int  begin_bit = 0,
const unsigned int &  end_bit = sizeof(KeyType) * 8 
)
inlinestatic

Performs a radix sort across a blocked arrangement of keys and values, leaving them in a striped arrangement.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in,out]keysKeys to sort
[in,out]valuesValues to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison
template<typename KeyType , int BLOCK_THREADS, int ITEMS_PER_THREAD, typename ValueType = NullType, int RADIX_BITS = 5, cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte>
static __device__ __forceinline__ void cub::BlockRadixSort< KeyType, BLOCK_THREADS, ITEMS_PER_THREAD, ValueType, RADIX_BITS, SMEM_CONFIG >::SortStriped ( SmemStorage smem_storage,
KeyType(&)  keys[ITEMS_PER_THREAD],
ValueType(&)  values[ITEMS_PER_THREAD],
unsigned int  begin_bit = 0,
const unsigned int &  end_bit = sizeof(KeyType) * 8 
)
inlinestatic

Performs a radix sort across a striped arrangement of keys and values.

A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the supplied smem_storage is to be reused or repurposed by the threadblock.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in,out]keysKeys to sort
[in,out]valuesValues to sort
[in]begin_bit[optional] The beginning (least-significant) bit index needed for key comparison
[in]end_bit[optional] The past-the-end (most-significant) bit index needed for key comparison

The documentation for this class was generated from the following file: