CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Groups Pages
CUB Documentation

Table of Contents

   Download CUB!
   Browse or fork CUB at GitHub!
   Join the cub-users discussion forum!

(1) What is CUB?

CUB is a library of high-performance parallel primitives and other utilities for constructing CUDA kernel software. CUB enhances productivity, performance, and portability by providing an abstraction layer over complex block-level, warp-level, and thread-level operations.
CUB's primitives are not bound to any particular width of parallelism or to any particular data type. This allows them to be flexible and tunable to fit your kernels' needs. Thus CUB is CUDA Unbound.
cub_overview.png
Browse our collections of:
  • Cooperative primitives, including:
    • Thread block operations (e.g., radix sort, prefix scan, reduction, etc.)
    • Warp operations (e.g., prefix scan)
  • SIMT utilities, including:
    • Tile-based I/O utilities (e.g., for performing {vectorized, coalesced} data movement of {blocked, striped} data tiles)
    • Low-level thread I/O using cache-modifiers
    • Abstractions for thread block work distribution (e.g., work-stealing, even-share, etc.)
  • Host utilities, including:
    • Caching allocator for quick management of device temporaries
    • Device reflection

(2) Recent news

  • CUB v0.9.1 (03/09/2013). Intial "preview" release. CUB is the first durable, high-performance library of cooperative block-level, warp-level, and thread-level primitives for CUDA kernel programming. More primitives and examples coming soon!

(3) A simple example

The following code snippet illustrates a simple CUDA kernel for sorting a thread block's data:
#include <cub.cuh>
// An tile-sorting CUDA kernel
template <
int BLOCK_THREADS, // Threads per block
int ITEMS_PER_THREAD, // Items per thread
typename T> // Numeric data type
__global__ void TileSortKernel(T *d_in, T *d_out)
{
using namespace cub;
const int TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD;
// Parameterize cub::BlockRadixSort for the parallel execution context
typedef BlockRadixSort<T, BLOCK_THREADS> BlockRadixSort;
// Declare the shared memory needed by BlockRadixSort
__shared__ typename BlockRadixSort::SmemStorage smem_storage;
// A segment of data items per thread
T data[ITEMS_PER_THREAD];
// Load a tile of data using vector-load instructions
BlockLoadVectorized(data, d_in + (blockIdx.x * TILE_SIZE));
// Sort data in ascending order
BlockRadixSort::SortBlocked(smem_storage, data);
// Store the sorted tile using vector-store instructions
BlockStoreVectorized(data, d_out + (blockIdx.x * TILE_SIZE));
}
The cub::BlockRadixSort type performs a cooperative radix sort across the thread block's data items. Its implementation is parameterized by the number of threads per block and the aggregate data type T and is specialized for the underlying architecture.
Once instantiated, the cub::BlockRadixSort type exposes an opaque cub::BlockRadixSort::SmemStorage member type. The thread block uses this storage type to allocate the shared memory needed by the primitive. This storage type can be aliased or union'd with other types so that the shared memory can be reused for other purposes.
Furthermore, the kernel uses CUB's primitives for vectorizing global loads and stores. For example, lower-level ld.global.v4.s32 PTX instructions will be generated when T = int and ITEMS_PER_THREAD is a multiple of 4.

(4) Why do you need CUB?

CUDA kernel software is where the complexity of parallelism is expressed. Programmers must reason about deadlock, livelock, synchronization, race conditions, shared memory layout, plurality of state, granularity, throughput, latency, memory bottlenecks, etc. Constructing and fine-tuning kernel code is perhaps the most challenging, time-consuming aspect of CUDA programming.
However, with the exception of CUB, there are few (if any) software libraries of reusable kernel primitives. In the CUDA ecosystem, CUB is unique in this regard. As a SIMT library and software abstraction layer, CUB provides:
  1. Simplicity of composition. Parallel CUB primitives can be simply sequenced together in kernel code. (This convenience is analogous to programming with Thrust primitives in the host program.)
  2. High performance. CUB simplifies high performance kernel development by taking care to implement and make available the fastest available algorithms, strategies, and techniques.
  3. Performance portability. CUB primitives are specialized to match the target hardware. Furthermore, the CUB library continually evolves to accommodate new algorithmic developments, hardware instructions, etc.
  4. Simplicity of performance tuning. CUB primitives provide parallel abstractions whose performance behavior can be statically tuned. For example, most CUB primitives support alternative algorithmic strategies and variable grain sizes (threads per block, items per thread, etc.).
  5. Robustness and durability. CUB primitives are designed to function properly for arbitrary data types and widths of parallelism (not just for the built-in C++ types or for powers-of-two threads per block).

(5) Where is CUB positioned in the CUDA ecosystem?

CUDA's programming model embodies three different levels of program execution, each engendering its own abstraction layer in the CUDA software stack (i.e., the "black boxes" below):
CUDA kernel. A single CPU thread invokes a CUDA kernel to perform some data-parallel function. The incorporation of entire kernels (and their corresponding invocation stubs) into libraries is the most common form of code reuse for CUDA. Libraries of CUDA kernels include the following:
Thread blocks (SIMT). Each kernel invocation comprises some number of parallel threads. Threads are grouped into blocks, and the entire block of threads invokes some cooperative function in which they communicate and synchronize with each other. There has historically been very little reuse of cooperative SIMT software within CUDA kernel. Libraries of thread-block primitives include the following:
CUDA thread. A single CUDA thread invokes some sequential function. This is the finest-grained level of CUDA software abstraction and requires no consideration for the scheduling or synchronization of parallel threads. CUDA libraries of purely data-parallel functions include the following:

(6) How does CUB work?

CUB leverages the following programming idioms:
  1. C++ templates
  2. Reflective type structure
  3. Flexible data mapping

6.1    C++ templates

As a SIMT library, CUB must be flexible enough to accommodate a wide spectrum of parallel execution contexts, i.e., specific:
  • Data types
  • Widths of parallelism (threads per block)
  • Grain sizes (data items per thread)
  • Underlying architectures (special instructions, warp size, rules for bank conflicts, etc.)
  • Tuning requirements (e.g., latency vs. throughput)
To provide this flexibility, CUB is implemented as a C++ template library. C++ templates are a way to write generic algorithms and data structures. There is no need to build CUB separately. You simply #include the cub.cuh header file into your .cu CUDA C++ sources and compile with NVIDIA's nvcc compiler.

6.2    Reflective type structure

Cooperation within a thread block requires shared memory for communicating between threads. However, the specific size and layout of the memory needed by a given primitive will be specific to the details of its parallel execution context (e.g., how many threads are calling into it, how many items are processed per thread, etc.). Furthermore, this shared memory must be allocated outside of the component itself if it is to be reused elsewhere by the thread block.
// Parameterize a BlockScan type for use with 128 threads
// and 4 items per thread
// Declare shared memory for BlockScan
__shared__ typename BlockScan::SmemStorage smem_storage;
// A segment of consecutive input items per thread
int data[4];
// Obtain data in blocked order
...
// Perform an exclusive prefix sum across the tile of data
BlockScan::ExclusiveSum(smem_storage, data, data);
To address this issue, we encapsulate cooperative procedures within reflective type structure (C++ classes). As illustrated in the cub::BlockScan example above, these primitives are C++ classes with interfaces that expose both:
  • Procedural entrypoints for a block of threads to invoke
  • An opaque shared memory type needed for the operation of those methods

6.3    Flexible data mapping

We often design kernels such that each thread block is assigned a "tile" of data items for processing.
tile.png
Tile of eight ordered data items
When the tile size equals the thread block size, the mapping of data onto threads is straightforward (one datum per thread). However, there are often performance advantages for processing more than one datum per thread. For these scenarios, CUB primitives support the following alternatives for partitioning data items across the block of threads:
  • Blocked arrangement. The aggregate tile of items is partitioned evenly across threads in "blocked" fashion with threadi owning the ith segment of consecutive elements. Blocked arrangements are often desirable for algorithmic benefits (where long sequences of items can be processed sequentially within each thread).
blocked.png
Blocked arrangement across four threads
(emphasis on items owned by thread0)
  • Striped arrangement. The aggregate tile of items is partitioned across threads in "striped" fashion, i.e., the ITEMS_PER_THREAD items owned by each thread have logical stride BLOCK_THREADS between them. Striped arrangements are often desirable for data movement through global memory (where read/write coalescing is an important performance consideration).
striped.png
Striped arrangement across four threads
(emphasis on items owned by thread0)
The benefits of processing multiple items per thread (a.k.a., register blocking, granularity coarsening, etc.) include:
  • Algorithmic efficiency. Sequential work over multiple items in thread-private registers is cheaper than synchronized, cooperative work through shared memory spaces.
  • Data occupancy. The number of items that can be resident on-chip in thread-private register storage is often greater than the number of schedulable threads.
  • Instruction-level parallelism. Multiple items per thread also facilitates greater ILP for improved throughput and utilization.
Finally, cub::BlockExchange provides operations for converting between blocked and striped arrangements.

(7) Contributors

CUB is developed as an open-source project by NVIDIA Research. The primary contributor is Duane Merrill.

(8) Open Source License

CUB is available under the "New BSD" open-source license:
Copyright (c) 2011, Duane Merrill. All rights reserved.
Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of the NVIDIA CORPORATION nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.