CUB
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Groups Pages
Classes | Public Types | Static Public Methods | List of all members
cub::BlockStore< OutputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER > Class Template Reference

Detailed description

template<typename OutputIterator, int BLOCK_THREADS, int ITEMS_PER_THREAD, BlockStorePolicy POLICY = BLOCK_STORE_DIRECT, PtxStoreModifier MODIFIER = PTX_STORE_NONE>
class cub::BlockStore< OutputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER >

BlockStore provides data movement operations for writing blocked-arranged data to global memory.

block_store_logo.png
.

BlockStore provides a single tile-storing abstraction whose performance behavior can be statically tuned. In particular, BlockStore implements several alternative cub::BlockStorePolicy strategies catering to different granularity sizes (i.e., number of items per thread).

Template Parameters
OutputIteratorThe input iterator type (may be a simple pointer type).
BLOCK_THREADSThe threadblock size in threads.
ITEMS_PER_THREADThe number of consecutive items partitioned onto each thread.
POLICY[optional] cub::BlockStorePolicy tuning policy enumeration. Default = cub::BLOCK_STORE_DIRECT.
MODIFIER[optional] cub::PtxStoreModifier cache modifier. Default = cub::PTX_STORE_NONE.
Algorithm
BlockStore can be (optionally) configured to use one of three alternative methods:
  1. cub::BLOCK_STORE_DIRECT. A blocked arrangement of data is written directly to memory. More...
  2. cub::BLOCK_STORE_VECTORIZE. A blocked arrangement of data is written directly to memory using CUDA's built-in vectorized stores as a coalescing optimization. More...
  3. cub::BLOCK_STORE_TRANSPOSE. A blocked arrangement is locally transposed into a striped arrangement which is then written to memory. More...
Usage Considerations
  • After any operation, a subsequent __syncthreads() barrier is required if the supplied BlockStore::SmemStorage is to be reused or repurposed by the threadblock
Performance Considerations
Examples
Example 1. Have a 128-thread threadblock directly store a blocked arrangement of four consecutive integers per thread.
#include <cub.cuh>
template <int BLOCK_THREADS>
__global__ void SomeKernel(int *d_out, ...)
{
// Parameterize BlockStore for the parallel execution context
typedef cub::BlockStore<int*, 128, 4> BlockStore;
// Declare shared memory for BlockStore
__shared__ typename BlockStore::SmemStorage smem_storage;
// A segment of consecutive items per thread
int data[4];
// Store a tile of data
BlockStore::Store(smem_storage, d_out + blockIdx.x * 128 * 4, data);
...
}

Example 2. Have a threadblock store a blocked arrangement of ITEMS_PER_THREAD consecutive integers per thread using vectorized stores and global-only caching:

#include <cub.cuh>
template <int BLOCK_THREADS>
__global__ void SomeKernel(int *d_out, ...)
{
const int ITEMS_PER_THREAD = 4;
// Parameterize BlockStore for the parallel execution context
// Declare shared memory for BlockStore
__shared__ typename BlockStore::SmemStorage smem_storage;
// A segment of consecutive items per thread
int data[4];
// Store a tile of data using vector-store instructions if possible
BlockStore::Store(smem_storage, d_out + blockIdx.x * BLOCK_THREADS * 4, data);
...
}


Public Types

typedef _SmemStorage SmemStorage
 The operations exposed by BlockStore require shared memory of this type. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated shared memory or union'd with other types to facilitate shared memory reuse.
 

Static Public Methods

static __device__
__forceinline__ void 
Store (SmemStorage &smem_storage, OutputIterator block_itr, T(&items)[ITEMS_PER_THREAD])
 Store a tile of items across a threadblock. More...
 
template<typename SizeT >
static __device__
__forceinline__ void 
Store (SmemStorage &smem_storage, OutputIterator block_itr, const SizeT &guarded_items, T(&items)[ITEMS_PER_THREAD])
 Store a tile of items across a threadblock, guarded by range. More...
 

Member Function Documentation

template<typename OutputIterator , int BLOCK_THREADS, int ITEMS_PER_THREAD, BlockStorePolicy POLICY = BLOCK_STORE_DIRECT, PtxStoreModifier MODIFIER = PTX_STORE_NONE>
static __device__ __forceinline__ void cub::BlockStore< OutputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER >::Store ( SmemStorage smem_storage,
OutputIterator  block_itr,
T(&)  items[ITEMS_PER_THREAD] 
)
inlinestatic

Store a tile of items across a threadblock.

Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]block_itrThe threadblock's base output iterator for storing to
[in]itemsData to store
template<typename OutputIterator , int BLOCK_THREADS, int ITEMS_PER_THREAD, BlockStorePolicy POLICY = BLOCK_STORE_DIRECT, PtxStoreModifier MODIFIER = PTX_STORE_NONE>
template<typename SizeT >
static __device__ __forceinline__ void cub::BlockStore< OutputIterator, BLOCK_THREADS, ITEMS_PER_THREAD, POLICY, MODIFIER >::Store ( SmemStorage smem_storage,
OutputIterator  block_itr,
const SizeT &  guarded_items,
T(&)  items[ITEMS_PER_THREAD] 
)
inlinestatic

Store a tile of items across a threadblock, guarded by range.

Template Parameters
SizeT[inferred] Integer type for offsets
Parameters
[in]smem_storageShared reference to opaque SmemStorage layout
[in]block_itrThe threadblock's base output iterator for storing to
[in]guarded_itemsNumber of valid items in the tile
[in]itemsData to store

The documentation for this class was generated from the following file: