Parallel Transforms (cudaFlow)
cudaFlow provides a template function that applies the given function to a range and sotres the result in another range
Iterator-based Parallel Transforms
Iterator-based parallel-transform applies the given transform function to a range of items and store the result in another range specified by two iterators, first
and last
. The two iterators are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The task created by tf::
while (first != last) { *first++ = callable(*src1++, *src2++, *src3++, ...); }
The two iterators, first
and last
, are typically two raw pointers to the first element and the next to the last element in the range. The following example creates a transform
kernel that assigns each element, starting from gpu_data
to gpu_data + 1000
, to the sum of the corresponding elements at gpu_data_x
, gpu_data_y
, and gpu_data_z
.
taskflow.emplace([](tf::cudaFlow& cf){ // ... create gpu tasks // create a kernel for performing the following parallel transforms: // gpu_data[i] = gpu_data_x[i] + gpu_data_y[i] + gpu_data_z[i] tf::cudaTask task = cf.transform( gpu_data, gpu_data + 1000, [] __device__ (int& xi, int& yi, int &zi) { return xi + yi + zi; }, gpu_data_x, gpu_data_y, gpu_data_z ); });
Each iteration is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a __device__
specifier.
Miscellaneous Items
The parallel-transform algorithm is also available in tf::cudaFlowCapturerBase::transform.