Learning from Examples » k-means Clustering

We study a fundamental clustering problem in unsupervised learning, k-means clustering. We will begin by discussing the problem formulation and then learn how to write a parallel k-means algorithm.

Problem Formulation

k-means clustering uses centroids, k different randomly-initiated points in the data, and assigns every data point to the nearest centroid. After every point has been assigned, the centroid is moved to the average of all of the points assigned to it. We describe the k-means algorithm in the following steps:

  • Step 1: initialize k random centroids
  • Step 2: for every data point, find the nearest centroid (L2 distance or other measurements) and assign the point to it
  • Step 3: for every centroid, move the centroid to the average of the points assigned to that centroid
  • Step 4: go to Step 2 until converged (no more changes in the last few iterations) or maximum iterations reached

The algorithm is illustrated as follows:

Image

A sequential implementation of k-means is described as follows:

// sequential implementation of k-means on a CPU
// N: number of points
// K: number of clusters
// M: number of iterations
// px/py: 2D point vector 
void kmeans_seq(
  int N, int K, int M, const std::vector<float>& px, const std::vector<float>& py
) {

  std::vector<int> c(K);
  std::vector<float> sx(K), sy(K), mx(K), my(K);

  // initial centroids
  std::copy_n(px.begin(), K, mx.begin());
  std::copy_n(py.begin(), K, my.begin());
  
  // k-means iteration
  for(int m=0; m<M; m++) {

    // clear the storage
    std::fill_n(sx.begin(), K, 0.0f);
    std::fill_n(sy.begin(), K, 0.0f);
    std::fill_n(c.begin(), K, 0);

    // find the best k (cluster id) for each point
    for(int i=0; i<N; ++i) {
      float x = px[i];
      float y = py[i];
      float best_d = std::numeric_limits<float>::max();
      int best_k = 0;
      for (int k = 0; k < K; ++k) {
        const float d = L2(x, y, mx[k], my[k]);
        if (d < best_d) {
          best_d = d;
          best_k = k;
        }
      }
      sx[best_k] += x;
      sy[best_k] += y;
      c [best_k] += 1;
    }

    // update the centroid
    for(int k=0; k<K; k++) {
      const int count = max(1, c[k]);  // turn 0/0 to 0/1
      mx[k] = sx[k] / count;
      my[k] = sy[k] / count;
    }
  }

  // print the k centroids found
  for(int k=0; k<K; ++k) {
    std::cout << "centroid " << k << ": " << std::setw(10) << mx[k] << ' '
                                          << std::setw(10) << my[k] << '\n';
  }
}

Parallel k-means using CPUs

The second step of k-means algorithm, assigning every point to the nearest centroid, is highly parallelizable across individual points. We can create a parallel-for task to run parallel iterations.

std::vector<int> best_ks(N);  // nearest centroid of each point

unsigned P = 12;  // 12 partitioned tasks

// update cluster
taskflow.for_each_index(0, N, 1, [&](int i){
  float x = px[i];
  float y = py[i];
  float best_d = std::numeric_limits<float>::max();
  int best_k = 0;
  for (int k = 0; k < K; ++k) {
    const float d = L2(x, y, mx[k], my[k]);
    if (d < best_d) {
      best_d = d;
      best_k = k;
    }
  }
  best_ks[i] = best_k;
});

The third step of moving every centroid to the average of points is also parallelizable across individual centroids. However, since k is typically not large, one task of doing this update is sufficient.

taskflow.emplace([&](){
  // sum of points
  for(int i=0; i<N; i++) {
    sx[best_ks[i]] += px[i];
    sy[best_ks[i]] += py[i];
    c [best_ks[i]] += 1;
  }
  
  // average of points
  for(int k=0; k<K; ++k) {
    auto count = max(1, c[k]);  // turn 0/0 to 0/1
    mx[k] = sx[k] / count;
    my[k] = sy[k] / count;
  }
});

To describe M iterations, we create a condition task that loops the second step of the algorithm by M times. The return value of zero goes to the first successor which we will connect to the task of the second step later; otherwise, k-means completes.

taskflow.emplace([m=0, M]() mutable {
  return (m++ < M) ? 0 : 1;
});

The entire code of CPU-parallel k-means is shown below. Here we use an additional storage, best_ks, to record the nearest centroid of a point at an iteration.

// N: number of points
// K: number of clusters
// M: number of iterations
// px/py: 2D point vector 
void kmeans_par(
  int N, int K, int M, cconst std::vector<float>& px, const std::vector<float>& py
) {

  unsigned P = 12;  // 12 partitions of the parallel-for graph

  tf::Executor executor;
  tf::Taskflow taskflow("K-Means");

  std::vector<int> c(K), best_ks(N);
  std::vector<float> sx(K), sy(K), mx(K), my(K);

  // initial centroids
  tf::Task init = taskflow.emplace([&](){
    for(int i=0; i<K; ++i) {
      mx[i] = px[i];
      my[i] = py[i];
    }
  }).name("init");

  // clear the storage
  tf::Task clean_up = taskflow.emplace([&](){
    for(int k=0; k<K; ++k) {
      sx[k] = 0.0f;
      sy[k] = 0.0f;
      c [k] = 0;
    }
  }).name("clean_up");

  // update cluster
  tf::Task pf = taskflow.for_each_index(0, N, 1, [&](int i){
    float x = px[i];
    float y = py[i];
    float best_d = std::numeric_limits<float>::max();
    int best_k = 0;
    for (int k = 0; k < K; ++k) {
      const float d = L2(x, y, mx[k], my[k]);
      if (d < best_d) {
        best_d = d;
        best_k = k;
      }
    }
    best_ks[i] = best_k;
  }).name("parallel-for");

  tf::Task update_cluster = taskflow.emplace([&](){
    for(int i=0; i<N; i++) {
      sx[best_ks[i]] += px[i];
      sy[best_ks[i]] += py[i];
      c [best_ks[i]] += 1;
    }

    for(int k=0; k<K; ++k) {
      auto count = max(1, c[k]);  // turn 0/0 to 0/1
      mx[k] = sx[k] / count;
      my[k] = sy[k] / count;
    }
  }).name("update_cluster");
  
  // convergence check
  tf::Task condition = taskflow.emplace([m=0, M]() mutable {
    return (m++ < M) ? 0 : 1;
  }).name("converged?");

  init.precede(clean_up);

  clean_up.precede(pf);
  pf.precede(update_cluster);

  condition.precede(clean_up)
           .succeed(update_cluster);

  executor.run(taskflow).wait();
}

The taskflow consists of two parts, a clean_up task and a parallel-for graph. The former cleans up the storage sx, sy, and c that are used to average points for new centroids, and the later parallelizes the searching for nearest centroids across individual points using 12 tasks (may vary depending on the machine). If the iteration count is smaller than M, the condition task returns 0 to let the execution path go back to clean_up. Otherwise, it returns 1 to stop (i.e., no successor tasks at index 1). The taskflow graph is illustrated below:

Taskflow cluster_p0x1dcb6e0 Subflow: parallel-for p0x1dcb4c0 init p0x1dcb5d0 clean_up p0x1dcb4c0->p0x1dcb5d0 p0x1dcb6e0 parallel-for p0x1dcb5d0->p0x1dcb6e0 p0x1dcb7f0 update_cluster p0x1dcb6e0->p0x1dcb7f0 p0x1dcb900 converged? p0x1dcb7f0->p0x1dcb900 p0x7fd610000b50 pfg_0 p0x7fd610000b50->p0x1dcb6e0 p0x7fd610000c60 pfg_1 p0x7fd610000c60->p0x1dcb6e0 p0x7fd610000d70 pfg_2 p0x7fd610000d70->p0x1dcb6e0 p0x7fd610000e80 pfg_3 p0x7fd610000e80->p0x1dcb6e0 p0x7fd610000f90 pfg_4 p0x7fd610000f90->p0x1dcb6e0 p0x7fd6100010a0 pfg_5 p0x7fd6100010a0->p0x1dcb6e0 p0x7fd6100011b0 pfg_6 p0x7fd6100011b0->p0x1dcb6e0 p0x7fd6100012c0 pfg_7 p0x7fd6100012c0->p0x1dcb6e0 p0x7fd6100013d0 pfg_8 p0x7fd6100013d0->p0x1dcb6e0 p0x7fd6100014e0 pfg_9 p0x7fd6100014e0->p0x1dcb6e0 p0x7fd6100015f0 pfg_10 p0x7fd6100015f0->p0x1dcb6e0 p0x7fd610001700 pfg_11 p0x7fd610001700->p0x1dcb6e0 p0x7fd610001810 pfg_12 p0x7fd610001810->p0x1dcb6e0 p0x1dcb900->p0x1dcb5d0 0

The scheduler starts with init, moves on to clean_up, and then enters the parallel-for task paralle-for that spawns a subflow of 12 workers to perform parallel iterations. When parallel-for completes, it updates the cluster centroids and checks if they have converged through a condition task. If not, the condition task informs the scheduler to go back to clean_up and then parallel-for; otherwise, it returns a nominal index to stop the scheduler.

Parallel k-means using GPUs

We observe Step 2 and Step 3 of the algorithm are parallelizable across individual points for use to harness the power of GPU:

  1. for every data point, find the nearest centroid (L2 distance or other measurements) and assign the point to it
  2. for every centroid, move the centroid to the average of the points assigned to that centroid.

At a fine-grained level, we request one GPU thread to work on one point for Step 2 and one GPU thread to work on one centroid for Step 3.

// px/py: 2D points
// N: number of points
// mx/my: centroids
// K: number of clusters
// sx/sy/c: storage to compute the average
__global__ void assign_clusters(
  float* px, float* py, int N, 
  float* mx, float* my, float* sx, float* sy, int K, int* c
) {
  const int index = blockIdx.x * blockDim.x + threadIdx.x;

  if (index >= N) {
    return;
  }

  // Make global loads once.
  float x = px[index];
  float y = py[index];

  float best_dance = FLT_MAX;
  int best_k = 0;
  for (int k = 0; k < K; ++k) {
    float d = L2(x, y, mx[k], my[k]);
    if (d < best_d) {
      best_d = d;
      best_k = k;
    }   
  }

  atomicAdd(&sx[best_k], x); 
  atomicAdd(&sy[best_k], y); 
  atomicAdd(&c [best_k], 1); 
}

// mx/my: centroids, sx/sy/c: storage to compute the average
__global__ void compute_new_means(
  float* mx, float* my, float* sx, float* sy, int* c
) {
  int k = threadIdx.x;
  int count = max(1, c[k]);  // turn 0/0 to 0/1
  mx[k] = sx[k] / count;
  my[k] = sy[k] / count;
}

When we recompute the cluster centroids to be the mean of all points assigned to a particular centroid, multiple GPU threads may access the sum arrays, sx and sy, and the count array, c. To avoid data race, we use a simple atomicAdd method. Based on the two kernels, the entire code of CPU-GPU collaborative tasking is described as follows:

// N: number of points
// K: number of clusters
// M: number of iterations
// px/py: 2D point vector 
void kmeans_gpu(
  int N, int K, int M, cconst std::vector<float>& px, const std::vector<float>& py
) {

  std::vector<float> h_mx, h_my;
  float *d_px, *d_py, *d_mx, *d_my, *d_sx, *d_sy, *d_c;

  for(int i=0; i<K; ++i) {
    h_mx.push_back(h_px[i]);
    h_my.push_back(h_py[i]);
  }

  // create a taskflow graph
  tf::Executor executor;
  tf::Taskflow taskflow("K-Means");
  
  // allocate GPU memory
  tf::Task allocate_px = taskflow.emplace([&](){ 
    cudaMalloc(&d_px, N*sizeof(float)); 
  }).name("allocate_px");

  tf::Task allocate_py = taskflow.emplace([&](){ 
    cudaMalloc(&d_py, N*sizeof(float)); 
  }).name("allocate_py");

  tf::Task allocate_mx = taskflow.emplace([&](){ 
    cudaMalloc(&d_mx, K*sizeof(float)); }
  ).name("allocate_mx");

  tf::Task allocate_my = taskflow.emplace([&](){ 
    cudaMalloc(&d_my, K*sizeof(float)); 
  }).name("allocate_my");

  tf::Task allocate_sy = taskflow.emplace([&](){ 
    cudaMalloc(&d_sy, K*sizeof(float)); 
  }).name("allocate_sy");

  tf::Task allocate_c = taskflow.emplace([&](){ 
    cudaMalloc(&d_c, K*sizeof(float)); 
  }).name("allocate_c");
  
  // transfer data from the host to the GPU
  tf::Task h2d = taskflow.emplace([&](tf::cudaFlow& cf){
    cf.copy(d_px, h_px.data(), N).name("h2d_px");
    cf.copy(d_py, h_py.data(), N).name("h2d_py");
    cf.copy(d_mx, h_mx.data(), K).name("h2d_mx");
    cf.copy(d_my, h_my.data(), K).name("h2d_my");
  }).name("h2d");
  
  // GPU task graph of the main k-means body
  tf::Task kmeans = taskflow.emplace([&](tf::cudaFlow& cf){

    tf::cudaTask zero_c = cf.zero(d_c, K).name("zero_c");
    tf::cudaTask zero_sx = cf.zero(d_sx, K).name("zero_sx");
    tf::cudaTask zero_sy = cf.zero(d_sy, K).name("zero_sy");

    tf::cudaTask cluster = cf.kernel(
      (N+1024-1) / 1024, 1024, 0, 
      assign_clusters, d_px, d_py, N, d_mx, d_my, d_sx, d_sy, K, d_c
    ).name("cluster");

    tf::cudaTask new_centroid = cf.kernel(
      1, K, 0, compute_new_means, d_mx, d_my, d_sx, d_sy, d_c
    ).name("new_centroid");

    cluster.precede(new_centroid)
           .succeed(zero_c, zero_sx, zero_sy);
  }).name("update_means");
  
  // condition task to check convergence
  tf::Task condition = taskflow.emplace([i=0, M] () mutable {
    return i++ < M ? 0 : 1;
  }).name("converged?");
  
  // transfer the result of clusters from GPU to host
  tf::Task stop = taskflow.emplace([&](tf::cudaFlow& cf){
    cf.copy(h_mx.data(), d_mx, K).name("d2h_mx");
    cf.copy(h_my.data(), d_my, K).name("d2h_my");
  }).name("d2h");
  
  // deallocated GPU memory
  tf::Task free = taskflow.emplace([&](){
    cudaFree(d_px);
    cudaFree(d_py);
    cudaFree(d_mx);
    cudaFree(d_my);
    cudaFree(d_sx);
    cudaFree(d_sy);
    cudaFree(d_c);
  }).name("free");

  // build up the dependency
  h2d.succeed(allocate_px, allocate_py, allocate_mx, allocate_my);

  kmeans.succeed(allocate_sx, allocate_sy, allocate_c, h2d)
        .precede(condition);

  condition.precede(kmeans, stop);

  stop.precede(free);
  
  // dump the taskflow without expanding GPU task graphs
  taskflow.dump(std::cout);

  // run the taskflow
  executor.run(taskflow).wait();
  
  // dump the entire taskflow
  taskflow.dump(std::cout);
}

The first dump before executing the taskflow produces the following diagram. The condition tasks introduces a cycle between itself and update_means. Each time it goes back to update_means, the cudaFlow is reconstructed with captured parameters in the closure and offloaded to the GPU.

Taskflow p0x562f9807bcc0 allocate_px p0x562f9807b550 h2d p0x562f9807bcc0->p0x562f9807b550 p0x562f9807b440 update_means p0x562f9807b550->p0x562f9807b440 p0x562f9807bbb0 allocate_py p0x562f9807bbb0->p0x562f9807b550 p0x562f9807baa0 allocate_mx p0x562f9807baa0->p0x562f9807b550 p0x562f9807b990 allocate_my p0x562f9807b990->p0x562f9807b550 p0x562f9807b880 allocate_sx p0x562f9807b880->p0x562f9807b440 p0x562f9807b330 converged? p0x562f9807b440->p0x562f9807b330 p0x562f9807b770 allocate_sy p0x562f9807b770->p0x562f9807b440 p0x562f9807b660 allocate_c p0x562f9807b660->p0x562f9807b440 p0x562f9807b330->p0x562f9807b440 0 p0x562f9807b220 d2h p0x562f9807b330->p0x562f9807b220 1 p0x562f9807b110 free p0x562f9807b220->p0x562f9807b110

The second dump after executing the taskflow produces the following diagram, with all cudaFlows expanded:

Taskflow cluster_p0x562f9807b550 cudaFlow: h2d cluster_p0x562f9807b440 cudaFlow: update_means cluster_p0x562f9807b220 cudaFlow: h2d p0x562f9807bcc0 allocate_px p0x562f9807b550 h2d p0x562f9807bcc0->p0x562f9807b550 p0x562f9807b440 update_means p0x562f9807b550->p0x562f9807b440 p0x562f9807bbb0 allocate_py p0x562f9807bbb0->p0x562f9807b550 p0x562f9807baa0 allocate_mx p0x562f9807baa0->p0x562f9807b550 p0x562f9807b990 allocate_my p0x562f9807b990->p0x562f9807b550 p0x562f9807b880 allocate_sx p0x562f9807b880->p0x562f9807b440 p0x562f9807b330 converged? p0x562f9807b440->p0x562f9807b330 p0x562f9807b770 allocate_sy p0x562f9807b770->p0x562f9807b440 p0x562f9807b660 allocate_c p0x562f9807b660->p0x562f9807b440 p0x7fbc54000b20 h2d_px p0x7fbc54000b20->p0x562f9807b550 p0x7fbc54000c00 h2d_py p0x7fbc54000c00->p0x562f9807b550 p0x7fbc54000ce0 h2d_mx p0x7fbc54000ce0->p0x562f9807b550 p0x7fbc54000db0 h2d_my p0x7fbc54000db0->p0x562f9807b550 p0x562f9807b330->p0x562f9807b440 0 p0x562f9807b220 h2d p0x562f9807b330->p0x562f9807b220 1 p0x7fbc540051d0 zero_c p0x7fbc540053d0 cluster p0x7fbc540051d0->p0x7fbc540053d0 p0x7fbc54005470 new_centroid p0x7fbc540053d0->p0x7fbc54005470 p0x7fbc54005270 zero_sx p0x7fbc54005270->p0x7fbc540053d0 p0x7fbc54005330 zero_sy p0x7fbc54005330->p0x7fbc540053d0 p0x7fbc54005470->p0x562f9807b440 p0x562f9807b110 free p0x562f9807b220->p0x562f9807b110 p0x7fbc5400bf40 d2h_mx p0x7fbc5400bf40->p0x562f9807b220 p0x7fbc54008020 d2h_my p0x7fbc54008020->p0x562f9807b220

The main cudaFlow task, update_means, must not run before all required data has settled down. It precedes a condition task that circles back to itself until we reach M iterations. When iteration completes, the condition task directs the execution path to the cudaFlow, h2d, to copy the results of clusters to h_mx and h_my and then deallocate all GPU memory.

Built-in Predicate

We observe the GPU task graph parameters remain unchanged across all k-means iterations. In this case, we can leverage tf::cudaFlow::offload_until or tf::cudaFlow::offload_n to run it repeatedly without conditional tasking.

tf::Task kmeans = taskflow.emplace([&](tf::cudaFlow& cf){

  tf::cudaTask zero_c = cf.zero(d_c, K).name("zero_c");
  tf::cudaTask zero_sx = cf.zero(d_sx, K).name("zero_sx");
  tf::cudaTask zero_sy = cf.zero(d_sy, K).name("zero_sy");

  tf::cudaTask cluster = cf.kernel(
    (N+1024-1) / 1024, 1024, 0,
    assign_clusters, d_px, d_py, N, d_mx, d_my, d_sx, d_sy, K, d_c
  ).name("cluster");

  tf::cudaTask new_centroid = cf.kernel(
    1, K, 0,
    compute_new_means, d_mx, d_my, d_sx, d_sy, d_c
  ).name("new_centroid");

  cluster.precede(new_centroid)
         .succeed(zero_c, zero_sx, zero_sy);
  
  // we ask the executor to launch the cudaFlow by M times
  cf.offload_n(M);
}).name("update_means");

// ...

// build up the dependency
h2d.succeed(allocate_px, allocate_py, allocate_mx, allocate_my);

kmeans.succeed(allocate_sx, allocate_sy, allocate_c, h2d)
      .precede(stop);

stop.precede(free);

At the last line of the cudaFlow closure, we call cf.offload_n(M) to ask the executor to repeatedly run the cudaFlow by M times. Compared with the version using conditional tasking, the cudaFlow here is created only one time and thus the overhead is reduced.

Taskflow cluster_p0x55764dbce0d0 cudaFlow: h2d cluster_p0x55764dbce1e0 cudaFlow: update_means cluster_p0x55764dbce2f0 cudaFlow: d2h p0x55764dbcd960 allocate_px p0x55764dbce0d0 h2d p0x55764dbcd960->p0x55764dbce0d0 p0x55764dbce1e0 update_means p0x55764dbce0d0->p0x55764dbce1e0 p0x55764dbcda70 allocate_py p0x55764dbcda70->p0x55764dbce0d0 p0x55764dbcdb80 allocate_mx p0x55764dbcdb80->p0x55764dbce0d0 p0x55764dbcdc90 allocate_my p0x55764dbcdc90->p0x55764dbce0d0 p0x55764dbcdda0 allocate_sx p0x55764dbcdda0->p0x55764dbce1e0 p0x55764dbce2f0 d2h p0x55764dbce1e0->p0x55764dbce2f0 p0x55764dbcdeb0 allocate_sy p0x55764dbcdeb0->p0x55764dbce1e0 p0x55764dbcdfc0 allocate_c p0x55764dbcdfc0->p0x55764dbce1e0 p0x7fc258000ba0 h2d_px p0x7fc258000ba0->p0x55764dbce0d0 p0x7fc258000c40 h2d_py p0x7fc258000c40->p0x55764dbce0d0 p0x7fc258000dd0 h2d_mx p0x7fc258000dd0->p0x55764dbce0d0 p0x7fc258000ea0 h2d_my p0x7fc258000ea0->p0x55764dbce0d0 p0x55764dbce400 free p0x55764dbce2f0->p0x55764dbce400 p0x7fc2580032e0 zero_c p0x7fc2580034e0 cluster p0x7fc2580032e0->p0x7fc2580034e0 p0x7fc258003580 new_centroid p0x7fc2580034e0->p0x7fc258003580 p0x7fc258003380 zero_sx p0x7fc258003380->p0x7fc2580034e0 p0x7fc258003440 zero_sy p0x7fc258003440->p0x7fc2580034e0 p0x7fc258003580->p0x55764dbce1e0 p0x7fc258005800 d2h_mx p0x7fc258005800->p0x55764dbce2f0 p0x7fc2580058a0 d2h_my p0x7fc2580058a0->p0x55764dbce2f0

We can see from the above taskflow the condition task is removed.

Benchmarking

We run three versions of k-means, sequential CPU, parallel CPUs, and one GPU, on a machine of 6 Intel i7-8700 CPUs at 3.20GHz and a Nvidia RTX 2080 GPU using various numbers of 2D point counts and iterations.

NKMCPU SequentialCPU ParallelGPU (conditional taksing)GPU (with predicate)
105100.14 ms77 ms1 ms1 ms
100101000.56 ms86 ms7 ms1 ms
100010100010 ms98 ms55 ms13 ms
1000010100001006 ms713 ms458 ms183 ms
10000010100000102483 ms49966 ms7952 ms4725 ms

When the number of points is larger than 10K, both parallel CPU and GPU implementations start to pick up the speed over than the sequential version. We can see using the built-in predicate of cudaFlow is two times faster than conditional tasking.