2 of 23

自己紹介

独立系ロボットエンジニア

neka-nat

https://twitter.com/neka_nat

フリーでロボティクス・画像処理関連のソフトウェアのお仕事やってます！
前職は某大手JTC社員
好きなロボット技術

マニピュレータ、プランニング、点群処理

ここ最近の活動

3 of 23

前回のロボゼミでの発表

ロボティクスにおけるGPUを使った計算についてお話させていただきました！

画像処理、点群処理、プランニングなどにGPUが使える
ざっくりGPUの使い方
自作OSS(cupoch)の紹介

今後はより実装に近い内容をお話ししたいと思います！

4 of 23

今回の内容

より簡単にCUDAを使ってパフォーマンスUPするには...？

5 of 23

今回の内容

より簡単にCUDAを使ってパフォーマンスUPするには...？
世の中にある便利なライブラリを活用しよう！

GPUでのメモリプールの利用

flann

GPU版KDTree

stdgpu

GPU版のunordered_mapとかが使える

Thrust

stdのコンテナやalgorithmライクな並列計算ライブラリ

6 of 23

今回の内容

より簡単にCUDAを使ってパフォーマンスUPするには...？
世の中にある便利なライブラリを活用しよう！

GPUでのメモリプールの利用

flann

GPU版KDTree

stdgpu

GPU版のunordered_mapとかが使える

Thrust

stdのコンテナやalgorithmライクな並列計算ライブラリ

7 of 23

(私がやってる)GPU計算できるようにするまでの流れ

もとの(CPU版での)アルゴリズムを理解する

論文や解説記事を見る
GithubでCPU版の実装を調べる

アルゴリズムの並列化

元のアルゴリズムのfor文に注目して並列化

さらなるチューニング

アルゴリズムを変えたり、キャッシュを利用したり…

8 of 23

Thrustとは？

stlライクなコンテナやalgorithmを使った並列計算を実装できるライブラリ
CUDAをインストールするとついてくる
生カーネルを記述するより多少パフォーマンスは落ちるが、GPU並列化による高速化としては十分機能する

https://developer.nvidia.com/thrust

9 of 23

並列計算で重要な演算①〜transformとreduce〜

最も基本的な並列演算
transform

全ての要素に同じ関数を適用する
返り値がコンテナに格納される

reduce

全ての要素を1つに集約する(sum, max, min, …など)
返り値はスカラ
logN回の同期が発生する(Nは要素数)

// transformのサンプル

int data[10] = {-5, 0, 2, -3, 2, 4, 0, -1, 2, 8};

thrust::negate<int> op;

thrust::transform(thrust::host, data, data + 10, data, op); // in-place transformation

// data is now {5, 0, -2, 3, -2, -4, 0, 1, -2, -8};

// reduceのサンプル

int data[6] = {1, 0, 2, 2, 1, 3};

int result = thrust::reduce(thrust::host, data, data + 6);

// result == 9

10 of 23

並列計算で重要な演算①〜transformとreduce〜

reduceのアルゴリズム

https://eximia.co/implementing-parallel-reduction-in-cuda/

11 of 23

並列計算で重要な演算②〜イテレータ〜

transformやreduceに与える入力

基本はコンテナやポインタをそのまま入力する

よく使うイテレータ

counting_iterator
constant_iterator
zip_iterator
transform_iterator

transformなどと組み合わせることでいろんな計算ができる

12 of 23

並列計算で重要な演算②〜イテレータ〜

zip_iterator

int main() {

thrust::device_vector<int> int_in(3), int_out(3);

int_in[0] = 0;

int_in[1] = 1;

int_in[2] = 2;

thrust::device_vector<float> float_in(3), float_out(3);

float_in[0] = 0.0f;

float_in[1] = 10.0f;

float_in[2] = 20.0f;

thrust::copy(thrust::make_zip_iterator(thrust::make_tuple(int_in.begin(), float_in.begin())),

thrust::make_zip_iterator(thrust::make_tuple(int_in.end(), float_in.end())),

thrust::make_zip_iterator(thrust::make_tuple(int_out.begin(),float_out.begin())));

// int_out is now [0, 1, 2]

// float_out is now [0.0f, 10.0f, 20.0f]

return 0;

}

13 of 23

並列計算で重要な演算②〜イテレータ〜

transform_iterator

struct square : public thrust::unary_function<float,float> {

__device__ float operator()(float x) const {

return x * x;

}

};

int main() {

thrust::device_vector<float> v(4);

v[0] = 1.0f;

v[1] = 2.0f;

v[2] = 3.0f;

v[3] = 4.0f;

float sum_of_squares =

thrust::reduce(thrust::make_transform_iterator(v.begin(), square()),

thrust::make_transform_iterator(v.end(), square()));

std::cout << "sum of squares: " << sum_of_squares << std::endl;

return 0;

}

14 of 23

並列計算で重要な演算③〜sortとscan〜

scan演算とは？

いわゆる累積和計算
[0, 1, 2, 3, 4, 5] -> [0, 1, 3, 6, 10, 15]
分割された配列の先頭インデックスを得たい時に使ったりする

sort

cubのRadixソートを使っている
GPUのソートアルゴリズムいろいろ

Mergeソート
Radixソート
Bitonicソート

時間があれば並列ソートアルゴリズムについても理解したい…

1	1	1	2	2	2	2	3	3	4	4	…

15 of 23

並列計算で重要な演算④〜その他の特殊演算〜

reduce系の便利なやつ(transform_iterator + reduce)

inner_product
transform_reduce

uniqueとdifference

データのsortが必要

stlに無いけどThrustで実装されてるやつ(boostにあるものもある)

xxx_by_key

sort_by_key, reduce_by_key, …

gatherとscatter
…

16 of 23

並列計算で重要な演算④〜その他の特殊演算〜

reduce_by_key

同じキーを持つバリューをreduceする
同じキーかどうかは隣り合うものしか見ない
全領域に渡ってキーのreduceを行うには予めソートが必要

...

const int N = 7;

int A[N] = {1, 3, 3, 3, 2, 2, 1}; // input keys

int B[N] = {9, 8, 7, 6, 5, 4, 3}; // input values

int C[N]; // output keys

int D[N]; // output values

thrust::pair<int*,int*> new_end;

new_end = thrust::reduce_by_key(thrust::host, A, A + N, B, C, D);

// The first four keys in C are now {1, 3, 2, 1} and new_end.first - C is 4.

// The first four values in D are now {9, 21, 9, 3} and new_end.second - D is 4.

17 of 23

点群処理のVoxelGridFilterをThrustで並列化してみる

VoxelGridFilterのアルゴリズム

点群に対して、ボクセル化した空間を作る
各ボクセル毎で次の計算を行う

ボクセル内にある点を集める
集めた点の平均値を計算

各ボクセルで計算した平均値の点群を新たな点群とする

元の点群の情報を比較的残しながらダウンサンプルすることができる

18 of 23

VoxelGridFilterの並列化

主に登場する並列化のためのカーネルは4つ

点群の各点からボクセルのキー(3次元の整数)の計算
ボクセルキーと点群のソート
同じキーの点の総和を取る
各ボクセルの点の和を点の数で割る

19 of 23

VoxelGridFilterの並列化

主に登場する並列化のためのカーネルは4つ

点群の各点からボクセルのキー(3次元の整数)の計算→transform
ボクセルキーと点群のソート→sort_by_key
同じキーの点の総和を取る→reduce_by_key
各ボクセルの点の和を点の数で割る→transform

20 of 23

Thrustを使ったVoxelGridFilterの並列化

// CUDA版VoxelGridFilterのコード抜粋(簡略化しておりこのままでは動かない)

// 入力の点群 points, フィルタ後の出力点群 out_points

// 各点からボクセルキーの計算

thrust::device_vector<Eigen::Vector3i> keys(points.size());

thrust::transform(points.begin(), points.end(), keys.begin(), compute_key_fn);

// ボクセルキーに基づいてソート(ボクセルキーと点群がソートされる)

thrust::device_vector<Eigen::Vector3f> sorted_points = points;

thrust::sort_by_key(keys.begin(), keys.end(), sorted_points.begin());

// 各キー毎の点の和を取ってから点の数で割る

thrust::device_vector<int> counts(points.size());

auto end = thrust::reduce_by_key(

keys.begin(), keys.end(),

thrust::make_zip_iterator(thrust::make_tuple(sorted_point.begin(), thrust::make_constant_iterator(1))),

thrust::make_discard_iterator(),

thrust::make_zip_iterator(thrust::make_tuple(out_points.begin(), counts.begin()));

thrust::transform(out_points.begin(), out_points.end(), counts.begin(), out_points.begin(), div_fn);

21 of 23

計算してみたが、パフォーマンス悪い。。

約10万点の点群を数千点にフィルタリング
Open3DのGPU版の約2倍かかっている

CPU	Open3D	32ms
GPU	Cupoch	16ms
GPU	Open3D	7.3ms

1 of 23

2 of 23

3 of 23

4 of 23

5 of 23

6 of 23

7 of 23

8 of 23

9 of 23

10 of 23

11 of 23

12 of 23

13 of 23

14 of 23

15 of 23

16 of 23

17 of 23

18 of 23

19 of 23

20 of 23

21 of 23

22 of 23

23 of 23