JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 41

CUDA optimization

FABRIC DRAPING

CS 750 HIGH PERFORMANCE COMPUTING – COURSE PROJECT

2 of 41

Contents

Draping Introduction
Blossom Polynomials
Problem definition and Inputs
Parallelized Reduction Algorithm
CUDA basics
Results
References

3 of 41

Draping

Draping of virtual characters is done using cloth simulation models.
The Fabric needs to be rendered and simulated in a time efficient manner.
CUDA optimization can help lower the time required to achieve this.
For maximum performance and process control, we choose programming language C++.

4 of 41

Rendering process

CAD model of the target

Target feature points

Grid of points for fabric

Physical simulation

Fabric geometry

Fully Rendered Fabric on Target

5 of 41

Our scope

Converting a grid of control points into a fabric geometry

Scope of project

Grid of points for fabric

Fabric geometry

6 of 41

B-Spline surface

7 of 41

Blossoming polynomials

8 of 41

Bi-variate Blossom : Quadratic Bspline

9 of 41

Tri-variate Blossom : Cubic Bspline

10 of 41

B-spline Construction

11 of 41

B-spline Construction

Blossom Construction

12 of 41

Surface blossoms

13 of 41

Grid of Control Points + u,v grid

14 of 41

Inputs

Grid of points for fabric

Fabric geometry

u, v coordinates

15 of 41

CUDA Intro

CUDA virtualizes the physical hardware into threads and blocks

Threads

Thread is a virtualized Scalar Processor

Blocks

Thread blocks is a virtualized Streaming multiprocessor.
Thread blocks need to be independent
They run to completion.
Order of blocks is undecided.

16 of 41

B-spline basis Construction

17 of 41

Thread(i,j): Across domain

18 of 41

Block(i_b,j_b) = 16 threads/ Per Block = 1 grid point (u,v)

19 of 41

20 of 41

21 of 41

Parallel Reduction: Sequential Addressing

22 of 41

Warps (Scheduling unit)

Each warp runs threads in a lock step fashion

23 of 41

Data transfer rates

24 of 41

25 of 41

Memory hierarchy

26 of 41

Code walk through

__Global__ function is called from the host and runs on the device.

* variables are for input and output.

__Shared variables are shared across the threads within a block.

27 of 41

Thread and Block IDs obtained

Thread ID i,j select the control points for each reduction operation.
Block ID ib,jb select the u,v grid co-ordinate.

28 of 41

Barriers an Thread Synchronization

__syncthreads ensures all threads within a block have reached here.
Used to prevent memory conflicts and race conditions from occurring.
Use atomic operations key words like and volatile, to prevent race conditions.

29 of 41

CUDA memory management.

Memory Allocation

Data transfer from host to device.

Data transfer from device to host.

30 of 41

31 of 41

32 of 41

33 of 41

34 of 41

35 of 41

Nsight : Visual Studio

36 of 41

NVVP

37 of 41

NVVP

38 of 41

Results

Sample duration: (without yarn info)

Cuda time (ms)= 1.227840
Sequential time (ms) = 4.081682

Speed up factor :3.33
Tolerance<1e-4

39 of 41

CUDA streams for Asynchronous data transfer

40 of 41

Profiler

41 of 41

References

CUDA

NVIDIA, GPU Programming Guide, Version 8.0.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
CS 759 Slides: High Performance Computing applications for Engineering
Jason Sanders and Edward Kandrot: CUDA by Example: An Introduction to General-Purpose GPU Programming, Addison-Wesley Professional, 2010
Lecture1.pdf (bu.edu)

Graphics

ME 535 slides : CAGD
Yuksel, C., Kaldor, J. M., James, D. L., & Marschner, S. (2012). Stitch meshes for modeling knitted clothing with yarn-level detail. ACM Transactions on Graphics (TOG), 31(4), 1-12.