1 of 41

CUDA optimization

FABRIC DRAPING

CS 750 HIGH PERFORMANCE COMPUTING – COURSE PROJECT

2 of 41

Contents

  • Draping Introduction
  • Blossom Polynomials
  • Problem definition and Inputs
  • Parallelized Reduction Algorithm
  • CUDA basics
  • Results
  • References

3 of 41

Draping

  • Draping of virtual characters is done using cloth simulation models.
  • The Fabric needs to be rendered and simulated in a time efficient manner.
  • CUDA optimization can help lower the time required to achieve this.
  • For maximum performance and process control, we choose programming language C++.

4 of 41

Rendering process

CAD model of the target

Target feature points

Grid of points for fabric

Physical simulation

Fabric geometry

Fully Rendered Fabric on Target

5 of 41

Our scope

  • Converting a grid of control points into a fabric geometry

Scope of project

Grid of points for fabric

Fabric geometry

6 of 41

B-Spline surface

7 of 41

Blossoming polynomials

8 of 41

Bi-variate Blossom : Quadratic Bspline

9 of 41

Tri-variate Blossom : Cubic Bspline

 

 

 

 

10 of 41

B-spline Construction

11 of 41

B-spline Construction

 

 

 

 

Blossom Construction

12 of 41

Surface blossoms

13 of 41

Grid of Control Points + u,v grid

 

 

14 of 41

Inputs

  •  

Grid of points for fabric

Fabric geometry

u, v coordinates

15 of 41

CUDA Intro

CUDA virtualizes the physical hardware into threads and blocks

Threads

  • Thread is a virtualized Scalar Processor

Blocks

  • Thread blocks is a virtualized Streaming multiprocessor.
  • Thread blocks need to be independent
  • They run to completion.
  • Order of blocks is undecided.

16 of 41

B-spline basis Construction

17 of 41

Thread(i,j): Across domain

  •  

18 of 41

Block(ib,jb) = 16 threads/ Per Block = 1 grid point (u,v)

  •  

 

19 of 41

 

20 of 41

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

21 of 41

Parallel Reduction: Sequential Addressing

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

22 of 41

Warps (Scheduling unit)

Each warp runs threads in a lock step fashion

23 of 41

Data transfer rates

24 of 41

25 of 41

Memory hierarchy

26 of 41

Code walk through

  • __Global__ function is called from the host and runs on the device.

  • * variables are for input and output.

  • __Shared variables are shared across the threads within a block.

27 of 41

Thread and Block IDs obtained

  • Thread ID i,j select the control points for each reduction operation.
  • Block ID ib,jb select the u,v grid co-ordinate.

28 of 41

Barriers an Thread Synchronization

  • __syncthreads ensures all threads within a block have reached here.
  • Used to prevent memory conflicts and race conditions from occurring.
  • Use atomic operations key words like and volatile, to prevent race conditions.

29 of 41

CUDA memory management.

  • Memory Allocation

  • Data transfer from host to device.

  • Data transfer from device to host.

30 of 41

31 of 41

32 of 41

33 of 41

34 of 41

35 of 41

Nsight : Visual Studio

36 of 41

NVVP

37 of 41

NVVP

38 of 41

Results

Sample duration: (without yarn info)

  • Cuda time (ms)= 1.227840
  • Sequential time (ms) = 4.081682

  • Speed up factor :3.33
  • Tolerance<1e-4

39 of 41

CUDA streams for Asynchronous data transfer

40 of 41

Profiler

41 of 41

References

CUDA

  • NVIDIA, GPU Programming Guide, Version 8.0.
  • http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
  • CS 759 Slides: High Performance Computing applications for Engineering
  • Jason Sanders and Edward Kandrot: CUDA by Example: An Introduction to General-Purpose GPU Programming, Addison-Wesley Professional, 2010
  • Lecture1.pdf (bu.edu)

Graphics

  • ME 535 slides : CAGD
  • Yuksel, C., Kaldor, J. M., James, D. L., & Marschner, S. (2012). Stitch meshes for modeling knitted clothing with yarn-level detail. ACM Transactions on Graphics (TOG)31(4), 1-12.