Published using Google Docs
Updated automatically every 5 minutes

571e: Harnessing Massively Parallel Computing Platforms

Instructor: Matei Ripeanu

Schedule: Mondays, 5:00-8:00pm.

Location: KAIS4018


[11/30]        Your reviews: R10.2

[11/28]        Take a look at Google’s Tensorflow 

[11/26]        Papers for next time: BlueDBM: An Appliance for Big Data Analytics, Jun et al., ISCA’15 [pdf][project] and Willow: A User-Programmable SSD, S. Seshadri et al., OSDI'14 [link]

[11/23]        Your reviews: R9.1 and R9.2

[11/16]        Papers for next time: Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers, ASPLOS’15 [paper], Integrated 3D-stacked server designs for increasing physical density of key-value stores, A. Gutierrez et al., ASPLOS'14 [pdf]

[11/16]        Your reviews: R8.1 and R8.2

[11/10]        Topic for next time fault-tolerance.  Quantitative evaluation of soft error injection techniques for robust system design, Cho et al., DAC'13 [pdf], Dual Execution for On the Fly Fine Grained Execution Comparison, ASPLOS’15 [pdf].  See more papers related papers below.

[11/09]        Your reviews:  R7.1 and R7.2

[11/03]        Your midterm reports.

[11/03]        Topic for next week: large-scale graph processing. Papers: NUMA-aware Graph-structured Analytics , K. Zhang et al., PPoPP’15 [pdf] and Chaos: Scale-out Graph Processing from Secondary Storage, Roy et al. SOSP’15 [pdf]

[11/02]        Your reviews:  R6.1 and R6.2

[10/27]        Papers for next week:  A domain specific approach to heterogeneous parallelism, H. Chafi et al., PPoPP’11 [pdf][slides] and Green-Marl: A DSL for Easy and Efficient Graph Analysis, ASPLOS’12, [pdf][slides][project website][experience report]

[10/26]        Reviews: R5.1

[10/22]        Paper for next week: VirtCL: A Framework for OpenCL Device Abstraction and Management, Ping et al., PPoPP’15 [pdf]

[10/19]        Reviews: R4.1 and R4.2.  

[10/19]        Your project proposals.

[10/14]        Papers for next week: A theoretical framework for algorithm-architecture co-design, K. Czechowski, R. Vuduc, IPDPS’13 [pdf]  and A roofline model of energy, Choi et al., IPDPS 2013 [link][slides].  

[10/05]        Reviews for last week and this week

[09/29]        Paper to review for next time (Monday 5th) (Discussion leader: Augustine): An Auto-Tuning Framework for Parallel Multicore Stencil Computations, S. Kamil et al., IPDPS'10, [pdf].  The review form is here  

[09/29]        Project:  based on what we have discussed in class, send me a one page project description structured as objectives, success criteria, methodology and workplan. Deadline: end of Friday October 2,

[09/29]        Scalability! But at what COST? or ‘my laptop is faster than your cluster’ [here][blog]

[09/23]        Paper to read and review for next week posted below. Please send your reviews by noon on Monday.  

[09/16]        Readings to motivate you: You and Your Research, R. W. Hamming [html], Technology and Courage, I. Sutherland [pdf]

[09/01]        We’ll use Piazza for discussions. Please subscribe here. 

Course description

Efficiently harnessing today’s massively parallel computing platforms and providing high and predictable levels of service are outstanding challenges today.  This graduate-level course uses an inclusive definition for massively parallel platforms to include a diverse set of silicon-level platforms (e.g., massively multi‑core chips like IBM’s Cell processors or nVidia’s Graphical Processing Units, AMD’s Fusion architecture, and Intel’s MIC architecture).  All these platforms have in common that they aggregate a large number of processing elements to offer a huge computing potential. Yet the issues that make it difficult to fully materialize this potential are numerous. They relate to minimizing the computational overheads of parallel applications, efficiently exploiting their heterogeneous nature, providing predictable performance at multiple levels of the computing stack, energy efficiency, usability (e.g., programming language support for data-parallel applications), reliability (understanding and mitigating the impact of faults), or maintainability (e.g., ability to efficiently identify and repair problems).

The course will cover fundamentals of massively multi-core processor architecture, operating system and programming language support for parallel hardware and parallel applications, system support for debugging parallel applications, impact on emerging hardware trends on large-scale data processing system design. Advances in all these directions are key ingredients for recent efforts to build cyber‑infrastructure. While students will be exposed to a range of technologies the focus will be on multi-core processors (nVidia GPUs), the software stack and tool-chains used to support these (based on CUDA or OpenCL), application integration issues in these systems and support for data parallel applications.  The course will also include an introduction to programming models, languages and tools for GPU architectures.

Course format. The course is structured to provide (i) an in-depth understanding of current topics in distributed/parallel system research; (ii) experience with reviewing and presenting advanced technical material; (iii) exercising writing and critically reviewing research papers. The class workload has a participation component and a final project (see slides).

Participation. In each class we discuss two or more research papers. You will have to read the papers before class (be an efficient reader!) and write a review for each paper that includes the following:

1. Summarize the main contribution of the paper

2. Critique the main contribution.

a.          Discuss the significance of the paper.  Also rate on a scale of 5 (breakthrough), 4 (significant contribution), 3 (modest contribution), 2 (incremental contribution), 1 (junk). More importantly: motivate your rating in a paragraph or two.

b.         Discuss how convincing the methodology is. You may consider some of the following questions (use what is relevant): Do the claims and conclusions follow from the experiments? Are the assumptions realistic? Are the experiments well designed? Are there different experiments that would be more convincing? Are there other alternatives the authors should have considered? And, of course, is the paper free of methodological errors?

c.          Discuss the most important limitations of the approach?

3. What are the two strongest and/or most interesting ideas in the paper?

4. What are the two most striking weaknesses in the paper?

5. Name two questions that you would like to ask the authors.

6. Detail an extension of the work not mentioned in the future work section.

7. Optional comments on the paper that you’d like to see discussed in class.

Reviews must be submitted by noon the day of the class. Papers and your reviews are then discussed in class. Discussions will be lead by one student and may include a brief (10-minute) presentation of the paper. Discussion leaders do not need to submit reviews, but they need to prepare a discussion plan.

Project: The final project is an opportunity for hands-on research in parallel/distributed systems. It involves literature survey, programming, running experiments or analytical modeling, analyzing results and writing a roughly 6 page report (ACM format, use appendixes if your report is much longer ). A list of project ideas will be posted, but students are highly encouraged to propose topics of their own interest that match the course focus. Teams of two or three students are highly recommended. Please see me if you want to form a larger team.  Some past project reports are available here.


Schedule (will be continuously updated)


Topic / Project steps

Research Papers / Other links

(Split as required/optional reading)



Course mechanics. [slides]

Introduction. Overview of current research problems. Amdahl’s Law.


[Project: Introduction. Possible project themes]


  • Amdahl's Law in the Multicore Era, M. Hill, M. Marty, IEEE Computer 2008 [pdf]
  • Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]
  • A View of the Parallel Computing Landscape, K. Asanović et al., Communications of the ACM, 2009 [link]
  • The Unreasonable Effectiveness of Data, A. Halevy, P. Norvig, F. Pereira, Communication of the ACM, 2009. [link]
  • Scaling the Power Wall: A Path to Exascale, O. Villa et al., SC’14 [link]



Understanding performance reports. Performance limits of GPU computation.

[Project: discussion of project ideas]




Required:   (student reviews)

  • Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU,  V. Lee et al., ISCA'10, [pdf]
  • Roofline: An Insightful Visual Performance Model for Multicore Architectures, S. Williams et al., CACM 2009 (52) [link] [slides]


  • On the Limits of GPU Acceleration, HotPar'10, [pdf][slides]
  • 12 Ways to fool the masses …, D. Bailey, Supercomputing Review, 1991  [paper] [update]
  • More on Roofline model: [TR] [website] [1][2][3][4]



Regular/irregular parallelism

GPU architecture.


[Project: 5-min presentation of your project idea]

Required:  (Discussion leader: Hassan)

  • How Much Parallelism is There in Irregular Applications?, M. Kulkarni, M. Burtscher, R. Inkulu, K. Pingali, C. Cascaval, PPoPP’09 [pdf][slides]


  • Amorphous parallelism: Galois project, tutorial
  • A quantitative study of irregular programs on GPUs, IEEE International Symposium on Workload Characterization (IISWC), 2012 [pdf]



More optimizations for regular and irregular applications: tiling, privatization, regularization, compaction, binning, data-layout transformation, thread management.

(plus auto-tuning)

GPU programming model


[Project: send a short project description (one page max)]

Required (Discussion leader: Augustine)

  • An Auto-Tuning Framework for Parallel Multicore Stencil Computations, S. Kamil et al., IPDPS'10, [pdf]


  • Benchmarking GPUs to tune dense linear algebra, V. Volkov, J. Demmel, SC08 [pdf] [slides]
  • Microarchitecture-independent workload characterization, IEEE Micro’07 [pdf]
  • Implementing sparse matrix-vector multiplication on throughput-oriented processors, N. Bell, M. Garland, SC’09. [pdf]
  • A Tuning Framework for Software-Managed Memory Hierarchies, M. Ren, J. Y.Park, M. Houston, A. Aiken, W. Dally, PACT 2008 [pdf]
  • Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications, I. Sung, J. Stratton, W. Hwu, PACT’10 [pdf]Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU Platforms, S. Venkatasubramanian, R. Vuduc, ICS’09 [pdf]
  • Optimizing and Tuning the Fast Multipole Method for State-of-the-Art Multicore Architectures, A. Chandramowlishwaran at al., IPDPS’10 [pdf]
  • Improving Parallelism and Locality with Asynchronous Algorithms, Lixia Liu, Zhiyuan Li, PPoPP’10 [pdf]
  • On-the-fly elimination of dynamic irregularities for GPU computing, E. Zhang, Y. Jiang, Z. Gao, K. Tian, X. Shen, ASPLOS'11 [pdf] [slides]
  • Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced Memory Accesses on GPU, B. Wu, Z. Zhao, E. Zhang, Y. Jiang, X. Shen, PPoPP’13 [pdf]
  • Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures, SC’08, [pdf]
  • Others: OSKI, ATLAS, optimizations, PetaBricks, slides.


 No class - Thanksgivings

[Project: finalized project description (goals, success criteria, methodology, timeline)]



Balanced architectures, co-design




Required (Discussion leaders: Augustine, Andrew)

  • A theoretical framework for algorithm-architecture co-design, K. Czechowski, R. Vuduc, IPDPS’13 [pdf][video]
  • A roofline model of energy, Choi et al., IPDPS 2013 [link][slides]


  • Balance Principles for Algorithm-Architecture Co-Design, K. Czechowski, C. Battaglino, C. McClanahan, A. Chandramowlishwaran, R. Vuduc, HotPar’11 [link][slides]
  • Algorithmic time, energy, and power on candidate HPC compute building blocks, J. Choi, M. Dukhan, X. Liu, R. Vuduc, IPDPS’14 [pdf]

More energy related papers

  • A Probabilistic Graphical Model-based Approach for Minimizing Energy Under Performance Constraints, ASPLOS’15 [pdf]
  • Evaluating the Effectiveness of Model-Based Power Characterization, J. McCullough et al., USENIX ATC 2011 [pdf][slides]
  • On Understanding the Energy Consumption of ARM-based Multicore Servers, B.M. Tudor, Y.M. Teo, SIGMETRICS 2013 [pdf]
  • Power Modelling and Characterization of Computing Devices: A survey, Reda et al. 2013 [pdf], more modelling [slides], Polfliet et al., 2011 [pdf]
  • Analyzing the Energy Efficiency of a Database Server, SIGMOD 2010 [pdf]
  • The Energy Case for Graph Processing on Hybrid CPU and GPU Systems, A. Gharaibeh, E. Santos-Neto, L. Beltrão Costa, M. Ripeanu. IA3 workshop 2013 [pdf] [slides]
  • Totally green: evaluating and designing servers for lifecycle environmental impact, Chang et al., ASPLOS’12 [pdf]
  • Underprovisioning backup power infrastructure for datacenters, Di Wang et al., ASPLOS'14  [pdf]
  • Price theory based power management for heterogeneous multi-cores, T. Muthukaruppan et al., ASPLOS'14 [pdf]
  • Parasol project: harnessing renewable energy [homepage][summary][slides] 



Runtime support

Required  (Discussion leaders: Arthur)

  • VirtCL: A Framework for OpenCL Device Abstraction and Management, Ping et al., PPoPP’15 [pdf]


  • Gdev: First-Class GPU Resource Management in the Operating System, Kato et al., USENIX ATC’12 [pdf][slides]
  • Rhythm: harnessing data parallel hardware for server workloads, S. Agrawal et al., ASPLOS 2014 [pdf]
  • Improving GPGPU concurrency with elastic kernels, S. Pai et al, ASPLOS'13 [pdf]
  • Portable performance on heterogeneous architectures, P. Phothilimthana et al., ASPLOS 2013 [pdf][slides]
  • PTask: operating system abstractions to manage GPUs as compute devices, C. Rossbach et al., SOSP'11, [pdf][slides][talk]
  • TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments, S. Kato et al., USENIX ATC'11 [link]



Language support

Project: (by 11/01 - end of day) Midterm report and presentations: Submit an up to four-page midterm report. This should include:  methodology, related work, progress to date.

Required  (Discussion leaders: Augustine, Andrew)


  • A Heterogeneous Parallel Framework for Domain-Specific Languages, PACT’11 [pdf]
  • Simplifying Scalable Graph Processing with a Domain-Specific Language, S. Hong et al., CGO'14 [pdf]
  • Singe: Leveraging Warp Specialization for High Performance on GPUs, ASPLOS’14 [pdf]
  • Others: GPS, Giraph, Diderot, Vivaldi, AnyDSL,  
  • Slides: 1



Applications: graph analysis for massive scale graphs


Required  (Discussion leaders: Pawan, Hassan)

  • NUMA-aware Graph-structured Analytics , K. Zhang et al., PPoPP’15 [pdf] [web]
  • Chaos: Scale-out Graph Processing from Secondary Storage, Roy et al. SOSP’15 [pdf]


  • SYNC or ASYNC: Time to Fuse for Distributed Graph-parallel Computation, C. Xie, PPoPP’15 [pdf]
  • Faster Parallel Traversal of Scale Free Graphs at Extreme Scale with Vertex Delegates, R. Pearce et al., SC'14 [pdf]
  • Of Hammers and Nails: An Empirical Comparison of Three Paradigms for Processing Large Graphs, M. Najork, WSDM’2012 [pdf]
  • Gunrock: A High-Performance Graph Processing Library on the GPU – Yangzihao [link]
  • Optimization of Asynchronous Graph Processing on GPU with Hybrid Coloring Model, poster at PPoPP’15 [pdf]
  • GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning, USENIX’15 [pdf]
  • GraphQ: Graph Query Processing with Abstraction Refinement—Scalable and Programmable Analytics over Very Large Graphs on a Single PC, USENIX’15 [pdf]
  • Accelerating CUDA Graph Algorithms at Maximum Warp, S. Hong et al., PPoPP’11 [pdf] [slides]
  • Scalable Graph Exploration on Multicore Processors, V. Agarwal et al., SC’10 [pdf]
  • Traversing Trillions of Edges in Real-time: Graph Exploration on Large-scale Parallel Machines, Checconi et al., IPDPS'14 [pdf]
  • Scalable GPU graph traversal, Merill et al., PPoPP’13, [pdf]
  • Connected Components in MapReduce and Beyond, K. Riveris et al., SoCC'14 [slides]
  • Others:,,  Center for Adaptive Supercomputing Software, MTAAP workshop ‘14, ‘13, ‘12; GRADES workshop: ‘15, ‘14, ‘13 




Required  (discussion leaders: Li, Justin)

  • Quantitative evaluation of soft error injection techniques for robust system design, Cho et al., DAC'13 [pdf]
  • Dual Execution for On the Fly Fine Grained Execution Comparison, ASPLOS’15 [pdf]


  • The Soft Error Problem: An Architectural Perspective, S. Mukherjee et al., [pdf]
  • Avoiding Pitfalls in Fault-Injection Based Comparison of Program Susceptibility to Soft Errors, DSN’15  [pdf]
  • Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults, DSN’14 [pdf]
  • Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability,  DSN’14 [pdf]
  • ExaScale: (i) Toward Exascale Resilience: 2014 Update [pdf]; (ii) Addressing failures in exascale computing [report]
  • Related on duplicated execution: (i) VARAN the Unbelievable An Efficient N-version Execution Framework, ASPLOS’15 [pdf], (ii) Operating System Support for Redundant Multithreading [pdf]



It’s (again) a new world!

Required  (discussion leaders: Augustine, Tanuj)

  • Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers, ASPLOS’15 [paper]
  • Integrated 3D-stacked server designs for increasing physical density of key-value stores, A. Gutierrez et al., ASPLOS'14 [pdf]


  • Clearing the clouds: a study of emerging scale-out workloads on modern hardware, M. Ferdman et al., ASPLOS 2012 [pdf]
  • Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached, K. Lim et al., ISCA'13 [pdf]
  • More accelerated memcached: 1 2 3 4 5 6 7



Custom accelerators / Bring processing close to data


  • BlueDBM: An Appliance for Big Data Analytics, Jun et al., ISCA’15 [pdf][project]
  • Willow: A User-Programmable SSD, S. Seshadri et al., OSDI'14 [link]


  • Q100: The Architecture and Design of a Database Processing Unit, Wu et al., ASPLOS’14 [pdf][slides][project]
  • DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning, T. Chen et al., ASPLOS'14 [link]
  • 10 x 10 project 
  • Architecture Support for Domain-Specific Accelerator-Rich CMPs. ACM Transactions on Embedded Computing Systems (TECS), Apr 2014.
  • BRAINIAC: Bringing reliable accuracy into neurally-implemented approximate computing, HPCA 2015
  • DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs, ASPLOS'15 [pdf]



Misc, Dark silicon, Flash memory, NV-RAM

  • The rise and fall of dark silicon, Hardavellas, Usenix Login 2013 [pdf]
  • A Landscape of the New Dark Silicon Design Regime, Michael Taylor, IEEE Micro 2013 [pdf][slides]
  • Harmonia: Balancing compute and memory power in high-performance GPUs
  • Better Flash Access via Shapeshifting Virtual Memory Pages, Trios’13 [paper]
  • How Persistent Memory Will Change Software Systems, A. Badam,  IEEE Computer, 2013 [link]
  • Storage Management in the NVRAM, VLDB’14 [pdf][slides]
  • NVM duet: unified working memory and persistent store architecture, Ren-Shuo Liu et al., ASPLOS'14 [pdf]
  • Exploring hybrid memory for GPU energy efficiency through software-hardware co-design, Wang et al., PACT'13 [pdf]
  • SSDAlloc:  NSDI’11 [paper] [slides]
  • OS Support for Non-Volatile Main Memory, I. Moraru, Trios'13 [pdf][slides]
  • SDF: software-defined flash for web-scale internet storage systems, J. Ouyang et al., ASPLOS'14 [pdf]
  • Others: 1



[Project: project presentations and wrap-up]



Previous years’ course schedules can be found here: 2014 - Harnessing Massively Parallel Processors; 2011 - Harnessing Massively Parallel Processors;  2010-Autonomic Computing, 2009-Massively Distributed/Parallel Computing Platforms; 2008-Quality of Service; 2007-Data-intensive Computing Systems. 

Textbook (recommended): Programming Massively Parallel Processors: A Hands-on Approach, David B. Kirk, Wen-mei WHwu, 2010 [link]

Other: The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 2nd edition, 2013 [pdf]

From a NSF CFP: “A key driver of continued performance improvement is ending: semiconductor technology is facing fundamental physical limits and single-processor performance has plateaued. Two reports, "21st Century Computer Architecture" (CCC, 2012) and "The Future of Computing Performance: Game Over or Next Level?" (NRC, 2011) highlight this development and its impact on science, the economy, and society. The reports pose the question of how to enable the computational systems that will support emerging applications without the benefit of near-perfect performance scaling (termed Dennard scaling) from hardware improvements. Although Moore's law may produce another few generations of smaller transistors, the end of Dennard scaling means that smaller transistors do not necessarily improve performance or energy efficiency. NSF's Advanced Computing Infrastructure: Vision and Strategic Plan (2012)  describes strategies that address this challenge for NSF and the research community. Furthermore, the National Strategic Computing Initiative (NSCI) outlines the need to establish "over the next 15 years, a viable path forward for future high-performance computing (HPC) systems even after the limits of current semiconductor technology are reached (the 'post-Moore's Law era')".

To continue improving performance and to support the move of parallelism to applications at multiple levels from mobile devices to desktops to exascale systems and the cloud, we need a new era of parallel computing, driven by novel, groundbreaking research in all areas impacting parallel performance and scalability. Achieving the needed breakthroughs will require a collaborative, cross-layer effort among researchers representing all areas from the application layer down to the micro-architecture, and will be built on new concepts and new foundational principles. Vertical integration, that is, tight linking between researchers in two or more layers of the hardware-software stack, is critical as abstraction layers evolve in two directions -- bottom-up (driven by foundations and principles) and top-down (driven by applications). Such cross-layer integration is more likely to yield consistent and coherent outcomes from these simultaneous evolutions. “