1034 Winsor Ave, Piedmont, CA, 94610, 650.391.4990, firstname.lastname@example.org
Speed and Impact. Seeking technical role in SW acceleration, performance, architecture, compilers, optimization, tools, analysis. From single binary to datacenter, frontend to backend, SW to HW, commodity chip to custom ASIC, from most challenging to super interesting ;-) Open to “Step Up” opportunities only.
Google, Principal Engineer, Oct 13 - present
Platforms (Q4, 2013-)
SW / Tech Lead for accelerators in the datacenter (compilers, tools, libraries, simulators, auto-tuners, programming models, APIs, performance, etc). Built, staffed, leading teams of 70+
- TL for XLA: Open-Source TensorFlow JIT / AOT compiler targeting CPU, GPU, TPU.
- SW Lead for TPU (Tensor Processing Unit), an ASIC project, announced Google IO 16. Functional simulator(s), device driver, user-space driver, customer API, compilation, performance, tools, debuggers, testing. Ran 1st workload 3 days (!) after power-on, in datacenter 20 days later. 10-40x performance/TCO gains compared to commodity platforms.
- SW Lead for Cloud TPU, an ML supercomputer, announced Google IO 17. Similar to above, but much, much bigger, 11 PFlops.
- CloudML on TPU, SW lead for Platforms
- Manager (only) for a scalable ML IP block for mobile
- Uber TL for Platforms Performance Team (NPI, TCO, Compute, Storage, Networking, !x86)
- Uber TL for open-source, LLVM-based CUDA compiler. Beating NVCC on 60+ benchmark kernels by geomean 0%-5%. Win by geomean 20%+ on internal benchmarks (up to 50%). Compile time up to 2.4x faster. Support for Cxx11/14. Transferred to Production.
- Uber TL for open-source, high performance BLAS, FFT, DNN libraries to replace NVIDIA’s proprietary binaries. Beat NVidia everywhere, except on SASS optimized code (BLAS-3). Employ auto-tuning.
- Another LLVM-based compiler (backend/codegen) for proprietary, tiny, RISC-like, 32-bit networking processor. Beat previous, gcc-based toolchain in performance (0-20%), and code size (16%). Transferred to Production.
- Research in staged compilation, generative programming to achieve performance portability across heterogeneous architectures.
Awards and Recognitions:
- Program Chair 2015 / Steering Committee for external CGO Conference (www.cgo.org)
- Technical Infrastructure Award Q2’16: TPU
- Technical Infrastructure Award Q1’15: PCIe Expansion/Accelerator related SW.
- Technical Infrastructure Award Q1’13: Delivering unprecedented C++ performance.
- Chrome&Apps PA Award, Q3’12: Reducing Gmail Memory usage by 80%+ at the 99%’ile
- Alan Eustace’s Top 10, Q3’11: Gmail turned profitable
- Alan Eustace’s Top 10, Q3’09: C/C++ compiler 10% faster, producing 25% smaller binaries
- Google Excellent Paper 2011: Impact of Memory Resource Sharing on Datacenter Performance
- IEEE Micro, Top Picks 2012 with “Bubble-Up: Increasing Utilization...”, Micro ‘11
Google, Senior Staff Software Engineer, Oct 08 - Sept 13
Google, Staff Software Engineer, Oct 07 - Oct 08
Google, Senior Software Engineer, Jan 07 - Sept 07
Tech Lead Manager for Apps/GMail Performance Team:
- Apps/Gmail Client - Hands-on mgmt with 4 people (+2 scholars, and a few interns)
- Identified redundancies and way to reduce Gmail’s CSS delivery by 70%
- Identified actual latency impact of JS/CSS delivery and developed novel “delta delivery” method that was accepted by Gmail, G+, several of the Apps. Filed >10 patents.
- Developed tools and methodology for analysis - and fixed - most all memory leaks, tracked down several Chrome GC issues. 99%’ile down from >2GB to ~250MB, 50%’ile down by >50%.
- Greatly improved cache hit rates and efficiency for highest frequency requests. Several latencies down by >15% and more.
- Lots of novel analysis techniques and studies in a whole range of (Web Performance) areas, e.g., Dart, impact of DOM complexity on latency and scroll speeds, study of Browser induced latency variability, application delivery, e.g., impact of JS PreParser, cost of 1K of JS/CSS, impact of v8 optimizations, how to improve Gmail’s dynamic MOD’ing, critical latency path analysis using dapper RPC request traces, RPC stall time analysis (leads to under-configured thread pools), many more.
- Previously: Gmail Backend (10 reports)
- Improve CPUsec/sec by >20% via job-to-core mapping
- Establish baseline, performance methodology (difficult), regression framework
- Several traditional (backend) performance analyses and optimizations
Tech Lead Manager for a several performance/compiler projects. 28 reports by end of Q3/10 (personal record):
- TLM for x86 server compiler optimization effort. Performance year 1: 12%, year 2: 10%, year 3: 5%. Several novel optimizations: LIPO, MAO, Live Range Shrinking, Integrated Register Allocator (with Briggs), locality based string opts. Traditional compiler optimization/tuning. FDO deployment.
- TLM for ARM/Android compiler optimization effort. Code size reduction: ~15%. Performance: ~10%.
- Manager for Google-wide profiling infrastructure. Published in IEEE Micro.
- TLM for datacenter performance analysis and tools team. Good impact. Search latency: -35%. Edge server machine requirements: -15%. Locking codes: 3-4x throughput. Memory allocation: Saved equivalent of 25 SWEs. Authentication code: Improved SLAs by 2 orders of magnitude. GMail: 15% CPU reduction. Ongoing.
- Heading compiler research and university collaborations. Grants to UVA, Princeton, Rochester, Tsinghua, Stanford, many others.
- Several other interesting (== not yet successful) projects: Scala @ Google, Micro-Architectural Optimizations (MAO, published at CGO '11), Contention-Aware Execution (CGO '09, '10), Latex on Google Docs, Sampled FDO (CGO '10), others
Culture, Froyodel in Chief, Sept 08 - Jan 2012
Created Organic Frozen Yogurt Store with wife and biz partner on California Ave in Palo Alto. Super delicious, only highest quality. Now teaching the partners how to run a business - successfully. Unfortunately, business failed due to fallout with co-owner
Hewlett-Packard, Compiler Eng, July 02 to Dec 06
CLTL, Cupertino Languages and Tools Lab, High Level Optimizer (HLO)
- Work in compiler infrastructure and high-level loop optimizer, scalar optimizer and inter-procedural optimizer for the HP-UX / Linux compiler backend for Itanium.
- Data layout optimizations, such as structure splitting, dead structure field removal, affinity based structure field reordering, and helper transformations, such as malloc coalescing. Inter-procedural propagation of edge weight estimates for non-profile guided compilations. Started automatic pool allocation / pointer compression.
- Leading Automatic Parallelization project. Bring-up of new loop optimizer in record time performing loop interchange, fusion, distribution, unswitching, inner and outer loop unrolling, cloning and rerolling. Loop recognition, canonicalization, and lowering/raising to various IR levels. SSA-based induction variable (IV) recognition. Index set splitting. Scalar and array renaming.
- Work on scalar optimizations. memset/memcpy optimization (replacement of copy iterations with faster milli-code routines), CFG optimizations, IV register promotion, IR lowering, other scalar optimizations. Contributions to bring up of new SSA framework and partial redundancy elimination (SSA-PRE).
- Bring-up of new infrastructure and IR for scalable cross-module optimizations. New intermediate file design for fastest access times, parallelization of code generation phase, new facilities for triaging and performance monitoring. Part of an effort to retire an existing non-scaling HLO.
Hewlett-Packard, Software Engineer, Jan 00 to Jun 02
CLTL, Cupertino Languages and Tools Lab, Performance Tool Caliper
- Technical authority for dynamic binary instrumentation on IA-64. Design and implementation of HP Caliper, a performance tool combining dynamic binary instrumentation and PMU supported sampling for HP-UX, Linux on IA-64. Caliper operates at the intersection of the IA-64 hardware, run-time architecture, OS, micro loader, virtual memory system, user-space, compiler, linker, and dynamic loader.
- Design and implementation of cdb, a simple yet powerful debugger for instrumented and non-instrumented code for IA-64.
- Design and implementation of all aspects of instrumentation for C++ programs, as well as JVMs. Dynamic handling of IA-64 unwind information, providing full support for C++ exceptions and general stack unwinding through dynamically generated code. Lazy and precise unwind generation via interception of interaction between unwinder and dynamic loader.
- Full support for all aspects of process handling, including un-instrumentation of in-line instrumented code via IA-64 call chain cleanup, OS related aspects of fork/vfork/exec/exit, and PBO compiled executables.
Spectra Precision TerraSat, Senior Software Engineer
Co-Owner, Jun 95 to Dec 99
- Design, implementation, and hands-on management of GeoGenius, SPT’s main program system consisting of ~1 mio LOC. Windows NT/9x, VC++, Prolog and WATCOM FORTRAN, distributed in 7 languages including Japanese in 3 different versions (2 OEMs). GeoGenius processes, manages, adjusts, visualizes and manipulates data from various satellite receivers, geodetic instruments, and other sources. 10 developers, close cooperation with multiple international partners.
- MacroStudio, an interactive debugging environment for TCL with syntax coloring and remote debugging. MacroStudio itself is programmable via TCL and can be used to debug itself.
- Vis-à-vis, a program for analysis and visualization of satellite visibility. Designed for automated analysis tasks. Written with WATCOM C++, ZINC interface library and TCL for Windows and OS/2.
- Thesis Guidance: A Program Generator for a COM-based Application Framework, Margit Wieser,
- Design, primary development, and supervision of GeoMotion, an OpenGL application for animated display of satellite geometry and movement in earth-fixed, satellite-fixed or inertial frame. Written with VC++, MFC for Windows
TerraSat GmbH, Senior Software Engineer
Co-Owner, Apr 92 to May 95
- Design, and implementation of GPS-Base, an extended DOS application for monitoring and analysis of the GPS system. Control of satellite receivers, atomic clock and meteorological stations. Written with Symantec/Zortech C++, ZINC, X32 DOS extender and Assembler (MASM).
- PACO, a LALR(1) parser compiler consisting of 60% BISON, 20% YACC and 20% Robert. My idea was to add a debugging interface allowing viewing of parser stack, shifts, reduces and corresponding grammar lines during parsing. Used to create DSDL, a Data Structure Description Language.
- Topas Turbo, the predecessor to GeoGenius, DOS based with own graphical user-interface, data-processing and least squares network adjustment, written with Borland C and FORTRAN. TerraSat’s main product during this period.
University FAF Munich, Research Assistant, 89 - 92
- Development of graphical user interface in Turbo Pascal. Port of TOPAS GPS processor from VMS to PC (8086 ;-). Implementation of assembler routines to substitute missing system calls. Start of Topas Turbo project.
Technical University Munich, Germany, 1992
Diplom Univ. in Informatik (~Masters in Computer Science), Minor in Physics
Thesis: Parallelization of Programs via Program Graph Transformations
Technical University Munich, Germany, 1989
Vordiplom (~Bachelor in Computer Science)
Thesis: Code generator for x86 Processors with Integrated Peephole Optimizer
CGO Steering Committee, Open64 Steering Group (chair), Steering Committee Open64 Workshop, WBIA 2005 Workshop (PC), CGO 2008 (PC), PACT 2008 (PC), Open64 2008 Workshop (PC), CGO 2009 (PC, WEB/Pub chair), Open64 2009 Workshop (PC), WISH 2009 Workshop (PC), SMART 2009 Workshop (PC), CC 2010 (PC), HiPEAC 2010 (ext PC), WBIA 2010 Workshop (PC), Open64 Workshop (PC), CGO 2011 (PC), ASPLOS 2011 (ext PC), SMART 2011, EXADAPT 2011, 2012 (Organizer), CGO 2012 (Workshop Chair, Steering Committee), CGO 2013 Sponsorship chair, PLDI 2012 (ext PC), ISPASS (PC), HiPeac/TACO (distinguished reviewer) 2011, 2012, MICRO EC 2014, CGO program chair 2015, MICRO 2015 ERC, ISCA 2016 ERC, PACT 2016 PC, MICRO 2016 ERC, CGO 2016 PC, sponsorship chair, CGO 2017 PC, ISCA 2017 ERC.
US6795964, US6817014, US6851110, US6898785, US6918110, US6934943, US6957421, US6957742, US6993750, US6996810, US7017153, US7103878, US7131115, US7165162, US7185320, US7249349, US7360207, more pending, stopped tracking, ~25 total.
Captain of 1st Bundesliga indoor volleyball team in 1987 (Germany). Played European indoor club volleyball championships 1988. Won US Open Gold in 2006 (40's) with Burgess VBC, Menlo Park, CA. Bronze in 2009 (45's). Won NCVA in 2010 with Last Call. Finished #15 with Last Call at Level AA at US Open 2011.
In-Datacenter Performance Analysis of a Matrix Processing Unit
Dave Patterson, Norm Jouppi, … ,Robert Hundt, … (the whole team)
GPUCC - An Open-Source GPGPU Compiler
Jingyue Wu, …, Robert Hundt
Jacques Piennar, Robert Hundt
Optimizing Google's Warehouse Scale Computers: The NUMA Experience
Lingjia Tang, Jason Mars, Xiao Zhang, Robert Hagmann, Robert Hundt and Eric Tune
Heterogeneity in “Homogeneous” Warehouse-Scale Computers: A Performance Opportunity
Jason Mars, Lingjia Tang, Robert Hundt
IEEE Computer Architecture Letterss, Dec 2011
Bubble-Up: Increasing Sensible Co-locations for Improved Utilization in Modern Warehouse Scale Computers
Jason Mars, Lingjia Tang, Mary-Lou Soffa, Robert Hundt
IEEE Micro Top Picks 2012
Loop Recognition in C++, Java, Scala, Go
Proceedings of Scala Days 2011 (also on “The Register”)
The Impact of Memory Subsystem Resource Sharing on Datacenter Applications,
Lingjia Tang, Jason Mars, Neil Vachharajani, Robert Hundt, Mary-Lou Soffa,
Google Excellent Paper of 2011
MAO - an extensible Micro-Architectural Optimizer
Robert Hundt, Easwaran Raman, Martin Thuresson, Neil Vachharajani.
RACEZ: A lightweight and non-invasive race detection tool for production applications
Tianwei Shen, Neil Vachharajani, Stephane Eranian, Robert Hundt, et al.
Google-Wide Profiling: A Continuous Profiling Infrastructure for Datacenters
Gang Ren, Tipp Moseley, Eric Tune, Silvius Rus, Robert Hundt.
IEEE Micro 2010
Lightweight Feedback-Directed Cross-Module Optimization
David Xinliang Li, Raksit Ashok, Robert Hundt
Taming Hardware Event Samples for FDO Compilation
Dehao Chen, Neil Vachharajani, Robert Hundt
Contention Aware Execution: Online Contention Detection and Response
Best Student Presentation
Jason Mars, Neil Vachharajani, Mary Lou Soffa, Robert Hundt
Scanario Based Optimization: A Framework for Statically Enabling Online Optimizations
Jason Mars and Robert Hundt
Feedback-Directed Optimizations with Estimated Edge Profiles from Hardware Event Sampling
Vinodha Ramasamy, Dehao Chen, Paul Yuan, Robert Hundt.
gcc summit 2008, PLDI 2008 poster
Structure Layout Optimization for Multi-Threaded Programs
Easwaran Raman, Robert Hundt, Sandya Mannarswamy.
Whole Program Optimization of Global Variable Layout
Nathaniel McIntosh, Robert Hundt, Sandya Mannarswamy.
Practical Structure Layout Optimization and Advice
Robert Hundt, Dhruva R. Chakrabarti, Sandya Mannarswamy
International Symposium on Code Generation and Optimization (CGO-2006)
Scalable High Performance Cross-Module Inlining
Dhruva R. Chakrabarti, Luis A. Lozano, Xinliang D. Li, Robert Hundt, Shin-Ming Liu
13th International Conference on Parallel Architecture and Compilation Techniques, 2004 (PACT'04)
SYZYGY - A Framework for Scalable Cross-Module IPO
Dhruva R. Chakrabarti, Luis A. Lozano, Xinliang D. Li, Robert Hundt, Shin-Ming Liu
2004 International Symposium on Code Generation and Optimization (CGO-2004)
Dynamic Binary Instrumentation in IA-64
Vinodha Ramasamy, Robert Hundt
EPIC-1 Workshop with MICRO 2001
HP Caliper - A Framework for Performance Analysis Tools
IEEE Concurrency Magazine 2001
HP Caliper - An Architecture for Performance Analysis Tools
First Workshop on Industrial Experience with Systems Software, WIESS-2001
Aircraft Positioning and Guidance with the Global Positioning System
Dr. Herbert Landau, Robert Hundt et al,
KIS94 in Benft Canada, 1994
A GPS-based High-precision Positioning and Guidance System
Dr. Herbert Landau, Robert Hundt, Christian Pagls, and Dr. Ulrich Vollath, terraSat GmbH; Bo Granstedt, Saab Instruments AB 1099-1106
ION GPS-94 Proceedings, 7th International Technical Meeting of The Satellite Division of The Institute of Navigation, September 20-23, 1994
A GPS Monitoring System: Concept, Implementation and Experiences
Landau H., Hundt R., Mueller A. (1994),
Proceedings of the Institute of Navigation Satellite Meeting Salt Lake City, Utah, 1321-1327