Welcome
Whether you're curious about teaching your first-ever performance-engineering module or wanting to share the success of an existing SPE class, you are welcome to join Fastcode's SPE instructors community.
Helpful resources
For starters
Take a look at this post by John Owens, "My experience teaching software performance engineering," about adapting MIT 6.106 to create his own SPE class at UC Davis. John also coauthored this paper, "Helping faculty teach software performance engineering," published at EduPar-24. EduPar and its sibling workshop EduHPC (held annualy at IPDPS and SC, respectively) are organized through CDER, which develops and curates many useful resources for teaching parallel and distributed computing. See also Get started with SPE.
To help you develop your own course or module on SPE, below is an incomplete list of relevant classes and workshops. Each listing includes links to lecture PDFs and sometimes videos.
Do you have your own class or module to add to our list? Please let us know.
When you join the Fastcode instructors community, you get access to editable slide decks and LaTeX documents, and you are welcome to join SPE-instructor discussions to share ideas and questions with like-minded faculty.
List of classes and workshops
UC Berkeley
CS267: Applications of Parallel Computers
CS267 was originally designed to teach students how to program parallel computers to efficiently solve challenging problems in science and engineering, where very fast computers are required either to perform complex simulations or to analyze enormous datasets.... While this general outline remains, a large change in the computing world started in the mid 2000's: not only are the fastest computers parallel, but nearly all computers are becoming parallel.... Students in CS267 will get an overview of the parallel architecture space, gain experience using some of the most popular parallel programming tools, and be exposed to a number of open research questions.
Cornell
CS 5220: Applications of Parallel Computers
CS 5220 is an introduction to performance tuning and parallelization, particularly for scientific codes. Topics include:
- Single-processor architecture, caches, and serial performance tuning
- Basics of parallel machine organization
- Distributed memory programming with MPI
- Shared memory programming with OpenMP
- Parallel patterns: data partitioning, synchronization, and load balancing
- Examples of parallel numerical algorithms
- Applications from science and engineering
CMU
15-418/15-618: Parallel Computer Architecture and Programming
From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover hardware design and how that affects software design.
UC Davis
Hands-on, project-based introduction to building scalable and high-performance software systems.
Topics include performance analysis, algorithmic techniques for high performance, instruction-level
optimizations, caching optimizations, parallel programming, and building scalable systems. The course
programming language is C. Links to lecture slides are below.
- Intro and Matrix Multiplication (PDF)
- Bentley Rules for Optimizing Work (PDF)
- Bit Hacks (PDF)
- Computer Architecture (PDF)
- C to Assembly (PDF)
- What Compilers Can and Cannot Do (PDF)
- Multicore Programming (PDF)
- Races and Parallelism (PDF)
- Analysis of Parallel Algorithms I (PDF)
- Analysis of Parallel Algorithms II (PDF)
- Measurement and Timing (PDF)
- Cheetah -- Cilk Runtime (PDF)
- Storage Allocation (PDF)
- Parallel Storage Allocation (PDF)
- Cache-Efficient Algorithms (PDF)
- Cache-Oblivious Algorithms (PDF)
- Nondeterministic Parallel Programming (PDF)
- Synchronization without Locks (PDF)
- Potpourri (PDF)
- Speculative Parallelism (PDF)
IIT Dharwad
CS601 Software Development for Scientific Computing
This course focuses on software development skills in the context of dominant algorithmic patterns found in scientific computing. Topics include:
- Exploring tools that cross most disciplines (build tools, version control tools, compilers, debugging tools, profiling tools etc.)
- Exploring dominant algorithmic patterns found in dense and sparse linear algebra, structured and unstructured grid methods, tree-based codes, particle methods, FFTs, and PDEs
- Selected topics in C++ programming, asymptotic analysis, perfomance tuning
GaTech
This course will provide a comprehensive introduction to parallel algorithms and parallel programming, with strong emphasis on the design of parallel algorithms and their rigorous analysis. Exposure to parallel programming is provided through programming assignments using MPI. Throughout the course, the design of algorithms is interlaced with the programming techniques useful in coding them.
This course teaches hands-on practical performance engineering on high-performance computer systems, starting with single-processor multicore platforms up to large-scale distributed systems. Topics include cache-friendly algorithm analysis and implementation, parallel programming, concurrent data structures, tools for effective performance engineering, and modern high-performance computing (HPC) applications.
Johns Hopkins
EN 601.420/620 Parallel Computing for Data Science
This course studies parallelism in data science, drawing examples from data analytics, statistical programming, and machine learning. It focuses mostly on the Python programming ecosystem, but we will use C/C++ to accelerate Python and Java to explore shared-memory threading. It explores parallelism at all levels, including instruction level parallelism (pipelining and vectorization), shared-memory multicore, and distributed computing. Concepts from computer architecture and operating systems will be developed in support of parallelism, including Moore’s law, the memory hierarchy, caching, processes/threads, and concurrency control. The course will cover modern data-parallel programming frameworks, including Dask, Spark, Hadoop!, and Ray. The course will not cover GPU deep-learning frameworks nor CUDA. The course is suitable for second-year undergraduate CS majors and graduate students from other science and engineering disciplines that have prior programming experience.
MIT
6.106 (formerly 6.172) is an 18-unit class that provides a hands-on, project-based introduction to building scalable and high-performance software systems. Topics include performance analysis, algorithmic techniques for high performance, instruction-level optimizations, caching optimizations, parallel programming, and building scalable systems. The course programming language is C. Links to lectures are below.
- Introduction & Matrix Multiplication (PDF, video)
- Bentley Rules for Optimizing Work (PDF, video)
- Bit Hacks (PDF, video)
- Assembly Language and Computer Architecture (PDF, video)
- C to Assembly (PDF, video)
- Multicore Programming (PDF, video)
- Races and Parallelism (PDF, video)
- Analysis of Multithreaded Algorithms (PDF, video)
- What Compilers Can and Cannot Do (PDF, video)
- Measurement and Timing (PDF, video)
- Storage Allocation (PDF, video)
- Parallel Storage Allocation (PDF, video)
- The Cilk Runtime System (PDF, video)
- Caching and Cache-Efficient Algorithms (PDF, video)
- Cache-Oblivious Algorithms (PDF, video)
- Nondeterministic Parallel Programming (PDF, video)
- Synchronization Without Locks (PDF, video)
- Domain Specific Languages and Autotuning (PDF, video)
- Leiserchess Codewalk (PDF, video)
- Speculative Parallelism & Leiserchess (PDF, video)
- Tuning a TSP Algorithm (PDF, video)
- Graph Optimization (PDF, video)
- High Performance in Dynamic Languages (PDF, video)
Modern algorithms workshop: parallel algorithms
Originally created in 2018 as a single full-day class, this workshop includes an introduction and 8 separate modules listed below.
- Introduction (PDF)
- Cilk model (PDF)
- Detecting nondeterminism (PDF)
- What Is parallelism? (PDF)
- Scheduling theory primer (PDF)
- Analysis of parallel loops (PDF)
- Case study: matrix multiplication (PDF)
- Case study: Jaccard similarity (PDF)
- Post-Moore software (PDF)
Join the Fastcode instructors community for access to editable slide decks.