We take inspiration from Amdahl's Law to give a more "authoritative" introduction to the basic concepts of multithreaded execution — work, span, and parallelism.
I’m constantly amazed how many seemingly well-educated computer technologists
bandy about the word parallelism without really knowing what they’re talking
about. I can’t tell you how many articles and books I’ve read on parallel
computing that use the term over and over without ever defining it. Many of
these “authoritative” sources cite Amdahl’s Law1, originally
proffered by Gene Amdahl in 1967, but they seem blissfully unaware of the more
general and precise quantification of parallelism provided by theoretical
computer science. Since the theory really isn’t all that hard, it’s curious
that it isn’t better known. Maybe it needs a better name — “Law” sounds so
authoritative. In this blog, I’ll give a brief introduction to this theory,
which incidentally provides a foundation for the efficiency of the OpenCilk
runtime system.
Amdahl’s Law
First, let’s look at Amdahl’s Law and see what it says and what it doesn’t
say. Amdahl made what amounts to the following observation. Suppose that
of a computation can be parallelized and can’t. Then, even if the
that is parallel took no time at all to execute, the total time is cut at most
in half, leaving a speedup of less than . In general, if a fraction of a
computation can be run in parallel and the rest must run serially, Amdahl’s Law
upper-bounds the speedup by .
This argument was used in the 1970’s and 1980’s to argue that parallel
computing, which was in its infancy back then, was a bad idea — the implication
being that most applications have long, inherently serial subcomputations that
limit speedup. We now know from numerous examples that there are plenty of
applications that can be effectively sped up by parallel computers, but
Amdahl’s Law doesn’t really help in understanding how much speedup you can
expect from your application. After all, few applications can be decomposed so
simply into just a serial part and a parallel part. Theory to the rescue!
A model for multithreaded execution
As with much of theoretical computer science, we need a model of multithreaded
execution in order to give a precise definition of parallelism. We can use the
dag model for multithreading, which I talked about in my blog, “Are
determinacy-race bugs lurking in your multicore application? (A dag is a
directed acyclic graph.)
The dag model views the execution of a multithreaded
program as a set of instructions (the vertices of the dag) with graph edges
indicating dependencies between instructions. We say that an instruction
precedes an instruction , sometimes denoted , if must complete
before can begin. In a diagram for the dag, means that there is a
positive-length path from to . If neither nor , we say
the instructions are in parallel, denoted . The figure at right
illustrates a multithreaded dag that indicates, for example, , , and .
Just by eyeballing, what would you guess is the parallelism of the dag? About
? About ? It turns out that two measures of the dag, called work and
span, allow us to define parallelism precisely, as well as to provide some key
bounds on performance. I’m going to christen these bounds “Laws,” so as to
compete with the Amdahl cognoscenti. If I’ve learned anything about business,
it’s the importance of marketing!
Work
The first important measure is work, which is what you get when you add up
the total amount of time for all the instructions. Assuming for simplicity that
it takes unit time to execute an instruction, the work for the example dag is
, because there are 18 vertices in the dag. The literature contains
extensions to this theoretical model to handle nonunit instruction times,
caching, etc., but for now, dealing with these other effects will only
complicate matters.
Let’s adopt a simple notation. Let be the fastest possible execution time
of the application on processors. Since the work corresponds to the
execution time on processor, we denote it by . Among the reasons that
work is an important measure is because it provides a bound — Oops, I mean Law
— on any -processor execution time:
The Work Law holds, because in our model, each processor executes at most
instruction per unit time, and hence processors can execute at most
instructions per unit time. Thus, to do all the work on processors, it must
take at least time.
We can interpret the Work Law in terms of speedup. Using our notation, the
speedup on processors is just , which is how much faster the
application runs on processors than on processor. Rewriting the Work
Law, we obtain , which is to say that the speedup on P processors
can be at most . If the application obtains speedup proportional to , we
say that the speedup is linear. If it obtains speedup exactly (which is
the best we can do in our model), we say that the application exhibits perfect
linear speedup. If the application obtains speedup greater than (which
can’t happen in our model due to the work bound, but can happen in models that
incorporate caching and other processor effects), we say that the application
exhibits superlinear speedup.
Span
The second important measure is span, which is the longest path of
dependencies in the dag. The span of the dag in the figure is , which
corresponds to the path . This path is
sometimes called the critical path of the dag, and span is sometimes referred
to in the literature as critical-path length or depth. Since the span is
the theoretically fastest time the dag could be executed on a computer with an
infinite number of processors (assuming no overheads for communication,
scheduling, etc.), we denote it by .
Like work, span also provides a bou…, uhhh, Law on -processor execution
time:
The Span Law holds for the simple reason that a finite number of processors
cannot outperform an infinite number of processors, because the
infinite-processor machine could just ignore all but of its processors and
mimic a -processor machine exactly.
Parallelism
Parallelism is defined as the ratio of work to span, or . Why does
this definition make sense? There are several ways to understand it:
- The parallelism is the average amount of work along each step of
the critical path.
- The parallelism is the maximum possible speedup that can be
obtained by any number of processors.
- Perfect linear speedup cannot be obtained for any number of processors
greater than the parallelism . To see this third point, suppose
that , in which case the Span Law implies that the
speedup satisfies . Since the speedup is
strictly less than , it cannot be perfect linear speedup. Note also that
if , then — the more processors you have beyond
the parallelism, the less “perfect” the speedup.
For our example, the parallelism is . Thus, no matter how many
processors execute the program, the greatest speedup that can be attained is
only , which frankly isn’t much. Somehow, to my eye, it looks like more, but
the math doesn’t lie.
Amdahl’s Law Redux
Amdahl’s Law for the case where a fraction of the application is parallel
and a fraction is serial simply amounts to the special case where . In this case, the maximum possible speedup is . Amdahl’s Law is simple, but the Work and Span Laws are far more
powerful.
In particular, the theory of work and span has led to an excellent
understanding of multithreaded scheduling, at least for those who know the
theory. As it turns out, scheduling a multithreaded computation to achieve
optimal performance is NP-complete, which in lay terms means that it is
computationally intractable. Nevertheless, practical scheduling algorithms
exist based on work and span that can schedule any multithreaded computation
near optimally. The OpenCilk runtime system contains such a near-optimal
scheduler. I’ll talk about multithreaded scheduling in another blog, where I’ll
show how the Work and Span Laws really come into play.
1Amdahl, Gene. The validity of the single processor approach to
achieving large-scale computing capabilities. Proceedings of the AFIPS Spring
Joint Computer Conference. April 1967, pp. 483-485.