2017-10-19
The bakery algorithm was proposed by Leslie Lamport as a solution to Dijkstra’s concurrent programming problem. In the problem, Dijkstra had first identified the need for mutual exclusion among a group of concurrently executing processes.
We want to run N
processes concurrently, with exclusive
access to a shared resource. Modern day languages solve this with help
from hardware. But the bakery algorithm allows for pure software mutual
exclusion by making sure that at most one process is executing its
critical section, while other non-faulty processes are either in their
non-critical section, or spinning in place, waiting to get to the
critical section. In other words, mutual exclusion in the critical
section is achieved in software. Now this is inefficient, since it
involves processes “spinning” in a loop while waiting for a chance to
execute the critical section.
What is remarkable about this algorithm (as Lamport points out
multiple times in the paper) is that even though reads from memory may
not be atomic (which is important with overlapping reads and writes to
the same location), the algorithm works fine, since it does not care
what exact value is read, and makes do with a distinction
between zero and nonzero values. Here’s an
implementation followed by a description (the code is on github:
go get github.com/yati-sagade/bakery
):
package main
import (
"flag"
"fmt"
"time"
)
func max(xs []int) int {
if len(xs) < 1 {
panic("max() on empty slice")
}
:= xs[0]
ret for i := 1; i < len(xs); i++ {
if xs[i] > ret {
= xs[i]
ret }
}
return ret
}
func main() {
:= flag.Int("nodes", 5, "number of participating nodes")
n := flag.Int("iters", 100000, "number of iterations")
iters := flag.Bool("debug", false, "print debug trace")
debug
.Parse()
flag
// 1. process i is in the doorway while choosing[i] == true
// 2. process i is in the bakery from when it resets choosing[i] to false till
// either it fails, or finishes its critical section.
:= make([]bool, *n+1, *n+1) // one extra for the monitoring process
choosing := make([]int, *n+1, *n+1)
numbers
:= map[string]int{
sharedMap "last_updated_by": -1,
}
:= make([]int, 0)
sharedSlice
:= func(id int) {
lock
[id] = true
choosing
/* At the doorway */
if *debug {
.Printf("%d: choosing\n", id)
fmt}
[id] = 1 + max(numbers)
numbersif *debug {
.Printf("%d: chose number %d\n", id, numbers[id])
fmt}
[id] = false
choosing
/* Entered bakery */
for k := 0; k <= *n; k++ { /* L1 */
:= false
printed for choosing[k] { /* L2 */
// spin
if *debug && !printed {
.Printf("%d: spinning for %d (waiting for it to choose: %v)\n",
fmt, k, choosing)
id= true
printed }
// this is important, as tight spinning here would cause only
// the first $n_core goroutines to be scheduled, starving
// all others. this would in turn cause the whole algorithm
// to stop making any progress.
.Sleep(10 * time.Millisecond)
time}
= false
printed for numbers[k] != 0 &&
((numbers[k] < numbers[id]) ||
(numbers[k] == numbers[id] && k < id)) { /* L3 */
// spin
if *debug && !printed {
.Printf("%d: spinning for %d (waiting for it to finish CS: %v)\n",
fmt, k, numbers)
id= true
printed }
.Sleep(10 * time.Millisecond)
time}
}
}
:= func(id int) {
unlock if *debug {
.Printf("%d: unlock: %v\n", id, numbers)
fmt}
[id] = 0
numbers}
:= func(id int, stop chan struct{}, ack chan struct{}) {
proc :
outfor {
select {
case <-stop:
break out
default:
(id)
lock.Printf("%d: entering critical section\n", id)
fmt
// critical section
["last_updated_by"] = id
sharedMap= append(sharedSlice, id)
sharedSlice = sharedSlice[1:]
sharedSlice // end critical section
.Printf("%d: done critical section\n", id)
fmt(id)
unlock}
}
<- struct{}{}
ack }
:= make([]chan struct{}, 0)
stops := make([]chan struct{}, 0)
acks for i := 0; i < *n; i++ {
:= make(chan struct{})
c := make(chan struct{})
d = append(stops, c)
stops = append(acks, d)
acks go proc(i, c, d)
}
// Monitor and quit after some time
for i := 1; i < *iters; i++ {
(*n)
lockif *debug {
.Sprintf("monitor lock acquired", sharedSlice)
fmt}
if len(sharedSlice) > 1 {
panic(fmt.Sprintf("found concurrent access with %v", sharedSlice))
}
// Go will panic when sharedMap is written to by two goroutines
(*n)
unlockif *debug {
.Sprintf("monitor lock released", sharedSlice)
fmt}
}
for _, stop := range stops {
.Println("stopping")
fmt<- struct{}{}
stop }
for _, ack := range acks {
<-ack
}
}
So we have n
goroutines, and each wants to mutate a
shared map (by updating the single value in it) and a shared slice (by
appending the process id to it, and then reslicing it to maintain a
length of 1). There is also the main monitoring goroutine that checks if
the shared slice ever contains more than two items. Simultaneous
mutation of a map by more than one goroutines is detected by the Go
runtime, which panics in such a case.
The key idea of the algorithm is that just like in an old fashioned bakery (or a bank), each participant gets a token, or a number, roughly in the order of arrival. Participants then get serviced in order of their token numbers. The thing is that in our situation, there is no single receptionist to hand these tokens over to participant processes. So, our solution is to have processes pick their own numbers, and such a tie is broken by letting the process with a lower id enter the critical section first.
The lock
routine is the meat of the bakery algorithm.
There are two shared slices, choosing
and
numbers
, and process i
only writes to
choosing[i]
and numbers[i]
, but can read from
any other indices in choosing
and numbers
.
Our system with N
processes (goroutines, here) functions
such that at any given time, at most one process is in the critical
section. To prove this, let’s see the execution from process
i
’s perspective, and first define some terms:
i
completes the last iteration of loop
L2
for process k
. This is when it sees
choosing[k] == false
.i
completes the last iteration of loop
L3
for process k
. This happens when process
i
sees a 0
in numbers[k]
, or if
it thinks it should be serviced before process k
owing to a
lower token number.k
completes choosing[k] = true
.k
completes setting choosing[k] = false
, after
choosing a number.We can now make some observations:
\(t^k_d \lt t^k_e\).
At time \(t^k_e\),
numbers[k]
is completely written, and a read at or after
this point of numbers[k]
will not return a partial
result.
At \(t^{ik}_{L2}\), either
process k
is just out of the doorway (set
choosing[k] = false
) and in the bakery (either waiting for
other processes, or executing its critical section); or is neither in
the doorway or in the bakery (maybe executing some non-critical section
code). This corresponds to either \(t^{ik}_{L2} \gt t^k_e\), or \(t^{ik}_{L2} \lt t^k_d\), respectively.
When \(t^{ik}_{L2} \gt t^k_e\),
we also have \(t^{ik}_{L3} \gt t^k_e\).
This means that process i
found (at time \(t^{ik}_{L3}\)) either
numbers[i] < numbers[k]
, or if both numbers are equal,
that i < k
. Note that k
must have
picked a number at least as large as numbers[i]
in this
case.
When \(t^{ik}_{L2} \lt t^k_d\),
it must be that numbers[i] < numbers[k]
, since when
k
does enter the doorway (after the moment \(t^k_d\)), it will pick a number that is at
least one larger than numbers[i]
, which has already been
written completely.
Hence, in both these cases, process i
does not wait for
process k
. Generalizing this to all other processes
k
that a given process i
needs to share
resources with, it can be seen that at most one process will be able to
execute its critical section.
If a process fails after setting choosing[k] = true
, and
then keeps failing and recovering, never resetting
choosing[k]
, a deadlock can be reached as other processes
will keep spinning, waiting on the failed process.
If we do not yield to the runtime by calling
time.Sleep()
in every loop iteration, only as many
goroutines as the number of cores on our computer will be able to run
causing the algorithm to stop making any progress.