Termination detection for fine-grained message-passing architectures
Matthew Naylor1, Simon W. Moore1, Andrey Mokhov2, David Thomas3, Jonathan R. Beaumont3, Shane Fleming4, A. Theodore Markettos1, Thomas Bytheway1 and Andrew Brown5
1 University of Cambridge, UK 2 Newcastle University and Jane Street, UK 3 Imperial College London, UK 4 Microsoft Research, UK 5 University of Southampton, UK
Barrier primitives provided by standard parallel programming APIs are the primary means by which applications
implement global synchronisation. Typically these primitives are fully-committed to synchronisation in the sense that, once a
barrier is entered, synchronisation is the only way out. For message-passing applications, this raises the question of what
happens when a message arrives at a thread that already resides in a barrier. Without a satisfactory answer, barriers do not
interact with message-passing in any useful way.
In this paper, we propose a new refutable barrier primitive that combines with message-passing to form a simple, expressive,
efficient, well-defined API. It has a clear semantics based on termination detection, and supports the development of both
globally-synchronous and asynchronous parallel applications. To evaluate the new primitive, we implement it in a prototype
large-scale message-passing machine with 49,152 RISC-V threads distributed over 48 FPGAs. We show that hardware support for
the primitive leads to a highly-efficient implementation, capable of synchronisation rates that are an order-of-magnitude higher
than what is achievable in software. Using the primitive, we implement synchronous and asynchronous versions of a range
of applications, observing that each version can have significant advantages over the other, depending on the application. Therefore, a barrier primitive supporting both styles can greatly assist the development of parallel programs.
[The authors opted for not publicly sharing a presentation video.]