AMDGPU Execution Synchronization¶
This document covers different ways of synchronizing execution of threads on AMD GPUs.
Note
This document is not exhaustive. There may be more ways of synchronizing execution that are not covered by this document.
Barriers¶
This section covers execution synchronization using barrier-style primitives.
Execution Model¶
This section contains a formal execution model that can be used to model the behavior of barriers on AMDGPU targets.
Note
The barrier execution model is experimental and subject to change.
Threads can synchronize execution by performing barrier operations on barrier objects as described below:
Each barrier object has the following state:
An unsigned positive integer expected count: counts the number of arrive operations expected for this barrier object.
An unsigned non-negative integer arrive count: counts the number of arrive operations already performed on this barrier object.
The initial value of arrive count is zero.
When an operation causes arrive count to be equal to expected count, the barrier is completed, and the arrive count is reset to zero.
Barrier objects exist within a scope (see AMDHSA LLVM Sync Scopes), and each instance of a barrier object can only be accessed by threads in the same scope instance.
Barrier-mutually-exclusive is a symmetric relation between barrier objects that share resources in a way that restricts how a thread can use them at the same time.
Barrier operations are performed on barrier objects. A barrier operation is a dynamic instance of one of the following:
Barrier init
Barrier init takes an additional unsigned positive integer argument k.
Sets the expected count of the barrier object to k.
Resets the arrive count of the barrier object to zero.
Barrier join.
Allow the thread that executes the operation to wait on a barrier object.
Barrier drop.
Decrements expected count of the barrier object by one.
Barrier arrive.
Increments the arrive count of the barrier object by one.
If supported, an additional argument to arrive can also update the expected count of the barrier object before the arrive count is incremented; the new expected count cannot be less than or equal to the arrive count, otherwise the behavior is undefined.
Barrier wait.
Introduces execution dependencies between threads; this operation depends on other barrier operations to complete.
Barrier modification operations are barrier operations that modify the barrier object state:
Barrier init.
Barrier drop.
Barrier arrive.
Thread-barrier-order<BO> is the subset of program-order that only relates barrier operations performed on a barrier object
BO.All barrier modification operations on a barrier object
BOoccur in a strict total order called barrier-modification-order<BO>; it is the order in whichBOobserves barrier operations that change its state. For any valid barrier-modification-order<BO>, the following must be true:Let
AandBbe two barrier modification operations whereA -> Bin thread-barrier-order<BO>, thenA -> Bis also in barrier-modification-order<BO>.The first element in barrier-modification-order<BO> is always a barrier init, otherwise the behavior is undefined.
barrier-participates-in relates barrier operations to the barrier waits that depend on them to complete. A barrier operation
Xbarrier-participates-in a barrier waitWif and only if all of the following is true:XandWare both performed on the same barrier objectBO.Xis a barrier arrive or drop operation.Xdoes not barrier-participate-in another distinct barrier waitW'in the same thread asW.W -> Xnot in thread-barrier-order<BO>.All dependent constraint and relations are satisfied as well. [0]
For the set
Sconsisting of all barrier operations that barrier-participate-in a barrier waitWfor some barrier objectBO:The elements of
Sall exist in a continuous, uninterrupted interval of barrier-modification-order<BO>.The arrive count of
BOis zero before the first operation ofSin barrier-modification-order<BO>.The arrive count and expected count of
BOare equal after the last operation ofSin barrier-modification-order<BO>. The arrive count and expected count ofBOcannot equal at any other point inS.
A barrier join
Jis barrier-joined-before a barrier operationXif and only if all of the following is true:J -> Xin thread-barrier-order<BO>.Xis not a barrier join.There is no barrier join or drop
JDwhereJ -> JD -> Xin thread-barrier-order<BO>.There is no barrier join
J'on a distinct barrier objectBO'such thatJ -> J' -> Xin program-order, andBObarrier-mutually-exclusiveBO'.
A barrier operation
Abarrier-executes-before another barrier operationBif any of the following is true:A -> Bin program-order.A -> Bin barrier-participates-in.Abarrier-executes-before some barrier operationX, andXbarrier-executes-beforeB.
Barrier-executes-before is consistent with barrier-modification-order<BO> for every barrier object
BO.For every barrier drop
Dperformed on a barrier objectBO:There is a barrier join
Jsuch thatJ -> Din barrier-joined-before; otherwise, the behavior is undefined.Dcannot cause the expected count ofBOto become negative; otherwise, the behavior is undefined.
For every pair of barrier arrive
Aand barrier dropDperformed on a barrier objectBO, such thatA -> Din thread-barrier-order<BO>, one of the following must be true:Adoes not barrier-participates-in any barrier wait.Abarrier-participates-in at least one barrier waitWsuch thatW -> Din barrier-executes-before.
For every barrier wait
Wperformed on a barrier objectBO:There is a barrier join
Jsuch thatJ -> Win barrier-joined-before, andJmust barrier-executes-before at least one operationXthat barrier-participates-inW; otherwise, the behavior is undefined.
barrier-phase-with is a symmetric relation over barrier operations defined as the transitive closure of: barrier-participates-in and its inverse relation.
For every barrier operation
Athat barrier-participates-in a barrier waitWon a barrier objectBO:There is no barrier operation
XonBOsuch thatA -> X -> Win barrier-executes-before, andXbarrier-phase-with a non-empty set of operations that does not includeW.
Note
Barriers only synchronize execution and do not affect the visibility of memory operations between threads. Refer to the execution barriers memory model to determine how to synchronize memory operations through barrier-executes-before.
Informational Notes¶
Informally, we can deduce from the above formal model that execution barriers behave as follows:
Barrier-executes-before relates the dynamic instances of operations from different threads together. For example, if
A -> Bin barrier-executes-before, then the execution ofAmust complete before the execution ofBcan complete.This property can also be combined with program-order. For example, let two (non-barrier) operations
XandYwhereX -> AandB -> Yin program-order, then we know that the execution ofXcompletes before the execution ofYdoes.
Barriers do not complete “out-of-thin-air”; a barrier wait
Wcannot depend on a barrier operationXto complete ifW -> Xin barrier-executes-before.It is undefined behavior to operate on an uninitialized barrier object.
It is undefined behavior for a barrier wait to never complete.
It is not mandatory to drop a barrier after joining it.
A thread may not arrive and then drop a barrier object unless the barrier completes before the barrier drop. Incrementing the arrive count and decrementing the expected count directly after may cause undefined behavior.
Joining a barrier is only useful if the thread will wait on that same barrier object later.
Barrier Implementations on AMDGPU Targets¶
s_barrier¶
s_barrier are the primary barrier implementation of AMD GPUs.
s_barrier instructions can only be used to synchronize threads at a wavefront granularity.
s_barrier instructions are convergent within a wave, and thus can only be performed
in wave-uniform control flow.
The s_barrier family of instructions is available in some form on all GFX targets,
and has evolved over time. The sub-sections below cover the capabilities offered by every major
iteration of this feature separately.
GFX6-11¶
Targets from GFX6 through GFX11 included do not have the “split barrier” feature.
The barrier arrive and barrier wait operations cannot be performed independently
using s_barrier.
There is only one workgroup barrier object of workgroup scope that is implicitly used
by all s_barrier instructions.
The following code sequences can be used to implement the barrier operations defined by the
execution synchronization model using
s_barrier on GFX6 through GFX11:
Barrier Operation(s) |
Barrier Object |
AMDGPU Machine Code |
|---|---|---|
Init, Join and Drop |
||
init |
|
Automatically initialized by the hardware when a workgroup is launched. The expected count of this barrier is set to the number of waves in the workgroup. |
join |
|
Any thread launched within a workgroup automatically joins this barrier object. |
drop |
|
When a thread ends, it automatically drops this barrier object if it had previously joined it. |
Arrive and Wait |
||
arrive then wait |
|
BackOffBarrier
s_barrierNo BackOffBarrier
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)s_waitcnt_vscnt null, 0x0s_barrier
|
arrive |
|
Not available separately, see arrive then wait |
wait |
|
Not available separately, see arrive then wait |
GFX12¶
GFX12 targets have the split-barrier feature, and also allow s_barrier instructions to use
one of multiple barrier objects available per workgroup. s_barrier instruction use the
barrier ID operand to determine the barrier object they operate on.
GFX12.5 additionally introduces new barrier objects that offer more flexibility for synchronizing the execution
of a subset of waves of a workgroup, or synchronizing execution across workgroups within a workgroup cluster, via
s_barrier.
Note
Check the the table below to determine
which barrier IDs are available to s_barrier instructions on a given target.
The following code sequences can be used to implement the barrier operations defined by the
execution synchronization model using
s_barrier on GFX12.0 and up:
Barrier Operation(s) |
Barrier ID |
AMDGPU Machine Code |
|---|---|---|
Init, Join and Drop |
||
init |
|
Automatically initialized by the hardware when a workgroup is launched. The expected count of this barrier is set to the number of waves in the workgroup. |
init |
|
Automatically initialized by the hardware when a workgroup is launched as part of a workgroup cluster. The expected count of this barrier is set to the number of workgroups in the workgroup cluster. |
init |
|
Automatically initialized by the hardware and always available. This barrier object is opaque and immutable as all operations other than barrier join are no-ops. |
init |
|
s_barrier_init <N>
|
join |
|
Any thread launched within a workgroup automatically joins this barrier object. |
join |
|
Any thread launched within a workgroup cluster automatically joins this barrier object. |
join |
|
s_barrier_join <N>
|
drop |
|
s_barrier_leave
|
drop |
|
When a thread ends, it automatically drops this barrier object if it had previously joined it. |
Arrive and Wait |
||
arrive |
|
s_barrier_signal <N>Or
s_barrier_signal_isfirst <N>
|
wait |
|
|
The following barrier IDs are available:
Barrier ID |
Scope |
Availability |
Description |
|---|---|---|---|
|
|
GFX12.5 |
Cluster trap barrier; cluster barrier object for use by all workgroups of a workgroup cluster. Dedicated for the trap handler and only available in privileged execution mode (not accessible by the shader). |
|
|
GFX12.5 |
Cluster user barrier; cluster barrier object for use by all workgroups of a workgroup cluster. |
|
|
GFX12 (all) |
Workgroup trap barrier, dedicated for the trap handler and only available in privileged execution mode (not accessible by the shader). |
|
|
GFX12 (all) |
Workgroup barrier. |
|
|
GFX12.5 |
NULL named barrier object. Barrier-mutually-exclusive with
barriers |
|
|
GFX12.5 |
Named barrier object. All barrier objects in this range are
barrier-mutually-exclusive with other barriers in |
Informally, we can note that:
All operations on the NULL named barrier object other than join are no-ops.
As the NULL named barrier object (barrier ID
0) is barrier-mutually-exclusive with all other named barrier objects (barrier IDs[1, 16]), a thread can use a join on the NULL barrier as a way to “unjoin” a named barrier (break barrier-joined-before) without having to use a drop operation.
When a thread ends, it does not implicitly drop any named barrier objects (barrier IDs
[0, 16]) it has joined.
