hello, pls allow one comment...
Indeed one 32-Bit read or write access is translated in one instruction, from the CPU perspective / boundary.
However I believe writes might be grouped / rescheduled as for the AXI write buffers.
And, in between a read / modify / write cycle (to hold-release a mutex) you could get interrupts, so potentially an issue / lock condition, if the same mutex is to be accessed by two different "threads" or routines.
I would rather recommend using load / store exclusives for a "bare metal" mutex
infocenter.arm.com/.../index.jsp
At the end of the chapter, there are even some implementation examples / code snippets