Direct Memory Access (DMA) is a hardware controller that transfers data between memory regions or between memory and a peripheral without continuous CPU involvement. The CPU configures the transfer parameters and the DMA controller handles the bus transactions autonomously, freeing the CPU to execute other code or enter a low-power state.
In practice
DMA is most commonly used to offload bulk data movement that would otherwise consume significant CPU cycles in polling or ISR loops. Typical use cases include streaming ADC samples into a buffer, feeding a DAC or PWM peripheral from memory, driving SPI/I2C/UART transfers, and moving data between RAM regions. On most MCUs, a DMA transfer is set up by configuring at minimum a source address, destination address, transfer count, and data width (byte, halfword, or word), though many controllers also require or support additional fields such as increment modes, request/trigger selection, priority, and scatter/gather descriptors. A completion or half-transfer interrupt notifies the CPU when the buffer is ready.
A common design pattern is the double-buffer (ping-pong) arrangement: the DMA fills one buffer while the CPU processes the other. This keeps both the CPU and the peripheral busy without stalling either. Many DMA controllers support continuous or chained operation through mechanisms such as circular mode or linked-list descriptors, though these are distinct features that differ across implementations. For example, the RP2040 DMA supports chained channel descriptors for linked-list-style operation, while STM32 DMA2 streams offer a separate double-buffer mode as well as circular mode.
Cache coherency is a significant pitfall on MCUs with data caches, such as Cortex-M7 based parts like the STM32H7 or i.MX RT series. If the CPU writes to a buffer that is cached, the DMA may read stale data from RAM; conversely, if the DMA writes to RAM, the CPU may read from a stale cache line. The correct fix is to either place DMA buffers in non-cacheable memory regions (via the MPU) or to explicitly clean and invalidate the cache around each transfer. On Cortex-M0/M3/M4 devices, which typically lack data caches, this specific cache coherency issue generally does not arise, though memory ordering and visibility considerations may still apply depending on the SoC.
DMA is also useful in DSP and signal-processing pipelines where continuous, deterministic data flow is critical -- for instance, feeding samples to a codec or capturing from an isolated sigma-delta modulator. Because the CPU is freed from the transfer itself, it can perform computation on completed blocks with tighter and more predictable timing.
Frequently asked
Does using DMA always result in faster throughput than CPU-driven transfers?
Not always. For very short transfers (a few bytes), the overhead of configuring the DMA controller can exceed the cost of a simple CPU copy or tight
ISR loop. DMA pays off most on transfers of tens of bytes or more, or when the goal is to free the CPU rather than reduce raw
latency. On small 8-bit MCUs with limited DMA (or none at all), hand-coded copy loops or ISR-driven byte shuffling may be the only practical option.
What is a DMA burst, and does it matter for my design?
A burst is a configurable number of consecutive bus beats the DMA controller performs before releasing the bus to other masters. Larger bursts improve throughput efficiency but can increase bus
latency for the CPU or other peripherals. On some STM32 DMA2 streams, for example, the burst size is configurable across multiple beat widths, though the exact options vary by controller variant and stream mode. Choosing the burst size involves balancing throughput against worst-case bus latency for other masters.
How do I handle DMA and cache coherency on a Cortex-M7 (e.g., STM32H7)?
The safest approach is to declare DMA buffers in a non-cacheable memory region, either by placing them in a dedicated MPU region with the cacheable attribute cleared, or by using
SRAM regions that are not cached by default (on STM32H7, for example, SRAM3 or SRAM4 are often used for this purpose). Alternatively, call SCB_CleanDCache_by_Addr() before a CPU-to-peripheral transfer and SCB_InvalidateDCache_by_Addr() before reading a peripheral-to-memory buffer. Forgetting this step is one of the most common sources of intermittent data corruption on high-performance Cortex-M parts.
What is circular (ring) DMA mode and when should I use it?
In circular mode, when the DMA reaches the end of the configured buffer it automatically wraps back to the start and continues without CPU intervention. This is well suited for continuous streaming applications such as capturing audio samples from an I2S peripheral, logging
ADC data, or feeding a
DAC. On controllers that provide half-transfer and transfer-complete
interrupts in circular mode, the CPU can process the first and second halves of the buffer alternately, forming a software ping-pong arrangement without needing to reconfigure the channel. Note that this interrupt-based technique is distinct from a dedicated hardware double-buffer mode offered by some controllers.
Can DMA transfers interfere with each other or with the CPU on a shared bus?
Yes. DMA controllers and the CPU share the same memory buses, and simultaneous accesses cause arbitration. On STM32 devices, the AHB bus matrix and DMA arbitration logic assign priority levels to each channel; lower-priority channels can be starved under heavy bus load. On tightly constrained real-time systems, it is worth checking the worst-case DMA
latency in the reference manual and ensuring that time-critical CPU code or peripherals are not blocked by a long DMA burst.
Differentiators vs similar concepts
DMA is sometimes confused with memory-mapped I/O (MMIO), where the CPU itself reads or writes peripheral
registers directly. The key distinction is agency: with MMIO the CPU executes every load/store instruction, while with DMA the controller performs those bus transactions autonomously. DMA is also distinct from a CPU
cache, which transparently buffers memory accesses to improve CPU performance but is not used to transfer data to or from peripherals. A related but separate concept is the DMAC (DMA controller) versus a peripheral's built-in
FIFO: a FIFO buffers data locally within a peripheral to tolerate burst timing, while DMA moves data between the FIFO and system memory without CPU involvement. The two are frequently used together.