A digital signal processor (DSP) is a processor architecture optimized for repetitive, numerically intensive operations on sampled data streams, such as filtering, FFTs, convolution, and modulation/demodulation. DSPs often achieve this through hardware multiply-accumulate (MAC) units, circular buffer addressing, zero-overhead looping, and fixed or floating-point datapaths tuned for throughput over general-purpose flexibility, though the specific features present vary across families.
In practice
In embedded systems, the term "DSP" refers both to dedicated standalone processors (TI C6000/C5000 series, Analog Devices SHARC/Blackfin, Cadence Tensilica HiFi) and to DSP-capable instruction extensions found in general-purpose MCUs and SoCs. ARM Cortex-M4 and M7 cores, for example, include a MAC-style datapath and SIMD instructions via the DSP extension, allowing many signal-processing workloads to run on the same MCU handling system control; Cortex-M33 also defines a DSP extension, but it is an optional implementation choice rather than universally present. Higher-end application processors, particularly those in SoCs targeting multimedia or communications, often include dedicated DSP or DSP-adjacent coprocessors alongside the main CPU.
The defining hardware feature in many DSP architectures is the MAC unit, which computes A += B * C and can sustain one result per cycle under ideal conditions. This is critical for FIR/IIR filters, FFTs, and PID controllers. Dedicated DSPs also commonly provide Harvard-style memory buses (separate instruction and data buses, sometimes with two data buses) so coefficients and samples can be fetched simultaneously in many configurations, reducing the memory bottleneck that would otherwise dominate a tight filter loop; the exact benefit depends on architecture and where data is placed in memory. The blog post "Data Types for Control & DSP" covers the fixed-point versus floating-point trade-offs that directly affect how well a workload maps to these datapaths.
Fixed-point arithmetic is still common on DSP cores where power budget or cost is tight. Q-format representations (Q15, Q31 on 16/32-bit parts) allow efficient use of the MAC unit, but introduce scaling, overflow, and saturation concerns that do not arise with floating-point. Many embedded DSP bugs stem from mismatched Q-formats or accumulator widths -- a topic covered in the blog post "Debugging DSP code." Some modern DSP cores (SHARC variants, C674x variants, and many Cortex-M7-based designs with an FPU) provide hardware floating-point to reduce this burden, at the cost of higher power and silicon area.
On projects where even a DSP-extended MCU cannot meet throughput requirements, FPGAs are an alternative worth considering, since they allow true parallel datapath instantiation rather than sequential MAC execution. The blog post "How FPGAs work, and why you'll buy one" discusses this trade-off. For algorithm selection upstream of implementation, "Modulation Alternatives for the Software Engineer" illustrates how DSP-centric choices (modulation scheme, filter order, FFT size) directly drive the compute budget that the hardware must satisfy.
Discussed on EmbeddedRelated
Frequently asked
What separates a DSP architecture from a general-purpose MCU running signal-processing code?
The main differences are hardware MAC units (single-cycle multiply-accumulate without pipeline stalls), Harvard memory buses that allow simultaneous coefficient and data fetch, circular/modulo addressing for sample buffers, and zero-overhead hardware loops. A general-purpose
MCU can run the same algorithms, but on architectures without these features a multi-cycle multiply and separate loop-overhead instructions reduce throughput and increase power consumption for the same workload. Cortex-M4/M7 and similar cores blur this line by incorporating DSP extensions, so the distinction is one of degree rather than a hard boundary.
Can a Cortex-M MCU replace a dedicated DSP for audio or motor-control filtering?
Often yes, for moderate workloads. Cortex-M4 and M7 cores with the DSP extension and
FPU handle audio codecs, biquad filter banks, and FOC motor-control loops in many production designs. Where a dedicated DSP tends to win is at higher sample rates, larger filter orders, or multi-channel workloads that saturate even a Cortex-M7 at 200-500 MHz, or in applications with strict power constraints where a purpose-built DSP core running at lower voltage offers better MIPS-per-mW.
What is a MAC unit and why does it matter for DSP?
A multiply-accumulate unit computes accumulator += A * B in a single instruction cycle (or in one pipeline slot on pipelined cores). Almost every DSP algorithm -- FIR filters, FFTs, matrix operations, PID controllers -- reduces to a series of MAC operations, so a hardware MAC unit directly sets the maximum throughput. Without one, a multiply followed by an add takes multiple cycles and additional
registers, which cuts throughput and increases code size for tight inner loops.
What are Q-format numbers and why do they appear in DSP code?
Q-format is a fixed-point representation where an integer word holds a fractional value by implicitly placing the binary point at a fixed position. Q15, for example, stores a value in [-1, 1) as a signed 16-bit integer, and Q31 does the same in 32 bits. DSP libraries use Q-format because the MAC unit operates on integers, yet filter coefficients and audio samples are naturally fractional. The pitfall is that the programmer must track scaling manually: multiplying two Q15 numbers produces a Q30 result in a 32-bit accumulator, and failing to shift or saturate correctly causes overflow or loss of precision. The blog post 'Data Types for Control & DSP' covers these trade-offs in detail.
When should a designer consider an FPGA instead of a DSP processor for signal processing?
FPGAs become attractive when the algorithm requires massive parallelism that no sequential MAC pipeline can sustain (e.g., many simultaneous FFT channels, radar pulse compression, real-time SDR at wide bandwidths), when
latency must be deterministic to the clock cycle, or when the datapath is unconventional and does not map well to a standard MAC unit. The trade-off is higher development effort and cost per unit compared to a DSP chip or DSP-extended
MCU. The blog post 'How FPGAs work, and why you'll buy one' provides a practical overview of where FPGAs make sense.
Differentiators vs similar concepts
DSP (architecture) vs. DSP (algorithm domain): the term "DSP" is used both for a hardware architecture and for the broader discipline of digital signal processing (the mathematics of sampling, filtering, and spectral analysis). A Cortex-M4 running an FFT is doing DSP work but is not a DSP processor in the architectural sense. Context usually disambiguates, but embedded engineers should be precise: "DSP core," "DSP instruction extensions," or "DSP algorithm" avoids ambiguity.
DSP vs.
FPGA for signal processing: a DSP executes instructions sequentially through a MAC-centric pipeline; an FPGA implements datapaths in parallel reconfigurable logic. DSPs offer faster software development, lower cost in volume, and simpler tooling; FPGAs offer higher throughput for parallelizable workloads, lower
latency, and flexibility for non-standard arithmetic. Many high-performance systems combine both.