EmbeddedRelated.com
Blogs

How to Deploy Local LLMs for Embedded Software Development: Terminology and Motivation

Mohammed BillooMay 12, 2026

About 6 months ago, I purchased a Framework Desktop (https://frame.work/desktop) with the intention of building a fully offline, local LLM to see whether it could compete with "frontier" models. In the next series of blog posts, I will outline the journey that began with the Framework Desktop and led me to a setup I use daily. In this blog post, I will outline some personal reasons for this undertaking, along with terminology important to understand before we dive into the implementation details in future blog posts.

Motivation

The original reason I wanted to create a completely local and offline LLM is outlined below:

  • Price: Prior to the release of the Framework Desktop, I had been experimenting with running models offline to transcribe meeting notes on my M3 Max MacBook Pro. This worked so well that I was on the lookout for an SoC with sufficient unified RAM and the ability to run Linux, so I could use the local LLM in my embedded software development workflow. When I saw the Framework Desktop released with the AMD Ryzen AI Max+ 395 and 128 GB of Unified RAM at an impressive price point, I immediately purchased it.
  • Client confidentiality: I'm not comfortable exposing cloud-based models to sensitive client code. My clients expect me not to share their IP, including their embedded software and firmware.
  • Data sovereignty: Regardless of client expectations, I'm not comfortable exposing the code I develop and use on a day-to-day basis to a tool that reaches out to the Internet (NB: In retrospect, this is probably an exaggeration, considering I have so much of my life in the cloud already, but it hits closer to home as an embedded software engineer using a command line tool that theoretically can access everything on my development machine).
  • Offline operation: There have been months when Claude has been down for multiple hours in a day (even entire days), making it difficult to rely on a tool for consistent, day-to-day work. 
  • Cost: Token pricing can be opaque and vary over time (I have seen reports from organizations and users that they exceeded their token usage much more quickly in certain weeks than in previous weeks). I don't have to worry about variable token pricing if I use my own service.

Terminology

In this section, we will cover some relevant definitions. 

This article is available in PDF format for easy printing
  • Large Language Model (LLM): This is the big one and the one that most people get confused by without even knowing! A large language model is essentially a neural network trained to predict the next token in a text sequence. "Large" here refers to the number of weights in the neural network. Modern models range from a few billion to hundreds of billions of weights. Most people think that using the "claude" CLI tool (in the case of Claude Code) is the model itself. The tool does a lot more in the background that we will get into in a later blog post (and create our own pipeline). However, the Claude Code CLI tool ultimately passes tokens to a particular LLM, such as Opus 4.x or Sonnet, and retrieves tokens.
  • Training: The process used by the model vendor to generate the model weights. This process happens once and is extremely expensive. It can take weeks or months and consume enormous amounts of compute and memory. We (as users) aren't interested in training.
  • Inference: The process we (as users) interact with LLMs. We give it a prompt and the model generates a response. The weights are frozen (from training), so the model is not learning during inference.
  • Parameters: The learned weights that define the model's behavior. Everything that the model "knows" is encoded in these floating-point values. When we see "7B" or "70B" associated with a model, that number represents the number of parameters in the model. The reason parameters matter for hardware is that they all have to "go" somewhere. a 7B parameter model at 16-bit precision requires roughly 14 GB of memory. A 70B model needs roughly 140 GB. The parameter count is the main limiting factor in hardware selection.
  • Active parameters: A new architecture, called a Mixture-of-Experts (MoE), was designed to enable specialization and modularity in 1991 to reflect how the brain behaves. However, as language models scaled, researchers realized that an MoE architecture has a very useful property: we can have a huge total parameter count while only activating a small subset of parameters. This means our total parameter count can be very large but the number of active parameters can be small, thus significantly reducing the required hardware.
  • Context window: The maximum number of tokens the model can consider at once, including both the input prompt and the generated response combined. Context window size translates to the complexity of your question and the amount of code and documentation that the model has to be fed in a single query. The context window correlates directly to the KV cache.
  • KV cache: During inference, the model computes intermediate representations called keys and values for every token in the context window. These intermediate results are stored in the KV cache, which consumes memory that would otherwise be available to the model. However, at the end of the day, it's a wash in terms of memory, since these values would have to be computed anyways if they are not buffered. Using the KV cache results in paying the computation cost once and caching the result for subsequent accesses. The memory cost of the KV cache scales with both context length and model size. For long-context workloads, such as querying a large codebase, the KV cache can consume as much memory as the model itself. Thus, it's important to keep the KV cache (and context window) in mind when sizing your GPU. It's not just the model weights that are important.
  • Quantization: LLMs are typically trained at 16-bit or 32-bit floating-point precision, commonly referred to as FP16 and FP32, respectively. Quantization reduces precision (e.g., INT8 for 8-bit integers, INT4 for 4-bit integers) to reduce the model's memory footprint. 
  • VRAM: Video RAM is the dedicated high-bandwidth memory on a discrete GPU. GPU architecture makes it the ideal compute element for LLM inference. Before the explosion of LLMs, GPUs were commonly used for traditional machine learning workflows. Nothing is preventing us from using "traditional" system RAM to load the LLM weights. However, performance is significantly degraded.
  • Unified memory: In this architecture, the CPU and GPU share the same physical memory pool rather than having separate allocations. The AMD Ryzen CPU on the Framework Desktop and Apple Silicon use this architecture. For LLM inference, the consequence is significant. There's no traditional VRAM limit. A system with 128 GB of unified memory can make (nearly) all 128 GB available to the GPU for model weights and KV cache. The tradeoff is bandwidth. Unified memory is no match for a discrete GPU in terms of memory bandwidth. However, the price per GB for unified memory is much lower than that for a discrete GPU.
  • Token speed: Memory bandwidth is directly tied to the token speed. Token speed refers to the rate at which the model can ingest and output tokens. Higher token speed can be important for chatbots. However, as embedded software engineers, token speed should not be our primary metric.
  • Frontier models: Cloud-based models offered by Anthropic, OpenAI, Google, etc. Although the exact number of parameters isn't disclosed, I personally believe they can't be run on consumer-grade hardware.  It is important to keep in mind that we will not be able to achieve anything close to the quality of these models on consumer hardware.

Conclusion

The goal of the next series of blog posts is to develop a fully offline AI pipeline that achieves an acceptable quality as close as possible to that of Anthropic's Sonnet 4.6 model. The operative word here is "acceptable". We have to accept that we will not be able to match the quality of frontier models. We have to work around the limitations of local models. In the next blog post, we will map the above terms to hardware selection and better understand the trade-offs.


To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: