Implementing client-side RAG

After integrating the Chrome Prompt API with diodeaign.org, I quickly ran into an interesting brick wall of a challenge: the (right now) 6,144-token context window of the on-device Gemini Nano model. While 6K tokens might sound generous for a quick chat, it's nowhere near enough to hold the entire archive of the site to interrogate.

If I just dumped every log entry into the combined user prompt and system prompt, the model would simply choke or truncate the most important bits.

The solution is a client-side Retrieval-Augmented Generation (RAG) architecture. Instead of giving the AI everything, I only give it the snippets, or "chunks", that are actually relevant to the user's query.

The indexing pipeline

The first step happens at build time. I updated the build.py script to process every Markdown file and "chunk" the text into 5,000-character segments. I also made sure to inject each text's metadata, such as the byline and summary, into each chunk so the search engine knows who wrote what and why.

These chunks are exported into a search-index.json file, which is fetched by the browser when the "Lab AI" initializes.

Retrieval: Orama and the fail-safe fallback

For the retrieval engine, I'm using Orama, a lightweight, WASM-based search engine that runs entirely in the browser. It handles the heavy lifting of indexing the site's content on the fly and performing ranked keyword searches.

However, natural language is messy. I found that a strict search engine sometimes penalizes conversational queries. To combat this, I implemented a two-stage retrieval pipeline:

  1. As the primary search, Orama attempts a ranked search across all chunks.
  2. As a keyword fallback, if Orama returns zero hits, the system triggers a fail-safe manual scan. It breaks the query into individual keywords and ranks every chunk by how many unique terms it matches.

This ensures that if you ask about a specific data point, such as which riots did I cover while working in the media, the system will find it even if the rest of the query is noisy.

Staying within the token budget

Once the relevant chunks are found, I have to ensure they actually fit in the 6,144-token window. The system now performs a token budget check before every inference:

  • It uses session.countPromptTokens() to measure the size of the combined prompt.
  • If the count exceeds the limit (with a safety margin), it iteratively drops the least relevant chunks from the context until it fits.
  • The maximum number of tokens is defined by session.maxTokens, which is used as the limit.

Citing sources

Finally, I tuned the system prompt to make the AI more academic. Instead of vague summaries, it's now instructed to cite its sources directly, such as: "According to the [Title] log..." This makes the answers feel less like a hallucination and more like a genuine research assistant tapping into the lab's archives.

The result is a search experience that's private, kinda fast on my laptop, and hopefully knows what it's talking about.

Integrating the Chrome Prompt API

For fun, I implemented a local AI search interface on diodeaign.org called ask the lab. It uses the experimental Prompt API built natively into Chrome, starting with Chrome Canary.

Essentially, it uses an on-device Large Language Model (Gemini Nano) to answer visitors' queries using the website's static archives, completely eliminating cloud latency, and privacy concerns. You ask the site a question, and it answers using an LLM and the site's content. It works by combining the archives with the user's query and a system prompt, then sending that payload to the model and displaying the answer in the page.

The primary challenge was managing the lifecycle of the 4GB neural weight component (OptimizationGuideOnDeviceModel) tied to the browser's profile directory. I wanted to avoid making people download the model over and over, so I implemented logic to persist the model across sessions. Initial implementations ran afoul of both the API namespace changing overnight (from window.LanguageModel to window.ai.languageModel) and silent schema changes that caused the browser's ML execution service to fatally hang during initialization.

The final architecture uses a graceful fallback approach:

  1. Probe the API namespace to ensure the device even supports the local model. There are hardware requirements, and if these aren't met, then the query box never even appears.
  2. Intercept the downloading status to provide user feedback while Chrome silently fetches the 4GB of weights in the background.
  3. Manage the initial "cold start" latency (telling the visitor the page is hydrating the weights from disk to RAM) when the AI is first invoked.
  4. Stream the output chunks directly into the UI alongside a dynamic DOM collapse, creating a chat experience that's all too familiar these days.

If you have Chrome Canary suitably configured, you can test it by clicking the search box on the homepage. If your browser or system doesn't support the API, the site appears exactly as it did before.

As I said, this was for fun to get a handle on the new API, and add a bit more functionality for free to the site.

Migrating Diosix to Zig

I've decided to move the Diosix RISC-V hypervisor codebase from Rust to Zig. While I appreciate Rust's safety, and it is a fantastic language, I find Zig's approach to low-level systems programming more productive. This isn't a critique of Rust’s capabilities, but rather a preference for a different development flow. Zig allows me to write as I think from an overall design, giving me the mechanisms for safety without the restrictiveness of a mandatory borrow checker.

You can find this ongoing Zig work in the zig branch of the Diosix git repository.

The road ahead for Diosix is focused on demonstrating its architecture as a mesh of autonomous gossiping nodes.

The next phase of the project will see - hopefully - Diosix evolve into a self-organizing mesh, where localized reasoning within a sovereign root VM allows it and its peer VMs to negotiate resources and maintain resilience across the mesh without needing a central controller. This shift toward decentralized intent is going to be an interesting challenge to pull off, and I'm curious to see how this new environment handles the stress of 1,000 or more VMs.

I'll be pushing these experiments out as they stabilize; I'm wondering what others might build on top of a system that manages itself.