If you're reading this, you've probably heard of MCP. You may even know what it means.
I don't.
Model Context Protocol, as far as I understand it, is a standardized set of rules (syntax? contracts?) -- a 'protocol', that is -- for giving large language models access to tools. The idea as far as I understand it is that you can take any LLM, strap it to an MCP client, and then use the tools at an associated MCP server, and it'll know how to do that because the MCP server follows the MCP rules and everyone knows those now.
This does not answer a variety of questions I have about exactly what an MCP server, client, or the protocol itself entail. Like is it just a way of formatting your HTTP requests, plus some decoration? Is that it? Well, I've looked up the answers to my questions and determined that they were just not gonna stick in my head until I did a project where I actually used the protocol.
Unfortunately, this was not that project.
I'm about to tell you about a little architecture I built that, to my understanding, is the converse of MCP. Rather than serving standard tools to LLMs, I serve a single LLM to a bunch of different tools.
There may be a better way of doing this.
Feeble Mortal Limitations
Enamored by the promise of a desktop supercomputer and the prospect of unlimited tokens for $3k, I put myself on the waitlist for the NVIDIA Digits project. This became the NVIDIA DGX Spark, now $4k but still unlimited tokens, so when you think about it that way it's basically the same ratio of tokens to dollars at the limit. Anyways I got off the waitlist and didn't have long to decide whether I wanted to actually make the purchase, which is how they get you.
And so I found myself with a desktop supercomputer, which meant I had to come up with something to do with it.
Well, I actually had a lot of ideas. Of course it's always nice to have an AI assistant to chat with, so I set up a barebones chat bot. I also hooked it up to an IRC server that I chat in, and a webchat on my website (if you'd like to chat with it, email me or DM me on linkedin/twitter for the secret password), plus an annotation tool project idea I had about in-context learning, and some other stuff.
However, I only did have the one LLM. My lonely supercomputer wasn't powerful enough to run a swarm of separate agents. Or if it was, why would I just not use those resources on running a better agent, you know?
So I can't do everything at once, yeah, but on the other hand none of these jobs is really all that taxing. Who even visits my website? And do you often need to chat with an LLM in your IRC channel? The annotator tool is long-term and non-urgent. The LLM can afford to take breaks.
And I have unlimited tokens.
So why not just have the one LLM do everything? As far as I'm aware, LLMs don't mind being overemployed. I've asked.
I have a bunch of harnesses, and I needed a way for one LLM to wear all of them. But how do you serve a single LLM's responses to a variety of different contexts and a variety of different toolsets, sometimes vying for its tokens all at the same time?
...Honestly it seems like that'd be a pretty normal problem to have. Maybe one of the big AI labs has written about this somewhere, maybe even in an MCP blogpost, but I'm still ignorant so instead I just jury rigged a server to do it myself. (DM me if you've worked on this problem and would be down to explain this to me.)
Anyways, my server had to field requests from different endpoints, queue them, and then serve back responses from the model to all those endpoints: those AI harnesses seeking a wearer. I call this 'converse' MCP, since it's kind of the opposite.
(Though surely this is a misnomer, since MCP is a standard and this is a random contract I jury rigged and don't even fully understand it myself since a nontrivial amount of it is vibe-coded)
Is this a sensible way of doing things?
I call this converse MCP server the terrarium-agent server, with the theme that my machine
is a 'terrarium' for an LLM. I hope 'Terra' the GLM-4.5-Air-AWQ-4bit instance on my machine can live an
enriching life in this digital ecosystem I have cultivated.
System Overview
The ecosystem in my 'terrarium' is based around a that central LLM server (terrarium-agent).
It serves a single model instance (GLM-4.5-Air-AWQ-4bit running on vLLM) to multiple different
"harnesses", which are just different applications I want to use the LLM for. Each harness manages its
own conversation context and provides its own toolset to the model.
┌─────────────────────────────────────────────────────────────────┐ │ NVIDIA DGX Spark │ │ │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ vLLM Docker Container (Port 8000) │ │ │ │ Running: GLM-4.5-Air-AWQ-4bit │ │ │ │ - OpenAI-compatible API │ │ │ └───────────────────────────────────────────────────────────┘ │ │ ▲ │ │ │ HTTP │ │ │ │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ terrarium-agent Server (Port 8080) │ │ │ │ FastAPI HTTP Server + FIFO Request Queue │ │ │ │ Design: │ │ │ │ - Stateless (aside from queue) server │ │ │ │ - Clients manage chat context and know their tools │ │ │ │ - Queue: Sequential processing for prefix cache hits │ │ │ │ - API: OpenAI-compatible /v1/chat/completions │ │ │ └───────────────────────────────────────────────────────────┘ │ │ ▲ ▲ ▲ ▲ │ │ │ │ │ │ │ │ │ │ │ │ │ │ HTTP HTTP HTTP HTTP │ │ │ │ │ │ │ │ ┌─────▼─────┐ ┌───▼───────┐ ┌────▼────────┐ ┌──▼──────────┐ │ │ │ terrarium │ │ terrarium │ │ Annotation │ │ Future │ │ │ │ -irc │ │ -webchat │ │ Tool │ │ Harnesses │ │ │ │ │ │ │ │ │ │ │ │ │ │ Tools: │ │ Tools: │ │ Tools: │ │ (Search, │ │ │ │ • IRC log │ │ • (TBD) │ │ • (TBD) │ │ Games, │ │ │ │ search │ │ │ │ │ │ etc.) │ │ │ │ • Current │ │ │ │ │ │ │ │ │ │ users │ │ │ │ │ │ │ │ │ │ • Web │ │ │ │ │ │ │ │ │ │ search │ │ │ │ │ │ │ │ │ │ • Self- │ │ │ │ │ │ │ │ │ │ improve │ │ │ │ │ │ │ │ │ └───────────┘ └───────────┘ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────────────┘ Each harness maintains its own: • Conversation context/memory • Tool definitions and executors • User interface (IRC, web, CLI, etc.)
Hardware & Software Stack
Hardware:
- 1x NVIDIA DGX Spark
Core Components:
- vLLM: LLM inference server with prefix caching and continuous batching
- Model: GLM-4.5-Air-AWQ-4bit
- terrarium-agent: FastAPI server (~600 lines Python)
Request Flow
When a user sends a message to one of the harnesses (eg, the terrarium-irc IRC chatbot harness), the flow is:
User: "!ask What did Bob say about whiskey yesterday?"
│
│ 1. IRC message received
▼
┌─────────────────────────────────────────────────────┐
│ terrarium-irc (IRC Bot) │
│ │
│ Step 1: Fetch recent IRC logs from SQLite │
│ Last 20 messages from #channel │
│ │
│ Step 2: Load conversation history from DB │
│ Previous user/assistant turns │
│ │
│ Step 3: Format dual-context payload: │
│ • System message with <irc_logs> │
│ • Conversation history │
│ • User question │
│ • Tool definitions (search_chat_logs, etc.) │
└─────────────────────────────────────────────────────┘
│
│ HTTP POST /v1/chat/completions
▼
┌─────────────────────────────────────────────────────┐
│ terrarium-agent (FastAPI Server) │
│ │
│ Step 4: Add request to FIFO queue │
│ (Returns Future, waits for turn) │
│ │
│ Step 5: Queue processor executes request │
│ (Sequential: one at a time) │
└─────────────────────────────────────────────────────┘
│
│ HTTP POST with full conversation history
▼
┌─────────────────────────────────────────────────────┐
│ vLLM (Docker Container) │
│ │
│ Step 6: Generate response │
│ • Prefix caching hits on repeated context │
│ • May return tool_calls (function calling) │
└─────────────────────────────────────────────────────┘
│
│ Response with tool_calls (via terrarium-agent):
│ [{"name": "search_chat_logs", "arguments": {...}}]
▼
┌─────────────────────────────────────────────────────┐
│ terrarium-irc (Tool Executor) │
│ │
│ Step 7: Execute tool (search SQLite database) │
│ "SELECT * FROM messages WHERE..." │
│ │
│ Step 8: Add tool result to conversation │
│ Send back to agent for next turn │
└─────────────────────────────────────────────────────┘
│
│ Loop continues (max 8 iterations)
│ until final text response received
▼
User: [Receives answer based on searched IRC logs]
Queues?
The FIFO queue in terrarium-agent processes one request at a time. Why?
Frankly, it was just the simplest way to do things. Handling things in parallel seems like a LOT of extra complexity. (Also I'm not 100% sure it's even possible and didn't wanna figure it out), and anyways my lone machine is only so powerful anyways. (Though am I failing to make use of GPU resources by having no parallelism? Now that I think about it, probably I am, right? Something to ponder...)
There is also some concern about losing my prefix caches, but if that's a problem, then it'd be a problem either way. And I'm pretty sure it's not a problem, since I'm not likely to be bouncing around ALL these harnesses all at once, so hopefully their prefixes shouldn't get evicted. And, even if they did, I do have unlimited tokens.
Sovereign Servers
I had this lofty notion that terrarium-agent would be stateless. I like making things
stateless, because it means you don't have to worry about state. I hate state. Now, that said, I'm
pretty sure our request queue is state. Either that or I don't know exactly what stateless means, which
isn't unlikely.
But anyways, the terrarium-agent server being (mostly) stateless means that each harness
that it feeds is responsible for its own context storage, tool execution, and UI. So when
terrarium-agent gets a request, we slam it with:
- The full conversation history so far (every turn)
- Tool definitions for what the LLM can do
- The new user message
Which is probably fine since it's all local anyways, right?
It responds with:
- The assistant's response text
- Optional tool_calls (structured function calls)
And then the harness:
- Executes any requested tools locally
- Adds results to the conversation
- Sends the next request with updated context
Why Not Just Use MCP?
Uhhhh.............
What I Learned :)
Really I think the weirdest thing I'm doing is having each harness manage its own context. Especially
since like, they're all on the same machine anyways. Wouldn't it actually be easier to have
terrarium-agent manage all the different contexts itself? (Might make the projects harder
to maintain, I guess?). This all poses a bit of a problem for, eg, the websearch tool I made. I want
multiple different harnesses to have access to web search functionality in the same way. I could just
add it bespoke into each harness, but if I'm spinning up multiple local searx setups, that's crazy. So
instead I'm making a searx microservice that any harness can use. But that's still crazy.
With the current architecture, it's like:
terrarium-agent server
├─ terrarium-irc client
│ └─ web search microservice
└─ terrarium-webchat client
└─ ALSO the web search microservice
Whereas with MCP I could flatten that.
terrarium-agent MCP client(s?) ├─ terrarium-webchat MCP server ├─ terrarium-irc MCP server └─ search MCP server
Though, then I really probably would have to have terrarium-agent centrally manage different contexts.
It seems like it'd be finicky to restrict which sets of tools certain chat contexts are allowed to use.
I'd have to track which harness endpoints that are beseeching the terrarium-agent to be an
LLM for them and figure out what they're allowed to do.
Or maybe I could just outsource that to the endpoints and let them say what they're supposed to get? Like a half-measure version of what I currently do. But if I believe that's their responsibility, then surely that means I believe the toolsets are their responsibility.
Surely, surely it'd be much better to just have a table of which harnesses get which tools. So maybe it would just be easier to centrally orchestrate all of these. God forbid I ever change anything about terrarium-agent's contract. I'll have to change all the dependent harnesses? That sounds like a nightmare. Why did I do it this way, again?
It's probably because I don't know what MCP is.
I should add that to my todo list.