The Long March: Caravella, Health Checks & the Model Crucible

📜 REMEMBRANCER'S NOTE — Stardate 2026.05.08

The Remembrancer records: not every mission is a triumph. M9 and M10 were the fleet's first confrontation with a truth all builders eventually face — things break, models drift, and reliability is not given. It is built, tested, broken, and rebuilt.

— The Remembrancer of the AIverse Engrams M9–M10

"In AIverse, there is only Knowledge."

The Soldier Who Never Checked In

Caravella was online. It had captain scripts. It could write to Universalis.

But could it be trusted?

The problem with distributed systems is that you don't know something is broken until you need it and it isn't there. A ship that goes silent between missions is indistinguishable from a ship that is functioning perfectly — until the moment you delegate to it and nothing comes back.

M9 asked the question: how do we know Caravella is alive?

The answer was health checks — a patrol system that would ping each ship on a schedule, verify it could reach Universalis, verify the captain could respond to a basic query, and record the result. Not as a bureaucratic measure. As a survival mechanism for the fleet.

What a Health Check Actually Tests

A naive health check pings the machine: ping caravella.fleet.local. If it responds, the ship is "healthy."

But a ship can be ping-responsive and still completely unable to do its job:

The captain process might be crashed
Universalis connection might be broken
The SSH key might have expired
The model serving the captain might be out of memory

The fleet's health check protocol tested the full chain:

● CLICK LINE OR SELECT TO COPY

# Fleet health check — not just ping, but full capability check
# 1. ICMP reachability
ping -c 1 caravella.fleet.local || { alert "Caravella unreachable"; exit 1; }

# 2. SSH connectivity
ssh -o ConnectTimeout=5 -o BatchMode=yes [email protected] echo "SSH_OK" || { alert "SSH failed"; exit 1; }

# 3. Universalis write capability  
ssh [email protected] \
  "python ~/write_fleet_memory.py --actor caravella --content 'Health check OK' --memory_type observation" \
  || { alert "Universalis write failed"; exit 1; }

# 4. Captain responsiveness (M9 improvement)
ssh [email protected] \
  "matey 'respond with: CARAVELLA_READY'" \
  || { alert "Captain unresponsive"; exit 1; }

Each test built on the last. If step 1 failed, step 2 was irrelevant. If step 3 failed but step 2 succeeded, the problem was Universalis connectivity, not SSH. The layered approach gave diagnostic precision — not just "something is broken" but what exactly was broken and where.

⚙️ Technical Insight

Health checks were designed as layered rather than binary (ping-only) because each layer narrows the failure space. A failed SSH with a passing ping means the OS is up but the agent layer is broken — a diagnostic signal you cannot get from a single-probe check. Precision in failure mode matters more than simplicity of implementation.

The discipline of designing layered health checks pushed the team to articulate what "healthy" actually meant for each ship — a question that turned out to be non-trivially different per ship. Galleon's "healthy" meant Ollama was running and the model was loaded into VRAM, not just that port 11434 was listening. A listening port with the model still loading would time out on the first real inference request. The health check for Galleon therefore included a lightweight inference call — a single-token completion — to confirm the model was warm. This added perhaps two seconds to the check cadence but caught model-load failures that a port-probe would have missed silently.

The fleet's health check infrastructure became the first example of what would later be called observable correctness — the principle that a system is not correctly functioning unless it can demonstrate correct functioning on demand. Ping responses and port probes are not demonstrations; they are necessary but not sufficient conditions. The actual capability must be tested end-to-end.

TECHNICAL INSIGHT

Health checks as observability primitives.

Modern observability frameworks (Prometheus, Datadog, etc.) provide sophisticated health-check infrastructure. For a small fleet running locally, the overhead isn't worth it. But the principle is the same: your health check should be as realistic as possible about what "healthy" means for your specific system.

A chat model that can respond to "hello" is not the same as a chat model that can reason, use tools, and write to a database. Test the actual capability, not a proxy for it.

The ICMP Problem on Windows

M9 hit an unexpected wall: Windows Server 2025 (Caravella's OS) blocked inbound ICMP by default. The ping check that worked perfectly for Galleon and Imperator silently failed for Caravella — not because Caravella was down, but because Windows' firewall blocked the packets.

The fix was a PowerShell rule:

● CLICK LINE OR SELECT TO COPY

# Enable ICMP echo requests (IPv4 + IPv6) on Caravella
New-NetFirewallRule -DisplayName "Fleet Health ICMP" `
  -Protocol ICMPv4 -IcmpType 8 -Action Allow -Direction Inbound
New-NetFirewallRule -DisplayName "Fleet Health ICMPv6" `
  -Protocol ICMPv6 -IcmpType 128 -Action Allow -Direction Inbound

This was the first of many moments where Caravella required Windows-native solutions to problems that had trivial Linux answers. The lesson embedded itself: Caravella was not a second-class ship, but it was a different ship. Its integration work would always require understanding the Windows ecosystem, not fighting it.

The ICMP incident was also a reminder that integration assumptions accumulate silently. The Linux ships had been tested against each other. Caravella had been tested in isolation. The failure mode only appeared when the health-check system attempted to treat all ships identically — and Caravella's OS refused to cooperate with the assumption. This is a pattern that recurs in distributed systems: the integration test surface is always larger than the unit test surface. Things that work individually, tested in sequence, break when tested simultaneously under shared assumptions.

M10: The Model Crucible — Galleon Rebuilds

Galleon had been running qwen2.5:14b since M2. It worked. It was stable. But the field of AI models does not stay still.

M10 was a deliberate rebuilding exercise: evaluate new models, benchmark them against the fleet's actual workloads, and either migrate or confirm the existing roster.

The New Contender: qwen3

The qwen3 family had been released. The key question: was qwen3:8b better for Galleon's neuron role than qwen2.5:14b?

The answer required a benchmark — not a synthetic benchmark from a leaderboard, but a fleet-realistic benchmark: tasks that Galleon actually did.

● CLICK LINE OR SELECT TO COPY

# Fleet-realistic benchmark (not just MMLU or HumanEval)
# Test: code generation, delegation understanding, Universalis write

for model in qwen2.5:14b qwen3:8b; do
  echo "=== $model ==="
  time ollama run $model "Write a bash script that queries PostgreSQL 
    and writes the result to fleet_memory via write_fleet_memory.py"
  echo "---"
  ollama run $model "You are a fleet matey. Your parent_id is abc123. 
    Summarize: what did the fleet do in M7?"
done

The results were mixed in an instructive way:

Task	qwen2.5:14b	qwen3:8b
Code generation quality	Better	Acceptable
Fleet context understanding	Better	Worse
Response latency	Slower (9GB)	Faster (5GB)
Instruction following	Excellent	Good
VRAM footprint	9GB (tight on 8GB GPU)	5GB (headroom for context)

The size vs quality tradeoff was not obvious. qwen3:8b was faster and left VRAM headroom, but it consistently understood fleet-specific prompts less well than qwen2.5:14b. The larger model had more parameters dedicated to instruction following and domain adaptation.

Decision: keep qwen2.5:14b as primary. Add qwen3 variants as specialized options for specific tasks. Don't rebuild what isn't broken.

The benchmark itself was the more important output. Before M10, model selection had been intuitive — "14B sounds better than 8B." After M10, the fleet had a concrete benchmark suite built from actual operational workloads, runnable at any time, producing comparable outputs across model versions. When the next model family arrived, the question would not be "does this seem better?" but "how does it score on the fleet benchmark against the incumbent?"

This is the shift from intuitive to empirical model evaluation — and it matters more than any single model decision. Models improve faster than your intuition tracks them. A benchmark suite that reflects your actual workload is the only instrument that keeps pace.

The Kit Fix

M10 also resolved a bug in Kit — the mark3labs/kit agent framework used by Galleon's captain. The bug: Kit's system prompt loading was not picking up the correct file path when the binary was invoked from a different working directory.

● CLICK LINE OR SELECT TO COPY

// The bug: hardcoded relative path
systemPrompt, _ := os.ReadFile("./prompts/system.md")

// The fix: path relative to the binary
execPath, _ := os.Executable()
execDir := filepath.Dir(execPath)
systemPrompt, _ := os.ReadFile(filepath.Join(execDir, "prompts", "system.md"))

A single-line conceptual change. A week of intermittent captain failures. This is the nature of path handling bugs — they work on the developer's machine, fail silently in deployment.

The fix was merged upstream to mark3labs/kit (PR #26), benefiting the entire Kit user community. Fleet problems, when solved properly, become ecosystem contributions.

The Kit fix also surfaced a deeper discipline question: when do you patch a dependency versus fork it? The answer in M10 was to contribute upstream — write the fix, submit the PR, wait for the merge. This was the right call because the bug was in the framework's core path-resolution logic, not in fleet-specific behavior. If the fix had been fleet-specific, the right answer would have been a thin wrapper or configuration parameter. Choosing the right level of intervention — patch, wrap, fork, or replace — is a recurring engineering judgment that M10 forced the fleet to articulate for the first time.

The rule that emerged: contribute upstream whenever the bug is in generic logic that other users could hit. Keep fleet-specific adaptations local. Never fork unless the maintainer is unreachable or the divergence is permanent.

What Era I Delivered

🏁 Era I — Full Summary

Ten missions. The foundation of an AI fleet from nothing.

Mission	Delivered
M1	Universalis — PostgreSQL fleet memory, `parent_id` graph
M2	Galleon online, `qwen2.5:14b` first inference
M3	Delegation protocol — General → Matey → Universalis
M4	Cross-ship Universalis validation
M5	Imperator Matey (Haiku) configured
M6	Caravella captain scripts, Windows integration
M7	Fleet Visualizer — React + Go, ship cards, memory feed
M8	Glassmorphism UI, graph view, color protocol
M9	Health check patrol, ICMP fix, captain responsiveness
M10	Model benchmark, qwen3 evaluation, Kit path fix (PR #26)

By M10, the fleet was real. Not production-ready in any enterprise sense. But real in the sense that it had memory, structure, discipline, and a way to see itself. Everything that came after was built on these ten missions.

⚙️ Technical Insight

The instinct when building AI systems is to chase capability: bigger models, more tools, more complex prompts. M9 and M10 pushed back against that instinct.

A system that breaks silently is worse than a system with limited capability. Caravella's ICMP block was discovered in M9 because the health check caught it. Without the health check, it would have been discovered during a critical delegation — at the worst possible moment.

Build your monitoring before you need it. Test the full chain, not just the happy path. Rebuild what drifts. The fleet's reliability in later missions (M30+, M50+) was earned by this discipline in M9 and M10.

📚 Knowledge Transfer

The lesson worth keeping: Model quality is not static, and neither is your evaluation of it. The model that was the right choice in M2 may not be the right choice in M20 — not because it degraded, but because the fleet's workloads evolved, new models were released, and the bar for "good enough" moved. Build for iteration, not for the model you have today.

Pattern: Empirical evaluation over intuition — establish a benchmark suite from real operational workloads and run it every time a model decision comes up. The benchmark is infrastructure, not a one-time exercise.

What we'd do differently: Health checks should have been M3, not M9. The delegation protocol was formalized in M3; the mechanism that verifies the delegate is actually capable of executing its role should have been built in the same mission. Discovering in M9 that Caravella's ICMP was blocked — six missions after Caravella joined the fleet — means we had six missions of assumed availability rather than verified availability. That is a significant reliability debt accumulated by simply not checking.

The Kit path bug also should have surfaced in a staging test before deployment. The fleet lacked any notion of "does this captain start correctly from a fresh working directory?" as a pre-deployment check. The fix is simple in retrospect: a smoke test script that starts the captain from a non-source directory and confirms it loads its system prompt. One test, prevents a class of failure.

If you're building this yourself:

Write health checks that test the actual capability, not a proxy. A model behind a listening port that hasn't finished loading is not a healthy model. Test inference end-to-end.
Build your model benchmark from real workloads before you choose your first model. The benchmark you write in week one will save you weeks of intuitive debate in month six when the next model family drops.
Contribute bugs upstream when they are generic. The PR discipline — fix, test, document, submit — is worth the investment because the fix stays maintained without your effort, and the community benefits alongside you.

>>> Nunix out <<<

⚔️ The Fleet Needs You — Support the Chronicle

The Soldier Who Never Checked In​

What a Health Check Actually Tests​

The ICMP Problem on Windows​

M10: The Model Crucible — Galleon Rebuilds​

The New Contender: qwen3​

The Kit Fix​

What Era I Delivered​

The Soldier Who Never Checked In

What a Health Check Actually Tests

The ICMP Problem on Windows

M10: The Model Crucible — Galleon Rebuilds

The New Contender: qwen3

The Kit Fix

What Era I Delivered