LLM Benchmarking | Michael W. Danko

Abstract

This page compares a set of local language models in two modes. First, each model was hosted locally and evaluated by talking to it directly through its model server. Then the same model was kept local, but accessed through Hermes instead, which adds the surrounding agent setup needed for longer, more structured, tool-using work.

That lets the benchmark answer two related but different questions. One is whether a local model can give a good answer when used directly. The other is whether the same model still holds up once it has to operate inside a real local agent workflow.

The benchmark shows two things at once: local AI is already practical for a lot of everyday work, and reliable multi-step agent behavior is still much harder than a good single-prompt demo.

Summary

Local inference is no longer a fringe experiment. On modern personal hardware, several models are already fast enough to feel useful for summarization, drafting, note work, private assistance, and other tightly scoped tasks.

The benchmark becomes more revealing when the model has to complete a job rather than answer a prompt. In real settings, that means retaining context, making decisions across multiple turns, using tools correctly, and returning something complete enough to trust.

Core conclusion: local models are ready for meaningful everyday use. Reliable local agents remain a much tougher engineering and model-selection problem.

Why this matters

A lot of public AI discussion still focuses on how quickly or fluently a model can answer a single prompt. That view is useful, but it is not the full story. Most real work is not one prompt and one answer. It is keeping the thread of the task, handling more context, using tools correctly, and staying coherent long enough to finish something useful.

This distinction matters especially for local AI. The point of using a personal MacBook Pro here was not just that it happened to be the machine available. The goal was to test whether local AI is becoming realistic for the kind of computer an individual professional or consumer might actually own, and to show how much performance can change once a model is placed inside an agent harness rather than used as a simple chatbot.

Framing the benchmark this way makes the results easier to relate to everyday use. A benchmark on specialized hardware can be interesting, but it does not answer the same question. This benchmark is really asking: how far can a normal personal machine go, and what do you give up when you ask that machine to run a more capable, more structured AI assistant locally?

Methodology

The direct benchmark track measures raw model-server behavior on Hermes-mirrored cases. Each model is hosted locally and queried directly through its serving endpoint with no agent layer in between. For the tool-based jobs, the direct track embeds the same fixture evidence directly into the prompt because a raw endpoint cannot actually read files or execute terminal commands.

The Hermes benchmark track runs those same underlying jobs with the model still hosted locally, but accessed through Hermes instead of called directly. That adds the surrounding agent stack: real profiles, larger system prompts, tool schemas, file-reading tasks, terminal-enabled debugging tasks, and compliance checks that more closely resemble actual agent usage.

As a result, the comparison is now much closer to an apples-to-apples benchmark by job. It is still not execution-identical. The direct track turns the evidence into a raw prompt, while the Hermes track makes the agent read files, run commands, and work through a real tool-using workflow. That makes this a much cleaner view of the cost of turning a raw model into a real local agent.

In the full benchmark, I also tracked whether each Hermes run actually completed the required job steps, such as reading the right files, using the right tools, and returning the required sections. I left those scores out of the public tables here to keep the page easier to read, but they still informed the written analysis and model recommendations.

There is also an important runtime distinction in the model list. MLX refers to Apple’s machine learning framework and the surrounding tooling built to run models efficiently on Apple Silicon. In practice, it is the more Mac-native path in this benchmark. GGUF is a portable model file format used heavily with tools such as llama.cpp, especially for quantized local models that are meant to be easy to distribute and run across many setups.

Two models can behave differently because of their size or quality and the local runtime used to serve them. In practice, that can affect startup time, memory usage, and how responsive the model feels during longer tasks.

What the benchmark cases actually ask the model to do

One weakness of benchmark tables is that they often hide the job behind a short label. In this benchmark, each case is meant to resemble a recognizable kind of work rather than a vague prompt category.

Long conversation memory and context growth: a 22-turn discussion about local-model benchmarking, prompt overhead, user-perceived slowness, and measurement strategy. The model has to keep revising earlier claims, reuse prior context, and end with a concise handoff.
Recall-heavy conversation: a 12-turn conversation where the model must remember exact benchmark dimensions and evaluation constraints, transform them across later turns, and finish with an operator checklist that uses the same terms correctly.
File-only docs review: synthesize a product spec, meeting notes, incident log, and pricing policy into four sections: summary, risks, open questions, and launch recommendation.
Bug debugging workflow: inspect a failing billing scenario, review `pricing.py`, `subscription.py`, test/check scripts, and expected totals, then explain the failure, root cause, minimum fix, and regression coverage.
Triage report workflow: combine failing test output, profiling output, product/policy docs, and fixture evidence into a concise go/no-go release recommendation.
Schema-heavy variants: run the same underlying docs-review or bug-debugging job, but inside a much wider Hermes tool menu so the benchmark captures the overhead of larger tool schemas.
Skill-augmented triage: run the same triage job with additional benchmark-owned operating instructions loaded first, so the model has to work through extra reusable guidance before doing the task.

In the direct track, those jobs are represented by embedding the same fixture evidence directly into the prompt. In the Hermes track, the agent has to work through the live workflow instead. That difference is a big reason the workflow cases separate the models more clearly.

Primary findings

1. Direct speed and practical usefulness are not the same thing

Some of the strongest raw direct benchmark results came from smaller models. That speed advantage often weakened once those same models had to manage tools, longer context windows, and multi-step workflow requirements. In other words, the fastest endpoint was not always the most useful system.

2. Workflow reliability often mattered more than raw latency

A model that is moderately slower but remains stable under workflow pressure is usually more valuable than a faster model that truncates, stalls, skips tools, or produces incomplete work. The strongest Hermes candidates were not simply the fastest. They were the models that remained more dependable as the workflow became more demanding.

3. Larger models held up better once orchestration pressure increased

The better-performing Gemma variants, especially Gemma 4 26B MLX, offered the strongest balance between acceptable direct performance and credible Hermes behavior. This does not mean that larger is always better, but it does suggest that once the workflow becomes more agentic, capability and stability begin to matter more than isolated speed wins.

Interpretation by model family

This is still a small set of models, so I do not want to overstate the conclusions. But even with a limited sample, a few patterns show up clearly enough to be useful. The most obvious one is the contrast between the Qwen runs and the Gemma runs once the benchmark moves beyond direct prompts and into longer, more structured workflows.

The smaller Qwen models demonstrate the central lesson of the benchmark. They can appear very strong when measured directly, particularly on short and simple cases. However, once moved into agent-style workflows, they become less compelling because workflow reliability declines and the performance gap grows sharply.

Qwen 3.6 35B A3B MLX shows a more capable profile than the earlier Qwen models and performs much better in Hermes on the conversation-oriented cases. Even so, the harder workflow cases still reveal meaningful operational risk, including failures on some of the most demanding tasks.

The Gemma family produced the most balanced story in this small sample. Gemma 4 E4B MLX was respectable directly, though less convincing on harder agent tasks. Gemma 4 26B MLX looked like the strongest overall balance point, combining solid direct usability with the most persuasive Hermes behavior in the set. The GGUF variants were slower in some direct measurements, but in several cases they remained surprisingly competitive once workflow reliability became the main criterion.

Practical implications

If the goal is a local chatbot, a personal drafting assistant, or a private summarization tool, local AI is already in a good place. Multiple models in this benchmark appear viable for those kinds of controlled tasks.

If the goal is a local AI worker that can reliably handle multi-step tasks, use tools, and stay coherent through longer workflows, the field narrows quickly. At that point, model selection becomes much more important, and the orchestration layer becomes part of the performance story, not just background plumbing. Put more simply, the direct benchmarks mostly answer whether a model is fast enough, while the Hermes benchmarks answer whether it stays trustworthy once real workflow pressure is introduced.

Conclusion

The practical takeaway is straightforward. Local AI is already useful on a personal machine, but usefulness is not the same thing as reliability in a local agent. As soon as context grows, tools enter the picture, and workflow correctness matters, the differences between models become much easier to see.

Even so, the overall result is encouraging. For simpler and more controlled use cases, the current generation already looks strong. For more autonomous, multi-step agentic work, local AI is promising, but you have to be much more selective about the model and runtime you choose.

Technical findings

The sections above provide the narrative interpretation. The tables below show all seven benchmark case families directly. Hover over the column labels for a quick explanation of what each metric represents.

Conversation cases

ModelThe specific model and runtime variant used in the benchmark row.	Direct long conversationTotal elapsed time for the 22-turn direct benchmark conversation.	Hermes long conversationTotal elapsed time for the same long conversation job when run through Hermes.	Direct recall conversationTotal elapsed time for the 12-turn direct recall-heavy conversation.	Hermes recall conversationTotal elapsed time for the same recall-heavy conversation when run through Hermes.
Qwen 3.5 0.8B	0m 50.7s	6m 14.0s	0m 24.5s	2m 56.1s
Qwen 3.5 9B	3m 16.1s	24m 33.3s	0m 54.5s	13m 10.6s
Qwen 3.6 35B A3B MLX	2m 9.2s	42m 38.1s	0m 49.3s	11m 27.0s
Gemma 4 E4B MLX	1m 55.9s	6m 3.7s	0m 26.0s	2m 23.7s
Gemma 4 26B MLX	2m 41.6s	18m 5.2s	0m 45.6s	7m 24.8s
Gemma 4 E4B GGUF	2m 42.6s	1m 41.9s	0m 34.1s	0m 29.5s
Gemma 4 26B A4B GGUF	3m 32.8s	18m 31.0s	1m 51.8s	12m 0.8s

Core task cases

ModelThe specific model and runtime variant used in the benchmark row.	Direct docs reviewAverage time for the direct docs-review task, where the fixture evidence is embedded directly in the prompt.	Hermes docs reviewElapsed time for the same docs-review job through Hermes.	Direct bug debugAverage time for the direct bug-debug case, where failing evidence is embedded into the prompt instead of gathered with tools.	Hermes bug debugElapsed time for the same debugging job through Hermes.	Direct triageAverage time for the direct triage-report case.	Hermes triageElapsed time for the same triage-report job through Hermes.
Qwen 3.5 0.8B	0m 3.9s	0m 24.1s	0m 3.2s	0m 27.2s	0m 3.6s	0m 21.2s
Qwen 3.5 9B	0m 11.7s	1m 45.6s	0m 10.4s	2m 8.9s	0m 11.7s	1m 50.7s
Qwen 3.6 35B A3B MLX	0m 9.8s	3m 56.3s	0m 8.4s	2m 11.4s	0m 9.5s	2m 10.5s
Gemma 4 E4B MLX	0m 8.7s	1m 45.8s	0m 7.4s	0m 57.4s	0m 8.3s	1m 8.7s
Gemma 4 26B MLX	0m 10.0s	1m 54.4s	0m 8.3s	1m 23.3s	0m 9.9s	1m 47.9s
Gemma 4 E4B GGUF	0m 13.6s	0m 53.8s	0m 13.3s	0m 56.4s	0m 10.9s	1m 7.0s
Gemma 4 26B A4B GGUF	0m 14.0s	1m 27.2s	0m 13.7s	2m 4.3s	0m 12.2s	3m 14.4s

Schema-heavy and skill-augmented cases

ModelThe specific model and runtime variant used in the benchmark row.	Direct schema-heavy docsAverage time for the direct schema-heavy docs-review variant.	Hermes schema-heavy docsElapsed time for the schema-heavy docs-review case in Hermes.	Direct schema-heavy bugAverage time for the direct schema-heavy bug-debug variant.	Hermes schema-heavy bugElapsed time for the schema-heavy bug-debug case in Hermes.	Direct skill-augmented triageAverage time for the direct skill-augmented triage variant.	Hermes skill-augmented triageElapsed time for the skill-augmented triage case in Hermes.
Qwen 3.5 0.8B	0m 3.9s	0m 26.3s	0m 3.2s	0m 43.4s	0m 3.9s	0m 57.9s
Qwen 3.5 9B	0m 12.4s	4m 50.6s	0m 10.3s	3m 57.0s	0m 12.6s	2m 32.0s
Qwen 3.6 35B A3B MLX	0m 9.8s	5m 2.0s	0m 8.4s	0m 30.4s	0m 10.3s	0m 33.1s
Gemma 4 E4B MLX	0m 8.6s	2m 34.8s	0m 7.2s	1m 53.8s	0m 8.9s	1m 17.8s
Gemma 4 26B MLX	0m 11.3s	2m 31.2s	0m 11.1s	2m 49.3s	0m 16.4s	1m 48.2s
Gemma 4 E4B GGUF	0m 14.2s	2m 22.2s	0m 14.3s	2m 10.3s	0m 11.6s	1m 29.3s
Gemma 4 26B A4B GGUF	0m 13.0s	4m 10.2s	0m 13.5s	4m 47.0s	0m 11.8s	3m 18.7s

Runtime cost and practical read

ModelThe specific model and runtime variant used in the benchmark row.	Direct peak memoryPeak memory observed during the direct benchmark run for that model. This gives a rough sense of local runtime cost.	Direct startup timeHow long the raw model server took to become ready before benchmarking began.	Practical readA short recommendation-style takeaway based on the direct and Hermes results together.
Qwen 3.5 0.8B	1.9 GB	0m 3.6s	Excellent for lightweight local experimentation, but not the best choice for dependable multi-step agent work.
Qwen 3.5 9B	5.3 GB	0m 4.0s	Respectable as a raw model, but still a weak fit for serious local orchestration.
Qwen 3.6 35B A3B MLX	2.6 GB	0m 10.4s	Promising high-end option, but not yet the cleanest all-around Hermes recommendation.
Gemma 4 E4B MLX	4.6 GB	0m 4.5s	Usable for simpler local workflows, but less convincing for robust agentic execution.
Gemma 4 26B MLX	13.3 GB	0m 6.4s	The strongest all-around local Hermes candidate in this benchmark.
Gemma 4 E4B GGUF	9.0 GB	0m 5.8s	A strong value option if workflow reliability matters more than raw speed.
Gemma 4 26B A4B GGUF	20.8 GB	0m 18.2s	Powerful, but it carries a meaningful cost in memory, startup time, and operational weight.

Can local AI do real work?

Abstract

Summary

Why this matters

Methodology

What the benchmark cases actually ask the model to do

Primary findings

1. Direct speed and practical usefulness are not the same thing

2. Workflow reliability often mattered more than raw latency

3. Larger models held up better once orchestration pressure increased

Interpretation by model family

Practical implications

Conclusion

Technical findings

Conversation cases

Core task cases

Schema-heavy and skill-augmented cases

Runtime cost and practical read