Independent Analysis

Can local AI do real work?

The short answer is yes. Local language models are already viable for a meaningful range of practical tasks on a personal computer. The harder question is whether they can operate as dependable agents inside multi-step, tool based workflows.

Abstract

This page compares a set of local language models in two modes. First, each model was hosted locally and evaluated by talking to it directly through its model server. Then the same model was kept local, but accessed through Hermes instead, which adds the surrounding agent setup needed for longer, more structured, tool-using work.

That lets the benchmark answer two related but different questions. One is whether a local model can give a good answer when used directly. The other is whether the same model still holds up once it has to operate inside a real local agent workflow.

The benchmark shows two things at once: local AI is already practical for a lot of everyday work, and reliable multi-step agent behavior is still much harder than a good single-prompt demo.

Summary

Local inference is no longer a fringe experiment. On modern personal hardware, several models are already fast enough to feel useful for summarization, drafting, note work, private assistance, and other tightly scoped tasks.

The benchmark becomes more revealing when the model has to complete a job rather than answer a prompt. In real settings, that means retaining context, making decisions across multiple turns, using tools correctly, and returning something complete enough to trust.

Core conclusion: local models are ready for meaningful everyday use. Reliable local agents remain a much tougher engineering and model-selection problem.

Why this matters

A lot of public AI discussion still focuses on how quickly or fluently a model can answer a single prompt. That view is useful, but it is not the full story. Most real work is not one prompt and one answer. It is keeping the thread of the task, handling more context, using tools correctly, and staying coherent long enough to finish something useful.

This distinction matters especially for local AI. The point of using a personal MacBook Pro here was not just that it happened to be the machine available. The goal was to test whether local AI is becoming realistic for the kind of computer an individual professional or consumer might actually own, and to show how much performance can change once a model is placed inside an agent harness rather than used as a simple chatbot.

Framing the benchmark this way makes the results easier to relate to everyday use. A benchmark on specialized hardware can be interesting, but it does not answer the same question. This benchmark is really asking: how far can a normal personal machine go, and what do you give up when you ask that machine to run a more capable, more structured AI assistant locally?

Methodology

The direct benchmark track measures raw model-server behavior on Hermes-mirrored cases. Each model is hosted locally and queried directly through its serving endpoint with no agent layer in between. For the tool-based jobs, the direct track embeds the same fixture evidence directly into the prompt because a raw endpoint cannot actually read files or execute terminal commands.

The Hermes benchmark track runs those same underlying jobs with the model still hosted locally, but accessed through Hermes instead of called directly. That adds the surrounding agent stack: real profiles, larger system prompts, tool schemas, file-reading tasks, terminal-enabled debugging tasks, and compliance checks that more closely resemble actual agent usage.

As a result, the comparison is now much closer to an apples-to-apples benchmark by job. It is still not execution-identical. The direct track turns the evidence into a raw prompt, while the Hermes track makes the agent read files, run commands, and work through a real tool-using workflow. That makes this a much cleaner view of the cost of turning a raw model into a real local agent.

In the full benchmark, I also tracked whether each Hermes run actually completed the required job steps, such as reading the right files, using the right tools, and returning the required sections. I left those scores out of the public tables here to keep the page easier to read, but they still informed the written analysis and model recommendations.

There is also an important runtime distinction in the model list. MLX refers to Apple’s machine learning framework and the surrounding tooling built to run models efficiently on Apple Silicon. In practice, it is the more Mac-native path in this benchmark. GGUF is a portable model file format used heavily with tools such as llama.cpp, especially for quantized local models that are meant to be easy to distribute and run across many setups.

Two models can behave differently because of their size or quality and the local runtime used to serve them. In practice, that can affect startup time, memory usage, and how responsive the model feels during longer tasks.

What the benchmark cases actually ask the model to do

One weakness of benchmark tables is that they often hide the job behind a short label. In this benchmark, each case is meant to resemble a recognizable kind of work rather than a vague prompt category.

  • Long conversation memory and context growth: a 22-turn discussion about local-model benchmarking, prompt overhead, user-perceived slowness, and measurement strategy. The model has to keep revising earlier claims, reuse prior context, and end with a concise handoff.
  • Recall-heavy conversation: a 12-turn conversation where the model must remember exact benchmark dimensions and evaluation constraints, transform them across later turns, and finish with an operator checklist that uses the same terms correctly.
  • File-only docs review: synthesize a product spec, meeting notes, incident log, and pricing policy into four sections: summary, risks, open questions, and launch recommendation.
  • Bug debugging workflow: inspect a failing billing scenario, review `pricing.py`, `subscription.py`, test/check scripts, and expected totals, then explain the failure, root cause, minimum fix, and regression coverage.
  • Triage report workflow: combine failing test output, profiling output, product/policy docs, and fixture evidence into a concise go/no-go release recommendation.
  • Schema-heavy variants: run the same underlying docs-review or bug-debugging job, but inside a much wider Hermes tool menu so the benchmark captures the overhead of larger tool schemas.
  • Skill-augmented triage: run the same triage job with additional benchmark-owned operating instructions loaded first, so the model has to work through extra reusable guidance before doing the task.

In the direct track, those jobs are represented by embedding the same fixture evidence directly into the prompt. In the Hermes track, the agent has to work through the live workflow instead. That difference is a big reason the workflow cases separate the models more clearly.

Primary findings

1. Direct speed and practical usefulness are not the same thing

Some of the strongest raw direct benchmark results came from smaller models. That speed advantage often weakened once those same models had to manage tools, longer context windows, and multi-step workflow requirements. In other words, the fastest endpoint was not always the most useful system.

2. Workflow reliability often mattered more than raw latency

A model that is moderately slower but remains stable under workflow pressure is usually more valuable than a faster model that truncates, stalls, skips tools, or produces incomplete work. The strongest Hermes candidates were not simply the fastest. They were the models that remained more dependable as the workflow became more demanding.

3. Larger models held up better once orchestration pressure increased

The better-performing Gemma variants, especially Gemma 4 26B MLX, offered the strongest balance between acceptable direct performance and credible Hermes behavior. This does not mean that larger is always better, but it does suggest that once the workflow becomes more agentic, capability and stability begin to matter more than isolated speed wins.

Interpretation by model family

This is still a small set of models, so I do not want to overstate the conclusions. But even with a limited sample, a few patterns show up clearly enough to be useful. The most obvious one is the contrast between the Qwen runs and the Gemma runs once the benchmark moves beyond direct prompts and into longer, more structured workflows.

The smaller Qwen models demonstrate the central lesson of the benchmark. They can appear very strong when measured directly, particularly on short and simple cases. However, once moved into agent-style workflows, they become less compelling because workflow reliability declines and the performance gap grows sharply.

Qwen 3.6 35B A3B MLX shows a more capable profile than the earlier Qwen models and performs much better in Hermes on the conversation-oriented cases. Even so, the harder workflow cases still reveal meaningful operational risk, including failures on some of the most demanding tasks.

The Gemma family produced the most balanced story in this small sample. Gemma 4 E4B MLX was respectable directly, though less convincing on harder agent tasks. Gemma 4 26B MLX looked like the strongest overall balance point, combining solid direct usability with the most persuasive Hermes behavior in the set. The GGUF variants were slower in some direct measurements, but in several cases they remained surprisingly competitive once workflow reliability became the main criterion.

Practical implications

If the goal is a local chatbot, a personal drafting assistant, or a private summarization tool, local AI is already in a good place. Multiple models in this benchmark appear viable for those kinds of controlled tasks.

If the goal is a local AI worker that can reliably handle multi-step tasks, use tools, and stay coherent through longer workflows, the field narrows quickly. At that point, model selection becomes much more important, and the orchestration layer becomes part of the performance story, not just background plumbing. Put more simply, the direct benchmarks mostly answer whether a model is fast enough, while the Hermes benchmarks answer whether it stays trustworthy once real workflow pressure is introduced.

Conclusion

The practical takeaway is straightforward. Local AI is already useful on a personal machine, but usefulness is not the same thing as reliability in a local agent. As soon as context grows, tools enter the picture, and workflow correctness matters, the differences between models become much easier to see.

Even so, the overall result is encouraging. For simpler and more controlled use cases, the current generation already looks strong. For more autonomous, multi-step agentic work, local AI is promising, but you have to be much more selective about the model and runtime you choose.

Technical findings

The sections above provide the narrative interpretation. The tables below show all seven benchmark case families directly. Hover over the column labels for a quick explanation of what each metric represents.

Conversation cases

ModelThe specific model and runtime variant used in the benchmark row. Direct long conversationTotal elapsed time for the 22-turn direct benchmark conversation. Hermes long conversationTotal elapsed time for the same long conversation job when run through Hermes. Direct recall conversationTotal elapsed time for the 12-turn direct recall-heavy conversation. Hermes recall conversationTotal elapsed time for the same recall-heavy conversation when run through Hermes.
Qwen 3.5 0.8B0m 50.7s6m 14.0s0m 24.5s2m 56.1s
Qwen 3.5 9B3m 16.1s24m 33.3s0m 54.5s13m 10.6s
Qwen 3.6 35B A3B MLX2m 9.2s42m 38.1s0m 49.3s11m 27.0s
Gemma 4 E4B MLX1m 55.9s6m 3.7s0m 26.0s2m 23.7s
Gemma 4 26B MLX2m 41.6s18m 5.2s0m 45.6s7m 24.8s
Gemma 4 E4B GGUF2m 42.6s1m 41.9s0m 34.1s0m 29.5s
Gemma 4 26B A4B GGUF3m 32.8s18m 31.0s1m 51.8s12m 0.8s

Core task cases

ModelThe specific model and runtime variant used in the benchmark row. Direct docs reviewAverage time for the direct docs-review task, where the fixture evidence is embedded directly in the prompt. Hermes docs reviewElapsed time for the same docs-review job through Hermes. Direct bug debugAverage time for the direct bug-debug case, where failing evidence is embedded into the prompt instead of gathered with tools. Hermes bug debugElapsed time for the same debugging job through Hermes. Direct triageAverage time for the direct triage-report case. Hermes triageElapsed time for the same triage-report job through Hermes.
Qwen 3.5 0.8B0m 3.9s0m 24.1s0m 3.2s0m 27.2s0m 3.6s0m 21.2s
Qwen 3.5 9B0m 11.7s1m 45.6s0m 10.4s2m 8.9s0m 11.7s1m 50.7s
Qwen 3.6 35B A3B MLX0m 9.8s3m 56.3s0m 8.4s2m 11.4s0m 9.5s2m 10.5s
Gemma 4 E4B MLX0m 8.7s1m 45.8s0m 7.4s0m 57.4s0m 8.3s1m 8.7s
Gemma 4 26B MLX0m 10.0s1m 54.4s0m 8.3s1m 23.3s0m 9.9s1m 47.9s
Gemma 4 E4B GGUF0m 13.6s0m 53.8s0m 13.3s0m 56.4s0m 10.9s1m 7.0s
Gemma 4 26B A4B GGUF0m 14.0s1m 27.2s0m 13.7s2m 4.3s0m 12.2s3m 14.4s

Schema-heavy and skill-augmented cases

ModelThe specific model and runtime variant used in the benchmark row. Direct schema-heavy docsAverage time for the direct schema-heavy docs-review variant. Hermes schema-heavy docsElapsed time for the schema-heavy docs-review case in Hermes. Direct schema-heavy bugAverage time for the direct schema-heavy bug-debug variant. Hermes schema-heavy bugElapsed time for the schema-heavy bug-debug case in Hermes. Direct skill-augmented triageAverage time for the direct skill-augmented triage variant. Hermes skill-augmented triageElapsed time for the skill-augmented triage case in Hermes.
Qwen 3.5 0.8B0m 3.9s0m 26.3s0m 3.2s0m 43.4s0m 3.9s0m 57.9s
Qwen 3.5 9B0m 12.4s4m 50.6s0m 10.3s3m 57.0s0m 12.6s2m 32.0s
Qwen 3.6 35B A3B MLX0m 9.8s5m 2.0s0m 8.4s0m 30.4s0m 10.3s0m 33.1s
Gemma 4 E4B MLX0m 8.6s2m 34.8s0m 7.2s1m 53.8s0m 8.9s1m 17.8s
Gemma 4 26B MLX0m 11.3s2m 31.2s0m 11.1s2m 49.3s0m 16.4s1m 48.2s
Gemma 4 E4B GGUF0m 14.2s2m 22.2s0m 14.3s2m 10.3s0m 11.6s1m 29.3s
Gemma 4 26B A4B GGUF0m 13.0s4m 10.2s0m 13.5s4m 47.0s0m 11.8s3m 18.7s

Runtime cost and practical read

ModelThe specific model and runtime variant used in the benchmark row. Direct peak memoryPeak memory observed during the direct benchmark run for that model. This gives a rough sense of local runtime cost. Direct startup timeHow long the raw model server took to become ready before benchmarking began. Practical readA short recommendation-style takeaway based on the direct and Hermes results together.
Qwen 3.5 0.8B1.9 GB0m 3.6sExcellent for lightweight local experimentation, but not the best choice for dependable multi-step agent work.
Qwen 3.5 9B5.3 GB0m 4.0sRespectable as a raw model, but still a weak fit for serious local orchestration.
Qwen 3.6 35B A3B MLX2.6 GB0m 10.4sPromising high-end option, but not yet the cleanest all-around Hermes recommendation.
Gemma 4 E4B MLX4.6 GB0m 4.5sUsable for simpler local workflows, but less convincing for robust agentic execution.
Gemma 4 26B MLX13.3 GB0m 6.4sThe strongest all-around local Hermes candidate in this benchmark.
Gemma 4 E4B GGUF9.0 GB0m 5.8sA strong value option if workflow reliability matters more than raw speed.
Gemma 4 26B A4B GGUF20.8 GB0m 18.2sPowerful, but it carries a meaningful cost in memory, startup time, and operational weight.