Which local LLM works best with OpenCode on 16GB VRAM?

Qwen 3.5 27b at IQ3_XXS quantization is the top performer. Running on llama.cpp with CPU+GPU offloading it reaches around 34 tokens per second, follows instructions precisely, and handles multi-step agentic tasks reliably — all on a single 16GB GPU.

Can Qwen 3.5 35b run locally on 16GB VRAM?

Yes, using quantized GGUF models via llama.cpp. The IQ3_S quantization fits within 16GB VRAM with a CPU+GPU split. It performs well on open-ended coding tasks but produces unreliable output on structured rule-following tasks — always validate its results.

What LLMs should I avoid for agentic coding tasks?

Smaller models tend to struggle in agentic mode. Qwen 3 14b hallucinates documentation and fabricates API endpoints. Devstral-small-2 24b confuses file content with shell commands. GPT-OSS 20b stalls when web fetches fail unless high-thinking mode is enabled. Prioritise instruction-following quality over raw speed.

Is Big Picle a good model for OpenCode?

Yes — Big Picle (available via OpenCode Zen) is one of the strongest performers for agentic coding. It proactively searches the web before writing code, finds correct API endpoints, and completes tasks quickly. Overall quality is excellent, though it makes more errors than the top local models on strict structured-data tasks.

How should I validate LLM output on structured tasks?

Write a small script that checks the key rules your task requires. For URL migration maps, verify that the last path segment matches on both sides of each mapping. For code generation, run the tests. Even capable LLMs produce plausible-looking but subtly wrong output, and automated validation is the only reliable way to catch it.

Best LLMs for OpenCode - Tested Locally

OpenCode LLM test — coding and accuracy stats

Page content

I have tested how OpenCode works with several locally hosted on Ollama LLMs, and for comparison added some Free models from OpenCode Zen.

OpenCode is one of the most promising tools in the AI developer tools ecosystem right now.

llms llama hardware cloud

TL;DR - OpenCode Best LLMs

Clear winner for local: Qwen 3.5 27b Q3_XXS on llama.cpp

The 27b at IQ3_XXS quantization delivered a complete, working Go project with all 8 unit tests passing, full README, and 34 tokens/sec on my 16GB VRAM setup (CPU+GPU mixed). Five stars, no caveats. This is my go-to for local OpenCode sessions.

Qwen 3.5 35b on llama.cpp — fast for coding, but validate everything

The 35b is excellent for quick agentic coding tasks — but my migration map tests exposed a serious reliability problem. Across two IQ3_S runs it produced 63–73% slug mismatches, and in the IQ4_XS quantization it forgot to include page slugs entirely, generating category paths that would map 8 different pages to the same URL. The coding quality on the IndexNow task was genuinely good, so this model is worth using — just never trust its output on structured, rule-following tasks without checking it. Validation is not optional.

Surprisingly good: Bigpicle (from OpenCode Zen)

The fastest to complete the task — 1m 17s. More importantly, it was the only model that paused before coding to actually search for the IndexNow protocol spec using Exa Code Search. It found all the correct endpoints on the first try. If you have access to OpenCode Zen, this one punches well above its weight.

Good, but only with high thinking: GPT-OSS 20b

In default mode GPT-OSS 20b fails — it hits dead-end WebFetch calls and stops. Switch to high thinking mode and it becomes a genuinely capable coding assistant: full flag parsing, correct batching logic, passing unit tests, all done fast. Keep that in mind before writing it off. GPT-OSS 20b failed on structured tasks even in high mode.

Skip for agentic coding: GPT-OSS 20b (default), Qwen 3 14b, devstral-small-2:24b

These used to be my favorites for speed in chat and generation tasks. But in agentic mode they all have real problems. Qwen 3 14b hallucinates documentation rather than admitting it can’t find something. GPT-OSS 20b (default) stalls when WebFetch fails. Devstral gets confused with basic file operations. For OpenCode specifically, instruction-following and tool-calling quality matters far more than raw speed.

About this test

I gave each model running in opencode two tasks/prompts:

Create for me a cli tool in Go, that would call bing and other search engines' indexnow endpoints to notify about changes on my website.
Prepare a website migration map.

You know what the Indexnow protocol is, right?

For the second task - I have a plan of migrating some old posts on this website from blogging url format (for example https://www.glukhov.org/post/2024/10/digital-detox/) to topic clusters (like this article url: https://www.glukhov.org/ai-devtools/opencode/llms-comparison/). So I have asked each LLM on OpenCode to prepare a migration map for me, according to my strategy.

I was running most of the LLMs on locally hosted Ollama, and some others on locally hosted llama.cpp. The Bigpicle and other very large language models were from OpenCode Zen.

Each model result

qwen3.5:9b

Complete failure on the first task. The model went through its thinking process — correctly identifying the relevant services (Google Sitemap, Bing Webmaster, Baidu IndexNow, Yandex) — but never actually called any tools. It produced a “Build” summary without touching a single file. No tool call whatsoever.

qwen3.5:9b-q8_0

A step up from the default quantization: it at least created a go.mod and a main.go. But then it immediately got stuck, admitted it needed to add missing imports, tried to rewrite the whole file using a shell heredoc — and failed. Build time was 1m 27s for something that didn’t work.

Qwen 3 14b

Classic hallucination under pressure. It tried to fetch IndexNow documentation three times in a row, each time hitting a 404 from a wrong URL (github.com/Bing/search-indexnow). Rather than admitting it couldn’t find anything, it fabricated a confident-sounding answer — wrong API endpoint, wrong authentication method. When I pushed it to search again, it produced a second fabricated answer pointing to yet another URL that also returns 404. The information it reported was incorrect. This is the failure mode I most want to avoid.

GPT-OSS 20b

At least the behavior was honest and methodical. It tried a long chain of WebFetch calls — indexnow.org, various GitHub repos, Bing’s own pages — and hit 404s or Cloudflare blocks on almost everything. It documented each failure transparently. In the end, it still couldn’t gather enough information to build a working tool, but unlike Qwen 3 14b, it didn’t make things up. Just couldn’t push through.

GPT-OSS 20b (high thinking)

A meaningfully different story from the default mode. With high thinking enabled, the model recovered from the same dead-end fetches and managed to build a complete, working tool — with proper flag parsing (--file, --host, --key, --engines, --batch, --verbose), GET for single URLs and POST batches for multiple, per the IndexNow spec.

When I asked for docs and unit tests, it delivered both. Tests passed:

=== RUN   TestReadURLsFile
--- PASS: TestReadURLsFile (0.00s)
=== RUN   TestReadURLsNoProtocol
--- PASS: TestReadURLsNoProtocol (0.00s)
ok  	indexnow-cli	0.002s

Fast, too — initial build in 22.5s. High thinking makes gpt-oss:20b actually usable.

qwen3-coder:30b

The most interesting failure. It actually compiled and ran the tool against real endpoints, saw real API errors back from Bing, Google, and Yandex, and started fixing them:

Error notifying Bing: received status code 400 ... "The urlList field is required."
Error notifying Google: received status code 404 ...
Error notifying Yandex: received status code 422 ... "Url list has to be an array"

That’s good instinct. The problem: it was running at 720% CPU and only 7% GPU — extremely inefficient for a 22 GB model. It took 11m 39s and the final output was still “not quite what is expected.” It also created a README.md, which is a nice touch. Not a bad model, just very slow on my setup and it didn’t fully nail the IndexNow protocol format.

qwen3.5:35b (Ollama)

Solid results but slow. It created a proper Go project, wrote tests, and all of them passed:

=== RUN   TestHashIndexNowPublicKey/non-empty_key
--- PASS
=== RUN   TestGetPublicKeyName/standard_root
--- PASS
=== RUN   TestGetPublicKeyName/custom_root
--- PASS

The downside: 19m 11s build time. For a 27 GB model running 45%/55% CPU/GPU split, that’s too slow for interactive use. The quality is there, but the latency kills the workflow.

Bigpicle (big-pickle)

The standout performer for the first task. Before writing a single line of code, it used Exa Code Search to actually research the IndexNow protocol:

◇ Exa Code Search "IndexNow protocol API endpoint how to notify search engines"

And it found the right endpoints:

Global: https://api.indexnow.org/indexnow
Bing: https://www.bing.com/indexnow
Yandex: https://webmaster.yandex.com/indexnow
Yep: https://indexnow.yep.com/indexnow
Amazon: https://indexnow.amazonbot.amazon/indexnow

It resolved the cobra import issue cleanly (go mod tidy), and the tool was done in 1m 17s. The rate-limit response it got back from Bing during testing was actually expected behavior for an invalid test key — the model correctly identified this as “the tool is working.” Impressive.

devstral-small-2:24b

Got confused at a basic level: it tried to write shell commands (go mod init indexnowcli, go mod tidy) directly into the go.mod file, triggering parse errors. Somehow it still managed to build a binary (7.9M), but the resulting CLI was far too simple — just indexnowcli <url> <key> with no flag handling, no multi-engine support, nothing. Took 2m 59s + 1m 28s to get a tool that wasn’t really useful.

qwen3.5:27b (llama.cpp, IQ3_XXS quantization)

This one impressed me the most of all the local runners. Running as Qwen3.5-27B-UD-IQ3_XXS.gguf on llama.cpp (mostly CPU), it created a complete tool with full test coverage — all 8 tests passing — and a proper README with installation instructions and protocol explanation:

PASS    indexnow    0.003s

Supported engines: Bing, Yandex, Mojeek, Search.io. Build time: 1m 12s for the tool, 1m 27s for tests and docs. Speed: 34 tokens/sec. Quality: 5 stars. Incredible result for a quantized model running on CPU+GPU.

qwen3.5:35b (llama.cpp, IQ3_S quantization)

Running as Qwen3.5-35B-A3B-UD-IQ3_S.gguf on llama.cpp. My notes here are short: “excellent!” — which says it all. The larger model at the same quantization level delivered at least as good results as the 27b variant, if not better.

Migration map results

For the second task I ran a separate batch — 7 models, all given the same instructions, site structure, and list of pages. The constraint was explicit: the slug (last path segment) must stay the same. For example, /post/2024/04/reinstall-linux/ must become /.../reinstall-linux/, not something else.

I measured how many slug mismatches each model produced — cases where the generated target slug differed from the source slug.

Model	Lines	Slug mismatches	Error rate
minimax-m2.5-free	80	4	5.0%
Nemotron 3	78	4	5.1%
Qwen 3.5 27b Q3_XXS (llama.cpp)	80	4	5.0%
Qwen 3.5 27b Q3_M (llama.cpp)	81	6	7.4%
Bigpicle	81	9	11.1%
mimo-v2-flash-free	80	42	52.5%
Qwen 3.5 35b IQ3_S -second run (llama.cpp)	81	51	63.0%
Qwen 3.5 35b IQ4_XS (llama.cpp)	80	79	98.8%

One thing all 7 models did identically: old 2022-format URLs had a month prefix baked into the slug (e.g., /post/2022/06-git-cheatsheet/ → slug 06-git-cheatsheet). Every model stripped that prefix and used git-cheatsheet as the new slug. That’s 4 consistent mismatches in the top three models — so their baseline is 4 errors each, not zero.

The real divergence starts above that baseline. minimax-m2.5-free, Nemotron 3, and Qwen 3.5 27b Q3_XXS only ever violated the slug rule on those 4 legacy paths — nothing else. Qwen 3.5 27b Q3_M added 2 more (renamed the cognee article slug and lowercased Base64). Bigpicle added 5 more on top of the 4, mostly by shortening long slugs.

The outliers are in a different category. Qwen 3.5 35b IQ3_S consistently rewritten slugs from page titles rather than preserving the source (e.g., executable-as-a-service-in-linux → run-any-executable-as-a-service-in-linux, file-managers-for-linux-ubuntu → context-menu-in-file-managers-for-ubuntu-24). The second run was slightly better (51 vs 59) but showed the same behaviour. It also hallucinated a source path: it quietly changed comparing-go-orms-gorm-ent-bun-sqlc to comparing-go-orms-gorm-ent-bun-sql (dropped the c), so both sides of that line agreed with each other — but were both wrong. mimov2 was similarly aggressive at shortening: gnome-boxes-linux-virtual-machines-manager → gnome-boxes, vm-manager-multipass-cheatsheet → multipass.

The IQ4_XS quantization of the 35b is a different failure category entirely: 98.8% slug mismatches — but the problem isn’t wrong slugs, it’s that the model forgot to include slugs at all. Instead of /new-section/page-slug/, it produced category paths like /developer-tools/terminals-shell/ and /rag/architecture/. Eight source pages ended up mapped to /developer-tools/terminals-shell/ — all pointing to the same URL, which would be catastrophic if used. /developer-tools/ alone collected five separate pages. The output is completely unusable.

For this task, the smaller quantized Qwen 3.5 27b Q3_XXS matched the best performers — while two 35b quantizations failed badly, each in its own way.

Takeaway

I am happy running both Qwen 3.5 35b and Qwen 3.5 27b locally on llama.cpp with quantized weights — the hardware constraints are real (16GB VRAM means IQ3/IQ4 quantizations), but the workflow is solid.

The 27b Q3_XXS is the reliable daily driver. It follows instructions precisely, produces complete output, and is fast enough for interactive use. On the migration map task it matched the best cloud models.

The 35b is capable but unpredictable on structured tasks. For open-ended coding — write me a tool, build this — it performs well. But when the task has strict rules (like “the slug must stay the same”), it hallucinates freely across all quantizations I tested: rewriting slugs from titles, dropping slugs entirely, even quietly editing the source paths to make its wrong output look self-consistent. If you use the 35b for anything that produces structured output you plan to use directly, build a validation step into your workflow. Don’t assume the output is correct just because it looks plausible.

If you are curious at what speed these LLMs perform, check out Best LLMs for Ollama on 16GB VRAM GPU.