16 GB VRAM LLM benchmarks with llama.cpp (speed and context)
llama.cpp token speed on 16 GB VRAM (tables).
Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.
llama.cpp token speed on 16 GB VRAM (tables).
Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.
LLM speed test on RTX 4080 with 16GB VRAM
Running large language models locally gives you privacy, offline capability, and zero API costs. This benchmark reveals exactly what one can expect from 14 popular LLMs on Ollama on an RTX 4080.
GPT-OSS 120b benchmarks on three AI platforms
I dug up some interesting performance tests of GPT-OSS 120b running on Ollama across three different platforms: NVIDIA DGX Spark, Mac Studio, and RTX 4080. The GPT-OSS 120b model from the Ollama library weighs in at 65GB, which means it doesn’t fit into the 16GB VRAM of an RTX 4080 (or the newer RTX 5080).
Specialized chips are making AI inference faster, cheaper
The future of AI isn’t just about smarter models - it’s about smarter silicon. Specialized hardware for LLM inference is driving a revolution similar to Bitcoin mining’s shift to ASICs.
Comparing Speed, parameters and performance of these two models
Here is a comparison between Qwen3:30b and GPT-OSS:20b focusing on instruction following and performance parameters, specs and speed.
Not very nice.
Ollama’s GPT-OSS models have recurring issues handling structured output, especially when used with frameworks like LangChain, OpenAI SDK, vllm, and others.
My own test of ollama model scheduling
Here I am comparing how much VRAM new version of Ollama allocating for the model vs previous Ollama version. The new version is worse.
Thinking of installing second gpu for LLMs?
How PCIe Lanes Affect LLM Performance? Depending on the task. For training and multi-gpu inferrence - perdormance drop is significant.
Ollama on Intel CPU Efficient vs Performance cores
I’ve got a theory to test - if utilising ALL cores on Intel CPU would raise the speed of LLMs? This is bugging me that new gemma3 27 bit model (gemma3:27b, 17GB on ollama) is not fitting into 16GB VRAM of my GPU, and partially running on CPU.
AI requires a lot of power...
In the midst of the modern world’s turmoil here I’m comparing tech specs of different cards suitable for AI tasks (Deep Learning, Object Detection and LLMs). They are all incredibly expensive though.
Understand Ollama concurrency, queueing, and how to tune OLLAMA_NUM_PARALLEL for stable parallel requests.
This guide explains how Ollama handles parallel requests (concurrency, queuing, and resource limits), and how to tune it using the OLLAMA_NUM_PARALLEL environment variable (and related knobs).
Next round of LLM tests
Not long ago was released. Let’s catch up and test how Mistral Small performs comparing to other LLMs.
Testing logical fallacy detection
Recently we have seen several new LLMs were released. Exciting times. Let’s test and see how they perform when detecting logical fallacies.
8 llama3 (Meta+) and 5 phi3 (Microsoft) LLM versions
Testing how models with different number of parameters and quantization are behaving.
Let's test the LLMs' speed on GPU vs CPU
Comparing prediction speed of several versions of LLMs: llama3 (Meta/Facebook), phi3 (Microsoft), gemma (Google), mistral(open source) on CPU and GPU.