LLM Performance

16 GB VRAM LLM benchmarks with llama.cpp (speed and context)

Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.

Comparing LLMs performance on Ollama on 16GB VRAM GPU

Running large language models locally gives you privacy, offline capability, and zero API costs. This benchmark reveals exactly what one can expect from 14 popular LLMs on Ollama on an RTX 4080.

NVIDIA DGX Spark vs Mac Studio vs RTX-4080: Ollama Performance Comparison

I dug up some interesting performance tests of GPT-OSS 120b running on Ollama across three different platforms: NVIDIA DGX Spark, Mac Studio, and RTX 4080. The GPT-OSS 120b model from the Ollama library weighs in at 65GB, which means it doesn’t fit into the 16GB VRAM of an RTX 4080 (or the newer RTX 5080).

The Rise of LLM ASICs: Why Inference Hardware Matters

The future of AI isn’t just about smarter models - it’s about smarter silicon. Specialized hardware for LLM inference is driving a revolution similar to Bitcoin mining’s shift to ASICs.

Here is a comparison between Qwen3:30b and GPT-OSS:20b focusing on instruction following and performance parameters, specs and speed.

Ollama’s GPT-OSS models have recurring issues handling structured output, especially when used with frameworks like LangChain, OpenAI SDK, vllm, and others.

Memory allocation and model scheduling in Ollama new version - v0.12.1

Here I am comparing how much VRAM new version of Ollama allocating for the model vs previous Ollama version. The new version is worse.

LLM Performance and PCIe Lanes: Key Considerations

How PCIe Lanes Affect LLM Performance? Depending on the task. For training and multi-gpu inferrence - perdormance drop is significant.

Test: How Ollama is using Intel CPU Performance and Efficient Cores

I’ve got a theory to test - if utilising ALL cores on Intel CPU would raise the speed of LLMs? This is bugging me that new gemma3 27 bit model (gemma3:27b, 17GB on ollama) is not fitting into 16GB VRAM of my GPU, and partially running on CPU.

In the midst of the modern world’s turmoil here I’m comparing tech specs of different cards suitable for AI tasks (Deep Learning, Object Detection and LLMs). They are all incredibly expensive though.

This guide explains how Ollama handles parallel requests (concurrency, queuing, and resource limits), and how to tune it using the OLLAMA_NUM_PARALLEL environment variable (and related knobs).

Not long ago was released. Let’s catch up and test how Mistral Small performs comparing to other LLMs.

Recently we have seen several new LLMs were released. Exciting times. Let’s test and see how they perform when detecting logical fallacies.

Testing how models with different number of parameters and quantization are behaving.

Comparing prediction speed of several versions of LLMs: llama3 (Meta/Facebook), phi3 (Microsoft), gemma (Google), mistral(open source) on CPU and GPU.

LLM Performance

16 GB VRAM LLM benchmarks with llama.cpp (speed and context)

Comparing LLMs performance on Ollama on 16GB VRAM GPU

NVIDIA DGX Spark vs Mac Studio vs RTX-4080: Ollama Performance Comparison

The Rise of LLM ASICs: Why Inference Hardware Matters

Comparison: Qwen3:30b vs GPT-OSS:20b

Ollama GPT-OSS Structured Output Issues

Memory allocation and model scheduling in Ollama new version - v0.12.1

LLM Performance and PCIe Lanes: Key Considerations

Test: How Ollama is using Intel CPU Performance and Efficient Cores

Comparing NVidia GPU suitability for AI

How Ollama Handles Parallel Requests

Mistral Small, Gemma 2, Qwen 2.5, Mistral Nemo, LLama3 and Phi - LLM Test

Gemma2 vs Qwen2 vs Mistral Nemo vs...

Comparing LLM Summarising Abilities

Large Language Models Speed Test