LLM Performance

How Ollama Handles Parallel Requests

How Ollama Handles Parallel Requests

Understand Ollama concurrency, queueing, and how to tune OLLAMA_NUM_PARALLEL for stable parallel requests.

This guide explains how Ollama handles parallel requests (concurrency, queuing, and resource limits), and how to tune it using the OLLAMA_NUM_PARALLEL environment variable (and related knobs).

Gemma2 vs Qwen2 vs Mistral Nemo vs...

Gemma2 vs Qwen2 vs Mistral Nemo vs...

Testing logical fallacy detection

Recently we have seen several new LLMs were released. Exciting times. Let’s test and see how they perform when detecting logical fallacies.

Large Language Models Speed Test

Large Language Models Speed Test

Let's test the LLMs' speed on GPU vs CPU

Comparing prediction speed of several versions of LLMs: llama3 (Meta/Facebook), phi3 (Microsoft), gemma (Google), mistral(open source) on CPU and GPU.