LLM - Page 4 - Rost Glukhov | Personal site and technical blog

Oh My Opencode QuickStart for OpenCode: Install, Configure, Run

Oh My Opencode turns OpenCode into a multi-agent coding harness: an orchestrator delegates work to specialist agents that run in parallel.

llama.cpp Quickstart with CLI and Server

I keep coming back to llama.cpp for local inference—it gives you control that Ollama and others abstract away, and it just works. Easy to run GGUF models interactively with llama-cli or expose an OpenAI-compatible HTTP API with llama-server.

AI Developer Tools: The Complete Guide to AI-Powered Development

Artificial Intelligence is reshaping how software is written, reviewed, deployed, and maintained. From AI coding assistants to GitOps automation and DevOps workflows, developers now rely on AI-powered tools across the entire software lifecycle.

OpenCode Quickstart: Install, Configure, and Use the Terminal AI Coding Agent

OpenCode is an open source AI coding agent you can run in the terminal (TUI + CLI) with optional desktop and IDE surfaces. This is the OpenCode Quickstart: install, verify, connect a model/provider, and run real workflows (CLI + API).

Monitor LLM Inference in Production (2026): Prometheus & Grafana for vLLM, TGI, llama.cpp

LLM inference looks like “just another API” — until latency spikes, queues back up, and your GPUs sit at 95% memory with no obvious explanation.

OpenClaw Quickstart: Install with Docker (Ollama GPU or Claude + CPU)

OpenClaw is a self-hosted AI assistant designed to run with local LLM runtimes like Ollama or with cloud-based models such as Claude Sonnet.

OpenClaw: Examining a Self-Hosted AI Assistant as a Real System

Most local AI setups start the same way: a model, a runtime, and a chat interface.

Implementing Workflow Applications with Temporal in Go: A Complete Guide

Temporal is an open-source, enterprise-grade workflow engine that enables developers to build durable, scalable, and fault-tolerant workflow applications using familiar programming languages like Go.

Observability for LLM Systems: Metrics, Traces, Logs, and Testing in Production

LLM systems fail in ways that traditional API monitoring cannot surface — queues fill silently, GPU memory saturates long before CPU looks busy, and latency blows up at the batching layer rather than the application layer.

Chunking Strategies in RAG Comparison: Alternatives, Trade‑offs, and Examples

Chunking is the most under-estimated hyperparameter in Retrieval ‑ Augmented Generation (RAG): it silently determines what your LLM “sees”, how expensive ingestion becomes, and how much of the LLM’s context window you burn per answer.

Observability in Production: Monitoring, Metrics, Prometheus & Grafana Guide (2026)

Observability is the foundation of reliable production systems.

Without metrics, dashboards, and alerting, Kubernetes clusters drift, AI workloads fail silently, and latency regressions go unnoticed until users complain.

Retrieval-Augmented Generation (RAG) Tutorial: Architecture, Implementation, and Production Guide

Production-focused guide to building RAG systems: chunking, vector stores, hybrid retrieval, reranking, evaluation, and when to choose RAG over fine-tuning.

LLM Hosting in 2026: Local, Self-Hosted and Cloud Infrastructure Compared

Strategic guide to hosting large language models locally with Ollama, llama.cpp, vLLM, or in the cloud. Compare tools, performance trade-offs, and cost considerations.

LLM Performance in 2026: Benchmarks, Bottlenecks & Optimization

A performance engineering hub for running LLMs efficiently: runtime behavior, bottlenecks, benchmarks, and the real constraints that shape throughput and latency.

Self-hosting LLMs keeps data, models, and inference under your control-a practical path to AI sovereignty for teams, enterprises, nations.

Comparing LLMs performance on Ollama on 16GB VRAM GPU

Running large language models locally gives you privacy, offline capability, and zero API costs. This benchmark reveals exactly what one can expect from 14 popular LLMs on Ollama on an RTX 4080.