What is Text Generation Inference TGI?

Text Generation Inference is a Hugging Face server for hosting large language models with token streaming, continuous batching, and tensor parallel sharding.

Which quantisation methods does TGI support?

TGI supports GPTQ, AWQ, bitsandbytes, EETQ, Marlin, EXL2, fp8, and compressed tensors, with some methods requiring pre quantised weights and others quantising on load.

TGI - Text Generation Inference - Install, Config, Troubleshoot

Q: How do you run TGI with Docker on an Nvidia GPU?

Run the official TGI container with GPU access, map a host port to the container port, and pass a model id so the server downloads and serves the model.

Q: How do you serve gated or private Hugging Face models with TGI?

Set the HF_TOKEN environment variable to a Hugging Face access token with model read permissions so the server can download gated or private model files.

Q: Which TGI settings control GPU memory and request limits?

Key limits include max total tokens, max input tokens, batch token budgets, and optional CUDA memory fraction, which together cap memory use and queueing behaviour.

Q: Why does TGI fail with NCCL or shared memory errors on multiple GPUs?

Multi GPU sharding uses collective communication and can rely on shared memory, so insufficient container shared memory can trigger NCCL failures or severe slowdowns.

Q: How can you use TGI with OpenAI compatible chat clients?

Use the Messages API endpoint and point an OpenAI compatible client base URL at the server v1 path, then send standard chat messages.

Q: Where can you scrape Prometheus metrics from a TGI server?

Scrape the Prometheus metrics endpoint on the server, which exposes request latency, token counts, queue depth, and batch timing metrics.

Install TGI, ship fast, debug faster

Page content

Text Generation Inference (TGI) has a very specific energy. It is not the newest kid in the inference street, but it is the one that already learned how production breaks -

then baked those lessons into the defaults. If your goal is “serve an LLM behind HTTP and keep it running”, TGI is a pragmatic piece of kit.

If you are still weighing where to run models, this comparison of LLM hosting in 2026 pulls local, self-hosted, and cloud setups together so you can place TGI in context.

A reality check first. As of 2026, TGI is in maintenance mode and the upstream repository has been archived read only. That sounds like bad news until you look at it from an ops perspective. A stable engine can be a feature, especially when the real churn is in models, prompts, and product requirements.

laptop with server

This guide focuses on four things that matter on day zero and day thirty: install paths, a quickstart that actually works, configuration that changes real behaviour, and a troubleshooting mindset that saves time.

Why TGI still matters in 2026

It is easy to treat inference servers as interchangeable. For a tool-by-tool survey of common local stacks, start from Ollama vs vLLM vs LM Studio: Best Way to Run LLMs Locally in 2026?.

In practice, there are only three questions that matter.

First, how does it behave under load. TGI is built around continuous batching and token streaming, so it can prioritise throughput while still giving users the illusion of responsiveness.

Second, can it speak the dialect your tooling already speaks. TGI supports its own “custom API” and also a Messages API that is compatible with the OpenAI Chat Completions schema. That means tooling that expects an OpenAI shaped endpoint can often be pointed at TGI with minimal change.

Third, can you observe it without guessing. TGI exposes Prometheus metrics and supports distributed tracing via OpenTelemetry, which is the difference between “I think it is slow” and “prefill is saturating, queue time is growing, and batch token budget is too high”.

Install paths and prerequisites

TGI can be approached via Docker or via a local install from source. The Docker route is the path most people mean when they say “install TGI”, because it packages the router, model server, and kernels into an image that can run with a single command.

Under the hood, TGI is a system with distinct components: a router that accepts HTTP and performs batching, a launcher that orchestrates one or more model server processes, and the model server that loads the model and performs inference. That separation explains a lot of the “why” behind configuration flags and common failure modes.

Two practical prerequisites show up again and again: GPU access from containers, and a sane cache strategy for model weights. GPU access for Nvidia typically means the Nvidia Container Toolkit is installed, and caching means mapping a host volume to the container so model weights do not re-download every time.

Local install from source

A source install exists, but it is opinionated toward developers and kernel builders. It expects Rust, Python 3.9+, and build tooling, and it is usually a slower first step than running the container. Useful when you need to modify internals, test patches, or integrate with a very specific environment.

Quickstart with Docker

The canonical quickstart is short, which is exactly the point. Pick a model id, mount a cache volume, expose a port, and run the container.

Nvidia GPU quickstart

This is a minimal pattern that works well for the first boot.

model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
  ghcr.io/huggingface/text-generation-inference:3.3.5 \
  --model-id $model

That one command implicitly answers a frequent FAQ question, “How do you run TGI with Docker on an Nvidia GPU” by showing the three non negotiables: --gpus all, a port mapping, and a model id.

A subtle but important point is the port mapping. The container is typically configured to serve HTTP on port 80, so you map host 8080 to container 80. If you run TGI outside Docker, the default port for the launcher is often 3000, which is why port confusion is such a common first day bug.

First request using the custom API

TGI exposes a simple JSON “generate” style API. A streaming request looks like this.

curl 127.0.0.1:8080/generate_stream \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}'

If you prefer a single response, use the non streaming endpoint.

curl 127.0.0.1:8080/generate \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{"inputs":"Explain continuous batching in one paragraph.","parameters":{"max_new_tokens":120}}'

First request using the Messages API

If your ecosystem expects OpenAI style chat requests, use the Messages API. This directly relates to another FAQ question, “How can you use TGI with OpenAI compatible chat clients”.

curl 127.0.0.1:8080/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "tgi",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "Give a one sentence definition of tensor parallelism."}
    ],
    "stream": false,
    "max_tokens": 60
  }'

Serving gated or private models

If you have ever asked “How do you serve gated or private Hugging Face models with TGI”, the answer is boring by design: provide a Hub token via HF_TOKEN.

model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data
token=hf_your_read_token_here

docker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:/data \
  ghcr.io/huggingface/text-generation-inference:3.3.5 \
  --model-id $model

The failure mode here is also boring: missing permissions, invalid token scopes, or trying to pull a model that requires acceptance of a licence.

AMD ROCm quickstart

TGI also has ROCm images and a different device setup. If you are on AMD GPUs, the boot shape changes.

model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data

docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data \
  ghcr.io/huggingface/text-generation-inference:3.3.5-rocm \
  --model-id $model

CPU only runs

CPU runs exist, but they are not the platform TGI was designed to be brilliant at. When you do it anyway, disabling custom kernels avoids some hardware specific issues.

model=gpt2
volume=$PWD/data

docker run --shm-size 1g -p 8080:80 -v $volume:/data \
  ghcr.io/huggingface/text-generation-inference:3.3.5 \
  --model-id $model \
  --disable-custom-kernels

Configuration that actually moves the needle

TGI has a lot of flags. Most of them are not worth memorising. A few are worth understanding, because they answer the most searched question in this space: “Which TGI settings control GPU memory and request limits”.

Memory budget is max total tokens

The single most important concept in TGI configuration is that the server needs a token budget to plan batching and to avoid memory blow ups.

There are two caps that define request shape: max_input_tokens and max_total_tokens.

max_total_tokens acts like a per request memory budget because it bounds input tokens plus generated tokens. If it is too high, each request becomes expensive, batching becomes awkward, and memory pressure grows. If it is too low, users hit length limits early, and the server rejects otherwise valid workloads.

A configuration that makes the budget explicit looks like this.

docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:3.3.5 \
  --model-id HuggingFaceH4/zephyr-7b-beta \
  --max-input-tokens 2048 \
  --max-total-tokens 3072

Batching knobs that matter

Once token budgets are set, batching control is the next lever.

max_batch_prefill_tokens limits prefill work, which is often the most memory heavy and compute bound phase. max_batch_total_tokens sets how many tokens the server tries to fit into a batch overall. This is one of the real throughput controls.

The interesting knob is waiting_served_ratio. It encodes a policy decision, not a hardware constraint. It controls when the server pauses running decode work to bring waiting requests into a new prefill so they can join the decode group. Low values tend to favour existing requests, high values tend to reduce tail latency for newly queued requests, and both can be “correct” depending on traffic shape.

Sharding, num shard, and why NCCL shows up

If your model does not fit on one GPU, or you want higher throughput via tensor parallelism, sharding is the next step.

The mental model is simple: --sharded true enables sharding, and --num-shard controls the shard count. The server can use all visible GPUs by default, or use a subset.

A useful pattern on multi GPU hosts is splitting GPUs into groups and running multiple TGI replicas, each replica sharded across its own GPU subset. That spreads load while keeping the sharding topology simple.

This is also where the FAQ question “Why does TGI fail with NCCL or shared memory errors on multiple GPUs” becomes relevant. Multi GPU setups rely on collective communication, and containers need enough shared memory for safe operation when SHM fallback is used.

Quantisation choices, and what they trade

Quantisation is the most misunderstood “make it fit” setting because it mixes two different goals: memory reduction and speed.

TGI supports pre quantised weights for schemes like GPTQ and AWQ, and also on the fly quantisation for certain methods like bitsandbytes and EETQ. Some methods reduce memory but are slower than native half precision, which is why quantisation should not be treated as a free performance upgrade.

A simple on the fly 8 bit quantisation example looks like this.

docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:3.3.5 \
  --model-id HuggingFaceH4/zephyr-7b-beta \
  --quantize bitsandbytes

And a 4 bit variant looks like this.

docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:3.3.5 \
  --model-id HuggingFaceH4/zephyr-7b-beta \
  --quantize bitsandbytes-nf4

API shaping and basic guard rails

TGI can be run as an internal service, or exposed more broadly. If exposure is possible, two flags matter: max_concurrent_requests and api_key.

max_concurrent_requests provides backpressure. It makes the server refuse excess requests rather than letting everything queue and time out.

An API key provides a coarse authentication barrier. It is not a full auth system, but it stops accidental public usage.

CORS is also configurable via cors_allow_origin, which matters if a browser based UI calls the server directly.

Operations and observability

This section answers the real operator question: “Where can you scrape Prometheus metrics from a TGI server”.

OpenAPI docs and interactive docs

TGI exposes its OpenAPI and Swagger UI under the /docs route, which is handy when you want to quickly confirm request and response shapes or test endpoints without writing a client.

Prometheus metrics

TGI exports Prometheus metrics on the /metrics endpoint. These metrics cover queue size, request latency, token counts, and batch level timings. The result is that you can observe whether the system is limited by prefill, decode, or queueing.

End-to-end production monitoring—PromQL, Grafana dashboards, alerts, and Docker or Kubernetes scrape layouts for these stacks—is covered in Monitor LLM Inference in Production (2026): Prometheus & Grafana for vLLM, TGI, llama.cpp.

Tracing and structured logs

TGI supports distributed tracing via OpenTelemetry. Logs can also be emitted in JSON, which makes log pipelines easier.

Troubleshooting playbook

TGI failures tend to cluster into a few buckets, and each bucket has a very different fix.

The container runs but no GPU is detected

The most common cause is that the container runtime is not configured for GPU passthrough. On Nvidia, this often correlates with missing Nvidia Container Toolkit support, or running on a host driver stack that does not match expectations.

Model download failures and permission errors

If the server cannot download model files, the usual culprits are a missing auth token for gated models, a token without model read permissions, or rate limits. Setting HF_TOKEN correctly resolves the gated model case.

CUDA out of memory or sudden restarts under load

The most common cause is overly permissive token budgets. If max_total_tokens is large and clients request long generations, the server will reserve memory for worst case requests. Reduce the budget, reduce concurrency, or choose a quantisation method that fits your constraints.

Multi GPU NCCL errors, hangs, or extreme slowdowns

When sharding across multiple GPUs, shared memory and NCCL matter. Insufficient shared memory inside containers often creates instability. Increasing shared memory allocation or disabling SHM sharing via NCCL_SHM_DISABLE can change behaviour, with a performance trade.

NCCL issues also become easier to debug when NCCL debug logging is enabled, because the error reports are more explicit.

Weird kernel errors on non A100 hardware

Some models use custom kernels that were tested on specific hardware first. If you see unexplained kernel failures, --disable-custom-kernels is frequently the simplest way to confirm whether custom kernels are involved.

Port confusion and “it runs but I cannot reach it”

A classic footgun is mixing the Docker port mapping model with the local default port model. In Docker examples, the container commonly serves on 80, while local runs default to 3000. If you map the wrong port, your curl requests land on nothing, and the system looks broken when it is actually just unreachable.

Closing note

TGI feels like infrastructure. That is the compliment. It is a system designed to make text generation boring enough to operate, measurable enough to debug, and flexible enough to fit into existing OpenAI shaped client stacks.