Buying guide
Building a Self-Hosted LLM Server for Your Team (vLLM)
How to build a self-hosted LLM server for teams: vLLM concurrency, KV cache VRAM headroom, dual-GPU 70B sizing, and cost vs cloud APIs.
If you have already run a model on a single workstation, your instinct for what makes a good box is probably wrong for a team. A single user cares about one number: how fast the first token comes back and how quickly the rest stream out. The GPU sits idle most of the time, waiting for that one person to read, think, and type the next prompt. Put ten people on the same machine and the question inverts. The GPU is rarely idle now, and what hurts is not single-stream latency but how many requests it can serve at once before everyone slows down. The bottleneck moves from speed to concurrent throughput, and that single shift changes which hardware you buy, how you size memory, and whether the box pays for itself against a cloud API.
This guide covers building a self-hosted LLM server for teams of roughly five to fifteen people using vLLM, the open-source inference engine most production deployments standardize on. The short version: budget VRAM not just for the model weights but for a large pool of simultaneous conversations, plan for a machine that runs hot and headless around the clock, and run the numbers honestly against a metered API before you commit.
Why a team server is a different machine than a single-user rig
On a single-user setup, the weights load into VRAM once and the GPU handles one request at a time. Most of the silicon is idle between keystrokes, so you can buy the cheapest card that fits the model and still feel like it is fast, because you are the only one in the queue.
A team server has the opposite problem. Five or ten people sending prompts throughout the day means requests overlap constantly. If the server handled them one at a time, the eleventh person would wait behind the other ten, and a snappy single-user experience would turn into a frustrating line. The metric that matters becomes aggregate tokens per second across all active requests, and the throughput you can sustain at a given number of concurrent users. A box that streams 80 tokens per second to one person but collapses under five simultaneous requests is worse, for a team, than one that streams 40 to each of eight people at once. This is why you cannot size a team server from single-user benchmarks: those numbers say almost nothing about how the machine behaves when the queue is full, which is the only state that matters once a team depends on it.
How vLLM serves many users at once, and why it needs VRAM headroom
vLLM exists to solve exactly this problem, and how it works tells you how much VRAM to buy. Naive batching waits for a fixed group of requests, runs them together, and only starts the next group when the slowest one finishes, wasting the GPU whenever requests are different lengths. vLLM uses continuous batching instead: after every forward pass, its scheduler adds newly arrived requests to the running batch and removes finished ones, so the GPU stays saturated and a new request slots in the moment another completes rather than waiting for a whole batch to drain.
The catch is memory. Every active request needs a KV cache, the running store of attention keys and values for every token it has seen. PagedAttention, vLLM’s memory manager, breaks that cache into small fixed-size blocks and allocates them on demand as a conversation grows, freeing them instantly when the request finishes. This cuts the fragmentation waste of older systems and lets the same GPU hold far more simultaneous conversations. But the caches still have to live somewhere, and that somewhere is the VRAM left over after the weights are loaded.
That leftover pool is the real concurrency ceiling. vLLM’s gpu-memory-utilization setting caps the total VRAM it uses for weights, activations, and KV cache combined, and whatever remains after the weights become the KV pool. A 70B model at Q4 eats roughly 40GB just for weights; on a box with 96GB total, the rest is divided among everyone’s active conversations. Run out of pool and new requests queue instead of running. So for a team, VRAM headroom above the weights is not optional slack, it is the budget that decides how many people can use the server at once before it makes them wait.
Sizing the box: dual-GPU 70B at Q4 for 5 to 15 people
For a team of five to fifteen, a dual-GPU box running a 70B model quantized to roughly 4 bits is a sensible default. A 70B model at Q4 lands around 38 to 40GB of weights, more than any single consumer card holds, so you split it across two GPUs with tensor parallelism. Two 48GB cards give you 96GB total: about 40GB for weights and the rest, minus framework overhead, as KV cache pool.
How many people that pool supports depends heavily on context length, because the KV cache grows with every token in the conversation. A short-context request might cost a couple of GB; a long-context one (32K tokens or more) can rival the size of the weights all by itself. So the same box might comfortably serve a dozen people doing short chat-style turns and only a handful doing long-document analysis. Size for your actual workload, not a best case.
- Total VRAM: aim for at least 2x to 2.5x the quantized weight size so the KV cache pool is large, not an afterthought (e.g., ~96GB total for a ~40GB 70B-at-Q4 model).
- Context length: long contexts multiply KV cache cost per user; if your team analyzes big documents, budget far more headroom per concurrent request.
- Quantization: Q4 roughly halves weight size versus 8-bit and frees VRAM for more concurrency, at a small, usually acceptable quality cost for most team tasks.
- GPU interconnect: tensor parallelism splits the model across cards every forward pass, so fast inter-GPU links (NVLink where available) help more than on a single-card box.
- Concurrency targets: pick a realistic peak concurrent-user number and validate it under load before rollout, not from single-user benchmarks.
Always-on, headless, and running hot: the rack-ready build
A team server is infrastructure, not a workstation, and it has to be built like one. It runs headless in a closet or rack with no monitor attached, reachable over the network and ideally managed remotely. That changes the build in concrete ways single-user guides skip.
The biggest difference is heat. A team box under continuous batching is not idle between prompts; with a full queue it can sit near full TDP for hours, two high-end GPUs together pulling well over 600 watts of sustained load. Cooling has to be sized for that steady state, not for bursts, which usually means strong directional airflow, blower-style or server-grade cards rather than open-air gaming coolers, and a chassis built to move air continuously. The power supply needs real margin above the combined peak draw of both GPUs and the rest of the system, because GPUs spike hard and an undersized PSU is the fastest way to cause mysterious crashes under load.
Treat the rest like a small server too: uninterrupted power, network access controls so only your team reaches it, and remote management so you are not walking to a closet to reboot. None of this is exotic, but skipping it turns a capable inference box into something that overheats, throttles, or falls over exactly when the whole team depends on it.
Cost versus a cloud API, and why privacy often decides it
The honest case for a self-hosted box is steady, predictable load. Cloud APIs bill per token and cost nothing when idle, which is unbeatable for spiky or occasional use. A team server flips that math: it has a fixed up-front cost and a fixed power bill whether it runs at 5 percent or 95 percent utilization. The breakeven depends almost entirely on how busy you keep it. A box that sits idle most of the day loses to a metered API; one kept near saturation during working hours, every working day, can pay back its hardware in months and then serve for the cost of electricity.
The other half of the case is not financial. For many teams the real driver is data privacy and control: prompts, source code, customer records, and internal documents never leave your network, there is no third-party retention policy to vet, and you are not exposed to a vendor changing prices, deprecating a model, or rate-limiting you mid-project. Run the cost comparison carefully, but weigh it against the value of keeping sensitive data in-house, which for regulated or security-conscious teams is often the deciding factor on its own. The team server and pro dual-GPU builds elsewhere on this site are sized around exactly this tradeoff.
Related builds
Team Server Inference
High-memory build for concurrent team inference with vLLM on large models.
View buildFrequently asked questions
- How many people can one server actually handle?
- There is no fixed number, because it depends on context length more than headcount. A dual-GPU 70B-at-Q4 box can comfortably serve roughly five to fifteen people doing short, chat-style turns, but far fewer if everyone runs long-context document analysis, since each long conversation can consume as much KV cache memory as the model weights themselves. Always validate your target concurrency under realistic load before rolling it out.
- Why not just buy one big GPU instead of two?
- A 70B model at 4-bit quantization needs around 38 to 40GB just for weights, which exceeds what a single consumer card holds, and you still need substantial VRAM left over for the KV cache pool that handles concurrent users. Two cards let you both fit the model via tensor parallelism and reserve a large enough pool that requests run in parallel instead of queueing.
- Does vLLM continuous batching slow down individual users?
- Slightly, but it is the right tradeoff for a team. Continuous batching keeps the GPU saturated by running many requests together, so any single user may see marginally lower tokens per second than they would alone on the box. In exchange, the server sustains far higher total throughput and avoids the long queue waits a one-request-at-a-time setup would create once several people use it at once.
- Is a self-hosted server cheaper than a cloud API?
- Only if you keep it busy. The server is a fixed cost in hardware and power whether it runs idle or saturated, so a lightly used box loses to a metered API that bills per token and costs nothing when idle. A box kept near saturation throughout the working day can pay back its hardware in months. For many teams, though, data privacy and control are the deciding factor regardless of the exact breakeven.
- What about cooling and power for an always-on box?
- Size both for sustained, not burst, load. Under a full queue the GPUs can sit near full TDP for hours, with two high-end cards pulling well over 600 watts together, so you need strong continuous airflow, blower-style or server-grade cards rather than open-air gaming coolers, and a power supply with real margin above combined peak draw. Treat it like a small server: uninterrupted power, network access controls, and remote management.
Related reading
Some links in this article are affiliate links. If you buy through them we may earn a commission at no extra cost to you. See our affiliate disclosure.