Cantech Knowledge Base

Your Go-To Hosting Resource

How to Build a vLLM Container Image Using Docker?

Introduction

With large language models (LLMs) emerging in advanced AI systems, inference will become increasingly critical. Maximize your inference performance to maximize user experience and savings as you deploy chatbots, summarization tools, or other applications powered by LLMs.

It is an open-source LLM inference engine that is highly memory efficient and can operate at high throughput. Here, we will guide you through the complete step-by-step process to create a container image for vLLM. Let’s go through it!

What is vLLM?

vLLM (Virtual Large Language Model) is a highly optimized inference engine for large language models. The main features for vLLM are:

  • Maximize throughput through scheduling algorithms.
  • Minimize memory consumption with PagedAttention, a new form of memory management.
  • Support dynamic batching to handle users with minimal latency.

vLLM also supports popular models like LLaMA, GPT-J, Falcon, and many others. vLLM has capabilities to serve these models via Python and Restful APIs, making it a useful backend for online LLM applications.

Why Containerize vLLM?

Here are reasons to containerize vLLM!

  • Portability: Same container can run on dev, staging, and production.
  • Reproducibility: No more “it works on my machine” problems.
  • Scalability: Deploying vLLM on cloud platforms, Kubernetes clusters, or cloud GPU servers is a scalable and production-ready approach.
  • Simplicity: You put all dependencies, configuration, and scripts into one package, and you can reuse that package.

What are the prerequisites for building a vLLM container?

Before starting to build your vLLM container, there are some concepts you should understand.

  • Docker Basics: Get basic understanding of the Dockerfiles, and ways to build/run containers, images, volumes and networks. If you’re using Ubuntu, you can install docker on ubuntu 24.04
  • Python & ML Environments: Have proper idea about Python virtual environments, install packages using pip, and basic idea about machine learning frameworks like PyTorch.
  • System Requirements: Since vLLM is built for high performance inference, it should run on a machine with:
    – Docker installed
    – NVIDIA GPU with CUDA support
    – NVIDIA Container Toolkit configured
    – Basic understanding of Dockerfiles

How do you build a vLLM container image?

Let’s have a look at the step-wise tutorial to build a vLLM container image!

Before getting started, you will need:

  • Access the server using SSH.
  • Then, start the Docker service.
sudo systemctl start docker

Server Setup Configurations

Step 1: Create a new directory you will use to store your vLLM project files.

mkdir vllm-project

Step 2: Change directory to the new directory.

cd vllm-project

Step 3: Clone the vLLM project with Git.

git clone https://github.com/vllm-project/vllm/

Step 4: Make sure to list the files and verify that you have a new vLLM directory.

ls

Step 5: Change directory to the vLLM project directory.

cd vllm

Next, list the directory files and verify you have the necessary Dockerfile resources.

ls

Output:

benchmarks      collect_env.py           Dockerfile          docs           LICENSE         pyproject.toml     requirements-common.txt    requirements-dev.txt      rocm_patch vllm
cmake                CONTRIBUTING.md  Dockerfile.cpu   examples  MANIFEST.in  README.md         requirements-cpu.txt            requirements-neuron.txt   setup.py
CMakeLists.txt  csrc                            Dockerfile.rocm

The vLLM project directory consists of the following Dockerfile resources:

  • Dockerfile: It is the main startup context for the vLLM library with access to NVIDIA GPU systems.
  • Dockerfile.cpu: Mainly comprises the vLLM build context for CPU systems.
  • Dockerfile.rocm: It usually includes the context for AMD GPU systems.

Hence, you can now leverage the mentioned resources for creating a GPU or CPU system container image.

How to Build a vLLM Container Image for CPU Systems?

To build a new vLLM container image using Dockerfile.cpu, follow the steps mentioned below, ensuring that it is inclusive of all of the relevant packages and dependencies for CPU-based systems.

Step 1: Build a new container image using Dockerfile.cpu with all the files in the project working directory. Replace vllm-image with your new image name.

docker build -f Dockerfile.cpu -t vllm-cpu-image

Step 2: View all the Docker images on the server and ensure that new vllm-gpu-image is available.

docker images

Output:

REPOSITORY          TAG       IMAGE ID       CREATED         SIZE
vllm-cpu-image      latest    a1b2c3d4e5f6   10 seconds ago  6.2GB

For NVIDIA GPU (CUDA) Systems

If your server has an NVIDIA GPU and you want to leverage GPU acceleration, you will need to build the vLLM container image using the main Dockerfile. This setup make sure that all GPU dependencies, like CUDA libraries, are included into the container, enabling vLLM to run smoothly on NVIDIA GPUs.

Now, you can build the GPU-enabled container image by using the command below:

docker build -f Dockerfile -t vllm-gpu-image

Verify the images

docker images

Note: When running this container, use the –gpus all flag to give the container access to the GPU.

For AMD GPU (ROCm) Systems

If you’re working with an AMD GPU, you’ll need to make sure your container image has the ROCm libraries included to enable GPU acceleration. The vLLM repository has a special Dockerfile.rocm that helps you create a ROCm-enabled container, allowing you to leverage AMD GPUs for your inference tasks and other GPU Computing workloads.

Build the ROCm-enabled container image using

docker build -f Dockerfile.rocm -t vllm-rocm-image

Verify the image

docker images

Note: Make sure that the ROCm drivers are properly installed and available on your host system so that the container can work as intended

After verifying your images, remove any unused Docker images to free up disk space by following these steps in our detailed article on removing Docker images.

What are common issues when building vLLM containers?

1. CUDA compatibility errors

Fix by matching CUDA version with GPU drivers.

2. Out-of-memory errors

Reduce batch size or use smaller models.

3. Slow inference

Ensure GPU is being used properly.

Conclusion

Containerizing vLLM is an effective and efficient way to deploy large language models when you are scaling across environments that need portability and consistency. Follow few steps mentioned below:

  • Configure a CUDA-enabled Docker environment.
  • Install the underlying system and Python dependencies that vLLM needed.
  • Build and run a Docker image with GPU support serving LLMs.
  • Test and validate the deployment of your large language model via real API calls.

This foundational image can now serve as the basis of your LLM inference infrastructure whether you choose to deploy it on a local server, in a cloud instance, or within a container orchestration platform like Kubernetes.

FAQs

What models work with vLLM?

vLLM is compatible with models, such as:

LLaMA / LLaMA 2
Falcon
MPT
GPT-J / GPT-NeoX
OPT

Can I run vLLM on a CPU only machine?

No, vLLM is optimized for GPU inference and requires a CUDA compatible NVIDIA GPU for effective operations.

How do I change the model that is being served in the container?

Update either the command line in your Docker file or simply include a new command when executing the container:

docker run --gpus all -p 8000:8000 vllm-container

python -m vllm.entrypoints.api_server --model facebook/opt-1.3b

You could also utilize environment variables or a shell script as the entrypoint.

Can I use vLLM with Kubernetes?

Yes, it works well with Kubernetes for scalable deployments.

What is the fastest way to deploy vLLM?

Using a prebuilt Docker image with GPU support is the fastest method.

Is it possible to run multiple models in the same container?

vLLM is designed to serve one model per process, but you can:

  • Run multiple container processes as separate models.
  • Use a load balancer or API gateway to route your requests correctly.
March 25, 2026