Local CUDA vLLM Setup for Python-Only Development Using a Precompiled Wheel

This guide shows how to:

run vllm directly from a local clone
make Python-only code changes in the repo
use precompiled native libraries from a nightly wheel

This is useful when you want:

editable Python code from your clone
working native extensions (vllm._C, etc.)
no full source compilation

Install `uv`

1	`curl -LsSf https://astral.sh/uv/install.sh \| sh`

Clone `vllm`

1 2	`git clone https://github.com/vllm-project/vllm.git cd vllm`

Create and activate a virtual environment

From the repo root:

1 2	`uv venv --python 3.12 --seed source .venv/bin/activate`

Inspect available nightly wheel variants

Check the nightly index:

1	`curl -L https://wheels.vllm.ai/nightly/`

Example output:

<!DOCTYPE html>
<html>
  <!-- Generated on 2026-05-19T03:30:43.127052 (commit 287471b99442b44c5a16c4d70b0f3e178dd52732) -->
  <meta name="pypi:repository-version" content="1.0">
  <body>
    <a href="cpu/">cpu/</a><br/>
    <a href="cu129/">cu129/</a><br/>
    <a href="cu130/">cu130/</a><br/>
    <a href="vllm/">vllm/</a><br/>
  </body>
</html>

Notes:

cu129 = CUDA 12.9 wheel variant
cu130 = CUDA 13.0 wheel variant
the HTML comment contains the latest built nightly commit

Pick a CUDA variant

For a server with CUDA 13.0 support, use:

cu130

To inspect that variant:

1	`curl -L https://wheels.vllm.ai/nightly/cu130/vllm/metadata.json`

Example output:

[
  {
    "package_name": "vllm",
    "version": "0.21.1rc1.dev89+g287471b99",
    "python_tag": "cp38",
    "abi_tag": "abi3",
    "platform_tag": "manylinux_2_24_aarch64",
    "filename": "vllm-0.21.1rc1.dev89+g287471b99-cp38-abi3-manylinux_2_24_aarch64.whl",
    "path": "../../../287471b99442b44c5a16c4d70b0f3e178dd52732/vllm-0.21.1rc1.dev89%2Bg287471b99-cp38-abi3-manylinux_2_24_aarch64.whl"
  },
  {
    "package_name": "vllm",
    "version": "0.21.1rc1.dev89+g287471b99",
    "python_tag": "cp38",
    "abi_tag": "abi3",
    "platform_tag": "manylinux_2_24_x86_64",
    "filename": "vllm-0.21.1rc1.dev89+g287471b99-cp38-abi3-manylinux_2_24_x86_64.whl",
    "path": "../../../287471b99442b44c5a16c4d70b0f3e178dd52732/vllm-0.21.1rc1.dev89%2Bg287471b99-cp38-abi3-manylinux_2_24_x86_64.whl"
  }
]

Choose the entry matching your machine.

For a typical Linux x86_64 server, select:

1	`"platform_tag": "manylinux_2_24_x86_64"`

Python tag note

You may see:

1	`cp38-abi3`

That is okay for Python 3.12 because the wheel uses the stable ABI.

Download the wheel file manually

Using the x86_64 entry above:

1	`curl -L "https://wheels.vllm.ai/nightly/cu130/vllm/<path>" -o /tmp/vllm-cu130.whl`

Sanity-check the wheel contents

1	`python -m zipfile -l /tmp/vllm-cu130.whl \| head -40`

You should see native libraries such as:

vllm/_C.abi3.so
vllm/_C_stable_libtorch.abi3.so
vllm/_moe_C.abi3.so
vllm/_flashmla_C.abi3.so
vllm/cumem_allocator.abi3.so
vllm/spinloop.abi3.so

If you see a BrokenPipeError at the end because of | head, that is harmless.

Install vLLM editable using the exact wheel file

Set the environment variables:

1 2	`export VLLM_USE_PRECOMPILED=1 export VLLM_PRECOMPILED_WHEEL_LOCATION=/tmp/vllm-cu130.whl`

Then install:

1	`uv pip install -U -e . --torch-backend=cu130`

Why use VLLM_PRECOMPILED_WHEEL_LOCATION?

it bypasses flaky automatic wheel selection logic
it forces the build to use the exact wheel you downloaded
it still installs the package in editable mode from your local repo

Verify that Python loads from your clone

1	`python -c "import vllm, inspect; print(inspect.getfile(vllm))"`

Expected result:

1	`/home/ubuntu/vllm/vllm/__init__.py`

This confirms that Python code is coming from your local clone.

Verify that the native extension loads

1	`python -c "import vllm._C; print('ok')"`

Expected output:

ok

This confirms that the precompiled native library is usable.

Verify the CLI works

1	`vllm --help`

If this works, the setup is ready.

Run the server

Example:

1	`CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen2.5-1.5B-Instruct --host 0.0.0.0 --port 8000`

In another shell, test it:

1	`curl http://localhost:8000/v1/models`

Development workflow

If you modify Python files under the repo, for example:

vllm/entrypoints/...
vllm/engine/...
vllm/core/...
other Python modules

then restart the process and your changes should be picked up.

You do not need to reinstall for normal Python-only changes.

AI and Machine Learning

#reference #systems #machine-learning #python #llm #linux #vllm

Local CUDA vLLM Setup for Python-Only Development Using a Precompiled Wheel

https://jifengwu2k.github.io/2026/05/19/Local-CUDA-vLLM-Setup-for-Python-Only-Development/

Author

Jifeng Wu

Posted on

May 19, 2026

Licensed under

Rust Crates and Python Packages Previous

Recording Audio on Linux with PulseAudio Next

Local CUDA vLLM Setup for Python-Only Development Using a Precompiled Wheel

Install uv

Clone vllm

Create and activate a virtual environment

Inspect available nightly wheel variants

Pick a CUDA variant

Python tag note

Download the wheel file manually

Sanity-check the wheel contents

Install vLLM editable using the exact wheel file

Verify that Python loads from your clone

Verify that the native extension loads

Verify the CLI works

Run the server

Development workflow

Install `uv`

Clone `vllm`