Setup a LLM Dev Environment on Windows 11

4 min readApr 28, 2024

Welcome to this follow-up to my previous article (you can read it here).

In this article, we’ll explore how to set up an LLM development environment on Windows 11 PC with NVidia GPU (my is 3080 Ti) using Nvidia-Docker and WSL2. All the steps are based on official website https://docs.nvidia.com/cuda/wsl-user-guide/index.html.

Initial Setup

1.Install NVIDIA GPU Drivers

Visit NVIDIA’s driver download page and select the driver corresponding to your GPU model. This is the only driver you need throughout this setup.

Again, this is the only driver you need to install during this whole process.

2. Install WSL 2

If you’re on Windows 11, WSL comes pre-installed. Open Windows Terminal, Command Prompt, or PowerShell and install your Linux distribution to ensure it’s up-to-date:

wsl --install
wsl --update

Installing CUDA on WSL

1.Install CUDA

Use following command to enter WSL

wsl

2. Prepare for CUDA Installation

Remove the outdated GPG key with the command

sudo apt-key del 7fa2af80

Set up the CUDA repository for Ubuntu:

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-wsl-ubuntu-12-4-local_12.4.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-4-local_12.4.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4

3. Verifying Installation

Check your NVIDIA driver and CUDA version with

nvidia-smi

it should show something like following

To verify Docker’s installation on WSL, enter

docker --version

It should show something like following

If Docker isn’t installed, follow the guide here.

Setting Up Jupyter Notebook

1.Start Jupyter via Docker

Here I used tensorflow docker image tensorflow/tensorflow:2.14.0-gpu-jupyter, because only this version supported the CUDA version I installed 12.0. Please change the version number to match your CUDA version.

Using the TensorFlow Docker image optimized for GPU, run

sudo docker run -v /mnt/<path-to-your-code>:/tf/projects --gpus all -p 8888:8888 tensorflow/tensorflow:2.14.0-gpu-jupyter

After the image downloads and the container starts, it will display a URL to access Jupyter Notebook. Here is output from my machine:

2. Install Dependencies

Install packages from jupyter notebook code cells like

! pip install -q -U torch
! pip install -q -U transformers
! pip install -q -U accelerate

Restart the Kernel

Run a LLM from Huggingface

Here I utilized the Microsoft LLM Phi-3-mini-4k-instruct, which is small while without auth. The code from HuggingFace https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

messages = [
    {"role": "system", "content": "You are a helpful digital assistant. Please provide safe, ethical and accurate information to the user."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

You will get following response:

and we can see the GPU memory usage is high, which means this code used the GPU to generate the response.

Summary

This setup ensures you utilize your GPU efficiently when running Large Language Models on Windows 11, providing a powerful platform for your machine learning. I cannot guarantee it will 100% work for you, the most reliable way to try LLM is on Linux or Cloud — colab for example.