[Tutorial] Running Google’s Gemma-4b locally with Google ADK and dual A40 GPUs For Windows 7,8,10,11-Winpcsoft.com

Der Google Agent Development Kit (ADK) is a modular framework for building agent workflows. Although it is deeply integrated into Gemini, its model-agnostic design allows you to connect local models deployed via vLLM.

Get that LiteLLM configuration Jedoch, getting this setup right can be surprisingly difficult. To save you the trial and error, this guide provides a proven approach to connecting local inference with Google’s agent tools.

In this tutorial you will learn how to serve Gemma 3 (4B) with vLLM on high-performance GPUs (z.B. single or dual NVIDIA A40s) and integrate it with Google ADK using a custom LiteLLM configuration.

I have also used this approach in a more complex project whose You can find the source code here; What we cover in this tutorial is just the prototype version.

🛠️ Phase 0: Installation and Environment Setup (with UV)

To efficiently handle the extensive dependencies for vLLM and Google ADK, we use uv, the extremely fast Python package manager and resolver.

1. Install uv and create the environment

First, make sure UV is installed and create a synchronized virtual environment from the terminal.

# Install uv if you haven't already
curl -LsSf [https://astral.sh/uv/install.sh](https://astral.sh/uv/install.sh) | sh
# Create a new project environment
uv venv --python 3.11
source .venv/bin/activate

2. Install dependencies

We need the Google ADK for the agent logic and perhaps for the inference server.

# Install the core stack
uv pip install google-adk vllm huggingface_hub pydantic

🏗️ Phase 1: Deploying Gemma 3 with vLLM

Gemma 3 4B is a high performance multimodal model. To use it for agent tasks (z.B. tool calls), we need to expose it via an OpenAI compatible API and specify the tool parser.

1. Preparation of the environment

Before starting the server, configure your environment to be optimized for the A40s. On enterprise GPUs like the A40, standard P2P/IB communication can sometimes cause NCCL timeouts depending on your PCIe topology. We disable these for maximum stability.

export VLLM_USE_V1=0 
export NCCL_P2P_DISABLE=1 
export NCCL_IB_DISABLE=1

2. Download the model weights

According to the official Hugging Face documentation for Gemma 3You should use the Huggingface CLI to get the weights.

💡 Since Gemma 3 is a closed model, make sure you have accepted the license agreement on the model card and are authenticated in your terminal.

3. Starting the inference server

We will use the downloaded weights. If you are using two A40s, we set tensor-parallel-size to 2 to split the model between both GPUs.

💡 To ensure that the server keeps running even if your SSH connection is lost, we use nohup to run the process in the background. Jedoch, this is of course optional.

nohup vllm serve "google/gemma-3-4b-it" \
--port 8000 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--tensor-parallel-size 2 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser openai \
--distributed-executor-backend mp > gemma-4b.log 2>&1

Notiz: You can monitor the logs in real time tail -f vllm.log.

4. Verification

To make sure everything is wired correctly, you can do a quick test curl from your terminal to the vLLM server, independent of the ADK:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "google/gemma-3-4b-it",
        "messages": [{"role": "user", "content": "What is the best city for tech in Germany?"}],
        "temperature": 0.7
    }'

🐍 Phase 2: Connect Google ADK to local vLLM

The Google ADK uses LiteLlm as a bridge to various backends. To force the agent to use your local vLLM instance instead of a cloud API, we use a helper function to configure the model parameters, including Guided JSON for structured expenses.

1. The model configuration function

Create a file named agent.py.

💡 Note the use of extra_body to enforce schema compliance via guided_json and is completely optional if you are just creating a prototype.

import asyncio
from google.adk.agents import LlmAgent
from google.adk.models.lite_llm import LiteLlm
from pydantic import BaseModel
# 1. Define the expected response structure for Guided Decoding
class AgentResponse(BaseModel):
    answer: str
    confidence: float
# 2. Custom LiteLLM configuration function
def _litellm(
    model: str,
    api_base: str,
    max_tokens: int,
    temperature: float = 0.7
) -> LiteLlm:
    return LiteLlm(
        model=model,
        api_base=api_base,
        custom_llm_provider="openai",
        api_key="not-needed", # vLLM doesn't require a key by default
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=0.95,
        extra_body={
            "guided_json": AgentResponse.model_json_schema(),
        },
    )
# 3. Initialize the Model
# Point this to your local vLLM instance
local_model = _litellm(
    model="google/gemma-3-4b-it",
    api_base="http://localhost:8000/v1",
    max_tokens=8192,
)
# 4. Initialize the Agent
# Notiz: Use 'instruction' (singular) for the system prompt
root_agent = LlmAgent(
    name="GemmaLocalAgent",
    model=local_model,
    instruction="You are travel expert. Answer the user's query." # this is a placeholder instruction
)
# 5. Simple Runner for testing
async def main():
    print("Querying local Gemma 3...")
    # ADK uses async streaming or simple run
    async for event in root_agent.run_async("What is the capital of Germany?"):
        if event.is_final_response():
             print(f"\nResponse: {event.content.parts[0].text}")
if __name__ == "__main__":
    asyncio.run(main())

Troubleshooting tips

OOM (Out of Memory): If you are working on a single A40, set –tensor-parallel-size to 1. If the model still doesn’t fit, reduce –max-model-len.
Problems calling the tool: If the agent does not trigger tools, check the vLLM logs. Gemma 3 requires special prompt formatting; Make sure your –tool-call-parser matches the expected format of the model.
Port conflicts: If port 8000 is busy, use –port 8080 in the vLLM command and update the api_base in your Python code accordingly.

Are you ready to build the API layer? Head over to my FastAPI + ADK deep dive to learn how to connect everything we’ve built today with a production-quality backend.
Although this tutorial focuses on that Gemma-4byou can use any open source HugingFace model supported by LiteLLM.
I have also used this approach in a more complex project whose source code can be found Here; What we cover in this tutorial is just the prototype version.

✨ If you like the article, please subscribe to my latest.
To get in touch, contact me at LinkedIn or about ashmibanerjee.com.

GenAI Usage Disclosure: GenAI models were used to check the blog for grammatical inconsistencies and refine the text for clarity. The authors take full responsibility for the content presented in this blog.

[Tutorial] Running Google’s Gemma-4b locally with Google ADK and two A40 GPUs was originally published in Google Developer Experts on Medium, where people are continuing the discussion by highlighting and responding to this story.