[チュートリアル] Running Google’s Gemma-4b locally with Google ADK and dual A40 GPUs For Windows 7,8,10,11

の Google エージェント開発キット (ADK) エージェントワークフローを構築するためのモジュール式フレームワークです. 双子座に深く統合されていますが、, its model-agnostic design allows you to connect local models deployed via vLLM.

Get that LiteLLM configuration しかし, getting this setup right can be surprisingly difficult. To save you the trial and error, this guide provides a proven approach to connecting local inference with Google’s agent tools.

In this tutorial you will learn how to serve ジェマ 3 (4B) with vLLM on high-performance GPUs (例えば. single or dual NVIDIA A40s) and integrate it with Google ADK using a custom LiteLLM configuration.

I have also used this approach in a more complex project whose You can find the source code here; What we cover in this tutorial is just the prototype version.

🛠️ Phase 0: Installation and Environment Setup (with UV)

To efficiently handle the extensive dependencies for vLLM and Google ADK, we use uv, the extremely fast Python package manager and resolver.

1. Install uv and create the environment

First, make sure UV is installed and create a synchronized virtual environment from the terminal.

# Install uv if you haven't already
カール-LsSf [https://アストラル.sh/uv/install.sh](https://アストラル.sh/uv/install.sh) | しー
# Create a new project environment
uv venv --python 3.11
ソース .venv/bin/activate

2. Install dependencies

We need the Google ADK for the agent logic and perhaps for the inference server.

# Install the core stack
uv pip install google-adk vllm huggingface_hub pydantic

🏗️ Phase 1: Gemma のデプロイ 3 with vLLM

ジェマ 3 4B is a high performance multimodal model. To use it for agent tasks (例えば. tool calls), we need to expose it via an OpenAI compatible API and specify the tool parser.

1. Preparation of the environment

Before starting the server, configure your environment to be optimized for the A40s. On enterprise GPUs like the A40, standard P2P/IB communication can sometimes cause NCCL timeouts depending on your PCIe topology. We disable these for maximum stability.

export VLLM_USE_V1=0 
export NCCL_P2P_DISABLE=1 
export NCCL_IB_DISABLE=1

2. Download the model weights

According to the official Hugging Face documentation for Gemma 3You should use the Huggingface CLI to get the weights.

💡 Since Gemma 3 is a closed model, make sure you have accepted the license agreement on the model card and are authenticated in your terminal.

3. Starting the inference server

We will use the downloaded weights. If you are using two A40s, we set tensor-parallel-size to 2 to split the model between both GPUs.

💡 To ensure that the server keeps running even if your SSH connection is lost, we use nohup to run the process in the background. しかし, this is of course optional.

nohup vllm serve "google/gemma-3-4b-it" \
--ポート 8000 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--tensor-parallel-size 2 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser openai \
--distributed-executor-backend mp > gemma-4b.log 2>&1

注記: You can monitor the logs in real time tail -f vllm.log.

4. Verification

To make sure everything is wired correctly, you can do a quick test curl from your terminal to the vLLM server, independent of the ADK:

curl http://ローカルホスト:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "google/gemma-3-4b-it",
        "messages": [{"role": "user", "content": "What is the best city for tech in Germany?"}],
        "temperature": 0.7
    }'

🐍 Phase 2: Connect Google ADK to local vLLM

The Google ADK uses LiteLlm as a bridge to various backends. To force the agent to use your local vLLM instance instead of a cloud API, we use a helper function to configure the model parameters, 含む Guided JSON for structured expenses.

1. The model configuration function

Create a file named agent.py.

💡 Note the use of extra_body to enforce schema compliance via guided_json and is completely optional if you are just creating a prototype.

非同期をインポートする
from google.adk.agents import LlmAgent
from google.adk.models.lite_llm import LiteLlm
pydanticインポートBaseModelから
# 1. Define the expected response structure for Guided Decoding
class AgentResponse(彼らはモデルです):
    answer: str
    confidence: float
# 2. Custom LiteLLM configuration function
def _litellm(
    モデル: str,
    api_base: str,
    max_tokens: int,
    temperature: float = 0.7
) -> LiteLlm:
    return LiteLlm(
        model=model,
        api_base=api_base,
        custom_llm_provider="openai",
        api_key="not-needed", # vLLM doesn't require a key by default
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=0.95,
        extra_body={
            "guided_json": AgentResponse.model_json_schema(),
        },
    )
# 3. Initialize the Model
# Point this to your local vLLM instance
local_model = _litellm(
    model="google/gemma-3-4b-it",
    api_base="http://ローカルホスト:8000/v1",
    max_tokens=8192,
)
# 4. Initialize the Agent
# 注記: Use 'instruction' (singular) for the system prompt
root_agent = LlmAgent(
    name="GemmaLocalAgent",
    model=local_model,
    instruction="You are travel expert. Answer the user's query." # this is a placeholder instruction
)
# 5. Simple Runner for testing
async def main():
    印刷する("Querying local Gemma 3...")
    # ADK uses async streaming or simple run
    async for event in root_agent.run_async("What is the capital of Germany?"):
        ifevent.is_final_response():
             印刷する(f"\nResponse: {event.content.parts[0].文章}")
if __name__ == "__main__":
    asyncio.run(main())

Troubleshooting tips

OOM (Out of Memory): If you are working on a single A40, set –tensor-parallel-size to 1. If the model still doesn’t fit, reduce –max-model-len.
Problems calling the tool: If the agent does not trigger tools, check the vLLM logs. ジェマ 3 requires special prompt formatting; Make sure your –tool-call-parser matches the expected format of the model.
Port conflicts: If port 8000 is busy, use –ポート 8080 in the vLLM command and update the api_base in your Python code accordingly.

Are you ready to build the API layer? Head over to my FastAPI + ADK deep dive to learn how to connect everything we’ve built today with a production-quality backend.
Although this tutorial focuses on that Gemma-4byou can use any open source HugingFace model supported by LiteLLM.
I have also used this approach in a more complex project whose source code can be found ここ; What we cover in this tutorial is just the prototype version.

✨ 記事が気に入ったら, お願いします 購読する 私の最近のことへ.
連絡するには, まで連絡してください リンクトイン または約 ashmibanerjee.com.

GenAIの使用状況の開示: GenAI モデルは、ブログの文法的矛盾をチェックし、テキストを明確にするために調整されました。. このブログで紹介されている内容については著者が全責任を負います.

[チュートリアル] Running Google’s Gemma-4b locally with Google ADK and two A40 GPUs was originally published in Google Developer Experts on Medium, 人々がこのストーリーを強調し、それに応答することで議論を続けている場所.