I wanted to build a medical Q&A fine-tuning project that remained truly TPU native on Kaggle. This project uses Google’s « gemma-3–1b-it », KerasHub, JAX backend and a Kaggle TPU to optimize ChatDoctor’s medical conversation data. I also compared the result to a much larger Gemma 4 model operated remotely.
![image[1]-Fine-Tuning Gemma 3 on TPU for Medical Q&A with Keras and JAX For Windows 7,8,10,11-Winpcsoft.com](https://winpcsoft.com/wp-content/plugins/wp-fastest-cache-premium/pro/images/blank.gif)
The short version is:
– The Gemma 3 training pipeline ran entirely on TPU
– Completed LoRA fine-tuning on a “TPU v5e-8” in under a minute
– Qualitative answers improved more than benchmark accuracy
– Gemma 4 remained much stronger on MedMCQA in zero-shot mode
Il’s worth noting that fine-tuning calls and benchmark gains are not always the same thing.
You can check out the Kaggle notebook Here.
Medical LLM fine-tuning on TPU with Keras + JAX
What I built
A pipeline that does five things:
1. Verifies a Kaggle TPU environment where Keras is bound to the JAX backend.
2. Loads and formats the ChatDoctor dataset into Gemma Chat-style prompt/response pairs.
3. Assesses basic Gemma 3 1B using a fixed 100-question MedMCQA segment.
4. Fine-tuning Gemma 3 1B with LoRA on TPU.
5. Compares the final model to Gemma 4 via OpenRouter to obtain a zero-shot reference point.
The active local training path is completely TPU native for Gemma 3. The Gemma 4 section is intentionally separate and remote, so I clearly refer to it as API comparison and not TPU local inference.
Environment and TPU setup
This run was done in Kaggle using:
Python `3.12.13`
NumPy `2.4.3`
datasets `4.8.3`
Keras `3.13.2`
JAX `0.9.2`
and 8` TPU devices visible to JAX. The crucial setup detail was to ensure that Keras saw « KERAS_BACKEND=jax » before the first import:
import os
os.environ["KERAS_BACKEND"] = "jax"
import keras
import jax
print(keras.backend.backend())
print(jax.devices())
This sounds insignificant, but in notebook environments it is often the difference between a clean TPU run and a confusing backend discrepancy.
Dataset selection: ChatDoctor for fine-tuning, MedMCQA for benchmarking
I used two datasets with different roles:
– “LinhDuong/chatdoctor-200k” for medical fine-tuning in conversation
– “medmcqa” for a medical multiple-choice benchmark
From the last run:
– Raw ChatDoctor examples: “207,408”.
– Valid formatted examples: “207.405”.
– Train distribution used: “1,800”.
– Validation split used: “200”.
– MedMCQA score: “100”.
ChatDoctor has been reformatted into Gemma Chat Rounds:
text = (
f"<start_of_turn>user\n"
f"You are a helpful medical assistant. Answer the patient's question clearly and safely.\n\n"
f"{patient_msg}<end_of_turn>\n"
f"<start_of_turn>model\n"
f"{doctor_msg}<end_of_turn>"
)
This structure allowed me to continue preprocessing in KerasHub instead of manually building a separate tokenization training pipeline.
Baseline: Gemma 3 1B before fine-tuning
As a basis, I loaded “google/gemma-3–1b-it” directly via KerasHub:
baseline_preprocessor = keras_hub.models.Gemma3CausalLMPreprocessor.from_preset(
"hf://google/gemma-3–1b-it",
sequence_length=256,
)
baseline_gemma_lm = keras_hub.models.Gemma3CausalLM.from_preset(
"hf://google/gemma-3–1b-it",
preprocessor=baseline_preprocessor,
)
baseline_gemma_lm.compile(sampler="greedy")
For the 100-question MedMCQA segment, the base model achieved 30/100, an accuracy of 30.0%. This gave me the “before fine-tuning” reference point for the rest of the project.
Fine-tuning setup: LoRA on TPU
The model had a total of “999,885,952” parameters. After enabling LoRA, only “2,609,152” parameters were trainable, making this practical on Kaggle TPU without attempting to fully optimize the entire model.
The LoRA step was simple:
gemma_lm.backbone.enable_lora(rank=16)
Training configuration from the last run:
sequence length: `256`
train samples: `1,800`
validation samples: `200`
batch size: `1`
epochs: `1`
learning rate: `1e-4`
LoRA rank: `16`
The model was compiled using SGD plus sparse categorical cross entropy:
optimizer = keras.optimizers.SGD(learning_rate=1e-4)
gemma_lm.compile(
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=optimizer,
weighted_metrics=[keras.metrics.SparseCategoricalAccuracy(name="accuracy")],
sampler="greedy",
)
TPU training results
The training run completed successfully on the TPU, and the first batch behaved exactly as TPU users expect: initially slower due to XLA compilation, then much faster steady-state execution.
Final TPU training stats from notebook:
hardware: `Google Cloud TPU (8 devices)`
model: `google/gemma-3–1b-it`
framework: `Keras 3.13.2 + JAX backend`
méthode: `LoRA (rank=16)`
train examples: `1800`
batch size: `1`
epochs: `1`
total train time: `0.7 minutes`
throughput: `~10512 tokens/sec`
Training metrics:
train loss: `1.8927`
validation loss: `1.7304`
train token accuracy: `0.3061`
validation token accuracy: `0.3129`
Token-level accuracy is a language model training signal, not a metric for “accuracy in answering medical questions.” Therefore, I treated it as a sanity check rather than a headline finding.
Did the fine-tuning improve the model?
This is where the story gets interesting. On MedMCQA the answer was: not really.
The fine-tuned Gemma 3 model achieved 30/100, an accuracy of 30.0%. Que’s exactly flat compared to the baseline of this benchmark.
But the qualitative results still changed. In the side-by-side examples, the refined model became a little more direct and safety-oriented. Par exemple, if you have chest pain and shortness of breath:
– The basic answer was cautious and explanatory
– The finely tuned answer leaned more clearly towards an urgent medical examination
This pattern was also evident in the other examples. While the model didn’t suddenly become a benchmark, it did become more aligned with the tone of medical guidelines as it fine-tuned data.
This is an important practical lesson: Fine-tuning conversation domain data can meaningfully change response style and security emphasis without significantly shifting a multiple-choice benchmark.
Held ROUGE-L check
I also performed a simple check on 20 validation samples from ChatDoctor and measured ROUGE-L between model outputs and reference responses.
Final result: ROUGE-L: 0.106
I wouldn’t overstate this number. It is best treated as a crude similarity signal rather than a substitute for clinical quality assessment. Still, it adds another perspective beyond MedMCQA.
Gemma 4 comparison via OpenRouter
For a stronger reference model, I added an optional Gemma 4 section via OpenRouter:
– Model: `google/gemma-4–26b-a4b-it`
– Execution mode: Remote API inference
– not TPU local
The Gemma 4 comparison was useful, but it was not part of the TPU training path. It served to answer a product question: How does a finely tuned small model compare to a larger, newer model from the factory?
On the same 100 MedMCQA questions, Gemma 4 achieved a score of 68/100, i.e. an accuracy of 68.0%. Que’s a big gap from both the base runs and the finely tuned Gemma 3 runs.
The qualitative results were also stronger. In the chest pain example, Gemma 4 immediately went to emergency instructions, specifically telling the user to treat the situation as a potential emergency and seek medical attention immediately. Using the example of fatigue, thirst and frequent urination, the classic triad of symptoms associated with diabetes was clearly recognized.
Final results
![image[2]-Fine-Tuning Gemma 3 on TPU for Medical Q&A with Keras and JAX For Windows 7,8,10,11-Winpcsoft.com](https://winpcsoft.com/wp-content/plugins/wp-fastest-cache-premium/pro/images/blank.gif)
Additional metrics:
- ROUGE-L on held ChatDoctor samples: “0.106”.
– Trainable parameters with LoRA: “2,609,152”.
– Total parameters: “999,885,952”.
– End-to-end TPU training time: “0.7 minutes”.
What this project actually shows
The biggest advantage is that fine-tuning is not always better than a better base model. My last run does not support this claim.
What it shows is:
– Keras and JAX is a convenient TPU training stack for Kaggle
– Gemma 3 1B can be fully fine-tuned on a TPU with a very low customization budget
– Fine-tuning medical dialogue can change qualitative behavior even if benchmark accuracy remains the same
– A much larger newer model can still dominate factual medical MCQ scoring in zero-shot mode
Final Thoughts
The Gemma 3 training path worked exactly how I wanted it to, i.e. easy setup, clean TPU execution, small trainable footprint, and quick iteration. The results were also a good reminder that not every successful training run results in an eye-catching benchmark jump. Sometimes true success comes from building a solid pipeline, understanding what has changed, and being honest about what hasn’t changed.
Overall, I successfully optimized Gemma 3 1B for medical dialogue on Kaggle TPU with Keras and JAX, noted qualitative improvements in responsiveness, and confirmed that a removed Gemma 4 baseline still significantly outperforms it on MedMCQA.
![]()
Fine-tuning Gemma 3 on TPU for Medical Q&A with Keras and JAX was originally published in Google Developer Experts on Medium, où les gens poursuivent la conversation en soulignant et en répondant à cette histoire.
