K Health enhances its AI physician by fine-tuning Gemma 3 with real-world clinical data

Gemma 3 helps medical professionals deliver better patient care through an AI physician with more natural and effective chat capabilities

K Health offers high-quality, accessible healthcare through a 24/7 virtual AI primary care platform that covers services including urgent care, chronic condition management, medical weight loss, and mental health support.

To optimize care and provide personalized services, the organization created an AI physician to assist its licensed medical providers. The developers at K Health migrated their AI physician model to Gemma 3, hosted on Google Cloud's Vertex AI platform, to improve the patient intake process and reduce inferencing costs.

The challenge

K Health’s AI physician is able to gather comprehensive medical information from patients during the intake process via chat. However, the team observed that its existing intake chat struggled to feel conversational, empathetic, and professional. The team’s goals were to enhance the chat experience for patients, incorporate new symptoms and medical knowledge into the training process, and reduce inference costs.

The team operated on the core philosophy that a smaller, well-tuned model can often outperform larger ones if trained correctly. Consequently, their strategy prioritized teaching the model decision-making logic rather than simple content generation.

The solution

After reviewing Llama and other open models, the team identified Gemma 3 on Vertex AI as the optimal solution for balancing computational performance with inference cost-efficiency. To support large-scale training, the team established a structured procurement process for multi-node (16) H100 GPU clusters and developed reusable scripts to streamline training and inference. This infrastructure enabled efficient training for 4B, 12B, and 27B Gemma 3 parameter variants, alongside the domain-specific MedGemma 27B model, to meet the needs of the AI physician.

The team used direct preference optimization (DPO) and created an evaluation function to review conversation quality. Conversations were evaluated on three key criteria: accuracy of medical details, conversational coherence, and clinical outcomes (such as whether the chat led to in-person referrals, lab tests, or prescriptions). For each case, 10 synthetic chats were generated and evaluated using the evaluation function. A combination of best chat and worst chat was selected to teach the model the decision-making logic behind a successful patient interaction.

While DPO training is highly effective, the team needed to find the optimal number of epochs (training rounds) to avoid negatively impacting model performance. After identifying the ideal number of epochs for each parameter size, the team found Gemma 3 4B showed the most significant improvement from the DPO training, with its “business score”—a key internal metric—improving from 0.48 to 0.76 with 10 epochs, while Gemma 3 12B achieved the highest overall score of 0.81 after 20 training epochs.

Comparing Gemma 3 4B, Gemma 3 12B, Gemma 3 27B, and MedGemma 27B business scores across four tuning states: Base, 4-bit DPO (10 epochs), 8-bit DPO (10 epochs), and 8-bit DPO (20 epochs). The 4B model achieves a sharp increase to 0.76 from a 0.48 base in the 4-bit DPO version. The Gemma 3 12B model achieves the highest overall score of 0.81 using 8-bit DPO at 20 epochs. For the 27B variant, scores improve from a 0.56 base to 0.68 with 8-bit DPO at 10 epochs. For the MedGemma 27B model, the 8-bit DPO at 10 epochs (0.71) outperforms the 20-epoch version (0.65).

MedGemma 27B achieved a promising business score of 0.71 post-DPO tuning, however it would incur higher inferencing costs, with Ran Ilan Ber, VP of Data Science at K Health, noting the result is competitive with the general Gemma models and highlights the potential for further improvement with more comprehensive tuning data and optimized infrastructure.

The impact

Ultimately, the team highlighted the success of the Gemma 3 4B model because it validated their hypothesis regarding efficiency, demonstrating that a smaller, general-purpose model fine-tuned on high-quality decision data can surpass a larger, domain-specific model.

During the process, the team discovered that using Axolotl AI with Accelerate on a custom multi-node virtual machine was their optimal training setup, reducing model training time by 66%, from 4.5 hours to just 1.5 hours. They overcame significant technical hurdles in training by implementing gradient checkpointing and 8-bit precision. These mitigations allowed them to find an optimal training configuration that achieved 90–95% accuracy while effectively preventing model overfitting. A self-reflection mechanism enabled the model to assess its own output for factual consistency and conversational flow. With this additional optimization, the average number of API calls per chat was reduced from 100 to 60.

These efficiency gains, coupled with Gemma’s lower inferencing costs, resulted in substantial savings. With the DPO training, the AI physician now demonstrates far more natural conversational capabilities than its previous iteration. “This successful pilot has validated K Health’s vision for an AI-driven patient intake system and has set a new benchmark for creating efficient, effective, and truly conversational AI in clinical settings,” concluded Ran.


More from the Gemmaverse

The Ministry of Economy, Ecology and Agriculture of Ukraine digitizes licensing process with Gemma

Adaptive ML trains Gemma 3 for exceptional multilingual results

Quarks improves user experiences with Gemma 2 and Gemma 3

Sarvam AI built a translation model with Gemma 3 to translate all 22 officially recognized Indian languages

Institute of Science Tokyo creates powerful Japanese-focused LLM with Gemma 2

Introducing GAIA, a Brazilian Portuguese Gemma 3 model developed with ABRIA, CEIA, Nama, and Amadeus AI