MedGemma Setup Guide for Local Development¶
This guide explains how to set up MedGemma 1.5 4B for use with MedExpertMatch via OpenAI-compatible APIs only. The application does not use Ollama natively; all AI providers must expose an OpenAI-compatible endpoint (e.g. LiteLLM, vLLM, or Ollama's OpenAI-compatible API if available).
Overview¶
MedGemma is Google's collection of open models for medical text and image comprehension, built on Gemma 3. MedExpertMatch uses OpenAI-compatible providers only; Ollama is excluded as a direct provider. For local MedGemma you can:
- LiteLLM proxy (recommended) - Exposes OpenAI-compatible API; can route to Ollama, vLLM, or other backends
- Ollama's OpenAI-compatible endpoint (if available) - Some Ollama versions expose
/v1/chat/completions; use only if fully compatible
Reference: MedGemma Documentation
Prerequisites¶
- A way to run MedGemma behind an OpenAI-compatible API, for example:
- LiteLLM (recommended): Python 3.8+ with pip; can proxy to Ollama, vLLM, or remote APIs
- Ollama (optional): Only if using Ollama as backend for LiteLLM or if Ollama exposes OpenAI-compatible endpoint
- Sufficient disk space (~2.5 GB for Q4_K_M quantized model)
- GPU recommended (but not required)
Step 1: Download MedGemma GGUF Model¶
Download the MedGemma 1.5 4B GGUF model from Hugging Face:
# Option 1: Using wget (Q4_K_M quantization - smaller, lower quality)
wget https://huggingface.co/unsloth/medgemma-1.5-4b-it-GGUF/resolve/main/medgemma-1.5-4b-it-Q4_K_M.gguf
# Alternative: Using wget (Q8_0 quantization - larger, higher quality)
wget https://huggingface.co/unsloth/medgemma-1.5-4b-it-GGUF/resolve/main/medgemma-1.5-4b-it-Q8_0.gguf
# Option 2: Using huggingface-cli
pip install huggingface-hub
huggingface-cli download unsloth/medgemma-1.5-4b-it-GGUF medgemma-1.5-4b-it-Q4_K_M.gguf --local-dir ./models
Model Information:
- Model: MedGemma 1.5 4B Instruction-Tuned (GGUF quantization)
- Quantizations Available:
- Q4_K_M: ~2.49 GB (balanced size/performance)
- Q8_0: ~4.37 GB (larger, higher quality)
- Architecture: Gemma 3 decoder-only transformer
- Modalities: Multimodal (text + images)
- Context length: 128K tokens
- Source: Hugging Face - unsloth/medgemma-1.5-4b-it-GGUF
Step 2: Create Ollama Modelfile¶
Create a file named Modelfile in the same directory as the downloaded GGUF file. The modelfile differs based on which
quantization you chose:
For Q4_K_M model (default, smaller size):¶
cat > Modelfile << 'EOF'
FROM ./medgemma-1.5-4b-it-Q4_K_M.gguf
TEMPLATE """{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 8192
EOF
For Q8_0 model (higher quality, larger size):¶
cat > Modelfile-Q8_0 << 'EOF'
FROM ./medgemma-1.5-4b-it-Q8_0.gguf
TEMPLATE """{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 8192
EOF
Note: Adjust the FROM path if your GGUF file is in a different location.
Step 3: Import Model into Ollama¶
Import the model into Ollama (choose the appropriate command for your quantization):
For Q4_K_M model (default, smaller size):¶
For Q8_0 model (higher quality, larger size):¶
Verify the model is available:
You should see medgemma:1.5-4b (and/or medgemma:1.5-4b-q8) in the list.
Step 4: Configure OpenAI-Compatible Access¶
MedExpertMatch requires OpenAI-compatible APIs. Since Ollama is already running as a service, you have two options:
Option A: Use Ollama's OpenAI-Compatible Endpoint (If Available)¶
Some Ollama versions support OpenAI-compatible endpoints. Check if your Ollama service exposes /v1/chat/completions:
# Test if Ollama has OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "medgemma:1.5-4b", "messages": [{"role": "user", "content": "test"}]}'
If this works, you can use Ollama directly without LiteLLM. Configure:
Note: Ollama's OpenAI-compatible endpoint may not be fully compatible. If you encounter issues, use Option B.
Option B: Use LiteLLM Proxy (Recommended for Full Compatibility)¶
If Ollama doesn't have OpenAI-compatible endpoints, or if you need full compatibility, use LiteLLM as a proxy:
Install LiteLLM:
Start LiteLLM with MedGemma (on a different port to avoid conflict with Ollama):
Important: Use port 8000 (or another port) since Ollama is already using 11434.
Then configure:
Step 5: Configure MedExpertMatch¶
The application-local.yml file is already configured for MedGemma. Choose your configuration based on whether you're
using Ollama directly or LiteLLM proxy:
Option A: Using Ollama's OpenAI-Compatible Endpoint (If Available)¶
If Ollama supports OpenAI-compatible endpoints, configure:
export CHAT_PROVIDER=openai
export CHAT_BASE_URL=http://localhost:11434 # Ollama service
export CHAT_MODEL=hf.co/unsloth/medgemma-27b-text-it-GGUF:IQ3_XXS # or medgemma:1.5-4b
export CHAT_API_KEY=ollama
export CHAT_TEMPERATURE=0.7
export CHAT_MAX_TOKENS=6000
export EMBEDDING_PROVIDER=openai
export EMBEDDING_BASE_URL=http://localhost:11434
export EMBEDDING_API_KEY=ollama
export EMBEDDING_MODEL=nomic-embed-text
export EMBEDDING_DIMENSIONS=768
export RERANKING_PROVIDER=openai
export RERANKING_BASE_URL=http://localhost:11434
export RERANKING_API_KEY=ollama
export RERANKING_MODEL=hf.co/unsloth/medgemma-27b-text-it-GGUF:IQ3_XXS # or medgemma:1.5-4b
export RERANKING_TEMPERATURE=0.1
export TOOL_CALLING_PROVIDER=openai
export TOOL_CALLING_BASE_URL=http://localhost:11434
export TOOL_CALLING_API_KEY=ollama
export TOOL_CALLING_MODEL=functiongemma
export TOOL_CALLING_TEMPERATURE=0.7
export TOOL_CALLING_MAX_TOKENS=4096
Important: Pull FunctionGemma for tool calling:
Option B: Using LiteLLM Proxy (Recommended)¶
If using LiteLLM proxy (for full OpenAI compatibility), configure:
export CHAT_PROVIDER=openai
export CHAT_BASE_URL=http://localhost:8000 # LiteLLM proxy (different port)
export CHAT_MODEL=medgemma:1.5-4b # or medgemma:1.5-4b-q8 for Q8_0 model
export CHAT_API_KEY=ollama
export CHAT_TEMPERATURE=0.7
export CHAT_MAX_TOKENS=6000
export EMBEDDING_PROVIDER=openai
export EMBEDDING_BASE_URL=http://localhost:8000
export EMBEDDING_API_KEY=ollama
export EMBEDDING_MODEL=nomic-embed-text
export EMBEDDING_DIMENSIONS=768
export RERANKING_PROVIDER=openai
export RERANKING_BASE_URL=http://localhost:8000
export RERANKING_API_KEY=ollama
export RERANKING_MODEL=medgemma:1.5-4b # or medgemma:1.5-4b-q8 for Q8_0 model
export RERANKING_TEMPERATURE=0.1
export TOOL_CALLING_PROVIDER=openai
export TOOL_CALLING_BASE_URL=http://localhost:8000
export TOOL_CALLING_API_KEY=ollama
export TOOL_CALLING_MODEL=functiongemma
export TOOL_CALLING_TEMPERATURE=0.7
export TOOL_CALLING_MAX_TOKENS=4096
Start LiteLLM with MedGemma (on a different port to avoid conflict with Ollama):
litellm --model ollama/medgemma:1.5-4b --port 8000
# OR for Q8_0 model:
# litellm --model ollama/medgemma:1.5-4b-q8 --port 8000
Step 6: Pull Embedding Model (Required)¶
MedGemma doesn't include embeddings, so you'll need a separate embedding model:
# Pull nomic-embed-text (recommended, 768 dimensions, good quality)
ollama pull nomic-embed-text
# OR pull alternative embedding models
ollama pull qwen3-embedding:0.6b # 1024 dimensions, faster, less memory
ollama pull qwen3-embedding:8b # 1024 dimensions, best quality
Step 7: Pull FunctionGemma for Tool Calling (Required)¶
MedExpertMatch uses FunctionGemma for tool/function calling operations because MedGemma 1.5 4B doesn't support tools:
Why FunctionGemma?
- MedGemma 1.5 4B does NOT support tool calling
- FunctionGemma DOES support tool calling
- The
MedicalAgentConfigurationuses FunctionGemma (toolCallingChatModel) for agent operations - Regular chat operations still use MedGemma (
primaryChatModel)
Step 8: Run MedExpertMatch¶
Start MedExpertMatch with the local profile:
Or set the profile via environment variable:
Verification¶
-
Check Ollama service:
-
Check Ollama model:
-
Test OpenAI-compatible endpoint (if using Ollama directly):
-
Test LiteLLM (if using LiteLLM proxy):
-
Test FunctionGemma (if using Ollama directly):
-
Test MedExpertMatch: Access Swagger UI at
http://localhost:8094/swagger-ui.html(port 8094 for local profile) and test the API endpoints
MedGemma Capabilities¶
MedGemma 1.5 4B supports:
- Medical image interpretation: Chest X-rays, CT scans, MRI, histopathology, dermatology, ophthalmology
- Medical text comprehension: Clinical reasoning, medical Q&A, document understanding
- EHR interpretation: Electronic health record understanding and extraction
- 3D medical imaging: CT and MRI volume interpretation
- Whole-slide histopathology: Multi-patch WSI interpretation
- Longitudinal imaging: Comparing current vs historical scans
- Anatomical localization: Bounding box detection in medical images
Troubleshooting¶
Model Not Found¶
If Ollama can't find the model:
# Check model location
ollama list
# Re-import if needed (for Q4_K_M model)
ollama create medgemma:1.5-4b -f Modelfile
# Re-import if needed (for Q8_0 model)
# ollama create medgemma:1.5-4b-q8 -f Modelfile-Q8_0
LiteLLM Connection Issues¶
If LiteLLM can't connect to Ollama:
- Verify Ollama service is running:
systemctl status ollamaorollama list - Check LiteLLM logs for connection errors
- Ensure LiteLLM uses a different port (e.g., 8000) since Ollama uses 11434
- Verify Ollama is accessible:
curl http://localhost:11434/api/tags
Using Ollama Directly (Without LiteLLM)¶
If your Ollama version supports OpenAI-compatible endpoints:
- Test the endpoint:
curl http://localhost:11434/v1/chat/completions ... - If it works, configure
CHAT_BASE_URL=http://localhost:11434directly - No need for LiteLLM proxy
Note: Not all Ollama versions have full OpenAI compatibility. LiteLLM ensures 100% compatibility.
Model Performance¶
- Slow inference: Consider using GPU acceleration or a smaller quantization (Q3_K_M)
- Out of memory: Use a smaller quantization or reduce context window (
num_ctx)
Alternative: Use Pre-converted Ollama Model¶
If MedGemma becomes available as a pre-converted Ollama model:
# Pull directly from Ollama (if available)
ollama pull medgemma:4b
# Then use in LiteLLM
litellm --model ollama/medgemma:4b --port 11434
Update CHAT_MODEL to medgemma:4b in your configuration.
Model Alternatives¶
Higher Quality Quantization (Q8_0)¶
For better quality with larger file size:
- Download the Q8_0 quantized model instead of Q4_K_M
- Follow the same import process but use the Modelfile-Q8_0
- Update
CHAT_MODEL=medgemma:1.5-4b-q8in configuration
MedGemma 1 27B (Better Quality)¶
For better performance (requires more resources):
- Download MedGemma 1 27B GGUF (if available)
- Follow the same import process
- Update
CHAT_MODEL=medgemma:27bin configuration
Other Medical Models¶
You can also use other medical LLMs available in GGUF format:
- Follow the same process with different model files
- Update the model name in configuration accordingly
References¶
- MedGemma Official Docs: https://developers.google.com/health-ai-developer-foundations/medgemma
- Hugging Face Model: https://huggingface.co/unsloth/medgemma-1.5-4b-it-GGUF
- Q4_K_M Model File : https://huggingface.co/unsloth/medgemma-1.5-4b-it-GGUF?show_file_info=medgemma-1.5-4b-it-Q4_K_M.gguf
- Q8_0 Model File : https://huggingface.co/unsloth/medgemma-1.5-4b-it-GGUF?show_file_info=medgemma-1.5-4b-it-Q8_0.gguf
- LiteLLM Documentation: https://github.com/BerriAI/litellm
- Ollama Documentation: https://ollama.ai/docs
Configuration File¶
The complete configuration is available in src/main/resources/application-local.yml. This file is gitignored and
should be customized for your local setup.
MedExpertMatch policy: The application uses OpenAI-compatible providers only. Ollama is excluded as a direct provider. Use LiteLLM or another OpenAI-compatible proxy when running MedGemma or other models locally.