A hands-on mini course on multimodal large language models and their clinical applications - from architecture to fine-tuning, inference, and evaluation with MedGemma 1.5.
University of Toronto · University Health Network
Multimodal LLMs combine images, text, video, and audio in a single model - and they are rapidly becoming practical tools in clinical workflows. This mini course walks through the complete pipeline for a medical image-text-to-text task: understanding how MLLMs align modalities, dissecting the MedGemma 1.5 architecture, then preparing the FLARE-MLLM-2D dataset, fine-tuning with QLoRA, running inference, and evaluating report generation with CRIMSON and GREEN scores.
Every hands-on step runs in an interactive Jupyter notebook on Colab, and everything is mirrored in the course repository on GitHub.
| 01 | BackgroundIntroduction to MLLM + clinical applications | Jump → |
| 02 | Model ArchitectureDetailed description of the MedGemma 1.5 model architecture | Jump → |
| 03 | Data PreparationPrepare the FLARE-MLLM-2D dataset for fine-tuning | Jump → |
| 04 | Fine-tuningFine-tune MedGemma 1.5 4B on the preprocessed dataset | Jump → |
| 05 | InferenceInfer with the base and fine-tuned MedGemma 1.5 4B | Jump → |
| 06 | EvaluationEvaluate MedGemma 1.5 4B’s performance on report generation using CRIMSON score and GREEN score | Jump → |
| 07 | ConclusionSummary of what we have covered in the mini course | Jump → |
| 08 | QuizTest your knowledge with the provided NotebookLM | Jump → |
Module 01
Introduction to MLLM + clinical applications.
MLLM emphasizes the alignment of latent spaces between different modalities. We are extremely familiar with the classic encoder–decoder networks like UNet - so it is very intuitive that we can merge multiple encoders and decoders in different combinations to perform different types of tasks.
Based on the input and output modalities, we can classify MLLMs into categories like the ones on the right. Hugging Face also uses these categories.
Recall the training process of a naive UNet. It is essentially training a pair of mappings: encoder (input → latent) and decoder (latent → output). The optimization process is inherently stochastic, which makes the latent space nondeterministic across trials. Therefore, two independently trained encoders and decoders do not share the same latent space. To align them, there are several approaches, shown below.
Align vision features to an existing LLM
Compress vision through learned queries
Inject vision through cross-attention
Train a unified multimodal autoregressive model
Figure credit: NVIDIA. (n.d.). Multimodal large language models. NVIDIA Glossary. Retrieved from nvidia.com/en-us/glossary/multimodal-large-language-models · Extended reading: Multimodal Large Language Models - NVIDIA
MLLMs are powerful in a clinical workflow for perception, reasoning, documentation, triage, and patient-facing support.
Below are five concrete clinical applications (CAs), each illustrating one way an MLLM slots into practice.
In a report generation task, the MLLM:
Figure credit: Luo, X., Huang, X., Liang, X. et al. Towards Automated Reporting: A Bronchoscopy Report Dataset for Enhancing Multimodality Large Language Models. Sci Data 13, 339 (2026). doi.org/10.1038/s41597-026-06692-8
In a longitudinal comparison task, the MLLM:
Figure credit: Zhang, X., Meng, Z., Lever, J., & Ho, E. S. (2025, July). Libra: Leveraging temporal images for biomedical radiology analysis. In Findings of the Association for Computational Linguistics: ACL 2025 (pp. 17275–17303). doi.org/10.48550/arXiv.2411.19378
In a multi-class classification task, the MLLM outputs one label for several classes. It is the same as naive classification, but with outputs in the form of text - we use string parsers to convert the textual class ids into integers. Some common clinical examples are BI-RADS category, tumor subtype, disease stage, and dermatology diagnosis category.

Figure credit: Wikipedia - Cat (Cat_August_2010-4.jpg)
In a multi-label classification task, the MLLM outputs multiple labels at once. Common clinical examples include chest X-ray findings: edema, consolidation, atelectasis, cardiomegaly, and pleural effusion.
Figure credit: Sharma, G. (2021, February 7). Multi-label classification. Medium; Analytics Vidhya. medium.com/analytics-vidhya/multi-label-classification
In a regression task, the MLLM outputs a continuous value. It is the same as naive regression, but with outputs in the form of text - we use string parsers to convert the strings into floats. Some common clinical examples are ejection fraction, tumor size, organ volume, lab value prediction, risk score, and survival time.

Figure credit: Wikipedia - Cat (Cat_August_2010-4.jpg)
In this mini course, we are going to use MedGemma 1.5 4B as an example. The MedGemma family consists of LLaVA-style vision-language models (VLMs) designed for image-text-to-text tasks, specialized in medical images.
Throughout the mini course, we will be learning hands-on examples with MedGemma 1.5, covering a complete pipeline for the report generation task using the FLARE-MLLM-2D dataset.
Dataset: FLARE-MLLM-2D · Model: MedGemma 1.5 4B
Module 02
Detailed description of the MedGemma 1.5 model architecture.
Figure credit: Sellergren, A., Gao, C., Mahvar, F., Kohlberger, T., Jamil, F., Traverse, M., ... & Golden, D. (2026). MedGemma 1.5 technical report. arXiv preprint arXiv:2604.05081. doi.org/10.48550/arXiv.2604.05081
MedGemma 1.5 4B is based on Gemma 3 with the same general architecture, using a 400M MedSigLIP vision encoder as the visual front end and a decoder-only Transformer LLM as the text generator. Images are normalized to 896×896 and encoded into 256 visual tokens per image.
The key improvement between MedGemma 1 and MedGemma 1.5 is the long context window, with which we are able to feed multiple uniformly sampled slices (up to 85, modeled as a time sequence) to represent a 3D volume. However, in this mini course, we only deal with 2D images.
Note that it processes each 896×896 image as four 448×448 tiles, and each tile yields 64 visual tokens.
Cited: Sellergren, A., Gao, C., Mahvar, F., Kohlberger, T., Jamil, F., Traverse, M., ... & Golden, D. (2026). MedGemma 1.5 technical report. arXiv preprint arXiv:2604.05081. doi.org/10.48550/arXiv.2604.05081
Module 03
Prepare the FLARE-MLLM-2D dataset for fine-tuning.
The FLARE-MLLM-2D dataset is a multimodal dataset for the MICCAI FLARE challenge. It can be downloaded (with permissions) through Hugging Face.
Since we use Hugging Face’s transformers as the backend of the pipeline, we want to convert the dataset format into Hugging Face’s supervised fine-tuning (SFT) records’ format.
Please find the interactive Jupyter Notebook attached on Colab. You can also find everything in the course repository on GitHub.
Module 04
Fine-tune MedGemma 1.5 4B on the preprocessed dataset.
Even with foundation models trained on massive amounts of data, it is still extremely common that the application dataset is out of distribution (OoD). To adapt to the target domain, we need to apply fine-tuning. There are multiple ways to perform fine-tuning; in this mini course, we will focus on parameter-efficient fine-tuning (PEFT) - specifically QLoRA.
Figure credit: Görner, M. (2025, March 13). Are you still using LoRA to fine-tune your LLM? Towards Data Science. towardsdatascience.com/are-you-still-using-lora-to-fine-tune-your-llm
In this mini course, we will fine-tune for only 0.25 epochs to save time.
Please find the interactive Jupyter Notebook attached on Colab. You would need a GPU runtime with ≥40 GB VRAM: A100, H100, or G4. Everything is also in the course repository on GitHub.
Module 05
Infer with the base and fine-tuned MedGemma 1.5 4B.
Now let us infer on 16 samples (to save time) in the validation set, with both the base and the fine-tuned model.
Please find the interactive Jupyter Notebook attached on Colab. You would need a GPU runtime with ≥40 GB VRAM: A100, H100, or G4. Everything is also in the course repository on GitHub.
Module 06
Evaluate MedGemma 1.5 4B’s performance on report generation using CRIMSON score and GREEN score.
Figure credit: Ostmeier, S., Xu, J., Chen, Z., Varma, M., Bluethgen, C., Michalson, A. E., Moseley, M., Langlotz, C., Chaudhari, A. S., & Delbrouck, J.-B. (2024, May). GREEN: Generative radiology report evaluation and error notation. Stanford University. stanford-aimi.github.io/green.html
Let us evaluate the inference outputs using CRIMSON score and GREEN score.
Please find the interactive Jupyter Notebook attached on Colab. You would need a GPU runtime with ≥40 GB VRAM: A100, H100, or G4. Everything is also in the course repository on GitHub.
Module 07
Module 08
A NotebookLM has been prepared with the course materials - quiz yourself on everything covered in this mini course.
Open the NotebookLM QuizUniversity of Toronto · University Health Network