Mini Course

Multimodal LLM (MLLM)

A hands-on mini course on multimodal large language models and their clinical applications - from architecture to fine-tuning, inference, and evaluation with MedGemma 1.5.

University of Toronto · University Health Network

About this mini course

Multimodal LLMs combine images, text, video, and audio in a single model - and they are rapidly becoming practical tools in clinical workflows. This mini course walks through the complete pipeline for a medical image-text-to-text task: understanding how MLLMs align modalities, dissecting the MedGemma 1.5 architecture, then preparing the FLARE-MLLM-2D dataset, fine-tuning with QLoRA, running inference, and evaluating report generation with CRIMSON and GREEN scores.

Every hands-on step runs in an interactive Jupyter notebook on Colab, and everything is mirrored in the course repository on GitHub.

Course outline

01 BackgroundIntroduction to MLLM + clinical applications Jump →
02 Model ArchitectureDetailed description of the MedGemma 1.5 model architecture Jump →
03 Data PreparationPrepare the FLARE-MLLM-2D dataset for fine-tuning Jump →
04 Fine-tuningFine-tune MedGemma 1.5 4B on the preprocessed dataset Jump →
05 InferenceInfer with the base and fine-tuned MedGemma 1.5 4B Jump →
06 EvaluationEvaluate MedGemma 1.5 4B’s performance on report generation using CRIMSON score and GREEN score Jump →
07 ConclusionSummary of what we have covered in the mini course Jump →
08 QuizTest your knowledge with the provided NotebookLM Jump →

Module 01

Background

Introduction to MLLM + clinical applications.

Introduction to MLLM

MLLM emphasizes the alignment of latent spaces between different modalities. We are extremely familiar with the classic encoder–decoder networks like UNet - so it is very intuitive that we can merge multiple encoders and decoders in different combinations to perform different types of tasks.

Based on the input and output modalities, we can classify MLLMs into categories like the ones on the right. Hugging Face also uses these categories.

MLLM task categories

  • Audio-text-to-text
  • Image-text-to-text
  • Image-text-to-image
  • Image-text-to-video
  • Video-text-to-text
  • Any-to-any

Recall the training process of a naive UNet. It is essentially training a pair of mappings: encoder (input → latent) and decoder (latent → output). The optimization process is inherently stochastic, which makes the latent space nondeterministic across trials. Therefore, two independently trained encoders and decoders do not share the same latent space. To align them, there are several approaches, shown below.

LLaVA-style

Align vision features to an existing LLM

+Simple projector, cheap to train, reuses a full off-the-shelf LLM
-Every visual token sits in the input sequence, so long/many images inflate context length

BLIP-2-style

Compress vision through learned queries

+Fixed, small number of visual tokens regardless of image size or resolution
-Extra Q-Former to train, and compression can lose fine-grained visual detail

Flamingo-style

Inject vision through cross-attention

+Handles many interleaved images/video without growing the text sequence
-Requires splicing new cross-attention layers into the LLM, more invasive and harder to adapt with lightweight fine-tuning

Kosmos-style

Train a unified multimodal autoregressive model

+One shared representation space by design, flexible across input/output modalities
-Most data- and compute-hungry option, can't just bolt onto an existing pretrained LLM
General MLLM architecture: modality encoder, input projector, LLM backbone, output projector, modality generator
The general anatomy of an MLLM: modality encoders project non-text inputs into the LLM backbone; projectors and generators map latents back out to other modalities.

Figure credit: NVIDIA. (n.d.). Multimodal large language models. NVIDIA Glossary. Retrieved from nvidia.com/en-us/glossary/multimodal-large-language-models  ·  Extended reading: Multimodal Large Language Models - NVIDIA

Clinical applications of MLLMs

MLLMs are powerful in a clinical workflow for perception, reasoning, documentation, triage, and patient-facing support.

Below are five concrete clinical applications (CAs), each illustrating one way an MLLM slots into practice.

CA1Report generation

In a report generation task, the MLLM:

MLLM caption generation process producing a bronchoscopy examination report with human revision

Figure credit: Luo, X., Huang, X., Liang, X. et al. Towards Automated Reporting: A Bronchoscopy Report Dataset for Enhancing Multimodality Large Language Models. Sci Data 13, 339 (2026). doi.org/10.1038/s41597-026-06692-8

CA2Longitudinal comparison

In a longitudinal comparison task, the MLLM:

Comparison of MLLM radiology answers with and without a prior study

Figure credit: Zhang, X., Meng, Z., Lever, J., & Ho, E. S. (2025, July). Libra: Leveraging temporal images for biomedical radiology analysis. In Findings of the Association for Computational Linguistics: ACL 2025 (pp. 17275–17303). doi.org/10.48550/arXiv.2411.19378

CA3Multi-class classification

In a multi-class classification task, the MLLM outputs one label for several classes. It is the same as naive classification, but with outputs in the form of text - we use string parsers to convert the textual class ids into integers. Some common clinical examples are BI-RADS category, tumor subtype, disease stage, and dermatology diagnosis category.

Photo of a tabby cat
Vision Encoder
“What animal is shown in the image? Answer in a single {class id} only: 0 for cat, 1 for dog, and 2 for owl.”
Text Encoder
LLM
0

Figure credit: Wikipedia - Cat (Cat_August_2010-4.jpg)

CA4Multi-label classification

In a multi-label classification task, the MLLM outputs multiple labels at once. Common clinical examples include chest X-ray findings: edema, consolidation, atelectasis, cardiomegaly, and pleural effusion.

Multi-class versus multi-label classification examples with one-hot and multi-hot label vectors

Figure credit: Sharma, G. (2021, February 7). Multi-label classification. Medium; Analytics Vidhya. medium.com/analytics-vidhya/multi-label-classification

CA5Regression

In a regression task, the MLLM outputs a continuous value. It is the same as naive regression, but with outputs in the form of text - we use string parsers to convert the strings into floats. Some common clinical examples are ejection fraction, tumor size, organ volume, lab value prediction, risk score, and survival time.

Photo of a tabby cat
Vision Encoder
“What percentage of the image area does the cat occupy?”
Text Encoder
LLM
0.43

Figure credit: Wikipedia - Cat (Cat_August_2010-4.jpg)

Course setup

In this mini course, we are going to use MedGemma 1.5 4B as an example. The MedGemma family consists of LLaVA-style vision-language models (VLMs) designed for image-text-to-text tasks, specialized in medical images.

Throughout the mini course, we will be learning hands-on examples with MedGemma 1.5, covering a complete pipeline for the report generation task using the FLARE-MLLM-2D dataset.

The hands-on pipeline

  • Data preparation
  • Fine-tuning
  • Inference
  • Evaluation

Dataset: FLARE-MLLM-2D · Model: MedGemma 1.5 4B

Module 02

Model architecture

Detailed description of the MedGemma 1.5 model architecture.

MedGemma overview

MedGemma collection: MedGemma 1.5 4B, MedGemma 27B, and the MedSigLIP vision encoder across 2D imaging, text, and advanced imaging

Figure credit: Sellergren, A., Gao, C., Mahvar, F., Kohlberger, T., Jamil, F., Traverse, M., ... & Golden, D. (2026). MedGemma 1.5 technical report. arXiv preprint arXiv:2604.05081. doi.org/10.48550/arXiv.2604.05081

MedGemma 1.5

3D CT Scan
2D Slices
MedSigLIPEncoder
Gemma 3Decoder
Text Outputs

Inside MedSigLIPEncoder-only Vision Transformer (ViT) · SigLIP-400M tuned on medical data

448×448 tileone of four tiles cut from the 896×896 input image
Patch embedding14×14 pixel patches → 32×32 = 1,024 patch tokens + position embeddings
× 27 blocks
Bidirectional self-attention16 heads · width 1,152 · every patch attends to every patch - no causal mask
MLPGELU · hidden 4,304 · LayerNorm around each sub-layer
4×4 average pooling1,024 patch tokens → 64 visual tokens per tile
Multimodal projectornorm + linear map into Gemma 3’s 2,560-d embedding space
64 visual tokens per tile256 per 896×896 image

MedSigLIP is the 400M-parameter SigLIP image encoder further trained on medical image–text pairs with SigLIP’s sigmoid contrastive loss, so its visual features are aligned with medical language before the LLM ever sees them.

Inside Gemma 3 (4B)Decoder-only Transformer LLM · generates text autoregressively

Input sequencetext tokens (262K SentencePiece vocab) with visual tokens spliced in at each image position
Token embeddingsdmodel = 2,560 · shared with the output layer
× 34 blocks
Grouped-query attention + RoPE8 query heads share 4 KV heads · head dim 256 · RMSNorm before & after
GeGLU feed-forwardhidden dimension 10,240

5 local sliding-window attention layers (1,024-token window) for every 1 global layer → a 128K-token context at manageable KV-cache cost

Final RMSNorm + LM headprobability distribution over the 262K vocabulary
Next tokenappended to the sequence and fed back in until the report is complete

Attention is causal over text - each token sees only its past - but all visual tokens belonging to the same image attend to each other bidirectionally.

MedGemma 1.5 4B is based on Gemma 3 with the same general architecture, using a 400M MedSigLIP vision encoder as the visual front end and a decoder-only Transformer LLM as the text generator. Images are normalized to 896×896 and encoded into 256 visual tokens per image.

The key improvement between MedGemma 1 and MedGemma 1.5 is the long context window, with which we are able to feed multiple uniformly sampled slices (up to 85, modeled as a time sequence) to represent a 3D volume. However, in this mini course, we only deal with 2D images.

Note that it processes each 896×896 image as four 448×448 tiles, and each tile yields 64 visual tokens.

Cited: Sellergren, A., Gao, C., Mahvar, F., Kohlberger, T., Jamil, F., Traverse, M., ... & Golden, D. (2026). MedGemma 1.5 technical report. arXiv preprint arXiv:2604.05081. doi.org/10.48550/arXiv.2604.05081

Module 03

Data preparation

Prepare the FLARE-MLLM-2D dataset for fine-tuning.

1

Download FLARE-MLLM-2D ≈ 2 minutes

The FLARE-MLLM-2D dataset is a multimodal dataset for the MICCAI FLARE challenge. It can be downloaded (with permissions) through Hugging Face.

2

Preprocess ≈ 5 minutes

Since we use Hugging Face’s transformers as the backend of the pipeline, we want to convert the dataset format into Hugging Face’s supervised fine-tuning (SFT) records’ format.

Try it out yourself!

Please find the interactive Jupyter Notebook attached on Colab. You can also find everything in the course repository on GitHub.

Module 04

Fine-tuning

Fine-tune MedGemma 1.5 4B on the preprocessed dataset.

Fine-tuning protocols

Even with foundation models trained on massive amounts of data, it is still extremely common that the application dataset is out of distribution (OoD). To adapt to the target domain, we need to apply fine-tuning. There are multiple ways to perform fine-tuning; in this mini course, we will focus on parameter-efficient fine-tuning (PEFT) - specifically QLoRA.

LoRA: frozen weight matrix W0 plus trainable low-rank matrices A and B
LoRA keeps W₀ frozen and learns a low-rank update AB; QLoRA applies the same adapters on top of 4-bit quantized base weights.

Figure credit: Görner, M. (2025, March 13). Are you still using LoRA to fine-tune your LLM? Towards Data Science. towardsdatascience.com/are-you-still-using-lora-to-fine-tune-your-llm

3

Run fine-tuning ≈ 15 minutes, depending on GPU

In this mini course, we will fine-tune for only 0.25 epochs to save time.

Try it out yourself!

Please find the interactive Jupyter Notebook attached on Colab. You would need a GPU runtime with ≥40 GB VRAM: A100, H100, or G4. Everything is also in the course repository on GitHub.

Module 05

Inference

Infer with the base and fine-tuned MedGemma 1.5 4B.

4

Run inference ≈ 5 minutes, depending on GPU

Now let us infer on 16 samples (to save time) in the validation set, with both the base and the fine-tuned model.

Try it out yourself!

Please find the interactive Jupyter Notebook attached on Colab. You would need a GPU runtime with ≥40 GB VRAM: A100, H100, or G4. Everything is also in the course repository on GitHub.

Module 06

Evaluation

Evaluate MedGemma 1.5 4B’s performance on report generation using CRIMSON score and GREEN score.

Evaluation metrics

CRIMSON Score

  • Checks chest X-ray reports
  • Compares generated report with reference report
  • Finds missing, false, and attribute-level findings
  • Weights errors by clinical severity

GREEN Score

  • Checks radiology reports broadly
  • Uses an LLM to detect and explain errors
  • Produces score + qualitative error descriptions
  • Aligns with radiologist error judgment
GREEN evaluation compared with BLEU, ROUGE-L, BERTScore, and F1RadGraph on a pleural effusion example
Unlike n-gram metrics, GREEN separates clinically correct from clinically wrong candidates and explains the errors it finds.

Figure credit: Ostmeier, S., Xu, J., Chen, Z., Varma, M., Bluethgen, C., Michalson, A. E., Moseley, M., Langlotz, C., Chaudhari, A. S., & Delbrouck, J.-B. (2024, May). GREEN: Generative radiology report evaluation and error notation. Stanford University. stanford-aimi.github.io/green.html

5

Run evaluation ≈ 5 minutes, depending on GPU

Let us evaluate the inference outputs using CRIMSON score and GREEN score.

Try it out yourself!

Please find the interactive Jupyter Notebook attached on Colab. You would need a GPU runtime with ≥40 GB VRAM: A100, H100, or G4. Everything is also in the course repository on GitHub.

Module 07

Conclusion - key takeaways

  1. MLLMs combine multiple modalities such as images and text; the core challenge is modality alignment, since vision encoders and language models do not naturally share the same latent representation space.
  2. Different architectures solve alignment differently - LLaVA-style projection, BLIP-2 learned queries, Flamingo-style cross-attention, and Kosmos-style unified modeling - enabling clinical tasks from report generation and longitudinal comparison to classification and regression.
  3. The course’s main example, MedGemma 1.5 4B, is a LLaVA-style model that pairs a MedSigLIP vision encoder with a Gemma 3 decoder, turning medical images into visual tokens the LLM uses to generate text.
  4. The hands-on workflow uses the FLARE-MLLM-2D dataset, downloaded, preprocessed, and converted into Hugging Face supervised fine-tuning format.
  5. Fine-tuning uses QLoRA - training small low-rank adapters on 4-bit weights - and results are evaluated with CRIMSON and GREEN scores that check radiology reports for clinical errors, missing findings, and severity-weighted mistakes.

Module 08

Quiz

Test your learning outcomes

A NotebookLM has been prepared with the course materials - quiz yourself on everything covered in this mini course.

Open the NotebookLM Quiz

Instructor

Tianhao (Terry) Fu

University of Toronto · University Health Network