This groundbreaking research explores the immense potential of combining Multimodal Artificial Intelligence (AI) with Citizen Science data to overcome critical challenges in marine biodiversity monitoring, particularly in the realm of coral reefs.

Knowledge gap

Global marine conservation efforts are severely hampered by a pervasive lack of current, consistent data and chronic underfunding. Monitoring biodiversity in remote, complex ecosystems like coral reefs is traditionally expensive, requiring specialized equipment and expert taxonomic knowledge. This results in:

Massive Data Deficiencies: Less than 6% of described species have been assessed for extinction risk by the IUCN, and up to 17,000 of those are “data deficient.” This gap is most pronounced in marine ecosystems.
Slow Data Processing: Citizen Science programs, like the invaluable Reef Life Survey (RLS), are generating high-quality image data much faster than expert taxonomists can label them. This creates a bottleneck that limits the real-world impact of citizen efforts.

Main approach

We hypothesized that Multimodal Large Language Models (MLLMs)—the cutting-edge AI architecture that can process both image and text data—could effectively leverage vast Citizen Science datasets to automate the laborious task of benthic image classification. Our approach specifically focused on:

Leveraging Citizen Science Data: Using tens of thousands of seabed images and their multi-label annotations from the Reef Life Survey (RLS) project.
Multimodal AI: Utilizing the Qwen-2.5-VL MLLM family, which are capable of interpreting images of the seafloor and the associated labels simultaneously.
Advanced Fine-Tuning: Applying efficient fine-tuning techniques (QLoRA) to specialize a pre-trained, general-purpose MLLM for the complex, domain-specific task of classifying benthic organisms and substrate.

Technological challenge - how we tackle the study

The research was a rigorous exploration of optimizing MLLMs for this specific task. We addressed two main technological challenges:

Optimizing Untrained Model Performance (Prompt Engineering): before fine-tuning, we tested various prompt engineering techniques—the art of giving the AI better instructions—to get the best possible results from the general-purpose model.
Achieving Domain Expertise (Fine-Tuning): we employed QLoRA to efficiently adapt the MLLMs to the RLS dataset. Our experiments compared different model sizes (3B vs. 7B parameters) and the effect of adding contextual data (like geographic co-ordinates) during training.

Main finding

The top-performing model—the 7B parameter MLLM fine-tuned on the largest dataset—achieved an average F1-score of 0.45. This score is directly comparable to previous high-performing, task-specific Convolutional Neural Network (CNN) models that have been integrated into professional marine monitoring workflows, such as those used by NOAA. Key Takeaways from the Final Model:

High-Frequency Labels: The model excelled at identifying common, well-represented classes like Ecklonia radiata (a type of kelp), achieving an F1-score of 0.706.
Imbalanced Data Challenge: The model struggled with rare species (low-frequency classes), which is a common limitation when dealing with real-world, highly imbalanced biodiversity datasets.

Main implications

This research validates Multimodal Large Language Models as a robust, scalable, and powerful alternative to traditional Computer Vision models (like CNNs) for ecological classification. The combined approach of Citizen Science and MLLMs offers a clear path to:

Unlock the RLS Data Archive: Significantly accelerate the labeling of the vast, high-quality RLS photo archives, turning raw data into actionable biodiversity insights faster than ever before.
Scale Global Monitoring: Create a foundation for global, cost-effective monitoring programs that can leverage publicly submitted imagery for real-time assessment of coral reef health.
Advance AI in Ecology: Establish MLLMs and transformer architecture as a strong contender in the field of marine and terrestrial ecology, which is currently dominated by CNNs.

With further research into increasing training data, scaling to even larger models (32B/72B), and optimizing for low-frequency species, the potential for MLLMs to exceed current benthic classification models is considerable, fundamentally enhancing our ability to protect the world’s oceans.