Syngenta

Improving plant detection and identification with PaliGemma 2

Agritech leader Syngenta moves beyond traditional computer vision to achieve higher accuracy with the PaliGemma 2 transformer network

Syngenta, a global agricultural technology company operating in over 90 countries, provides growers with the latest innovation in seeds and crop protection products. For AI solutions to be effective, they must be able to distinguish between subtle variations present in weed species, crop growth stages and yield potential, and disease and pest markers.

Looking to move beyond rigid object detection, the team explored the PaliGemma 2 vision-language model (VLM) as a more flexible alternative that would allow them to handle multiple field operations, including spatial reasoning, crop identification, and counting, on a single GPU deployment. With their fine-tuned model, they achieved 11% higher precision and 26% higher recall scores when trained on the same amount of data as traditional YOLO v5, as well as a zero-hallucination rate across 100 runs.

Beyond rigid object detection

Object detection models perform well when identifying what they were explicitly trained to see, but farming is rarely rigid; plants overlap, early-stage leaf diseases are microscopic, and in some cases rare crops are outnumbered by dominant weeds. While reasoning models can easily identify common crops, they also struggle with heavy class imbalance in domain-specific environments.

Seeking a solution that could handle both spatial complexity and imbalanced data, the team chose PaliGemma 2 as a more flexible alternative to convolutional neural networks (CNNs), experimenting with different model sizes and fine-tuning parameters to test multiple field operations. Syngenta’s pipeline utilized a “hot-swapping” architecture:

Fine-tuning: Used QLoRA (4-bit quantization), freezing the vision tower (SigLIP) and multimodal projector while training only the Gemma decoder layers
Serving: Deployed via vLLM containers on Vertex AI for high-throughput and continuous batching
Tooling: Leveraged the Hugging Face ecosystem (PEFT, Accelerate), uv for dependency management, and Dataloop SDK for data ingestion
Resolution: Supported 896x896 resolution, preserving 16x more pixels than standard models to detect minute agricultural details

This resulted in a modular system that allows a single deployment on an NVIDIA L4 GPU to switch between tasks in milliseconds, reducing compute costs.

Unlocking higher accuracy with text-based bounding boxes

Unlike the regression-based approach of CNNs, the team trained PaliGemma 2 to treat object detection as a structured text task. Instead of relying on predefined anchor boxes, the model represents spatial coordinates as text tokens (e.g., <loc000><loc1023>). This allows it to reason about object relationships and predict the precise location of overlapping or partially obscured plants, a common challenge in early-season scouting that often causes traditional detectors to fail.

Furthermore, PaliGemma 2’s SigLIP-So400m encoder processes the micro-details of the crop canopy without compression artifacts. The model leverages sliding window attention to generate the dense sequences of coordinate tokens needed to map hundreds of overlapping plants in a single field. By training their model to output these structured text tokens rather than open-ended language, the team achieved a zero-hallucination rate, receiving identical outputs across 100 runs.

A close-up image showing a hand next to young plants in a seedling tray, with blue bounding boxes overlaying the plants, labeling them as "weed dicot" for object detection and identification.

Overcoming class imbalances for plant identification

To overcome plant class imbalances in the training data, the team sorted the labels in the prompt so the rarest plants appeared first. This enabled their model to prioritize identification of rare plants and then fall back on the common ones. As a fallback, they also added parent-level classes (“monocot” and “dicot” for either broadleaf plants or grasses)—an important step that is difficult to apply to a CNN model at inference time.

Next, the team evaluated how base model size impacts performance in crowded fields, comparing a versatile PaliGemma 2 3B model against the 10B version. The 10B model yielded higher precision scores across the board, however when faced with denser imagery (6-20 bounding boxes), the 10B model’s weed recall dropped to 66.2%. Conversely, when trained with a higher Low-Rank Adaptation (LoRA) parameter, the 3B model achieved a 80.3% weed recall. This suggests that for denser imagery, targeted LoRA fine-tuning is more critical than raw base model size.

Data tables showing performance metrics for fine-tuned plant models. The first table, titled "Plant counting: Comparing PaliGemma 2 3B vs. Gemma 3 4B fine-tuned models," compares two models across four metrics: Weed Parent Precision, Weed Parent Recall, Crop Precision, and Crop Recall. The PaliGemma 2 3B Mix with LoRA Rank 20 scores 87.5% for Weed Parent Precision, 86% for Weed Parent Recall, 86.5% for Crop Precision, and 83.4% for Crop Recall. The Gemma 3 4B IT with LoRA Rank 16 scores 80.7% for Weed Parent Precision, 85.6% for Weed Parent Recall, 78.1% for Crop Precision, and 84.2% for Crop Recall. The second table, titled "Plant identification: Impact of LoRA rank on crowded imagery," evaluates variants across different image densities. For an image density of 5 or fewer bounding boxes, the PaliGemma 2 3B Mix with LoRA Rank 10 scores 81.6% Weed Parent Precision, 76.7% Weed Parent Recall, 70.9% Crop Precision, and 75.0% Crop Recall, while the PaliGemma 2 10B PT with LoRA Rank 10 scores 85.3% Weed Parent Precision, 79.7% Weed Parent Recall, 80.9% Crop Precision, and 75.7% Crop Recall. For an image density of 6 to 20 bounding boxes, the PaliGemma 2 10B PT with LoRA Rank 10 scores 87.8% Weed Parent Precision, 66.2% Weed Parent Recall, 87.1% Crop Precision, and 72.0% Crop Recall, while the PaliGemma 2 3B Mix with LoRA Rank 20 scores 82.6% Weed Parent Precision, 80.3% Weed Parent Recall, 80.4% Crop Precision, and 77.1% Crop Recall.

High-fidelity crop counting with edge-optimized models

For many agricultural use cases, such as yield estimation or herbicide volume calculations, growers only need to know how many plants are present. The team simplified the task, pivoting from spatial localization with bounding boxes to simple identification and counting. By fine-tuning the model to output structured JSON, they achieved up to 86.7% F1 scores.

They also tested if this simplified counting task could be deployed directly to field equipment. Achieving comparable F1 scores with the Gemma 3 4B IT model, they demonstrated that this approach to high-fidelity counting tasks is feasible for edge devices.

A data table chart titled "Plant counting: Comparing PaliGemma 2 3B vs. Gemma 3 4B fine-tuned models". The table evaluates two models across four metrics: Weed Parent Precision, Weed Parent Recall, Crop Precision, and Crop Recall. The PaliGemma 2 3B Mix (LoRA Rank: 20) model scores 87.5%, 86%, 86.5%, and 83.4% respectively. The Gemma 3 4B IT (LoRA Rank: 16) model scores 80.7%, 85.6%, 78.1%, and 84.2% respectively.

Maximizing ROI: infrastructure consolidation vs. inference costs

A major finding of the project was the trade-off between inference costs and overall project economy. While the inference cost per image is higher—approximately $500 per 1M images for PaliGemma 2 vs. $10 for YOLO—VLMs required less training data to achieve over 85% recall and precision. In addition, by utilizing hot-swappable adapters versus separate models, the team can run different field tasks on a single GPU deployment, effectively offsetting the higher inference price tag. As a potential avenue to further reduce the burden of data collection, image models can be tested for synthetic data creation.

This is intended to be used as a more economic path for scaling computer vision at Syngenta. By reducing the effort previously required for data collection and manual annotation, the improved economics are now possible.

Rob Lind

Computer Vision and AI Fellow, Syngenta

Syngenta’s adoption of PaliGemma 2 proves that reasoning is a complementary piece for detection. Moving forward, the team is expanding this architecture to even more complex scenarios and broader agricultural challenges, such as pest detection and identifying early-stage disease markers. To address the challenges of high-density imagery where many weeds are clustered together, Syngenta is implementing sliding window inference techniques to ensure higher recall and more granular mapping.

While the team experiments with larger models to further improve accuracy for rare and complex plant species, the ultimate goal is to move these solutions from the cloud directly into the field, deploying the system on agricultural machinery, quadrupedal robots, or mobile devices for real-time scouting. By treating object detection as a text generation task, they gain the flexibility that traditional CNNs lack and give the agricultural community tools that are more accurate, modular, and faster to deploy.

Acknowledgements: Rohit Naidu, Patrick Nestler, Federico Patota