Improving plant detection and identification with PaliGemma 2
Agritech leader Syngenta moves beyond traditional computer vision to achieve higher accuracy with the PaliGemma 2 transformer network
Syngenta, a global agricultural technology company operating in over 90 countries, provides growers with the latest innovation in seeds and crop protection products. For AI solutions to be effective, they must be able to distinguish between subtle variations present in weed species, crop growth stages and yield potential, and disease and pest markers.
Looking to move beyond rigid object detection, the team explored the PaliGemma 2 vision-language model (VLM) as a more flexible alternative that would allow them to handle multiple field operations, including spatial reasoning, crop identification, and counting, on a single GPU deployment. With their fine-tuned model, they achieved 11% higher precision and 26% higher recall scores when trained on the same amount of data as traditional YOLO v5, as well as a zero-hallucination rate across 100 runs.
Beyond rigid object detection
Object detection models perform well when identifying what they were explicitly trained to see, but farming is rarely rigid; plants overlap, early-stage leaf diseases are microscopic, and in some cases rare crops are outnumbered by dominant weeds. While reasoning models can easily identify common crops, they also struggle with heavy class imbalance in domain-specific environments.
Seeking a solution that could handle both spatial complexity and imbalanced data, the team chose PaliGemma 2 as a more flexible alternative to convolutional neural networks (CNNs), experimenting with different model sizes and fine-tuning parameters to test multiple field operations. Syngenta’s pipeline utilized a “hot-swapping” architecture:
- Fine-tuning: Used QLoRA (4-bit quantization), freezing the vision tower (SigLIP) and multimodal projector while training only the Gemma decoder layers
- Serving: Deployed via vLLM containers on Vertex AI for high-throughput and continuous batching
- Tooling: Leveraged the Hugging Face ecosystem (PEFT, Accelerate), uv for dependency management, and Dataloop SDK for data ingestion
- Resolution: Supported 896x896 resolution, preserving 16x more pixels than standard models to detect minute agricultural details
This resulted in a modular system that allows a single deployment on an NVIDIA L4 GPU to switch between tasks in milliseconds, reducing compute costs.
Unlocking higher accuracy with text-based bounding boxes
Unlike the regression-based approach of CNNs, the team trained PaliGemma 2 to treat object detection as a structured text task. Instead of relying on predefined anchor boxes, the model represents spatial coordinates as text tokens (e.g., <loc000><loc1023>). This allows it to reason about object relationships and predict the precise location of overlapping or partially obscured plants, a common challenge in early-season scouting that often causes traditional detectors to fail.
Furthermore, PaliGemma 2’s SigLIP-So400m encoder processes the micro-details of the crop canopy without compression artifacts. The model leverages sliding window attention to generate the dense sequences of coordinate tokens needed to map hundreds of overlapping plants in a single field. By training their model to output these structured text tokens rather than open-ended language, the team achieved a zero-hallucination rate, receiving identical outputs across 100 runs.
Overcoming class imbalances for plant identification
To overcome plant class imbalances in the training data, the team sorted the labels in the prompt so the rarest plants appeared first. This enabled their model to prioritize identification of rare plants and then fall back on the common ones. As a fallback, they also added parent-level classes (“monocot” and “dicot” for either broadleaf plants or grasses)—an important step that is difficult to apply to a CNN model at inference time.
Next, the team evaluated how base model size impacts performance in crowded fields, comparing a versatile PaliGemma 2 3B model against the 10B version. The 10B model yielded higher precision scores across the board, however when faced with denser imagery (6-20 bounding boxes), the 10B model’s weed recall dropped to 66.2%. Conversely, when trained with a higher Low-Rank Adaptation (LoRA) parameter, the 3B model achieved a 80.3% weed recall. This suggests that for denser imagery, targeted LoRA fine-tuning is more critical than raw base model size.
High-fidelity crop counting with edge-optimized models
For many agricultural use cases, such as yield estimation or herbicide volume calculations, growers only need to know how many plants are present. The team simplified the task, pivoting from spatial localization with bounding boxes to simple identification and counting. By fine-tuning the model to output structured JSON, they achieved up to 86.7% F1 scores.
They also tested if this simplified counting task could be deployed directly to field equipment. Achieving comparable F1 scores with the Gemma 3 4B IT model, they demonstrated that this approach to high-fidelity counting tasks is feasible for edge devices.
Maximizing ROI: infrastructure consolidation vs. inference costs
A major finding of the project was the trade-off between inference costs and overall project economy. While the inference cost per image is higher—approximately $500 per 1M images for PaliGemma 2 vs. $10 for YOLO—VLMs required less training data to achieve over 85% recall and precision. In addition, by utilizing hot-swappable adapters versus separate models, the team can run different field tasks on a single GPU deployment, effectively offsetting the higher inference price tag. As a potential avenue to further reduce the burden of data collection, image models can be tested for synthetic data creation.
This is intended to be used as a more economic path for scaling computer vision at Syngenta. By reducing the effort previously required for data collection and manual annotation, the improved economics are now possible.
Syngenta’s adoption of PaliGemma 2 proves that reasoning is a complementary piece for detection. Moving forward, the team is expanding this architecture to even more complex scenarios and broader agricultural challenges, such as pest detection and identifying early-stage disease markers. To address the challenges of high-density imagery where many weeds are clustered together, Syngenta is implementing sliding window inference techniques to ensure higher recall and more granular mapping.
While the team experiments with larger models to further improve accuracy for rare and complex plant species, the ultimate goal is to move these solutions from the cloud directly into the field, deploying the system on agricultural machinery, quadrupedal robots, or mobile devices for real-time scouting. By treating object detection as a text generation task, they gain the flexibility that traditional CNNs lack and give the agricultural community tools that are more accurate, modular, and faster to deploy.
Acknowledgements: Rohit Naidu, Patrick Nestler, Federico Patota