Overview
This project focused on developing an end-to-end automated image processing pipeline to support AI model training in agriculture. The system was designed to collect and validate high-resolution images of crop diseases, nutrient deficiencies, pests, and fungi.
By automating data acquisition and applying filtering aligned with research practices, the client assembled a structured, consistent crop disease dataset for use in machine learning agriculture models aimed at plant health diagnostics.
Client Background
The client operates in the agritech AI sector and specializes in developing tools for AI crop diagnostics. They required a scalable and repeatable method to build a high-quality agricultural image dataset for machine learning.
Previously, image collection was handled manually, which introduced inconsistencies and slowed down dataset development. The new system needed to support image curation at scale while adhering to standards common in academic and applied AI contexts.

Challenges & Objectives
- Manual sourcing of agricultural images was inefficient, inconsistent, and unscalable
- Ensuring image quality and relevance without automated validation was time-intensive
- Lacking a consistent framework for data labeling, categorization, and audit trails
- Preparing a dataset that meets the quality standards expected in supervised learning workflows
Objectives
- Build an image scraping tool for agricultural AI training datasets
- Organize images into standardized categories (deficiencies, pests, diseases, fungi)
- Implement automated validation and deduplication using a research-aligned image filtering tool
- Provide metadata tracking and transparency through structured logging
- Enable scalable, continuous data collection via server-based deployment

Approach & Implementation
The following are the details of the approach that was taken:
Search Query Construction
The system began by defining detailed search queries mapped to common crop issues. These included terms like “rice leaf potassium deficiency” or “soybean rust symptoms.” Filters for image resolution (minimum 480p), location, and recent timeframes were applied to reflect the freshness and quality standards typical of AI training datasets.
SERP Scraping Module
Selenium and search engine APIs (Google, Bing, Yahoo) were used to retrieve image URLs, page sources, and metadata. A retry mechanism handled rate limits to ensure uninterrupted extraction. This module served as the core of the image scraping tool for agricultural AI training datasets and supported robust, high-volume collection.
Image Download and Storage
Image files were downloaded using blob and base64 methods, then stored in cloud repositories like Google Drive and AWS S3. A hierarchical folder structure categorized the images into deficiencies, pests, diseases, and fungi. This structure was built to support downstream tasks such as annotation, model validation, and class balancing, which are critical steps in ensuring model generalizability.
Quality and Relevance Filtering
An AI-based content validation layer, supported by perceptual hashing (phash), was used to detect and eliminate duplicates. Content relevance was assessed using predefined visual cues. Only images meeting clarity and context standards were retained. Filtered-out samples were logged for audit, promoting dataset transparency and adherence to data sanitization best practices. These steps helped preserve the consistency and usability of the plant health AI training data.
Metadata Logging with Google Sheets
A synchronized Google Sheets interface was used to log filenames, sources, categories, and filtering status. This created a live audit trail and helped align data science workflows with agronomic review processes. The traceability also simplified quality checks, dataset updates, and collaboration.
Server Deployment and End-to-End Testing
The entire system was deployed to a cloud server, enabling scheduled runs and continuous data collection. Each pipeline component, from search query execution to dataset export, was tested to ensure reliability and alignment with training data requirements in machine learning agriculture projects.

Results & Outcomes
The following were the outcomes of the project:
- The platform enabled automated, scalable collection of thousands of high-resolution crop images
- Images were consistently categorized, validated, and prepared for use in AI development
- Manual image vetting was eliminated, saving time and improving dataset preparation speed
- Images were organized into usable classes for model training
- Substandard and redundant images were filtered out before ingestion
- The system remained operational with minimal manual intervention
- Metadata logging improved dataset management and accountability
Key Takeaways
Given the direction of the project, the following key takeaways were identified:
- A structured pipeline combining automated image processing and content validation is essential for building reproducible AI datasets
- Using perceptual hashing and relevance scoring ensured dataset quality and reduced noise
- Metadata tracking supported review, debugging, and retraining workflows
- Aligning the pipeline with academic dataset practices supported long-term research and commercial deployment goals

Conclusion
Relu Consultancy delivered a scalable, research-informed solution that transformed fragmented image sourcing into a reliable and automated process. The final crop disease dataset enabled the client to accelerate AI model development with cleaner, labeled data that met training standards. By integrating scraping, relevance filtering, audit logging, and storage into a seamless workflow, the system laid a strong foundation for future work in AI crop diagnostics and machine learning agriculture.