Real World Deep Learning: Neural Networks for Smart Crops

The advances in image classification, object detection, and semantic segmentation using deep Convolutional Neural Networks, which spawned the availability of open source tools such as Caffe and TensorFlow (to name a couple) to easily manipulate neural network graphs... made a very strong case in favor of CNNs for our classifier.

By Andres Milioto, University of Bonn.

Header image
Autonomous ground robot spraying FLOURISH using the selective spraying tool, as seen from UAV.

To produce high-quality food and feed a growing world population with the given amount of arable land in a sustainable manner, we must develop new methods of sustainable farming that increase yield while minimizing chemical inputs such as fertilizers, herbicides, and pesticides. I and my colleagues are working on a robotics-centered approaches to address this grand challenge. My name is Andres Milioto, and I am a research assistant and Ph.D. student in robotics at the Photogrammetry and Robotics Lab ( at the University of Bonn, Germany. Together with Philipp Lottes, Nived Chebrolu, and our supervisor Prof. Dr. Cyrill Stachniss we are developing an adaptable ground and aerial robots for smart farming in the context of the EC-funded project “Flourish” (, where we collaborate with several other Universities and industry partners across Europe.

The Flourish consortium is committed to develop new robotic methods for sustainable farming that aim at minimizing chemical inputs such as fertilizers, herbicides, and pesticides in order to reduce the side-effects on our environment. Our precision agriculture techniques seek to address this challenge by monitoring key indicators of crop health and targeting treatment only to plants or infested areas that need it. The development of these novel methods is a very active and ongoing area of research, and the main goal of the project is to bridge the gap between the current and desired capabilities of agricultural robots by developing an adaptable robotic solution for precision farming. While conventional weed control systems treat the whole field uniformly with the same dose of herbicide, more novel perception-controlled weeding systems offer the potential to perform a treatment on a per-plant level, for example by selective spraying or mechanical weeding.

Automatically treating plants at an individual level requires a plant classification system, which can analyze the sensor data perceived by the the robots in the field in real time, detects individual plants, and thoroughly distinguishes the crops and weeds. Our team is responsible for the perception aspect of the approach.

Milioto image
From left to Right: Full pipeline from raw RGB+NIR images to classification output

Milioto image
Left: Keypoint extraction and area around keypoint where features are calculated.
Right: Classification of the Keypoints and results.

We focus on a detection on a per-plant basis to estimate the amount of crops as well as various weed species as a part of an autonomous robotic perception system. We are working on vision-based classification system for identifying crops and weeds in both, RGB-only as well as RGB combined with near infra-red (NIR) imagery.

At the beginning of the project, we approached the implementation of the crop vs. weed detector by using a random forest classifier over 500 statistical, shape, and geometrical features around different image keypoints containing vegetation, and later applying a Markov Random Field to smooth the results taking neighbor information into consideration. This approach works extraordinarily well, reaching more than 95% precision and recall for both crops and weeds. We further improved this approach for detecting crops and weeds at a very early growth stage, i.e at a leaf size smaller than 0.5 cm^2. A big challenge with purely vision-based approaches is the transfer of a learned classification model to unseen field environments, where the soil conditions, growth stages of the plants and weed types may have changed. We solved this generalization problem by exploiting the spatial structure of the crop plants, which are sowed in rows with a certain lattice distance in between. We developed a geometry-based classification system, which interacts with the vision-based system in a semi-supervised manner in order to exploit the time invariant geometry information to retrain the vision classifier online while the robot discovers new field environments. Furthermore, we introduced a method for initializing the whole classification system with  a labeling effort of only 1 minute and still achieve state-of-the-art performance (see publications by Lottes et al., 2016, 2017).

The experience we have made in the last two years during the research and development of the classification system led to the following conclusions. First, the design of the hand-crafted features is a very time consuming task that requires a very high level of specialization. Second, the implementation of the feature extraction in GPU code (using CUDA for NVIDIA GPUs) in order to make the system run in real time requires significant effort and expertise, and it took several months. Third, even when computing all features efficiently in parallel, the execution time was not close to the frame rate of a camera, and our approach was still very reliant on the availability of near infrared information, which comes at a high cost.

The advances in image classification, object detection, and semantic segmentation using deep Convolutional Neural Networks, which spawned the availability of open source tools such as Caffe and TensorFlow (to name a couple) to easily manipulate neural network graphs, and to quickly prototype, train, and deploy using off the shelf GPUs made a very strong case in favor of CNNs for our classifier. Another thing that made a deep learning approach possible was that during the last 2 years of project we have been able to gather a dataset containing a large amount of data (in the order of 10^5 labeled images). A big part of these dataset has been made publicly available to allow other institutions to benefit from it (see publications by Nived Chebrolu et. al), and also to compare algorithmic results.

Our first attempt was in favor of an object detection pipeline, using an approach similar to R-CNN. The key difference in our approach is that, given our specific problem, there is a possibility to pre-segment the vegetation from the soil, using well known vegetation indexes such as the NDVI (when NIR information is present) and/or Excess green (for RGB-only). Therefore, instead of using selective search for our region proposals, we were able to use the connected components in this vegetation mask, which speeds up the process significantly. This new approach had similar results to the random forest based one, but also had two big advantages: the main one is that it runs at over 5Hz in an NVIDIA Jetson TX2; the second one is that it took less than 2 months to implement, including the time to learn the tools. At the time, we didn’t even have dedicated hardware to train the networks, and we were working off of a notebook GPU, and the lightweight network worked very well after 3 days of training. (See publications by Andres Milioto et. al, 2017)

From left to right: RGB, NIR images. Vegetation Segmentation. Blob extraction. Classification of blobs. Overlay of classified masks to original image

This motivated us to further pursue this direction, and we therefore acquired a dedicated GPU computer with 4 NVIDIA GTX1080Ti’s. In order to quantify how much deep learning is affecting the lab, after roughly 3 months of receiving the first cluster, we have already ordered a second identical twin to the one we already have, because we are already having trouble scheduling GPU time.

Milioto image
Full semantic segmentation pipeline.

Coming back to the approach, we have currently dropped the object detection pipeline in favor of a full semantic segmentation one. This allows us to eliminate some of the limitations of the old approach, such as an impossibility to handle overlapping plants, and reliance on the pre-segmentation, which needed hyperparameter tuning when transferring to a new field. This new approach allows us, in addition, to get rid of the costly NIR information that we needed for an accurate pre-segmentation, and to work in constant time complexity. In order to help the architecture generalize well to different fields, and to obtain an accurate segmentation, we add to the RGB inputs several remappings and representations of the latter, which had already proven useful when designing features for the random classifier, such as vegetation indexes, different color spaces, gradients, edges, etc. We have then been able to significantly increase the performance of the semantic segmentation without hyperparameter re-tuning, and since the architecture is designed with real-time performance in mind, we are able to run at near frame rate of a commercial camera.

The perception solution as a part of our project goals has already reached a maturity state and yields a high performance, and we have been working hard to bridge the last gaps towards an applicable system in real world. What this last gap entails is to get the algorithms running with zero re-labeling effort when applying it to different fields and crop species. We are exploring generative models, and several unsupervised learning approaches in order to achieve this, all of which would use the autonomous capabilities of the robots to gather new data in the new field and allow it to intelligently auto-retrain itself for the new task in hand. We are also always working on getting the models to run faster and faster, and using less resources and power, for example by using newly available hardware accelerators for neural networks.

It has been a very exciting path to walk, and we are very eager to see what the future holds! Check our website for updates on this exciting journey!


Bio: Andres Milioto is a PhD Candidate in Robotics and AI at The University of Bonn. His drive is to help people lead better lives through technology by creating algorithms for cool robots that can perceive and interact with the world around us.