Machine Learning for Crop Yield Prediction
Machine Learning for Crop Yield Prediction
1
Dept of ISE, MSRIT and Research Scholar-Dept of CSE, AMCEC, Bangalore,
India and affiliated to Visvesvaraya Technological University, Belagavi, Karnataka,
India
2
Department of CSE, AMCEC, Bangalore, India and affiliated to Visvesvaraya
Technological University, Belagavi, Karnataka, India
E-mail: ashwitha.a.1990@[Link]; drlathaca@[Link]
∗
Corresponding Author
Abstract
In most developing countries like India, Agriculture is seen as one of
the most widely followed habitations and contributes majorly to the coun-
try’s economy. Agriculture provides the primary source of food, income,
livelihood and employment to the majority of rural populations in India.
Many crops are destroyed every year due to a lack of technical knowledge
and unpredictable weather patterns such as temperature, rainfall, and other
atmospheric parameters, which play a massive role in deciding the crop
yield and profit. Therefore, choosing the right crop to grow and enhancing
crop yield is an essential aspect of improving real-life farming scenarios.
One of the motives is to collect and integrate the agricultural data from
specific regions that may be used to analyse the optimal crop and estimate
the crop yield. This script is novel by using simple crop, soil and weather
parameters like crop, the area under cultivation, nitrogen, phosphorus and
potassium content of the soil, season, average rainfall and temperature of
a district in Karnataka, India. The user can predict the most suitable crop
and its estimated yield for a chosen year. This model uses primary classifica-
tion, techniques like the random forest, k-NN, logistic regression, decision
tree, XGBoost, SVM and gradient boosting classifier for determining the
most suitable crop and regression algorithms like Linear Regression, k-
NN, DBSCAN, Random Forest and ANN algorithm to estimate the yield
of the most optimal crop identified earlier. The algorithm that has the least
mean error is chosen for prediction and estimation and thus gives better
results than the particular machine learning algorithm domain. There is a
web interface for ease of use for end-users. Therefore, this project assists
the farmers in choosing the suitable crop that can be grown in a particular
region during a specific season or specific period and estimate its yield and
predict if the recommended crop is profitable. Hence this project helps the
farmers in preserving their time by assisting them in the decision-making
process.
1 Introduction
Agriculture is the Indian economy’s lifeline, as it employs around 70–75 per
cent of the people directly or indirectly. Agricultural growth is closely propor-
tional to India’s economic growth. However, due to a variety of constraints,
such as changing climatic conditions, bad soil conditions, soil erosion, land
drifting, and floods, Indian farmers face various obstacles. If agriculture fails,
it has a significant impact on the productivity of other industries, which harms
the country’s gross domestic product ratio. India’s growth in agriculture has
remained stagnant, and the government requires a technological revolution
to fulfil the country’s expanding population demands. Thus it is essential for
us to adopt new technologies in agriculture and enhance crop production.
This will help the farmers to avoid or minimise loss and achieve profit.
Initially, the crop and yield predictions were performed based on farmer’s
experience in a particular region. And they don’t have adequate knowledge
on the changing weather patterns, soil components of phosphorus, potas-
sium, nitrogen and the soil ph. Therefore this inadequate knowledge may
lead to soil acidification, reduced yield, and damaging the top layer of the
soil by the application of insufficient and improper amounts of nutrients
to the land and without the crop rotation. The crop yield mainly depends
on environmental changes, weather conditions, water resources, and rainfall.
Crop Recommendation and Yield Estimation Using Machine Learning 863
Hence the farmers may not be able to achieve the expected or higher yield.
Machine learning, data mining and deep learning are the essential analytical
technologies that support accurate decision making in crop yield prediction,
which includes some of the assisting conclusions on which crop to grow and
the decisions regarding the crops in the growing season on the agricultural
land.
2 Literature Survey
Various machine learning, data analytics, data mining, and deep learning
techniques are previously used to analyse the agricultural data and draw
valuable conclusions.
[1] have developed a web page using data mining to determine the crop
yield on the basis of several climatic parameters as input. The C 4.5 algorithm
864 A. Ashwitha and C. A. Latha
has been used to find the most predominant climatic input parameter on the
yield of the crop in specific districts of Madhya Pradesh. The environmental
data with respect to the previous years like temperature, rainfall were col-
lected. It focuses on giving the idea of how the growth of the crop is affected
by various input parameters of climate. [2] Developed a Java application to
determine the crop yield rate by using two methods, namely, the K-nearest
neighbours method and the Naive Bayes method. It helps in achieving the
maximum rate of yield for the crops. It also assists in choosing an optimal
crop suitable for a specific region and the environmental conditions. From
the input dataset, the accuracy of the two methods is compared, and the
best method is chosen for predicting. [3] proposed a system that uses data
mining techniques to predict the harvest yield of the crop. This forecast helps
the farmer to know his requirements and can plan accordingly. This system
makes use of the Random forest algorithm for predicting the output of crop
yield. [4] have analysed the agricultural data using data analytics and machine
learning to assist the farmers by predicting the productivity of the crop in
order to maximise crop production. They have made use of three algorithms,
namely, 1. Naı̈ve Bayes, 2. K-means Clustering 3. A priori Algorithm for final
data accumulation and to predict the crop and its estimated yield. [5] have
developed a model that uses the id3 algorithm for predicting crop yield
and getting excellent quality and improved yield for the crop of Tomato.
The proposed model is executed using PHP. The various parameters used
in this study are Area, Temperature, humidity, and tomato crop production.
[6] describes the rainfall and crop yield prediction using machine learning
techniques. This paper describes the various machine learning techniques
for predicting the rain and also predicting crop yield. It mentions the effi-
ciency and the performance of various machine learning algorithms like
SVM, decision tree, linear regression and KNN. It is concluded that the
SVM algorithm has the highest efficiency or performs the best for rainfall
prediction.
3 Proposed Approach
This paper proposes a method for crop recommendation and yield estimation
which is very important in the field of agriculture. Outcome can be visualised
in the form of a user-friendly web interface.
The overview and the workflow of the project can be seen in Figure 1,
which involves the steps involved and the sequential order that has to be
Crop Recommendation and Yield Estimation Using Machine Learning 865
y g p
performed in order to achieve the final results. The design of the proposed
project of crop recommendation and yield estimation system involves the
following steps:
A. Data Collection and integration: It is the most crucial and essential
step. The proposed model begins with the collection of the necessary
agricultural data that consists of various parameters. It requires integrat-
ing data from multiple sources such as the soil parameters that include
the nitrogen, phosphorus and potash values as are necessary for the
crops, the weather parameters such as the season, rainfall, temperature,
and Ph concerning a particular region and crops.
B. Data Pre-Processing: The next step after data collection and integration
is the data pre-processing, which also means cleaning the data and
preparing it for modelling. The initial part involves handling the missing
values, removing the noisy data, and handling the outliers. The next part
is encoding the categorical values into numerical values using the label
encoder to be easier in modelling the data. The next part is scaling or
normalising the data by using the Minmax scalar or Standardscalar.
C. Splitting the data: Once the data is pre-processed, the data is divided
into two, i.e. the train data, which is used for training the model and the
test dataset, which is used to test the model.
D. Applying Machine Learning Algorithms: In this step, we apply
various classification and regression machine learning algorithms and
evaluate each model against the performance metrics.
866 A. Ashwitha and C. A. Latha
giving the required parameters for crop recommendation and yield esti-
mation, the user will know the best crop suitable for a particular land
and its estimated yield in an optimised manner.
Decision Tree
A decision tree functions as a decision support tool, with each branch repre-
senting a tree-like model decision. The decision tree results are insignificant
for small datasets because of the number of options available at each branch,
and the tree’s height is easy to overfit [7, 10, 12, 17, 18, 21, 23–25].
Logistic Regression
The logistic model is used in statistics to model the probability of a particular
classification of an entity or an event occurring, such as fail or pass, win
or lose, or positive or negative. This model can be transformed to replicate
events, such as determining whether the image has a cow, buffalo, horse or
pony. Each detected object in the image would be assigned a max probability
of one, all of which adds up to one [14, 22].
XGBoost
This is a decision tree algorithm that makes use of a gradient boosting
framework. It can handle missing values. It has built-in regularisation, which
prevents the model from overfitting. It is more likely to be overfitting than
bagging. It requires a solid dataset and correctly chosen hyperparameters,
which results in improved accuracy [22].
868 A. Ashwitha and C. A. Latha
Gradient Boosting
Gradient boosting is a technique that enhances the weak classifiers, like
decision trees, for regression, classification, and other problems. A Gradient
boosted decision tree can easily outperform a random forest in multiple
scenarios. It carries out model construction in a stage-wise manner, sim-
ilar to previous boosting approaches, then generalises them by allowing
optimisation of any differentiable loss function [23].
Naı̈ve Bayes
The naive assumption taken for building this classifier is that the presence of
one feature or attribute is independent of the existence of another set of traits
or characteristics. In comparison to numerical variables, it works particularly
well with categorical input variables. It can accurately estimate the class of a
test dataset and operates quickly. The loss of categorical variables that were
not observed in the training dataset might significantly impact the outcomes
because it is foolish to assign zero probability to them. The underlying
limitation is the assumption that the presence of one attribute is dependent
on the presence of other qualities [12].
Linear Regression
The linear regression attempts to shape the relationship between two variables
with the linear equation of the observed data. One variable is explanatory,
while the other is considered to be dependent. For example, a person might
want a linear regression model to match people’s weight to their height [8,
14, 16, 19, 24, 26].
K-Nearest Neighbours
In n-dimensional space, the clustering process is classified based on the k-
nearest data points. The multi-class problem is simple to implement. It’s
straightforward to use. It is continually changing. The data quality determines
the accuracy of the algorithm. And when the size grows, the calculations grow
cubically. It is a computationally expensive algorithm [12, 15, 20, 22, 23].
Crop Recommendation and Yield Estimation Using Machine Learning 869
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
is used for spatial density-based clustering of data points in n-dimensional
space. It is capable of handling clusters of diverse forms and sizes. It is
noise-resistant. The DBSCAN algorithm is incompatible with data with
variable spatial density. It also does not perform well with high-dimensional
data [13].
Random Forest
This is a classification algorithm that entails the construction of many deci-
sion trees. It reduces overfitting and enhances accuracy. It works effectively
with both continuous and categorical data. The results are difficult to interpret
because of the vast number of decision trees. When dealing with climatic
variability, selecting the appropriate ranges of values to make decisions is
critical [22, 23].
11: else
12: # Render page without results
13: # Ask the users to fill the form
14: input(data)
15: submit()
16: if valid (data) == True:
17: # Predict crop
18: pred = predict crop(data)
19: # Redirect with output.
20: predict(pred)
21: else
22: # Show error message
23: end if
24: end if
25:
26: yield predict(pred)
27: # Renders the yield prediction page.
28: if pred!=None:
29: # Render page with result
30: else
31: # Render page without results
32: # Ask the users to fill the form
33: input(data)
34: submit()
35: if valid (data) == True:
36: # Predict yield
37: pred = predict yield(data)
38: # Redirect with output.
39: yield predict(pred)
40: else
41: # Show error message
42: end if
43: end if
44:
45: about()
46: # Render about us Page
872 A. Ashwitha and C. A. Latha
Flowchart
The flow diagram of the Methodology is depicted in Figure 2 below.
Figure 2 Flowchart.
Results
Data Collection and Integration
The data available in this work is obtained for the years from 1997 to 2014 in
30 districts of Karnataka in India.
A sample of integrated dataset can be shown in Figure 3.
State = Karnataka
District Year Season Area Temp pH Rainfall N P K Crop
Udupi 2008 Rabi 39 25.15 5.3 123.21 45.3 8.1 6.2 maize
Chickballapura 2012 Kharif 134 26.32 6.5 82.34 56.8 9.4 4.5 Lobia
Bagalkot 2007 Kharif 1345 27.51 5.6 59.12 34.9 0 3.1 sunflower
Chickmaglur 2002 Kharif 3456 28.72 6.7 18.67 76.1 0 2.2 Tur
Bellary 2013 Kharif 7868 23.91 5.2 17.34 23.2 1.6 9.1 Tur
Dharwad 2001 whole 4565 34.22 6.3 34.15 59.3 7.8 5.3 Onion
Gulbarg 2007 Rabi 6786 27.43 5.4 23.34 20.4 9.0 8.5 sunflower
Bijapur 2012 Rabi 12 28.69 6.2 25.11 34.5 4.4 3.3 urad
Raichur 2011 Kharif 123 28.34 5.3 67.55 35.6 8.2 2.2 Mustard
Kolar 2002 Rabi 235 29.12 6.5 58.23 37.7 2.1 9.7 Mustard
Haveri 2002 Kharif 4345 30.23 5.8 23.34 53.2 0 1. Sugarcane
Koppal 2007 Rabi 6778 26.44 6.9 37.19 41.3 0 4.3 moong
Bangalore 2007 Rabi 345 25.55 5.7 45.2 78.3 3.5 4.7 jowar
Hasan 2012 Kharif 3556 27.45 6.8 52.4 34.4 5.7 6.8 raagi
Mysore 1998 Rabi 768 28.31 5.8 98.8 23.9 8.5 7.6 rice
Belgaum 2000 Rabi 997 29.12 7.5 79.2 25.3 9.3 3.2 urad
Dharwad 2008 Kharif 675 24.23 6.9 83.2 37.1 7.2 2.1 linseed
Uttarkarnataka 2003 Kharif 5678 23.34 5.2 62.3 59.2 6.2 1.2 moong
Koppal 2005 Rabi 3536 26.12 7.1 13.5 19.4 5.1 4.3 rice
kodagu 2011 Rabi 34346 28.67 5.4 92.8 28.7 0 8.3 gram
Yadgiri 2010 Kharif 3677 29.90 7.9 21.2 31.2 0 7.9 jowar
Data Pre-Processing
The dataset collected had no null or missing, or duplicate values so, it did not
require cleaning. After verifying this fact, the categorical are encoded into
numerical values as shown in Figure 4.
State = Karnataka
District Year Area Production Temp pH Rainfall N P K Crop
2 1997 11268 0.16 34.7 6.71 62.9 4.38 12.56 12.56 0
Data Analysis
After analysing the graphs in Figures 6 to 8 which consists of Predicted yield
vs Temperature for MOONG, we can identify that the optimal average annual
temperature range for max yield for MOONG (GREEN GRAM) lies between
33 to 34.5 degrees Celsius. In contrast, the least favourable temp range is
31-to-32.5-degree Celsius.
Figure 6 Predicted yield vs temperature over the years for MOONG (GREEN GRAM) in
HASSAN.
Figure 7 Predicted yield vs temperature over the years for MOONG (GREEN GRAM) in
BELGAUM.
Crop Recommendation and Yield Estimation Using Machine Learning 875
Figure 8 Predicted yield vs temperature over the years for MOONG (GREEN GRAM) in
BIJAPUR.
The performance metrics are Accuracy for Crop prediction, whereas Root
Means Square Error(RMSE) for Yield prediction. After comparing all the
models as shown in the Figure 9, we concluded the Random Forest algorithm
gave the highest accuracy among algorithms used for crop prediction and
least RMSE among algorithms used for yield prediction. Random Forest
achieved an accuracy of 99.94%. And RMSE of 0.2716.
Figure 9 The Mean absolute error comparison of the Machine learning models.
876 A. Ashwitha and C. A. Latha
The Random Forest Regressor has the lowest mean absolute error of all
the machine learning algorithms, as shown in Figure 9. As a result, it’s best
for predicting crop yield.
Table 1 The mean absolute error comparison of the machine learning models for yield
estimation
Machine Learning Models Mean Absolute Error
Random Forest Regressor 0.270
Artificial Neural Net 0.3052
Linear Regression 0.478
KNN Regressor 0.303
DBSCAN 1.054
SGD Regressor 0.478
Figure 10 Accuracy Comparison of the classification models of machine learning for Crop
Recommendation.
In Figure 11, there is a close relationship between the actual and predicted
data.
Crop recommendation and yield estimation features are included on a
web page that is easy to access and use for farmers. Pickle is used to
incorporating the algorithms chosen in the Flask application to develop the
Web application based on selecting algorithms that perform better in crop
suggestion and yield estimation.
The developed online application is divided into four web pages: home
page, crop recommendation, yield estimation, and about us page.
Crop Recommendation – The user proposes the most appropriate crop and
the projected return, as area selection, season entry, area values, temperature,
precipitation, ph, nitrogen and potash, is also advised. The production output-
input parameters are required. After the user enters suitable inputs, the
predicted crop and estimated yield is displayed as shown in Figure 12.
878 A. Ashwitha and C. A. Latha
References
[1] Veenadhari, S., Bharat Misra, and C. D. Singh. “Machine learning
approach for forecasting crop yield based on climatic parameters.” Inter-
national Conference on Computer Communication and Informatics.
IEEE, 2014.
[2] Medar, Ramesh, Vijay S. Rajpurohit, and Shweta Shweta. “Crop yield
prediction using machine learning techniques.” IEEE 5th International
Conference for Convergence in Technology (I2CT). IEEE, 2019.
880 A. Ashwitha and C. A. Latha
Biographies