Estimation of Maize grain yield using multispectral satellite data sets (SPOT 5) and the random forest algorithm

Crop yield estimation is a very important aspect in food production as it provides information to policy and decision makers that can guide food supply not only to a nation but also influence its import and export dynamics. Remote sensing has the ability to provide the given tool for crop yield predictions before harvesting. This study utilised canopy reflectance from a multispectral sensor to develop vegetation indices that serve as input variables into an empirical pre-harvest maize (Zea mays) yield prediction model in the north eastern section in Free State province of South Africa. Some fields in this region that were grown of maize under rain-fed conditions were monitored and the grain harvested after 7-8 months with actual yields measured. The acquisition of suitable medium resolution SPOT 5 images over this area was in March and June before the grains were harvested in July of 2014. A number of well known spectral indices were developed using the visible and near infrared bands. Through the random forest algorithm predictive models, maize grain yields were estimated successfully from the March images. The accuracies of these models were of an R2 of 0.92 (RMSEP = 0.11, MBE = -0.08) for the Agnes field and for Cairo the R2 was 0.9 (RMSEP = 0.03, MBE = 0.004). These results were produced by the SAVI and NDVI respectively for both fields. It was therefore evident that the predictive model applied in this study was site specific and would be interesting to be tested for an optimal period during the plant life cycle to predict grain yields of maize in South Africa.


Background to study
Crop yield prediction is production estimates that are made a couple of months or weeks depending on the crop in question before the actual harvest. This is frequently done through computer programmes that utilize agro-meteorological data, soil data, remotely sensed and agricultural statistics to describe quantitatively the plant-environment interactions (Zere et al., 2004). In some instances, meteorological data is included to run some of the yield models. The meteorological data is usually generated from weather stations and cover a given area. Hence, crop yield can be described as involving the effect of biotic and abiotic factors cumulatively which could however vary not just across fields but among fields and seasons alike (Bullock, 2004).
The traditional methods turn to be time-consuming and cannot consider yield variations over a field or space; therefore they are prone to large errors due to incomplete ground observations, leading to poor crop yield assessment and crop area estimations or predictions (Reynolds & Yittayew, 2000;Sau et al., 2004). In the light of these limitations, remote sensing methods were introduced (Mo et al., 2005). While remote sensing methods seemed to have responded to the above challenges, they were not without problems among which availability of suitable satellite data is enlisted. These challenges have led to the continuous seeking of improvements in yield estimation through either repeated application of already existing methods in different fields with different satellite data and crop types (Ngie et al., 2014).
Over the years remotely sensed data has proven worthy through its extracted spectral information to give information that relates statistically to crop yields and mapping of the spatial variability across regions as well as fields (Sun, 2000;Li et al., 2007). The frequently researched field crops have included wheat (Singh et al., 2002;Thenkabail, 2003;Bullock, 2004;Kastens et al., 2005;Ren et al., 2008), potatoes (Al-Gaadi et al., 2016) rice (Casanova et al., 1998;Noureldin et al., 2013), soybeans (Kastens et al., 2005;Li et al., 2007, You et al., 2017 and maize (Lewis et al., 1998;Shanahan et al., 2001;Baez-Gonzalez et al., 2002;Ferencz et al., 2004;Baez-Gonzalez et al., 2005;Kastens et al., 2005;Kogan et al., 2005;Mkhabela et al., 2005;Li et al., 2007;Inman et al., 2007;Salazar et al., 2008;Panda et al., 2010;Bognár et al., 2011). Most of these studies made used of the normalised difference vegetation index (NDVI) generated from the coarse resolution sensors such as the Advanced Very High Resolution Radiometer (AVHRR) and the Moderate Resolution Imaging Spectroradiometer (MODIS) to model yields. The use of these sensors was instrumental since the research focus was mostly at regional or county levels. Among the above cited studies, Singh et al. (2002) and Thenkabail (2003) worked on wheat fields at local levels and for the maize, only Inman et al. (2007) used a handheld sensor to collect remotely sensed data at field level for yield estimation.
However, the listed studies above can only show the trend and importance to maize yield predictions which cannot be overemphasized as the crop is relevant not just for food security but other economic sectors like energy.
The use of remote sensing to estimate biological crop yields is being explored in many countries such as the United States, China and India, and likely will become the keystone of agricultural statistics in the future (Zhao et al., 2007). The fact that crop productivity vary greatly across climatic regions since it depends on agroclimatic conditions, the application of remote sensing in this field would be necessary to show the differences. The variability of these conditions warrants models to be developed based on the conditions of different areas where the crops are planted. Moreover, there is room to improve on methodologies and principles such as applying non-linear statistical algorithms.
South Africa is among the top ten maize producers in the world and a major player on the African continent, which makes it necessary to monitor productivity through quick and reliable methods such as remote sensing. The medium resolution Satellite Pour l'Observation de la Terre (SPOT 5) images were accessed through the South African Space Agency (SANSA). This study also sets out to test a non-linear statistical method in analysing canopy reflectance values for precise or fairly accurate prediction of maize grains models for crops grown under field conditions.

Description of study area
The fields cultivated with maize in the 2011/2012 farming season by the farming group with which collaboration was reached were around Sasolburg and Parys. The latter is the major town in the Metsimaholo local municipality (Metsimaholo LM) and the former is of the Ngwathe local municipality (Ngwathe LM), all in the northeastern section of the Free State province of South Africa ( Figure 1). However, the two fields for this study where access was granted by the farmer were located within the Ngwathe local municipality (Figure 1). This area is located within the "Maize Triangle" of South Africa that is seated within two other provinces being the North West and Gauteng. This region where the fields were located is fairly flat with an altitude of less than 1500 m above sea level and mostly covered by the grassland ecosystem. The area is well watered by some of the main river systems of South Africa such as the Vaal and Orange Rivers. Rainfall (500 mm per annum) over this region is during the summer months (October to April) followed by the winter season which can get frosty. 1 It is made up of rich soils and greatly covered by commercial farms in grains (maize, soybeans, sunflower and sorghum). The Cairo field was made up of 97.83 hectares and the Agnes was 109.89 hectares.

Satellite image acquisition and pre-processing
The SPOT 5 L3 data set of path/row 133/404 acquired in March and June 2012 was obtained from the fundisa disc provided to universities by the South African National Space Agency (SANSA). The multispectral digital imagery has a 10 m pixel value with the green band range of 500 -590 nm, red band range of 610 -680 nm, near infrared (NIR) band range of 780 -890 nm and the shortwave infrared (SWIR) band range of 1.58 -1.75 nm. The green and red bands make up the visible region of the electromagnetic spectrum for this sensor. For this study, the interest was with the visible and NIR bands which are vital in measuring the vigour or photosynthetic capacity of the plants.

Field data acquisition
Field visits were conducted firstly to ascertain accessibility to farms with proper harvesting systems that record the yields across the fields. Secondly, the visits to the identified fields were to ascertain the conditions of the plants as well as the plots. The fields were planted with maize and treated under the normal field conditions for maximal production as marketed by the seed company.
After maturity, the plants were left on the field to lose over 90% of its moisture to enable grain harvest and storage. The fields were harvested using a combine harvester that recorded the grain weight per hectare within 20 m X 20 m ranges at (kg/ha). The harvester has an onboard system with a GPS and records the coordinates of the plots against the dry weight of the harvested grain.
In order to avoid the effects from the field boundaries, only plots that were completely within the fields were selected for the extraction of reflectance values from the developed indices (Thenkabail, 2003). The demarcated polygons were comprised of 4 pixels each to correspond with the harvest area of 20 m X 20 m plots. The reflectance value from the polygons was an average of the 4 pixels and the value per polygon served as sample observation (n) for the yield prediction models. The sampling of plots across the field was crucial since spatial variability was evident and replicates at other locations were considered based on the actual yield provided by the farmer after harvest. The plots were randomly chosen but considering areas of high and low actual grain weight.

Spectral vegetation indices
The principal reason for using spectral vegetation indices in crop studies has been to compensate the effects of factors of disturbance between the spectral reflectance measured from the vegetation and its characteristics such as canopy biomass or vegetation type (Bouman, 1995). The indices obtained from optical sensors have been valuable in crop production estimates through leaf interception media being mainly the leaf area index (LAI) (Tucker, 1979). The main index from which this interception medium is derived is the NDVI thereby making it a building block towards crop yield estimation using remote sensing. However, this relationship between the NDVI and the LAI is not without vices as it saturates with fully covered plant canopy (Pontailler et al., 2003).
There are some disturbing factors such as soil background to measured-reflectance where other distance-based vegetation indices were included to this analysis such as the Soil adjusted vegetation index (SAVI) (Huete, 1988) to overcome. The SAVI has the ability to completely cancel or reduce the effect of soil brightness wherever pixels have a combination of soil and vegetation reflectance (Huete & Jackson, 1988). It is a calibration factor in the NDVI equation that accounts for the first order soil-vegetation optical interactions and its potential has been successfully proven (Huete, 1988).
All the spectral vegetation index images were recreated from the identified indices (Table 1)

GVI or NDVIgreen or GNDVI
Green vegetation index which determines nitrogen influences from the green colour of the leaf through green reflectance Gitelson et al. (1996)

GRDI
Green red difference index which is the visible light normalised index and can relate to plant pigment content

VI
Vegetation index is sensitive to the green leaf material or the photosynthetically active biomass in plant canopy RNIR-RRED Tucker (1979) GDI Green difference index which is a nonnormalised index. Could therefore be used when the impact of factors such as slope and aspect is not pronounced

Random forest (RF) algorithm
The RF operates on the principle of constructing through recursive partitioning to split data into homogenous regression trees independently to maximum size without pruning and averages the results of all trees. There are two important parameters in the construction of the RF algorithm namely: the number of trees (ntree), and number of variables randomly chosen at each split (mtry) (Breiman, 2001). The robustness of the RF causes it to fit against the challenge of over-fitting that is experienced in linear models (Prasad et al., 2006;Palmer et al., 2007). The RF regression algorithm operates through bootstrapping samples from randomly divided original data set into the two third (2/3) training sample and one third (1/3) testing sample. There was a variation of the values for the ntree and mtry parameters accordingly. Liaw and Weiner (2002) in their study recommended the optimum number of mtry to be defined by one third of the total number of the input variables (32 for Agnes and 36 for Cairo). Meanwhile the ntree was regularised through the model for selection from 500 up to 2500 at 500 intervals (Prasad et al., 2006). The algorithm was performed for each growth stage (March and June) for the two fields being calibrated on training sample, n = 64 for Agnes field; n = 72 for Cairo and testing sample with n = 32 for the Agnes; n = 36 for Cairo fields. The calibration was evaluated through the root mean square error of calibration (RMSEC). The same sample plots were used for both the March and June sample dates or growth stages of the maize plants.

Selection of variables (vegetation indices)
The RF algorithm also has the ability to calculate variable importance (varimp) (Breiman, 2001).
This function has been critiqued for its bias nature of selecting variables (Strobl et al., 2007) which would be able to extract relevant indices to maize yield prediction. The conditional forest (cforest) has been proposed for such variable selection analysis. The cforest function reduces the level of biased selection of variables in individual classification trees unlike the variable selection embedded in the original RF by Breiman (2001) (Strobl et al., 2008;Strobl et al., 2009). The cforest is included within the development of the regression script for the RF algorithm to run simultaneously.
The RF ensemble uses the out-of-bag (OOB) error estimates to rank variables and is derived by predicting the data that are in each tree being considered (error of prediction). Prasad et al. (2006) then describes the variable importance as an evaluation of how worse the prediction becomes when the data for a variable were randomly permuted.
For the evaluation of different variables (selected spectral vegetation indices from varimp) in the performance of the RF regression algorithms, the predicted and actual or measured maize grain yields were related in one-to-one sets. During the relationship match, the coefficient of determination (R 2 ), root mean squared error of prediction (RMSEP) and the mean bias error (MBE) were calculated for every model run with the selected variables (spectral vegetation index).

Measured maize yields
The actual or measured yields of the maize grains from the Agnes and Cairo field for 2012 harvesting season from the combined harvester were recorded. The descriptive analysis indicated the Cairo field more productive with a higher mean yield of 4077.26 kg/ha (4.08 t/ha) as opposed to the 3207.1 kg/ha (3.21 t/ha) for the Agnes field (Table 2).

Spectral index of importance selection
The RF prediction models for maize grain yield including all the developed spectral vegetation indices in this study proved successful through the varied values of the parameters. The optimum performing parameter values were identified (Table 3). However, the contribution of the various spectral vegetation indices to the success would have varied according to their relationships with the grain productive parameters of the maize plants such as chlorophyll content, water content and others. indices ranked at top three included NDVI, SAVI and GVI (Figure 2; Figure 3).  The observed results with the NDVI as variable of importance in both fields, was in confirmation of its importance in relating the chlorophyll concentration. This is through the absorption levels of the red light and reflection levels in the NIR region of the electromagnetic spectrum of healthy plants (Rouse et al., 1974). The GVI was among the top three selected indices by the cforest which confirms its importance in maize yield estimation as reported in previous studies (Shanahan et al., 2001) where it highly correlated with maize grain yields. The relevance of the SAVI in maize yield predictions was also confirmed to Panda et al. (2010) where it was considered successful at 95.04% (R 2 of 0.53) equally with the NDVI. It should be clear that as the maize plants matured towards harvesting of the grains, soil background had a significant effect on the maize spectral features. Then it accounts for identification of the SAVI as an important index to maize yield prediction at such growth stage.

Random forest regression algorithm for maize yields using selected indices
The June not just because of the distinctive summer and winter seasons but also growth stages where March was the vegetative stage and June the reproductive stage. The NDVI which is developed from the difference of the region of maximum chlorophyll absorption (red region) and the corresponding region of maximum reflectance of incident light (NIR) has proven successful in all maize yield estimation studies applying remote sensing (Ngie et al., 2014) and this study was not left out. It proved to be the best out of the indices for both growth stages though with varying accuracies as well as the two fields. The accuracies of predictions obtained for this study however, do vary from previously recorded studies (Ngie et al., 2014) which could be accounted for by the different sensors used which ranged from coarse to medium and to fine resolution satellite images. The choice of the sensor also depends on the spatial extent of the study area. In some situation also where the same sensor was used, other factors such as soil characteristics, climatic conditions, cultivar types and the growth stage at which the yields were predicted varied, thereby causing discrepancies in the accuracy. For instance the results of this study which comprised of the same cultivar planted under same field conditions and monitored with the same sensor as well as dates did not provide the same accuracies in yield predictions. The difference in accuracies with R 2 of 0.91 and 0.89 for Agnes and Cairo fields obtained from NDVI in March (Figure 4; Figure 6) was however slightly lower but looks insignificant.
The NDVI and SAVI variables performed better in the model than the GVI which has problems of underestimation and overestimation. The GVI even though resulting with a high R 2 (over 70%) and a low RMSEP (0.18) overestimated the grain yield at higher grain output (Figure 4 (c)) which could be a challenge to stakeholders requiring adequate information for financial planning.  Once again in the Cairo field, the model was more precise with the NDVI and SAVI variables than the GVI which was challenged with overestimation as the productivity increased ( Figure 6).  (Table 3). The evaluation of the accuracies was noted through the lower RMSEP and MBE, and the increased R 2 values for both fields as well as the growth stages (March and June).
There was an improvement in the performance of the yield prediction algorithms that was as a result of using the random forest selection function to identify relevant indices (variables of importance). These results illustrated a good performance of this algorithm in prediction and feature selection analysis as earlier noted by Prasad et al. 2006. The selected spectral vegetation indices related to the amount of green materials (NDVI, GVI or VI) in the maize plants (Rouse et al., 1974;Tucker, 1979). Spectral vegetation indices that are responsive to the green pigments are excellent indicators for vegetation quantity and status or vigour (Salazar et al., 2008), and confirms the selection of the GVI in the top three indices as ranked by the OOB error estimates in this study as relevant to maize grain yield prediction.
The results from the two dates of data acquisition illustrated March as a period of better grain yield prediction for maize in this area of South Africa than the June. Even though a more robust study would need to be conducted to ascertain the optimum period of yield estimates or predictions with more monitoring dates throughout the growing season, the March period which was about four months to harvest (mid cropping season for maize in this region) was in conformity to another study in the United States of America (USA) where at 3-4 months before harvest predictions resulted in an estimation error of 3% (Kogan et al., 2012). In another previous study, it was 2-3 weeks before and after tasseling (Kogan et al., 2005) that was identified as optimum growth stage for yield predictions.
Meanwhile an 8-leaf stage was optimal for predicting maize grain yields and the 3-leaf stage was optimal for biomass estimations for Islam et al. (2011). According to Panda et al. (2010), the mid cropping season of maize was ideal growth stage to predict grain yields of the crop in North Dakota.
These discrepancies therefore suggest that optimal growth stage depends on the geographical region as well as the cultivar type being investigated as some might have a longer life span.
The observed inconsistency in maize grain yield predictability in the different fields (Agnes and Cairo in this study) (Appendix A) could also have been explained by the complexity in crop yield that depends on other non-imagery factors, such as nutrient stresses, or water availability. These factors were not considered in developing the random forest regression algorithms used in this study.
High performance in crop yields could be obtained if their production parameters remain consistent throughout the season until harvest (Panda et al., 2010) which is hardly the situation and therefore contribute to discrepancies in crop yield estimate results. Hence, these algorithms are site specific and applying them to other areas might not produce same accuracies but should still perform well considering all parameters.

Conclusions
In-field maize yield estimation or prediction using multispectral satellite imagery of medium resolution over rain-fed fields proved successful through the use of vegetation spectral indices. The spectral indices derived from SPOT 5 imageries were used as input variables into the random forest algorithm for regression analysis in predicting the grain yield by weight of maize across both fields with a good accuracy of high coefficient of determination (R²) values and low RMSEP as well as MBE values. However, the prediction was more accurate earlier on in the season (vegetative growth stage in March) than later for the reproductive stage in June. This could be attributed to the fact that the green pigment in maize leaves is largely responsible for its yields through photosynthetic activities. The NDVI was amongst the most important indices relating to maize yields for this study as proven through the random forest non-linear regression algorithm and selected by the cforest ranking them through the OOB errors.
It would be of interest for future research to engage in the optimisation of the grain prediction period in maize over the "Maize Triangle" of South Africa to ascertain in a timely manner the production of this important grain. Also as a result of the climatic challenges in obtaining real-time satellite optical data during the growing season, the potential of radar data such as Sentinel 1 could be exploited in monitoring crop productivity. There could also be the possibility to integrate climatic factors essential for maize productivity such as rainfall and temperature, and linking with soil nutrient data together with the vegetation spectral indices for a mechanical predictive model in future research.
In that way, climatic-related periods of critical development that controls productivity would be established thereby providing information for decisive measures to be taken.