Improving the accuracy of estimation of eutrophication state index using a remote sensing data-driven method : A case study of Chaohu Lake , China

Trophic Level Index (TLI) is o en used to assess the general eutrophication state of inland lakes in water science, technology, and engineering. In this paper, a data-driven inland-lake eutrophication assessment method was proposed by using an arti cial neural network (ANN) to build relationships from remote sensing data and in-situ TLI sampling. In order to train the net, Moderate Resolution Imaging Spectroradiometer (MODIS, which has a revisit cycle of 4 times per day) data were combined with in-situ observations. Results demonstrate that the TLI obtained directly from remote-sensing images using the data-driven method is more accurate than the TLI calculated from the water quality factors retrieved from remote-sensing images using a multivariate regression method. Spatially continuous and quasi-real time results were retrieved by using MODIS data. is method provides an e cient way to map the TLI spatial distribution in inland lakes, and provides a scheme for increased automation in TLI estimation.


INTRODUCTION
Some countries use Trophic State Index (TSI) to assess the trophic condition of a lake and associated changes (Carlson, 1977;Carlson and Simpson, 1996).Trophic Level Index (TLI) is another approach that was developed by the Ministry for the Environment of New Zealand (Burns and Bryers, 2000;Verburg et al., 2010).TLI is recommended by the China National Environmental Monitoring Centre for measuring inland lakes' eutrophication levels (CNEMC, 2009).Currently, TLI can either be calculated from water samples or from in-situ hyper-spectral data measurements (Li et al., 2006), and is usually calculated by interpolating all of these observation points.However, as demonstrated by Kuster (2004), nutrients in inland lakes have various shapes and complex spatial distributions.Some of the nutrient clusters even have a strip shape less than 30 m wide.Furthermore, the complicated non-linear relation is simpli ed as a linear relationship (Jiang et al., 2013).All this complexity of hydrology makes the retrieval of TLI by interpolation of only some ground observation points unreliable.On the other hand, to solve the laborious sampling and time-consuming chemical analysis problem, Yao used remote-sensing techniques that were fast, wide ranging, low-cost, and period dynamic (Yao et al., 2009).However, this method still cannot satisfy regional monitoring needs.TLI synthesizes many nutrition targets, and the synthetic eutrophication index is computed based on many nutrition targets retrieved from remotesensing images.Some previous studies tried to obtain TLI by building multivariate regression from remote-sensing data to create a eutrophication index combined with in-situ measurements (He et al., 2009).All these studies provide a time-saving way of acquiring the spatial distribution of TLI.However, there are two de ciencies in such studies.Firstly, the revisit cycle of remote-sensing satellite data used in these studies is usually several days, which makes it almost impossible to collect the insitu data at exactly the same time.Considering bad weather and other imaging conditions, the data were not suitable for daily monitoring.Secondly, the regression approach for retrieving TLI was a linear retrieval method, but it has been proven that a non-linear relationship exists (NSCEP, 1976) between satellite bands and the ve water quality factors of interest.Studies (NSCEP, 1976;Yang et al., 2006) suggest that a non-linear model may be better adapted to retrieve results.
To overcome the accuracy and e ciency problem, we propose a rapid data-driven method for monitoring the TLI distribution of Chaohu Lake.A data-driven method (Darema, 2004;Darema et al., 2005) would be appropriate to rapidly retrieve TLI information directly from satellite images (Li, 2007;Ei Serafy et al., 2007;Song et al., 2014b).CBERS satellite data were substituted by Moderate Resolution Imaging Spectroradiometer (MODIS) because the MODIS revisit cycle is 4 times per day (Freeborn et al., 2011) and the spectrum covered by the image is wider.Multivariate regression (linear analysis) was replaced by an arti cial neural network (non-linear analysis) regression to better retrieve TLI.
is paper introduces the data acquisition and processing procedures and reviews the feasibility of using ANN and MODIS to map the TLI distribution.e results of ANN regression and the time-series TLI distribution maps are discussed and presented, concluding with prospects for future improvement.

Study area
Chaohu Lake is a typical large, shallow, subtropical lake.With a mean water depth of 2.69 m covering an area of 780 km 2 , the lake is located between 30°58´ and 32°06´N and 116°24´ and 118°00´E.Its northwestern border is less than 10 km away from the capital city of Anhui Province, Hefei, and its eastern border is very near Chaohu City (Fig. 1).Being one of the drinking water catchment areas for both cities, the water quality of Chaohu Lake is of the utmost importance.Some remote-sensing inversion work on water quality has been done in Chaohu (Mei et al., 2008b;Xie et al., 2010).
e lake is managed and monitored by two di erent administrations.Each administration takes responsibility for 6 observation points.ese sample points are used to get observation data on surface water quality every other day from May to October and once a week for the rest of the year.e current 12 sampling points are the nal choice of the Ministry of Environmental Protection and the local environmental protection department, based on rigorous analysis and veri cation synthesized from many di erent factors, such as lake area, lake basin form, condition of recharge, e uent and water intake, the location and scale of sewage disposal facilities, pollutant circulation, and migration and transformation of algae in water.e current 12 sampling points satisfactorily control water quality monitoring and veri cation in Chaohu Lake.At the same time, a more homogeneous distribution or more sample sites might not lead to better veri cation; it depends on the representativeness of the sample sites according to the characteristics of the water.Actually, in the early monitoring of Chaohu Lake in 2009, the sample sites were homogeneously distribution as shown in Fig. 2.
A nutrient-level distribution for Chaohu Lake was drawn based on the sample sites, and the area and distribution of chlorophyll, which is the main indicator of nutrient levels in lake water, was computed (Fig. 3).e central part and north central area are known to be consistently eutrophic, and therefore few sample sites were chosen in this region.More sample sites will be deployed in Chaohu Lake in the future to monitor changes in nutrient distribution and improve the accuracy of water quality data veri cation.
To demonstrate the feasibility of our method, high temporal and spectral resolution MODIS data from April 2 to July 13, 2013, were acquired.MODIS Surface-Re ectance Products (MOD09) were downloaded from the National Aeronautics TLI j is the jth composite indicator with the corresponding weight W j .e r ij value given in Table 1 gives the correlation coe cient for the relationship between the reference chlorophyll concentration and each indicator.e TLI of all the observation points was calculated from 5 components including chemical oxygen demand (COD), total phosphorus (TP), total nitrogen (TN), chlorophyll a (Chl-a), and Secchi depth (SD).Formulas for the TLI of each component are given below: TLI(Chl-a) = 10(2.5+ 1.086 ln(Chl-a)) (3) TLI(TP) = 10(9.436+ 1.624 ln(TP)) ( 4) TLI(TN) = 10(5.453+ 1.694 ln(TN)) ( 5) TLI(SD) = 10(5.118+ 1.94 ln(SD)) ( 6) Equations 3 to 7 are empirical regression equations based on a survey of eutrophication levels of more than 20 lakes in China (Jin et al., 1990).According to Speci cations for Lake Eutrophication Survey (2nd Edition) (Jin et al., 1990), these 20 lakes are representative of a range of depths, sizes and meteorological conditions.For example, the sample includes Aydingkol Lake with a depth of less than 1 m, and Nam-Co Lake with a depth of 125 m; as well as Tianchi Lake with a surface area of less than 10 km 2 , and Qinghai Lake with a surface area of 4 500 km 2 .Climatic conditions represented range from those of plain to highland and mountainous regions.Most of these lakes were highly eutrophic.erefore, the method presented is reliable in lakes with a moderate to high eutrophication status.e empirical coe cients can be used in other lakes only a er determining the nutrition composition of a typical lake in the region of interest.
e units for each component are given in Table 1.e score for TN and TP was adjusted in the range from 0 to 100 according to the international standard (Jin et al., 1990) for lake eutrophication levels and was analysed for correlations between the index of the eutrophication and water quality parameter.Also taken into consideration was that TN and TP is generally higher in Chinese lakes relative to those in developed countries.Trophic status is categorised using the ∑TLI as follows: From the de nition of ∑TLI, it has a linear relationship with the natural logarithm value of the ve indicator values.Both band ratio combination methods (Gons et al., 2002;Song et al., 2013), and Space Administration (NASA) (Vermote, 2013).MOD09 is the Level 2 product automatically generated from the MODIS Level 1B land bands intended to estimate the surface spectral re ectance.Atmospheric e ects are almost removed in MOD09, though there are articles reporting on the de ciencies of MOD09 (Guang et al., 2013).
In-situ samples were obtained from 2 April to 13 July 2013 with a 7-day interval from April to May and 1-day interval from June to July.e sites were accessed by motor boat and water samples were taken from 0.5 m below the water surface using plastic samplers.A volume of 1.5 to 3 ℓ water was taken.Water quality was also assessed using an EXO2 multifunctional measuring instrument, which was lowered 1-2 cm into the water.e samples were placed in a box lled with ice and stored in the dark for a short period before laboratory analysis.All laboratory analyses were done at the local environmental monitoring centres.A spectrophotometric method was used to determine chlorophyll a according to Wei et al. (2002).A dichromate method was used to determine COD according to National Standard GB/T11914-89 (Yin, 1989).An ammonium molybdate spectrophotometry method was used to determine TP according to National Standard GB11893-89 (Yuan and Yao, 1989).An alkaline potassium persulfate digestion UV spectrophotometric method was used to determine TN according to National Standard HJ636-2012 (DLMEMC, 2012).e measured results were returned 2 days a er the water samples were taken to the laboratory.e Environmental Protection Monitoring Station of the Chaohu Management Bureau has specialized departments and personnel responsible for daily equipment calibration and the manufacturer's engineers do regular equipment calibration.ree replicate samples were collected for each sampling point, and the average for each sampling site and time reported.
Training data selection was based on the following principles: • Sampling points close to the shore were excluded from the dataset.Since the resolutions of MOD09 are 250 m (Band 1 and Band 2) and 500 m (Band 3 to Band 7), o -shore water will be mixed with coastal water, which will cause spectral distortion.• Cloud-contaminated pixels are excluded in the MODIS cloud mask product (MOD35).
Pixel values for each sampling point were extracted.Only Bands 1 to 5 were used in this study in order to avoid redundancy.Bands 6 and 7 have rarely been reported as being suitable for monitoring water.A total of 63 samples of in-situ data concurrent with MOD09 were obtained and the dataset was randomly divided into 3 groups: the training set (45 points), the validation set (9 points) and the testing set (9 points).

Feasibility of mapping TLI spatial distribution from MODIS
Chaohu Lake is a phosphorus-controlled lake, and TLI is used by the managers of the lake as an indicator for assessing the trophic level (Zhang et al., 2013;Wang et al., 2007).TLI is a chlorophyll-based index aiming to characterize, in both a qualitative and quantitative manner, the level of eutrophication of lakes.It is a weighted sum based on chlorophyll a and several other substances.Currently, the TLI of Chaohu is calculated using the national standard and the following equations (Wang et al., 2002;Jin et al., 1990): namely the bio-optic method (Ma et al., 2006;D›Alimonte et al., 2012) and non-linear algorithms (Schwarz et al., 2002;Wu et al., 2009;Keiner, 1999;Zhang et al., 2002), were applied in retrieving the ve indicators from satellite images.erefore, it was theoretically possible to use TLI with satellite images and in-situ observations mapping the spatial distribution because there was a close relationship between the combination of satellite image bands and TLI values.However, there were two reasons why the TLI retrieval application was hindered using the white box method: bio-optic and multivariate regression.Firstly, current standard atmospheric correction procedures, especially developed for inland waters, were unable to remove the atmospheric e ects on the data.Atmospheric e ects will cause anomalies in bands in lakes, and this will lead to systematic errors in retrieving indicators.Secondly, errors caused by interactions will increase when the turbidity of the water rises.
For example, the Chl-a estimation algorithm, based on the response at 469 nm, is more likely to be a ected by concentrations of suspended sediments and coloured dissolved organic matter (CDOM), which can be more prevalent in shallow waters like Chaohu Lake (Darecki and Stramski, 2004;Shutler, 2007).erefore linear regression and the white box method are not suitable for retrieving the ve components of TLI because systematic errors will be introduced into the nal result.Machine learning is an e ective way to solve non-linear problems in general.As a practical theory and algorithm of machine leaning, arti cial neural network algorithms are more suitable in tting non-linear relationships of attribute-value pair problems (Wu et al., 2005;Mitchell, 1997).

Neural networks
Neural networks are a kind of machine learning algorithm that imitates brain processes and were originally developed to solve non-linear problems like tting, pattern recognition, clustering, and time-series prediction (Mitchell, 1997).ey have been successfully applied in environmental sciences (Sattari et al., 2012;Krasnopolsky and Chevallier, 2003;Krasnopolsky and Schiller, 2003;Song et al., 2014a).Figure 4 (a) shows that traditional TLI retrieval requires all water quality factors to be calculated from remote-sensing data by means of neural networks.In this study, a neural network was used as a transferring function to link MODIS bands and in-situ TLI values.e neural network used in this study comprised of 3 layers (input layer, hidden layer, and output layer) and was a feed-forward type including back-propagation of errors.e owchart of the satellite bands to the TLI neural network is presented in Fig. 4 (b).Traditional TLI retrieval requires an intermediate process to retrieve all water quality factors, but the remote sensing data-driven method can omit this intermediate process.
e red dashed box in Fig. 4 (a) shows the di erence between traditional TLI retrieval and the remote-sensing data-driven retrieval method.
In this study, the input layers used the rst ve bands of the surface re ectance product from MODIS (MOD09).For each observation point, the corresponding re ectance values were extracted from MOD09 as well as the ancillary data.Band 3, Band 4 and Band 5 (500-m resolution) were interpolated to the same pixel size as Band 1 and Band 2 (250-m resolution) by using the nearest neighbour algorithm.e output of the network was the TLI calculated by in-situ measurement of the corresponding pixel.In the hidden layer: where: Net j represents the jth node in the hidden layer and v ij represents weight for unit j corresponding to the ith input.f represents the activation function of a node.In this study we used a sigmoid function as follows: In the output layer, ∑TLI is calculated through the function as follows: where: ω j is the corresponding weight for each hidden node.f 0 is the activation function.In this study, we used the commonlyused linear function.Both ω j and v ij were assigned with random values initially, and then modi ed by the delta rule according to the learning samples.
Out of a total of 63 points, 45 were training data, 9 were validation data, and 9 were testing data.e Levenberg-Marquardt approximation method was employed to minimize errors in each trial.In order to avoid over-training, hidden layer nodes between 2 and 5 were tested and results showed that 4 nodes was the optimal number.We also implemented a multivariate regression (MR) method as a comparison experiment.Correlations of the ve bands, MR results, and neural network (Nodes 2 to 5) to TLI are given in Table 2. Two commonly-used indexes for measuring accuracies were utilized for testing the performances of the method, i.e., coe cient of determination (R 2 ) and mean square error (MSE): e number of nodes in the hidden layer depends on the complexity of the relationship between input and output.Using enough nodes ensures the nonlinearity of the neural network in tting the data.However, excessive nodes will lead to over-tting when the network not only learns the real model but also takes in noise.In order to nd the optimal node number, the coefcient of determination (R 2 ) and MSE were calculated to nd the optimal value.R 2 indicates how well data t a statistical model, and ranges from 0 to 1.If R 2 is closer to the numerical value 1, it indicates that this model ts the data better.Additionally, negative values of R 2 may occur when tting non-linear functions to data.We trained each neural network with node numbers from 2 to 5 and found that the network with 4 nodes had the minimum MSE and highest correlation.Besides this, we used the thresholds of each level of TLI to classify the results and compare with the in-situ value.e comparison of the neural network and multivariate regression is presented in Table 3.

RESULTS
e coe cient of determination (R 2 ) and RMSE of the neural network were much better than for the MR results. is is due to the non-linear nature of the neural network in transferring the satellite bands into the ve components of TLI and the nal TLI value.Table 3 shows that the accuracy of classi cation using the MR method was not satisfactory.e di erence between the retrieved result (59.5) and the actual result (60.5) was judged to indicate a classi cation error.Errors in classi cation occur because classi cation is based on using thresholds to directly separate results.Classi cation is done according to the score interval for di erent eutrophication level statuses.Accuracy of classi cation is evaluated using the actual observed values for the 12 water quality sample sites.e closer the retrieved value from the remote sensing imagery is to the observed value, the higher the accuracy of the classi cation result.Nevertheless, the large number of uncertainties in the water body environment of lakes leads to fuzziness in the categorisation and standardization of the di erent indicators, which may lead to the possibility of classi cation errors.Even though the retrieved value from the remote sensing imagery may be very close to the observed value, because the observed value led to a di erent classi cation, this will be judged as a classi cation error.Consequently, an improved evaluation method based on fuzzy mathematics should be used to evaluate the water quality level.Figure 5 shows a comparison of neural network results (4 nodes), MR results, and the TLI calculated from sampled data.e MSE of the 4-node hidden layer neural network appears to be much better than that of the MR results.
e results show that the application of neural networks in TLI retrieved from MODIS images outperforms the multivariate retrogression both in its regression accuracy and in the retrieval stability.us, the neural network based TLI mapping method can be used as a complement to the conventional data sampling method.
ese three groups of images were obtained on 2 April 2013, 3 June 2013, and 11 July 2013, respectively.

DISCUSSION
As Fig. 6 indicates, the neural network-retrieved spatial distribution of TLI outperformed the MR approach in two ways: • e range of the TLI retrieved was set between the minimum and maximum of the training data.In this way, extreme values were avoided.Even though higher TLI values could be retrieved, this conservative algorithm maintained stability and controllability.By contrast, the range of the MR results extended both the minimum and maximum of the training data over a larger scale.
• Results calculated from the neural network data-driven method show great spatial heterogeneity, which means that more eutrophication state information was mined from the satellite images.On the contrary, the results of MR are spatially smooth.e linear MR method involves moderate-and low-resolution satellite images sharing the characteristic of having mixed pixels, which leads to similarity of the neighbouring pixels and a spatially smooth result.is means that the neural network results are indeed pixelbased while MR results su er from the mixed-pixel e ect.
One obstacle that a ects the performance of both neural networks and MR is the atmosphere.MOD09 is generated by the Second Simulation of a Satellite Signal in the Solar Spectrum, Vector (6S) model with several atmospheric parameters taken either from the National Centres for Environmental Prediction (NCEP) (ozone, pressure) or from the MODIS data (aerosol, water vapour) (Vermote and Vermeulen, 1999).Future work might focus on atmospheric correction, and recently a better algorithm has been reported (Guang et al., 2013), which suggests that future improvement in the retrogression accuracy might depend on the new method.Another factor that in uences the performance of a neural network's data retrieval capability is the spatial resolution of MODIS imagery and disturbance of water when sampling.
e low resolution of MODIS imagery means the pixel value is a mixture of the sampling area and surrounding area.As Kuster points out, the spatial distribution of water surface algae is at less than 30 m depth (Kuster, 2004). is may also apply to TLI distribution.e groundtruth is the point value while the corresponding pixel is the mixture.Besides this, the water sampling method will disturb the surface water and push the scum and subsurface aggregations away from the ship.e joint force of the two factors brings uncertainty to the nal results.
ere is another very important question about vertical strati cation of water.Vertical water column structure may lead to changes in concentration.Chaohu Lake is, however, an expansive shallow lake, and so the results obtained from sampling will be una ected by vertical strati cation.For other types of lakes, we are developing a three-dimensional model based on hydrodynamics, which will consider the in uence of water column structure on water quality.In the future, the remote-sensing data will be assimilated with the three-dimensional hydrodynamic model and can be used to monitor and analyse the results of model simulation.
Finally, because the number of sample points is not su cient, the accuracy of the model is bound to be a ected, especially the accuracy of the results for the non-sampled area.Based on the accuracy error recorded between the actual measured value and the inversion value of the neural network model, the deployment of more sampling points in the areas of Chaohu Lake recording poor accuracy will improve the accuracy of the model while ensuring that a suitable number of points are sampled.

CONCLUSIONS
We implemented a rapid data-driven method for monitoring the TLI distribution of Chaohu Lake from MODIS satellite remote-sensing data on the basis of an arti cial neural network.Advantages and potential future improvements of this innovative method are listed below: • Results demonstrate that water quality distribution can be predicted with information retrieved from satellite remote-sensing images.From the perspective of information theory, reducing the intermediate steps may improve accuracy because of less information loss in the transformation process.• e TLI distribution mapping interval can be improved.If weather permits, the mapping of TLI can be done twice a day.Compared with sampling method requiring access by • Inaccuracy caused by surface water disturbance can be avoided to some extent.However, this depends on the accuracy of the training data acquired by the boat sampling method.is problem can be avoided if automatic water sampling stations are created in Chaohu Lake.• Higher spatial resolution images with an appropriate revisit cycle may be used to improve the mapping result.
To address the limitations of weather and data acquisition, high temporal resolution MODIS data were used in this study.Better spatial resolution Landsat imagery will provide better details for determining TLI.More accurate pixels concurrent to the observation point will be extracted from the image, which will lead to much better regression performance and more detailed information for areas with di use blue-green algae.• is method demonstrates that ANN is much more suited to TLI determination than the MR method.e key reason for the improved performance lies in the non-linearity of neural networks.Various machine-learning algorithms have been extensively studied and successfully applied in tting problems.Future work may focus on substituting a back-propagation neural network.For example, genetic programming has been applied for better performance in chlorophyll a concentration retrieval (Chang et al., 2013).

Figure 6
The retrieval results contrasting the neural network and MR

Figure 1 Figure 3
Figure 1Location of sampling sites within Chaohu Lake

Figure 4
Figure 4Architecture of satellite bands to TLI neural network

Figure 5
Figure 5Comparison of neural network results (4 nodes), MR results, and TLI calculated from observation data