Transferability of decision trees for land cover classification in a heterogeneous area

As the value of accurate land cover becomes more apparent, methods to decrease the costs associated with supervised land cover mapping are investigated. One such method is to use training data captured in one scene and apply it to a different scene through a process known as signature extension. This paper attempts to derive classification rules from training data of four Landsat-8 scenes by using the classification and regression tree (CART) implementation of the decision tree algorithm. The transferability of the ruleset was evaluated by classifying two adjacent scenes. The classification of the four mosaicked scenes achieved an overall accuracy of 80.6%, while the two adjacent scenes achieved 61.4% and 83.7% respectively. The low accuracy of the first adjacent scene can be ascribed to a misclassification of graminoids, urban and bare areas, attributed to the temporal changes of grasslands throughout the year. In an attempt to improve the results, a normalised difference vegetation index (NDVI) threshold was applied to each scene. This increased the accuracy of the first adjacent scene but decreased the accuracy of the second. We conclude that signature extension using CART is unreliable. However, simple rules can be added to improve the results.


Introduction
There are many applications of remotely sensed imagery, but the most common is undoubtedly land cover mapping (Gray and Song, 2013;Hu et al., 2015). Moreover, the importance of accurate and up to date land cover and land use information is increasing as the significance of this information becomes recognised by the international scientific community (Rodriguez-Galiano and Chica-Olmo, 2012).
One method to produce land cover information is through the supervised classification of remotely sensed imagery. Generally, supervised classification takes place on a scene-by-scene basis (Gray and Song, 2013;Knorn et al., 2009), which is problematic as the selection of training data can be time-consuming and expensive and requires the expertise of a skilled photo-interpreter (Knorn et al., 2009;Richards and Jia, 2006). These factors could lower the quality and frequency of land cover products.
The costs associated with the collection of training data could be reduced if the training data are applied to a different area or time period (Giri, 2012). When confronted with a large area to be classified, one approachinitially developed in the mid-1970'sis signature extension, also known as generalization (Pax-Lenney et al., 2001). This method involves the use of spectral signatures created in one area and applied either to another scene, a different sensor or at a different time (Hu et al., 2015;Knorn et al., 2009). Spatial signature extension was initially found to be ineffectual, mainly due to poor radiometric calibration and normalisation between scenes (Olthof et al., 2005). The success of signature extension was further hindered by spatial heterogeneity and phenological differences (McDermid et al., 2005), specifically in a north-south direction, due to significant changes in vegetation (Laborte et al., 2010).
Owing to variations in topography, phenology (Knorn et al., 2009), the angle of the sun and atmospheric conditions (Hu et al., 2015), accuracies of land cover maps generated through signature extension tend to decline by as much as 13% when classifying nearby scenes (Pax-Lenney et al., 2001). This decline is, however, not yet fully understood as the patterns affecting it are complex (Pax-Lenney et al., 2001). Using only two land cover classes as a method of monitoring forest change, Woodcock et al., (2001) noted that spatial extension is possible, and in fact comparable to other methods, but only when used for nearby scenes. This observation supports the findings of other authors who noted that the success of land cover classification is hampered when an area is heterogeneous, as geographical complexity can have a negative effect on the spectral separability of classes  and can reduce classification accuracy (Okubo et al., 2010). Spectral separability measures have been used to provide an indication of the potential accuracy of land cover classifications (Su et al., 2010) in heterogeneous areas. For instance, Verhulp and Van Niekerk (2016) used spectral separability measures in four individual and mosaicked Landsat-8 scenes to evaluate the potential of signature extension for supervised land cover mapping. They noted that the overall spectral separability was substantially lower in highly heterogeneous scenes compared to less complex scenes. They also showed that the use of multi-temporal imagery can improve the separability of classes in heterogeneous areas as it better represents the phenological stages of vegetation . The value of multi-temporal imagery for land cover classification was also demonstrated by Verhulp and Van Niekerk (2016) and Brown de Colstoun et al., (2003). The inclusion of multi-temporal imagery could possibly improve the accuracies associated with signature extension.
An alternative to spatial classifier extension without the use of spectral signatures is to create a decision tree. A decision tree uses binary rules to classify pixels based on both spectral and ancillary information (Chuvieco and Huete, 2010). Each tree has a root node, a series of splits and terminal nodes known as leaves (Pal and Mather, 2001). The decision tree method is very popular in land cover classifications thanks to its flexibility, simplicity and ease of interpretation (Brown de Colstoun et al., 2003). The accuracies obtained through decision trees are also similar to or better than other classification methods (Brown de Colstoun et al., 2003;Zhai et al., 2012).
Traditionally, the creation of the decision tree and the identification of the splitting criteria were based on expert interpreter's knowledge (known as expert rules) or on statistical approaches (Chuvieco and Huete, 2010). Recently algorithms have been developed to automatically generate decision trees (Chuvieco and Huete, 2010). One such algorithm is classification and regression trees (CART). CART recursively splits the data into two nodes according to the independent variables (the spectral and ancillary data), until there is a consistency between the land cover classes. Since decision trees tend to overfit the data (i.e. produce trees with poor generalizability to other datasets), CART uses an independent dataset to test the classification and prune the tree to an optimal size, known as the best tree, which is a combination of predicted accuracy and complexity (Steinberg and Colla, 2001). Ideally, the best tree is "less complex yet has superior predictive capabilities" (Brown de Colstoun et al., 2003: 317).
Despite the simple and transparent nature of decision trees, very little research has been conducted on the transferability of the resulting rules to different scenes. Wentz et al., (2008) adapted expert rules, originally designed for the classification of Phoenix, Arizona, and applied them to Delhi, India. They achieved an overall accuracy (OC) of 80.0%, but noted that certain classes had been hardcoded by the expert system, resulting in 100% accuracy for those classes. Zhai Mapper (ETM) scenes, having used only a few of the scenes to develop rules. Using spectral data from one season, as well as normalised difference vegetation index (NDVI) and tasseled cap components, an accuracy of 78.87% was achieved. They concluded that it is possible to classify large areas using decision trees, and that sample selection in every scene is not necessary.
This paper aims to investigate the accuracy and robustness (transferability) of decision tree rules to classify a large, highly heterogeneous area into land cover classes. The potential to spatially transfer the rules to two adjacent Landsat-8 scenes is investigated and the results are interpreted in the context of finding cost-effective operational solutions for monitoring land cover in complex areas. The area selected for this study is particularly complex owing to the great variation in elevation, climate, environmental patterns and vegetation (Verhulp and Van Niekerk, 2016).

Study area
The study area is made up of six Landsat-8 scenes situated primarily in the Eastern Cape Province of South Africa (Figure 1). The scenes were acquired from the United States Geological Survey (USGS) archive and stretch from 22°20' E to 30° E, with four scenes positioned along the coastline and two positioned inland.
The western portion of the study area is separated by the east-west oriented Baviaanskloof Mountain range, with a maximum elevation of 2 130m. The eastern portion of the study area rises steadily from the coast to the Winterberge and Drakensberg, with a maximum elevation of 2 743m. Mountains is accompanied by a decrease in the average temperature and rainfall. The western interior is drier than the coast and experiences hot summers and cold winters (Schulze and Maharaj, 2006). In the central interior, north of the Winterberge Mountain range, the temperatures are much cooler, especially during winter.
The extent of the geographical area, along with the large variation in climate, has resulted in a highly diverse vegetation structure. According to Mucina and Rutherford (2006), the study area contains nine of the ten vegetation biomes found in South Africa. A biome is a "high-level hierarchical unit having similar vegetation structures exposed to similar microclimatic patterns, often linked to characteristic levels of disturbance such as grazing and fire" (Mucina and Rutherford, 2006: 32). The inland scenes are dominated by the grassland and nama-karoo biomes, while the coastal areas host a complex mixture of albany thicket, fynbos, savanna, succulent karoo, azonal vegetation, Indian Ocean coastal belt and forests ( Figure 2).
The study area was selected because the complex nature and heterogeneity of vegetation within the province would ensure that any methodologies developed should be applicable to less complex, more homogeneous areas.

Satellite imagery
Landsat-8 imagery was collected for early spring of 2013 (August -September) and late summer of 2014 (February -April). The imagery was pre-processed by applying atmospheric and radiometric corrections. This is essential when multi-temporal and multi-scene images are utilised (Hu et al., 2015), as it allows the user to compare the digital numbers across both space and time (Chuvieco and Huete, 2010). The corrections were implemented using the Atmospheric and Topographic Correction (ATCOR) procedure within the Interactive Data Language (IDL) environment. The 30m resolution imagery was pansharpened to 15m, while the two thermal bands were converted to a single surface temperature and resampled to 15m. The final result of the preprocessing workflow was a 12 bit, 15m TIFF image with seven bands: blue, green, red, near infrared (NIR), short wave infrared 1 (SWIR1), short wave infrared 2 (SWIR2) and surface temperature.

Training and reference data
The four coastal scenes were mosaicked and treated as a single entity. Training data for the coastal scenes were collected manually using a combination of Landsat-8, SPOT-5 and Google Earth imagery. A total of 1 464 polygons were collected for seven land cover classes (Table 1). The two inland scenes were kept separate and used to test classification via spatial extension of the developed ruleset. A total of 180 samples to be used as ground truthing (reference) data for scenes 170/082 and 171/082 were collected from SPOT-6 imagery (Table 1). 3.3 Auxiliary data

Principle component analysis and texture measures
Texture is characterised by the spatial variation of the spectral brightness within an image, and can be included to increase the classification accuracy (Rodriguez-Galiano and Chica-Olmo, 2012; Berberoglu et al., 2007). A popular method used to determine texture is the grey-level cooccurrence matrix which evaluates the arrangement of grey values within a specified window in order to determine textural variation (Chuvieco and Huete, 2010;Berberoglu et al., 2007;Coburn and Roberts, 2004). The use of textural features may, however, dramatically increase the dimension of the data as the calculation is applied to each image .
Additionally, many of these features may be redundant or highly correlated (Pacifici et al., 2009).
Principal component analysis (PCA) is a feature selection procedure that results in the maximum amount of information for all bands condensed into a single band (Campbell and Wynne, 2012), made possible by the high correlation between bands. PCA is achieved through a linear transformation where the data axes are rotated in order to realign them with the maximum data variance (Giri, 2012). The first axis, which contains the maximum information in a single band, is known as the first principal component or PCA1. To avoid an increased complexity resulting from an increase in the number of bands and to reduce redundancy, texture is usually extracted from the PCA1 band Berberoglu et al., 2007).
In this study, PCA1 was extracted from both the spring and summer bands, and contained 87% of the data variance. Six texture measures were applied to PCA1: homogeneity, second angular momentum, contrast, entropy, correlation and standard deviation. These six measures are generally accepted to be the most important measures for analysing images (Kayitakire et al., 2006). A 3x3 window produces the largest classification accuracy as well as the highest Kappa value (Chet et al., 2004), and was therefore used for calculating the texture measures.

Spectral indices
Spectral indices are formulas designed to extract quantitative information about each pixel (Chuvieco and Huete, 2010) and enhance latent or hidden information in the image data (Campbell and Wynne, 2012). Vegetation indices take advantage of the strong reflectance and absorption of chlorophyll in the NIR and red bands respectively (Chuvieco and Huete, 2010). The soil adjusted vegetation index (SAVI), second modified soil adjusted vegetation index (MSAVI-2) and the enhanced vegetation index (EVI) (Jensen, 2005) are all variations of the NDVI, and make use of this relationship. Water indices, such as the normalised difference water index (NDWI_MF) proposed by McFeeters (1996) or the modified NDWI proposed by Xu (2006) (NDWI_XU), attempt to identify water and reduce shadow noise. Built-up and bare soil indices aim to emphasise non-vegetated features including urban areas, rock and bare soil. Examples of such indices include the enhanced built-up and bareness index (EBBI) (As-syakur, 2012), the index-based built-up index (IBI) (Xu, 2008), the soil index (Waqar, 2012), the normalised difference bareness index (NDBAI) (Waqar, 2012) and NDBAI-2 . These eleven indices were calculated for both the spring and summer image sets and included as additional input variables.

Ancillary data
The inclusion of topographic data as ancillary data in land cover mapping can improve classification accuracies (Ren, 2009). For this study, the 30m Shuttle Radar Topography Mission (SRTM) digital elevation model (DEM) covering the area of interest was obtained from the USGS.
Slope gradient and aspect values were calculated and incorporated into the classification as additional features.

Data preparation and CART application
The four coastal scenes were mosaicked and treated as a single entity in order to produce a decision tree that incorporates scene-to-scene variations. No colour calibration, feathering or dodging parameters were selected during the mosaicking process.
All of the polygon training samples were converted to a series of vector points at a 15m sample distance, with an attribute representing the reference land cover class. This resulted in 625 939 sample points. An equally proportioned random subset of 89 238 sample points was created to produce the decision tree as CART performs best with an equal ratio (Campbell and Wynne, 2012).
For each point, the underlying pixel value of the Landsat-8 image features as well as the ancillary data was extracted. The image features consisted of the seven bands and eleven indices described in Section 3.3.2, as well as the six texture variables discussed in Section 3.3.1. The ancillary data consisted of slope gradient and aspect. The attribute data was exported for input into CART. Half of the points were used to build the initial tree, while the remaining points were used for pruning and obtaining the predicted classification accuracy.
CART often generates complex trees containing a large number of nodes that are not easily programmable or transferable. It is consequently common practice to limit the depth of the tree or the maximum number of nodes. However, such limitations generally have a negative effect on the resulting tree's predictive accuracy (Steinberg and Colla, 2001). Another approach is to manually prune the tree to the desired complexity. In this study, two tree complexity reduction methods were implemented and evaluated according to their resulting predicted accuracy. In the first scenario (Scenario 1), the number of nodes was limited during the tree-building phase. In the second scenario (Scenario 2), no limits were specified during the tree-building phase, but the tree was manually pruned. The predictive accuracies of these two scenarios were recorded for each treesized instance (from 900 to 20 nodes). The impact of merging different land cover classes was also investigated by repeating the tree complexity reduction scenarios on different sets of classification schemes.
The classification rules derived from the decision tree with the smallest number of terminal nodes and the highest predictive accuracy was implemented using ERDAS Imagine's Knowledge Engineer Classifier. The ruleset was used to produce three land cover maps. The first map covered the four coastal scenes from which the rules had been derived, while the second and third maps were generated by implementing the ruleset on the two inland scenes. The purpose of the latter two maps was to test the ability of image extension as no training samples were collected in these areas.
An independent set of reference samples was used to determine the accuracy of the resulting maps. The points were randomly selected from the ground truthing polygons discussed in Section.
3.2. As there were many clouds in scene 170/082, which could not be masked out, a cloud mask was created. No points inside the mask were used for the accuracy assessment, as this could negatively affect the result. A confusion matrix was used for the accuracy assessment. The user's and producer's accuracy, OC and the kappa index of agreement (KIA) coefficient were calculated from the confusion matrix.

Results
CART produced an initial decision tree with 975 terminal nodes. Allowing the tree to grow and then manually pruning it (Scenario 2) achieved a higher predicted accuracy than when a maximum number of nodes were imposed prior to tree-building (Scenario 1). However, even with manual pruning, a relatively large number of terminal nodes (42) were required to achieve a predicted accuracy above 80%. Pruning the tree further resulted in a sharp drop in predictive accuracy, mainly because of a poor distinction between trees and bushes on the one hand and the urban and bare classes on the other.
It is known that the accuracy of classifying urban features with the spatial resolution of Landsat-8 is low (Moran, 2010;Kahya et al., 2010) and that urban and barren land features are easily confused due to the similarity in their spectral signature (Zhang, 2014). A third scenario (Scenario 3) was consequently tested in which the bare and urban classes were merged into a single class and its predictive accuracies tested (Figure 3). When manual pruning was applied to this simplified classification scheme, a predicted accuracy of 80.77% was achieved with only 21 terminal nodes.
Limiting the number of nodes was to 21 during the tree building phase produced a predictive accuracy of 75.45%. Figure 3. The predicted accuracy compared to the number of terminal nodes when the maximum number of nodes is specified prior to tree-building (Scenario 1), manual pruning is applied (Scenario 2), and when manual pruning was applied after the urban and bare class was combined (Scenario 3).
Under Scenario 3 the pruning process discarded the attributes that were not beneficial to the classification and only eleven attributes were retained. The attributes that remained included the blue, NIR, SWIR1 and thermal bands, as well as both water indices from the spring season; the blue and thermal bands and NDVI and EBBI indices from the summer season, and contrast as a texture measure. All other attributes were deemed unnecessary to achieve an accuracy over 80%. The decision tree in Figure 4 was applied to both the scenes from which the rules had been derived and two independent scenes to produce land cover maps. The accuracy of these maps is described in the following subsections. The land cover classification of the mosaicked coastal scenes ( Figure 5) achieved an OC of 80.58% with a KIA of 0.76. Table 2 shows the confusion matrix along with the user's and producer's accuracy of the resulting map. Bushes and trees could not be clearly distinguished from one another (66.9% and 64.3% producer's and user's accuracy respectively). Urban and bare areas were slightly over-classified, with 1 915 pixels (15%) of bushes, forbs and graminoids samples being incorrectly classified as urban and bare.  A random sample of 40 002 points was used to test the accuracy of this scene 170/082. A KIA of 0.55 and an OC of 64.1% were obtained for the scene. Table 3 shows the confusion matrix and user's and producer's accuracy for the scene. Trees have a high user's accuracy (98.4%), but a large portion (22%) of trees was classified as bush. This scene did not contain any forbs, which meant that any classification thereof was inherently incorrect. Nearly 1 000 (2%) samples were incorrectly classified as forbs, with 48% of them verified as being urban and bare, while 40% of them were meant to represent graminoids.
The urban and bare class was substantially over-classified, with 46% of pixels classified as such, when they were graminoids according to the reference data. Graminoids were also confused with bush, with a further 25% being classified as such.
In an attempt to reduce the misclassification of vegetation as bare areas and vice versa, an additional rule was manually added. The rule considers all pixels classified as urban and bare and applies a threshold to reclassify pixels with NDVI values of higher than 0.2 as graminoids. This improved the overall classification accuracy to 70.4% and the KIA to 0.63. The user's accuracy of the bare and urban class increased considerably (from 48.8% to 86.9%). The classification of scene 171/082 was distinctly better than that of scene 170/082 and achieved an OC of 83.7% with a KIA of 0.80. Table 4 shows the confusion matrix and user's and producer's accuracy for the classification of the scene. Trees, bushes, urban and bare, and water classes all had accuracies above 80%. Graminoids had a very low producer's accuracy (30.2%), with 69.6% of graminoids classified as either forbs or urban and bare. The inclusion of an NDVI mask as in scene 170/082 only served to reduce the OC of the classification. Confusion between bushes and trees is common when only using spectral information (Geerling, 2007). The inclusion of height data (such as LiDAR data) could possibly assist with discriminating between the two, as they have different structures (Geerling, 2007). Research on the use of LiDAR data for discriminating these and similar land covers is recommended.
The inaccurate classification of pixels as forbs in scene 170/082 (none were present), is a known limitation of supervised classification, as spectral classes are forced to be classified in terms of operator defined classes (Campbell and Wynne, 2012). A possible solution is to amend the decision tree so that all forbs are classified as either grasslands or bare, however, operational issues must be considered. Furthermore, the analyst may not be aware of the absence of a certain class within a specific scene.
Scene 170/082 contains predominantly grassland, while scene 171/082 is primarily made up of nama-karoo vegetation, as well as grassland and Albany thicket. This large proportion of grassland in scene 170/082 may be the cause of its reduced accuracy. The producer's accuracy of graminoids was poor in both scenes (39.3% and 30.2% for 170/082 and 171/082 respectively), with over 30% being classified as urban and bare in each case.
The over-classification of bare and urban areas, prevalent in both inland scenes, could be resulting from the point sampling process. Urban areas often contain large quantities of vegetation (Zhang et al., 2014), and this contamination of the training areas could have affected the classification result. Another reason for the extent of the misclassification of grassland and bare is that sensitive areas may be bare during the dry season, but contain vegetation during the wet season.
This temporal complexity would then confuse the classifier when using duel season imagery. The grassland biome is particularly seasonal, with strong summer rainfall and droughts in winter (Mucina and Rutherford, 2006). Figure 6 shows the difference in vegetation between the wet and dry season, specifically the transformation from grassland to bare areas. Figure 6. The substantial difference between the wet season (Parts (a) and (b)) (lush and green vegetation) and the dry season (Parts (c) and (d)).
The application of a masked NDVI or similar vegetation index to the urban and bare area at a specific season could reduce this misclassification and clarify the actual land cover class. A second mask, outlining the urban edge, could then separate the urban and bare class into two. The addition of other ancillary data, such as a biome or vegetation map, can also improve the classification, but may complicate the decision tree development.

Conclusion
This study evaluated the transferability of decision tree rules for land cover classification. A sample of 89 238 points was used to develop a decision tree ruleset. The information attributed to each point included the spectral information of the Landsat-8 images from two seasons, various indices, as well as elevation and texture information. The decision tree was pruned so as to reduce the complexity of the ruleset, while maintaining a predicted accuracy of above 80%. The ruleset was then applied to two adjacent scenes to test the transferability.
The results of this study provided new insight into the extent to which a decision tree ruleset can be transferred to adjacent scenes. The accuracy of scene 171/082 was 83.7%, while scene 170/082 only achieved an accuracy of 64.1%. The inclusion of an NDVI mask, however, improved the classification accuracy of scene 170/082 to 70.4%. Although the use of a decision tree via image extension for classification is possible, more insight into factors affecting the accuracy is needed, especially when complex, heterogeneous areas are involved. As noted by Pax-Lenney et al., , a single classification test of spatial extension is insufficient to draw concrete conclusions.
This study showed that it is possible to transfer decision rules in complex areas, but that the accuracy varies depending on vegetation and distance from the original scene. Further research on the transferability of decision tree rules in complex, heterogeneous areas is needed; specifically on improving class specific accuracies and determining the optimal distance over which rules can be transferred.