Clustering of groundwaters by Q-mode factor analysis according to their hydrogeochemical origin : A case study of the Cariri Valley ( Northern Brazil ) wells

Factor analysis was applied to 56 groundwater samples collected from wells located in the Araripe Sedimentary Basin, in the north-east of Brazil. The parameters are a set of 9 physicochemical, chemical, and isotope data, constituted by electrical conductivity (EC), ionic concentrations of Ca2+, Mg2+, Na+, K+, Cl−, SO4 2−, alkalinity and δO/00. In R-mode factor analysis, the first 3 factors explain 62% of the variance, their loadings allowing the interpretation of hydrogeochemical processes that take place in the area. Q-mode factor analysis on the 56 water samples decreases space dimensionality to 6, explaining 93% of the total database information. With the aid of a scalar and angular measurement method, objects were clustered, resulting in 11 groups classified according to their inherent characteristics, related to their hydrogeological origin.


Introduction
The ever-increasing demand for potable water requires knowledge of the quality of stored waters, as well as of the natural and anthropogenic processes that influence it.Waters stored in the same aquifer system can differ in their chemical composition due to internal and external processes, and suitable methodologies are needed for their identification.
A great manifold of parameters is used in water research for assessing water quality, pollution, evaporation, flow dynamics and chemical evolution through the water cycle.Thus, great amounts of data are generated.In order to gain insight into the relationships between the parameters associated with a given set of objects, multivariate techniques have been applied to reveal hidden affinities present in the database, and undetectable by other means.Mathematically, these methods reduce space dimensionality by a suitable choice of new dimensions constructed as linear combinations of the original ones, simplifying the representation of the data set and facilitating its interpretation.
Many multivariate analysis techniques have been applied in hydrological studies: R-mode analysis in groundwater quality studies (Grande et al., 1996;Liu et al., 2003;Panagopoulos et al., 2004;Garcia-Rodriguez et al., 2007); R-mode, Q-mode and cluster analysis to assess surface/groundwater interaction and groundwater mixing (Reghunath et al., 2002); R-mode and cluster analysis to study groundwater quality in the Blue Nile basin (Hussein 2004); principal component analysis (PCA), cluster, and discriminant analysis to evaluate spatial and temporal variations in river waters (Wunderlin et al., 2001;Singh et al., 2004); PCA and R-mode factor analysis to understand origin and variation of each solute in natural waters (Anazawa et al., 2005).Geochemical data were used to test the influence of different factor-analysis techniques on the results extracted (Reimann et al., 2002).

Material and methods
A set of 56 groundwater samples was analysed for 9 physical and chemical parameters comprising major ion concentrations (Ca 2+ , Mg 2+ , Na + , K + , Cl − , SO 4 2− ), alkalinity (alka), electrical conductivity (EC) and the isotope oxygen-18 (δ 18 O 0 / 00 ).The samples were taken from wells located in the Cariri valley, part of the Araripe sedimentary basin, Brazil, embedded in Precambrian basement rock.This basin is divided between the Federal States of Ceará, Pernambuco (Pe) and Piauí (Pi).The greatest part is in Ceará and comprises the Araripe plateau and the Cariri valley, containing the most important groundwater storage of the State.Figure 1 shows the region under study, enclosing the towns of Crato, Juazeiro do Norte, and Barbalha and the areas of the formations Exu, Arajara, Santana, and Rio da Batateira.Water samples were collected from the Rio da Batateira aquifer, from wells that provide water for industrial, rural and urban use.

Factor analysis
Factor analysis is a multivariate statistical method which, through a linear dependence model constructed in an abstract space called factor score space, searches for correlations among measured variables that characterise a set of objects/samples.Its main feature is to decrease space dimensionality through the construction of a new dimensional base that preserves the essential information contained in the original database.Linear dependencies of variables are measured in that new space, where new variables are defined by the column vectors of a so-called factor-loading matrix (A) in the space spanned by the column vectors of the factor score matrix (F).
R-mode factor analysis searches for interrelationships among variables.The mathematical model is: where: Y is the data matrix in deviate form y ij = x ij -‹x j › (with x ij representing parameter j of object/ sample i and ‹x j › the mean of variable j) or in standardised form, (s j being the standard deviation of variable j) A′ is A transposed and E the residual matrix.
The maximum likelihood estimation method was used to compute estimates for A by a numerical iterative procedure (Jöreskog 1967;1977;Davis, 1986).
In R-mode factor analysis, to define the best dimensionality (k) of space, we have calculated chi-square (χ k 2 ) and the number of degrees of freedom (d k ) for every factor space dimensionality.A measure of the relative importance in increasing the number of dimensions by one is defined by as the difference ratio between chi-square and degrees of freedom.
Q-mode factor analysis is a multivariate technique intended to classify objects according to interrelations among them, so that each object (row) in the data matrix is understood as a combination of hypothetical or real objects with specific parameter values.The technique consists of measuring the resemblance among objects (index of proportional similarity) normalising data matrix rows (objects) so that measured variables can be interpreted as proportions, . Imbrie and Purdy (1962) defined the similarity coefficients as cos q nm = w n .w′m of the angle between any two data matrix row vectors (objects n and m), where w n = [w n1 w n2 .... w np ] is a row vector of matrix W.Then, the similarity matrix can be written as The model is expressed as the product of a factorloading matrix (A N x k ) and a factor score matrix (F p x k ), W N x p ≈ A N x k F′ k x p , and the similarity matrix can be written as H = WW′ = AF′FA′ .This matrix can be factorised (Reyment et al., 1996) to find F and A.
To achieve simplicity (with the elements of factor-loading vectors approaching 0 or 1) varimax orthogonal rotation, designed by Kaiser (1958) so as to maximise the variance of the factors, was applied to the calculated factor-loading matrices.

R-mode
The parameters computed for our data set are listed in Table 1.Considerable information is gained when dimensionality increases from 1 to 2 (∆ k = 9.74), from 2 to 3 (∆ k = 2.59), but not from 3 to 4 (∆ k = 0.74).So the best choice for dimensionality is 3, with 74% of accumulated information.
The varimax rotated factor-loading matrix is shown in Table 2 (where only factors with modulus greater than 0.24 are represented).The first factor explains 26% of total variance, the second, 20% and the third, 16%, total accumulated variance being 62%.This is the percentage of variance explained (in the entire database) without overestimating the amount of information available, according to the chi-square analysis.Space dimensionality decreased from the original 9 variables to only 3, so that, with the aid of multivariate statistical analysis, 3 main hydrogeochemical processes can explain the complexity of Cariri valley waters, as presented below.

Figure 1
Study area location and map with outcropping areas of Araripe sedimentary basin formations  2 is a bar diagram illustrating the relative importance of the variables in the factor-loading vectors from Table 2.All 3 factors have high EC loadings.The 1 st factor, explaining 26% of the entire sample set variance, shows high correlation between Ca 2+ , Mg 2+ , SO 4 2− , alkalinity and EC.As limestone and gypsum are common minerals in the Santana formation (Ponte and Appi, 1990), this factor proves that hydrogeochemical reactions relating precipitation/dissolution processes with calcite, dolomite, and gypsum minerals are important in water quality evolution in this area.
The 2 nd factor, corresponding to 20% of total variance, is related to δ 18 O 0 / 00 , Na + , Ca 2+ , SO 4 2− and alkalinity.High correlation with Na + and, to a lesser extent, with Ca 2+ , SO 4 2− , and alkalinity can be associated with ion exchange by clay minerals, abundant in the Rio da Batateira formation.
The 3 rd factor, responsible for 16% of the total variance, shows high correlations with EC, Mg 2+ , K + , Cl − and inverse correlation with alkalinity, and so it could represent contamination of waters by domestic sewage.

Q-mode
As in the R-mode case, Q-mode factor analysis also needs to define the dimensionality of factor score space.Table 3 defines space dimensionality and shows the information carried by each factor and total accumulated information as space dimensionality increases.When factors number 6, 93% of the information is accumulated; information is more uniformly distributed among factors from the second one on.The information for Factor 7, in 7 dimensions, and for Factors 7 and 8, in 8 dimensions, has low importance and can be discarded.
Results from varimax rotated factor-loading matrix calculation are given in Table 4 (next page) together with object (well) identification.
Application of the selection criteria (described above at the end of Q-mode factor analysis description) to the 56 elements in factor space resulted in 11 groups (Table 5 -next page).In order to interpret groups' characteristics, parameter means for each group were calculated (Table 6 -next page).As dimensionality is greater than 3, it is impossible to visualise the results from this procedure graphically in our three-dimensional visual space.Instead of grouping objects by visual inspection, they were analysed with respect to their angular and scalar distance between each member of the set, represented by vectors in a sixdimensional space, and from the group's centroid (defined by the mean vector, with unitary modulus, calculated considering all the elements in the group).If the angular separation and scalar distance between a given vector (object) and the group centroid in this factor score space is less than or equal to a respective predefined cut-off value, then the analysed element becomes an element of this group.
In our analysis, a cut off angle (q c ) of 45 o and a cut-off scalar distance (d c ) equal to the equivalent distance between unitary vectors separated by the cut-off angle, i.e. d c = 2[1 -cos(q c )] were chosen.
Figure 3 shows bi-dimensional plots of elements' factor score space positions, marked by geometric symbols according to groups.All graphs have Factor 1 as abscissa.To avoid overloading the graphs, only group centroids are shown.Ordinates, representing the 2 nd dimension, are Factors 2 to 6, respectively.Values approaching ±1 imply increasing importance.
Table 6 shows that Group 1 (star) waters are only slightly saline and have a δ 18 O mean value of -3.1 0 / 00 , very close to the rainwater value (≈-3.2 0 / 00 ; Santiago et al., 1997) .These waters represent recent recharge derived directly from rainfall.The fact that this group is the most numerous is not surprising, because its member wells exploit the uppermost unconfined aquifer in the Cariri valley.Factor 1 is the most important one to discriminate this group.

654
Group 2 water samples (circle) have high salinity and the lowest values of δ 18 O (-3.9 0 / 00 ).High concentration values of Ca 2+ , SO 4 2− and alkalinity imply that gypsum and limestone dissolution/precipitation processes are involved.These minerals are characteristic of the Santana formation lithology.Thus, this group was interpreted as recharge waters from the top of    the Araripe plateau that percolated the Santana formation, and the low δ 18 O could be due to altitude effect on rainfall and/or to the presence of palaeo-waters because of the long transit time through that aquitard.
Water samples in Group 3 (diamond) show high EC combined with very high Ca 2+ and high Mg 2+ and SO 4 2− concentrations as well as high alkalinity.Major ions' mean values are similar to those of Group 2, indicating the same geochemical environment.δ 18 O, however, is slightly higher (-3.0 0 / 00 ), pointing to an origin from rainfall in lower altitude and/or slightly enriched by evaporation during runoff.We interpret these waters as recharge that leached Araripe plateau cliff matter.
Group 4 (square) waters have as principal characteristics high δ 18 O values (-2.9 0 / 00 ) and low Ca 2+ , Na + , K + concentrations and EC, implying fast infiltration to the aquifer.The elevated δ 18 O indicates slightly evaporated water.Factor 3 is important in discriminating this group.
Group 5 (triangle) waters are characterised by very high EC and Cl − , high K + but low Ca 2+ and SO 4 2− concentrations.As Cariri valley natural waters have low Cl − concentration, these waters, from urban areas, are associated with chlorine pollution through residential wastewater, which is a major source of Cl − .δ 18 O = -3.5 0 / 00 shows that these waters are mixed with palaeo-waters (uprising due to a reduction of hydraulic heads in the superior aquifer, caused by excessive pumping in well-fields for public supply).Factor 2 is of high importance in this group's discrimination.
The water samples in Group 6 (triangle) have mean parameter values near the universal mean.However, SO 4 2− concentrations are the smallest of all groups.δ 18 O = -3.2 0/ 00 indicates recent, fast recharge without evaporation.Factor 6 better discriminates this water type.
Group 7 waters (arrows to the right) show very low concentration of K + and the lowest one for Cl − .The mean value of δ 18 O = -3.1 0 / 00 indicates rainfall-derived recent recharge waters.Like Group 5, Factor 2 discriminates this group, but with negative values near -1.In this sense, it is the opposite of Group 5.
Water samples in Group 8 (arrows to the right) have very low alkalinity, low Ca 2+ , Mg 2+ concentrations, and EC.The high δ 18 O value (-2.7 0 / 00 ) reveals recent recharge waters that suffer evaporation before infiltration.Factors 1 and 4 (with positive correlation) best discriminate this group.
Group 10, with 2 elements, and Group 11, with only 1, could not be interpreted hydrogeologically, but one can see that Group 9 is near Group 10, and Group 11 is near Groups 1 and 8.If a larger cut-off angle had been adopted and the 'discrimination power' reduced that way, these groups would have been integrated into their respective groups.

Conclusions
Multivariate statistical methods of factor analysis are shown to be an important tool for characterising hydrogeochemical processes and clustering groundwaters according to their shared hydrochemical characteristics.The 3 principal factors identified by R-mode factor analysis correspond to 3 principal processes taking place in the study area: precipitation/dissolution processes of calcium carbonate and gypsum, cation exchange processes occurring in clay layers, and processes related to anthropogenic contamination with chlorine.Q-modal analysis grouped all 56 samples collected in the study area into 11 groups, detecting similarities.
The relatively high number of groups found shows the wide variety of these groundwaters.In spite of it the methodology applied was efficient enough to permit association of factors and groups with hydrogeological environmental features of the research area.