Comparative investigation of two different self-organizing map-based wavelength selection approaches for analysis of binary mixtures with strongly overlapping spectral lines

Purpose: To demonstrate the ability and investigate the performance of two different wavelength selection approaches based on self-organizing map (SOM) technique in partial least-squares (PLS) regression for analysis of pharmaceutical binary mixtures with strongly overlapping spectra. Methods: Two different variable selection methods were compared, namely, SOM1-PLS and SOM2PLS. The main difference between these methods involved the structure of neurons in input layer and the algorithm for variable selection. Adjustable parameters for each technique were optimized for better comparison. The performance of these methods was statistically verified for predictive ability using both synthetic mixtures and a real combination product of sulfamethoxazole (SMX) and trimethoprim (TMP), which exhibited strongly overlapping of spectral lines. Results: The results obtained indicate that SOM2-PLS was more efficient than SOM1-PLS technique with 30 and 6 % improvement in predictive ability for SMX and TMP, respectively. Furthermore, the mean difference between the results obtained from SOM2-PLS method and those from the official method was not statistically significant as p-value was more than 0.01. Conclusion: Although, SOM2-PLS method is more efficient than SOM1-PLS method for the analysis of pharmaceutical binary mixtures with severely overlapping spectra, some problems associated with SOM2-PLS technique include difficult computations of some parameters.


INTRODUCTION
In modern analytical methods, using multiple component spectral analysis plays an important role.Various wavelengths are measured with minimal effort measuring per test and yield multivariate information.There are several literatures reporting the improvement of predictive ability when wavelength choosing is accomplished before multivariate calibration [1][2][3] and describing the different strategies of wavelength selection [4][5][6].In the previous two decades, a standout among fascinating variable selection approaches was self-organizing map (SOM).
Self-organizing map (commonly called Kohonen neural network) was first developed by Kohonen in 1975 [7].SOM algorithm is based on competitive ('Winner Takes All') and unsupervised learning.The concept of SOM is that it tries to map the similar inputs to similar or nearby map unit (neuron position or node).Two significant wavelength selection approaches based on SOM technique have been found useful in pharmaceutical field [8][9][10][11].In general, SOM strategy can be applied in both space (sample and variable spaces).The first method (SOM1) has been proposed to specify wavelengths by clustering of variables, while, the second method (SOM2) has been proposed to select wavelengths according to the measure of topological relevance (MTR) on SOM's array which is generated from grouping a set of samples.
The strategies of wavelength selection for SOM1 and SOM2 were introduced by Todeschini et al [8,9] and Corona et al [10,11], respectively.Because of the quick computation of the analysis results of enormous samples, chemometric methods for spectroscopy are still today's common multivariate method seen in in-process controlin pharmaceutical industry.This work was highlighted to support the analyst's decision when developing new analytical methods.The work also demonstrated the ability and investigated the performance of these two SOM methods for wavelength selection using partial least-squares (PLS) as linear regression and cotrimoxazole formulation as an example of pharmaceutical binary mixtures with strongly overlapping spectra.Co-trimoxazole formulation is commonly used for the treatment of pneumocystis pneumonia and for the prophylaxis against this disease in people with HIV/AIDSdefining infections [12].Co-trimoxazole preparation also provides almost 100 % protection against malaria [13].

Reagents and chemicals
The analytical-grade reagents and pharmaceutical-grade chemicals (SMX 99.45 % and TMP 99.41 %) were employed for all investigations.
The compounds and cotrimoxazole tablets were provided by Sea Pharm Co. Ltd, Thailand.
The stock solutions of SMX (100 mg/L) and TMP (25 mg/L) were accurately prepared in 95% ethanol.

Apparatus and software
All absorption measurements were recorded on a double-beam UV spectrophotometer (UV160A, Shimadzu) attached to a computer loaded with UVPC software, using quartz cuvettes with a 10-mm path length.The apparatus parameters which were used to determine the absorption spectra of all standard and test solutions included the following: wavelength range: 200 -400 nm; spectral sampling interval: 1 nm.The data measured were processed by a workstation equipped with Intel Core i3 processor, 2 GB RAM, and Windows 7 operating system.PLS modelling, SOM1 and SOM2 wavelength selection were achieved by PLS_Toolbox [14], kohonen_cpnn_toolbox 2.0 [15] and SOM_Toolbox [16] respectively.All toolboxes were run under MATLAB [17].

Calibration and validation sets
Twenty-two standard samples in calibration set were prepared from the combination of 12 samples corresponding to central composite design (CCD) with four center points and 10 samples of pure substances.
For model validation, a validation set of nine samples was also built by randomization.
The standard concentrations used within the calibration set and validation set were spread over the linearity ranges of 8.00 -24.23 and 2.00 -6.22 mg/L for SMX and TMP, respectively.They are shown in Table 1.

Commercial sample preparation
The commercial co-trimoxazole tablets containing a combination of 80 mg TMP and 400 mg SMX were studied.Twenty tablets were weighed, finely powdered and blended in a mortar.An accurately weighed quantity of finely powdered tablet equivalent to about 50 mg of SMX was transferred to a 50-ml volumetric flask containing approximately 30 ml of 95 % ethanol.Subsequent to being shaken by sonication for 30 min, the volume was completed to 50 ml with the same solvent, and the solid was filtered.Then, 1 ml of clear aliquot was pipetted into a 25-ml volumetric flask containing 5 ml of ammoniaammonium chloride buffer (pH 10.64) and made up to volume with water.The preparation was analyzed in six replicates.

Wavelength selection by SOM methods
SOM is a kind of artificial neural networks that provides a topology preserving mapping from a space of many dimensions onto a space of few dimensions that is ordinarily two dimensions (2D).In this study, SOM comprised two layers of neurons (nodes or cells): a one-dimensional input layer and a two-dimensional output layer (also called output map).Input layer was assigned to receive and transmit input data while output layer was responsible for competitive layer that held all weight vectors.The connection of both layers would be feed-forward.SOM methods had the following steps.

Input data preparation and preprocessing by mean-centering
According to SOM1 method, each wavelength of all samples in the calibration set was assigned to an input vector, leading to an input layer made of 101 input vectors (210 -310 nm).Therefore, each input vector composed of 22 components.Also, the output map was characterized by being a squared toroidal lattice that consisted of a network of N 2 neurons, where N is the number of cells for each side of network.
According to SOM2 method, the absorbance values of all wavelength glued together with concentration value were assigned to an input vector (observation sample), leading to an input layer made of 22 input vectors.Hence, each input vector contained 102 components (101 wavelengths + 1 concentration).Furthermore, the output map was produced using hexagonal toroidal grid of neurons initialized in the space spanned by the eigenvectors corresponding to the two largest eigenvalues of the covariance matrix of data [11].The map size was computed from the ratio between these eigenvalues.The schema of input vector and output map of SOM1 and SOM2 are shown in Figure 1.

Step 2: SOM algorithm processing by SOM software
There were five major steps which were regularly executed within each SOM software.First, initiation step, the weight vectors were initiated through a principal component analysis applied to the data or small random values.Consequently, all weights and input vectors were normalized to standard length 1.
Second, competitive learning step, an input vector was selected by randomization from the input data set and, Euclidian distances between this input vector and all weight vectors (neurons) were calculated.The neuron with the shortest Euclidian distances was designated as the winner.Third, cooperation step, the winning node identified the spatial area of a topological neighborhood of excited neurons, consequently providing the basis for cooperation among neighboring neurons.Fourth, adaptation step, the weights of winner neuron and the weights of all neurons (neighborhood) existing in a neighboring area around the winner were updated.Finally, iteration step, the next input vector was fed and the process was repeated.The weight were modified until no further change in the output map or observation of some other termination conditions.At the end of SOM process, output map was built up.

Step 3: Subset selection of wavelengths among the 101 full spectrum wavelengths
In accordance withSOM1 method, the relative wavelengths, which fell in the same place in output map, carried the same information.As was described in paper [9], the wavelength closest to each neuron centroid was selected as the representative among all the wavelengths within the same neuron and it was subjected for regression analysis.
In accordance with SOM2 method, the Unifieddistance matrix (U-matrix) of each component variables (wavelength and concentration variables) were calculated independently along each direction of the data space [10].MTR or distance (D) between each wavelengthconcentration pair of component U-matrices was estimated using Frobenius norm method as in Eq 1 [10].

…………….. (1)
where U xj is U-matrices of wavelength component, U y the U-matrix of concentration component, and || .|| F the Frobenius norm.Frobenius norm was used to determine the closeness between U-matrices.The closer to zero, the more relevant measure of two components is [10], which means that the wavelength was suitable for reconstructing the concentration.As stated by Corona et al, MTR values should be inverted (which is called topological relevance index) in order to clearly represent the relevancies, with larger values indicating stronger relevancy.The median cut-off of topological relevance index was used to perform wavelength selection.A more detail of SOM1 and SOM2 method can be found in [8,9] and [10,11], respectively.

PLS modeling
The flowchart of the development process of SOM1-PLS and SOM2-PLS models were shown in Figure 2. In calibration phase, the absorbance data from each SOM-based wavelength selection strategy were processed by PLS_toolbox program to develop each PLS calibration model.Before processing, the searching for suitable number of factors was done by leave-one-out cross-validation method described by Haaland and Thomas [18].In prediction phase, each PLS calibration model with optimum number of factors was applied for estimation of each compound in validation samples and pharmaceutical tablets.Then, the performance of each calibration model was measured by the mean score of percentage recovery and the root mean square error in prediction (RMSEP).The value of RMSEP was calculated as in Eq 2.

n y y n
where n is the sample size of the validation set.

Spectral characteristics
Before data collection, the impact of noises and solvent on the absorption spectra of all active ingredients and their mixtures was studied.Figure 3 indicated the strongly overlapping spectra of solutions of SMX and TMP over the UV region between 200 and 400 nm that might cause problems for regression analysis.Hence, the informative wavelength region should be selected before analysis.
Moreover, in the preliminary step, the absorption spectra in the wavelength range between 210 and 310 nm were selected due to a cut-off value of solvent and low absorbance value (under 0.05) of points in the lower and higher end of the range 210 -310 nm.

SOM-PLS modeling
For better comparison, SOM was adopted in advance to select the wavelengths for PLS analysis and the adjustable parameters of SOM including the number of training loops (epochs) and the SOM map size were optimized.A test set of five synthetic mixtures was employed for optimization of these parameters by trial and error.
To optimize and develop SOM1-PLS model, the map sizes were varies with different size: 4 x 4, 8 x 8, 12 x 12 and 16 x 16, and for each map size,the number of training epochs were examined with different epochs: 200, 500, 1000 and 2000.In this manner, 16 networks were produced.The subsets of wavelengths were chosen at each set of parameters and were transmitted into PLS algorithm.SOM1-PLS calibration models for each substance were developed and applied to test set.The mean percentage recovery and RMSEP were calculated for measuring the performance of each calibration model and the results were illustrated in Figure 4.The sets of parameters with minimum RMSEP and the percentage recovery closer to 100 were chosen.The summary specifications for SOM1-PLS models were presented in Table 2.The distribution of 101 wavelengths into 8x8 sized map at the number of training epochs 2000 was shown in Figure 5.The 39 selected wavelengths used for constructing the models of both compounds were presented in Table 3.To optimize and develop SOM2-PLS model, the grid sizes were varied with different size according to the ratio between eigenvalues of training data: 13x1, 26x2, 39x3, 52x4 for SMX and 12x1, 24x2, 36x3, 48x4 for TMP.Also, the number of training epochs were observed with different epochs: 200, 500, 1000 and 2000 for each grid size.A total of 16 networks for each drug were designed in this way.For each network, the quantization error and topographic error were calculated by SOM Toolbox and the results were illustrated in Figure 6.The sets of parameters with low quantization error and minimum topographic error were chosen to construct output maps for calculation of U-matrix of each component variable and topological relevance index.The wavelengths which had the topological relevance index values (refer to inverted D(x j ,y)) larger than the median cut-off line indicated strong relevancy and they were selected to create SOM2-PLS models.The values of topological relevance index and median cut-off line for SMX and TMP were shown in Figure 7. Sets of 51 wavelengths from 39x3 sized map at 500 epochs and set of 53 wavelengths from 36x3 sized map at 200 epochs were collected for analysis of SMX and TMP, respectively.These sets were presented in Table 4 and Table 5.Once the number of the wavelengths were selected, they were transmitted into PLS algorithm.
The summary specifications for SOM2-PLS models were listed in Table 2.The percentage recovery values of SMX and TMP, obtained from SOM1-PLS and SOM2-PLS models, were presented in Table 6.

Application to pharmaceutical dosage form
The calibration models which were constructed from the wavelengths chosen earlier by SOM techniques were applied for determination of the active ingredients in co-trimoxazole tablets in accordance with preceding procedures.From SOM1-PLS models, the quantity of SMX and TMP in samples varied from 97.59 to 98.66 % LA (percent of the labeled amount) and 105.95 to 108.29 %LA, respectively, while for for SOM2-PLS models, the quantity of SMX and TMP varied from 99.20 to 100.60 % LA and 100.48 to 102.42 % LA, respectively.The acceptance limits of both drugs corresponding to official method in the United States Pharmacopeia 39th ed.(USP 39) are between 93.0 and 107.0 % LA [19].The Student's t-test (P,0.01)data for the statistical significance of the difference among two means of SOM-PLS methods and the HPLC official method are provided in Table 7.

DISCUSSION
The finding of this research study showed that SOM was an efficient option for wavelength selection in establishing new analytical approaches for quality assurance enhancement in pharmaceutical industry.From Table 6, the outcomes indicated that SOM1-PLS models and SOM2-PLS models offered good recoveries.Especially, for values of RMSEP, SOM2-PLS models provided 30 % enhancement of predictive ability for SMX and 6 % for TMP, when comparing to SOM1-PLS models.This suggested that SOM2 strategy has actually been confirmed better selection by cause of the supervising criterion for wavelength selection.Furthermore, from Table 7, there were statistically significant differences between two mean scores of SOM1-PLS method and official method (p < 0.01), while there were no statistically significant differences (p > 0.01) between two mean scores of SOM2-PLS method and official method.
According to these outcomes, it was possible that there were two reasons why SOM2-PLS method provided better estimates than SOM1-PLS method.SOM2 algorithm chose wavelengths that had much information, involving relation between wavelength variables and concentration of the analytes, whereas SOM1 algorithm selected wavelengths by clustering of collinear wavelengths with no considerable in concentration variable.Moreover, SOM2-PLS model reduced noise because it provided lower optimum number of PLS factors when compared to SOM1-PLS model (Table 2).Hence, the number of factor loadings decreased.

CONCLUSION
Successful comparison of SOM1 method against SOM2 method as a wavelength selection technique in PLS regression for analysis of pharmaceutical binary mixtures with severely overlapping spectra has been undertaken.Although, SOM2 method is more efficient than SOM1 method, some problems associated with SOM2 technique include the difficult computations of topological relevance index values and the distance values between each pair of input and output values estimated from Umatrices.

Figure 2 :
Figure 2: Flow chart showing development process of two different SOM-PLS models

Figure 6 :
Figure 6: Quantization error and topographic error of SOM2 models.a) SMX; b) TMP

Figure 7 :
Figure 7: Topological relevance index and median cut-off line obtained from SOM2 models.a) SMX; b) TMP

Table 1 :
Combination of SMX and TMP in calibration set corresponding tocentral composite design a Star point of 2 factors in CCD (= ±1.414)

Table 3 :
Set of 39 wavelengths for SOM1-PLS models to analyze SMX and TMP

Table 4 :
Set of 51 wavelengths for SOM2-PLS model to analyze SMX

Table 5 :
Set of 53 wavelengths for SOM2-PLS model to analyze TMP

Table 6 :
Results from applying SOM1-PLS and SOM2-PLS models to the validation samples a Relative standard deviation

Table 7 :
Statistical results obtained from the proposed and official methods