RESEARCH ARTICLE Year : 2009  Volume : 5  Issue : 20  Page : 279286 Discrimination Of Radix Pseudostellariae According To Geographical Origin By FTNIR Spectroscopy And Supervised Pattern Recognition Bangxing Han^{1}, Naifu Chen^{2}, Yong Yao^{3}, ^{1} School of Pharmacy, Jiangsu Unviersity, 301, xuefuroad, Zhenjiang, 212013; Engineering Technology Research Center of Plant Cell Engineering, Anhui Province, Lu’an, 237012, China ^{2} Engineering Technology Research Center of Plant Cell Engineering, Anhui Province, Lu’an, 237012, China ^{3} Xuancheng Jinquan ecological agriculture Co. Ltd. Xuancheng 242000, China Correspondence Address: Radix Pseudostellariae is one of the most popular Traditional Chinese Medicine (TCM) for promoting the immune system, treating asthenia after illnesses with a long history in China and some other Asian countries. Rapid discrimination of R. Pseudostellariae according to geographical origin is crucial to pharmacodynamic action control. FTNIR spectroscopy and supervised pattern recognition was attempted to discriminate R. Pseudostellariae according to geographical origin in this work. LDA, ANN and SVM were used to construct the discrimination models based on PCA, respectively. The number of PCs and model parameters were optimized by crossvalidation in the constructing model. The performances of three discrimination models were compared. Experimental results showed that the performance of SVM model is the best among three models. The optimal SVM model was achieved when 5 PCs were used, discrimination rates being 100% in the training and 88% in prediction set. The overall results demonstrated that FTNIR spectroscopy has a high potential to discriminate qualitatively R. Pseudostellariae according to geographical origins by means of an appropriate supervised pattern recognition technique.
Introduction Radix Pseudostellariae , the root of Pseudostellaria heterophylla (Miq.) Pax ex Pax et Hoffm, known as 'Taizishen', has a long history of use in Asian countries, such as China and Korea, is now classifi ed as a traditional Chinese herbal medicine in common use [1] . Studies have demonstrated its multiple pharmacological effects such as antioxidation, promoting the immune system, antidepressant, antifatigue, treat night sweating, and asthenia after illnesses activities [2],[3] . But due to the different ecological factors such as soil, temperature, illumination, moisture content of different regions, the differences in composition and properties among R. Pseudostellariae of different geographical origins are observed [4] . R. Pseudostellariae is widely distributed and cultivated throughout China, i e. Anhui, Fujian, Jiangsu, Henan, Zhejiang, Shandong and Guizhou provinces. And for this reason, R. Pseudostellariae of different geographical origins have often been confused, and the authenticity of R. Pseudostellariae based medicine is compromised. There are also some diffi culties in selecting famousregion R. Pseudostellariae for curing diseases. So the discrimination of R. Pseudostellariae according to geographical origin is still focused on at present. However, it is not easy to determine its geographical origin by external appearance evaluation. The current discrimination of R. Pseudostellariae is restricted to the employment of a few chemical analysis tools such as TLC, UV, GCMS, HPLC [5],[6],[7] . Chemical differentiaton of R. Pseudostellariae is of great importance in science of TCM, especially when the origin is to be verified, thus R. Pseudostellariae is a complex mixture of organic as well as in organic compounds the composition of which is influenced by mang and varying factors. In the holistic theory of TCM, TCM take effects in curing diseases as a whole. R. Pseudostellariae is composed of tens major components such as polysaccharides, saponins, flavones, cyclopeptides, amino acids and microelements [4],[5],[6],[7],[8] . We cannot select only a limited number of specific components as essential screening criteria. Furthermore, these chemical analysis methods are all timeconsuming, laborintensive, expensive, and require large amounts of organic reagent. We must thus conclude R. Pseudostellariae cannot be discriminated and identified very well at this moment using only a conventional method. Therefore, a rapid, reliable, accurate, and nondestructive analytical method is essentially required to discriminate the different habitats for the quality control of R. Pseudostellariae . Fourier transformation nearinfrared (FTNIR) spectroscopy, a nondestructive, rapid, costeffective, and integrityemphasized method, has important practical utility in identifying and distinguishing the TCM according to geographical origin [9],[10],[11],[12] . NIR is an optical technique that involves measurements between the visible and the midIR spectral region of the electromagnetic spectrum, measures overtones and combinations of fundamental vibrations from the midIR region: OH, NH, SH and CH [13] .It presents an intriguing alternative, requiring no sample preparation while offering rapid (seconds rather than minutes), noninvasive and nondestructive sample analysis, moreover it does not require organic reagent, particularly in terms of the use of solid samples [14] . Especially, this method follows the integrate principle of traditional Chinese medicine, and it does not lose original natural instinct and compatibility of TCM [9] . For these reasons, NIR techniques has found widespread application in TCM and pharmaceutical sciences over the past several years [9],[10],[15],[16] . These works mentioned above show that FTNIR spectroscopy technique has a high potential to analyze quantitatively some active components in TCM. Supervised pattern recognition refers to techniques in which a priori knowledge about the category membership of samples is used for classification. The classification model is developed on a training set of samples with categories. The model performance is evaluated by means of some samples from a prediction set by comparing their categories predicted with their own true categories. TTNIR spectroscopy combined with supervised pattern recognition is also used to tackle classification problem [17] . Recently, NIR spectroscopy technique has been applied in discrimination of Fritillary, Ganoderma lucidum, trace element in Italian virgin olive oils according to geographical origin [9],[10],[18],[19] . However, No studies have been reported on the applications of FTNIR spectroscopy to the identification of the cultivation origins of R. Pseudostellariae until now. In present studies, a rapid method for classifying the different geographical origin of R. Pseudostellariae samples was fi rst studied by FTNIR spectroscopy combined with pattern recognition techniques. In this work, three wellknown supervised pattern recognition algorithms were attempted to develop the discrimination models:Linear Discriminant Analysis (LDA), Artifi cial Neural Network (ANN), and Support Vector Machine (SVM). Among them, LDA is linear method, both ANN and SVM are two nonlinear methods. Principal component analysis (PCA) was conducted on the NIR data to extract some principal components (PCs) as the inputs of the supervised pattern recognition model. Three spectral preprocessing methods. Standard Normal Variate Transformation (SNV), Multiplicative Scatter Correction (MSC), fi rstderivative and secondderivative were applied comparatively. The number of PCs was optimized by crossvalidation. Materials and Methods Materials. All of the sampels were collected from the local drug shops. But their fresh roots were collected from four provinces of PR China (i.e.Jiangsu, Henan, Fujian and Guizhou Province), Except sample from Anhui Province was provided by Xuancheng Jinquan Ecological Agriculture Co.,Ltd.,(Anhui, China). All the samples were dried in a forceddraught oven from Shanghai Jinghong Pharmacy Machine Co. (Shanghai, China) at 100? for about 10h upon acquisition. Considered the heterogeneities of samples, R. Pseudostellariae materials were crushed into powder by a pulverizer made in Wuyi Yili Pharmacy Machine Co. (Zhejiang, China) and controlled below 80 mesh before spectra collection. and these powders sieved were used as for further analysis. All of the roots identificated by Professor Dequn Wang of Anhui College of TCM. Voucher specimens are deposited in the Pharmacognosy Laboratory, School of Pharmaceutical, Jiangsu University. Spectra collection. The NIR spectra were scanned on a Antaris II Nearinfrared spectrophotometer (Thermo Electron Co.,USA) with an integrating sphere.The NIR measurements were performed within the region 400010,000cm?1. Each spectrum was the average spectrum of 32 scans. and the raw data were measured in 3.856cm1 intervals, which resulted in 1557 variables. About 1.0 g of the sample in powder form was individually filled in a glass sample cup. Each sample spectrum was collected three times. The mean of three spectra which were collected from the same sample was used as the further analysis. The temperature was kept around 25 ?, while the humidity was kept at an ambient level in the laboratory. All spectra were recorded as log (1/R), where R is the relative reflectance. Spectra preprocessing. [Figure 1]a shows the raw spectral profi le of R. Pseudostellariae . NIR spectra are affected by both the concentration of the chemical constituents and the physical properties of the analyzed product, and the latter properties account for the majority of the variance among spectra while the variance due to chemical composition is considered to be small [20] . It is necessary to perform mathematical pretreatments to reduce the systematic noise, such as baseline variation, light scattering, path length differences and so on. In this study, three spectral preprocessing methods were applied comparatively, and they were Standard Normal Variate Transformation (SNV), Multiplicative Scatter Correction (MSC), firstderivative and secondderivative. SNV is a mathematical transformation method of the log (1/R) spectra used to remove slope variation and to correct for scatter effects. MSC was used to modify the additive and multiplicative effects in the spectra. first and second derivatives eliminate baseline drifts and small spectral differences are enhanced [21] . Compared with results obtained by three preprocessing methods, SNV preprocessing method is as good as MSC, and much better than fi rst and second derivatives. R. Pseudostellariae roots are particle solids that bring to easily scatter light in spectra collection. SNV spectral preprocessing methods can remove slope variation and correct light scatter because of different particle sizes. Therefore, SNV spectral preprocessing method was used in this work. The NIR spectra after SNV preprocessing are showed in [Figure 1]b . Software. All algorithms were implemented in Matlab V7.0 (Mathworks, USA) under Windows XP in data processing. Result Software (Antaris IISystem, Thermo Electron Co., USA) was used in NIR spectral data acquisition. Results and Discussion Principal component analysis. PCA is often the fi rst step of the data analysis in order to detect patterns in the measured data. Although PCA can only be used as an unupervised pattern recognition method, this behavior can indicate data trends in a visualizing dimensional space. To visualize the cluster trends of these samples, a scatter plot was obtained using the top three principal components (i.e.PC1, PC2, PC3) issued from PCA. [Figure 2]. shows a 3D plot constructed by PC1, PC2, and PC3, and R. Pseudostellariae sample is labeled according to its geographical origin (i.e., Anhui, Jiangsu,Henan, Fujian and Guizhou Province). All R. Pseudostellariae samples appear clustered along the three principal components axes, confirming the presence of fi ve groups. PC1 can explain82.2% of the variance, PC2 can explain 9.8% of the variance, and PC3 can explain 3.8% of the variance. The total accumulative contribution rate of variance from PC1, PC2, and PC3 is 95.8%. Therefore, the 3D representation of the PC1, PC2, and PC3 scores for the 300 samples can explain 95.8% raw spectral information from all samples. [Figure 2]. shows that there is a separation of five groups in the 3D space represented by the first three principal components. Such good classification in this 3D space could be explained by the chemical background of R. Pseudostellariae and PCA methods. R. Pseudostellariae can exhibit considerable differences in their own chemical characteristics according to different geographical origins. The differences from chemical characteristics of R. Pseudostellariae can be reasonably differentiated in the NIR spectroscopy. Therefore, NIR spectroscopy data can exhibit the cluster trend of R. Pseudostellariae samples according to geographical origins by means of PCA. Discrimination model of supervised pattern recognition. Geometrical exploration of 3D plot by PCA only gives the cluster trend of samples. Moreover, it is not perfect, the lack of definite index describing the sampe differences will lower the credibility of the results. Therefore, actual discrimination of R. Pseudostellariae according to geographical origin by means of NIR spectra data and supervised pattern recognition were utilized in the following studies. In this work, all 300 samples were divided into two subsets. One of subset was called the training set that was used to build model, and other was called the prediction set that was used to test the model reliability. The training set contained 200 samples, and the remaining 100 samples constituted the prediction set. Before developed discrimination model, principal components vectors were extracted by PCA, as the inputs of model. Three supervised pattern recognition algorithms (LDA, ANN, and SVM) were attempted to develop the discrimination model, respectively. The number of PCs was optimized by crossvalidation. Linear Discriminant Analysis. Linear Discriminant Analysis (LDA) is a linear and parametric method with discriminating character. LDA focus on fi nding optimal boundaries between classes. The number of principal component factors is crucial to the performance of the LDA discrimination model. The discrimination rates by crossvalidationwere used to optimize the number of PCs. [Figure 3]. shows the discrimination rates of LDA model according to different PCs by crossvalidation. The optimal number of PCs is according to the highest discrimination rates by crossvalidation. As shown in [Figure 3] , the optimal LDA model is achieved when PCs = 4. The discrimination rate is 88% in the training set and 86% in the prediction set, respectively. Artificial Neural Networks. The linear model did not provide a complete solution to the classification problem relatively, Therefore, nonlinear approach such as artificial neural networks (ANN) was used in this work. ANNs are widely used for discrimination. Many researches proved that ANN is a effectual and powerful model in discrimination [22] . After the first simple neural network was developed by McCulloch and Pitts in 1943 [23] , many types of ANN have been proposed. The Back Propagation Artifi cial Neural Network (BPANN) is the most widely used model among ANN models and is used in this study. As an important supervised pattern recognition method, many parameters exert to some extent certain influence on the performance of BPANN models. These parameters include the number of neurons in the middle layer, scale functions, learning rate factor, momentum factors, and initial weights. In our modeling, a three hidden layers BPANN was used. These parameters of BPANN models were optimized by crossvalidation as follows: the number of neurons in the hidden layer was set to 3, the learning rate factor and momentum factor were set to 0.1, the initial weight was set to 0.3, and the scale function was set as 'tan h' function. It is crucial to select the appropriate number of PCs in constructing an ANN model. [Table 1] shows the discrimination rates of ANN model according to the number of PCs by crossvalidation. The optimal ANN model is obtained when 5 PCs are used. The discrimination rate of this BPANN model is 97% in the training set and 98% in the prediction set. Support vector machine. Support vector machine (SVM) is a supervised learning technique, based on the statistical learning theory, proposed by Vapnik and Chervonenkis [24] , have been successfully applied for mid and near infrared classification tasks, such as material identification [25] and food discrimination [26] . The SVM is originated from the classification of twoclass problems, in which SVM can be considered to create a 'optimal' boundary (hyperplane) of two classes in a vector space independently on the probabilistic distributions between two sets of data for classification. In case the linear boundary in the low dimension input space would not be enough to separate two classes properly, it is possible to create a hyperplane that allows linear separation in the higher dimension feature space. The readers can get more imformation from the references and tutorials about SVM in detail [27],[28] . SVM was attempted in this work. Optimization of parameters is the key step in SVM as their combined values determine the boundary complexity and thus the classification performance [29] . There are several classical kernel functions: Gaussion kernel function (is also called RBF kernel function), Polynomial kernel function, Selection of kernel function and Linear kernel function. As a nonlinear kernel function, RBF kernel function have more capable of handling nonlinear relationship between the signals in response to the characteristics and the results than that of Linear kernel function, moreover, it has a terse structure of the function which can reduce the complexity of the process in training. In general, RBF kernel function is the optimal choice, without prior experienced knowledge. To obtain a good performance of SVM model, some parameters in the nuclear function (regularization paramete C and γ) have to be optimized too. [Figure 4] is contour plot of the optimization parameters C and γ of the model using RBF kernel. It can be found that the optimal SVM model is achieved when the optimized parameter values of C and γ2, respectively, 800 and 2.5. After two parameters of SVM model were determined, the number of PCs was also optimized by crossvalidation, and the optimal number of PCs was also determined according to the highest discrimination rate by crossvalidation. The optimal SVM model is achieved when PCs = 5. The discrimination rates of this optimal SVM model are 100% in the training and 88% in the prediction sets. Conclusions The results described in this research open the possibility of discriminating R. Pseudostellariae according to their geographic origin using FTNIR spectroscopy and supervised pattern recognition, such as LDA, ANN, and SVM models. [Table 2] shows the discrimination results from LDA, ANN and SVM models in the training and prediction sets. As shown in [Table 2] , discrimination rates of LDA model are 88% in the training set and 86% in the prediction set when the PCs = 4, and discrimination rates of ANN model are 97% in the training set and 98% in the prediction set when PCs = 5. Compared with the performances of LDA and ANN, Discrimination rates of SVM model are 100% in the training set and 88% in the prediction set when PCs=5. Seen from total discrimination results in the training sets, the SVM model is the best. But, Seen from total discrimination results in the prediction sets, the linear models are superior to the nonlinear model, the ANN model is the best. In generally, nonlinear method is stronger than linear method in the level of selflearning and selfadjust. Thus the results of this research showing that the linear models are superior to the nonlinear model accroding to discrimination results in the prediction sets. The season is possibly that the parameters in the nuclear function (regularization paramete C and γ) have to be optimized further so as to construct the better SVM model. It can be concluded that FTNIR spectroscopy technique combined with pattern recognition has high potential to discriminate other TCM according to geographical origin. But, further research will be devoted to ultimately removing the misclassifications. Multiidentification is expected to be the simplest solution to overcome this limitation because the identification probability is high enough to classify. Acknowledgements This work has been fi nancially supported by the Natural and Science Foundation of Anhui Educational Committee of China (Grant No. KJ2008B330). Innovation Fund for small and mediumsized enterprises of Anhui Province of China (Grant No. cz3401122) References


