Principle Component Analysis

The multivariate statistical method of Principle Component Analysis is a very useful tool for reducing the number of variables in a data set and for obtaining useful two-dimensional views of a multi-dimensional data set. As explained above, the data matrix consisting of fifteen elements can be considered to exist in fifteen-dimensional space (since this would be the number of dimensions required to simultaneously plot all of the variables against one another). For a data set with multivariate normal distribution, when the data points are all plotted they will form a "cloud" of points which may have an oval to circular cross-section in any particular direction. A three-dimensional version can be pictured as a (flattened-)football-shaped cloud of data points.

A Principal Component Analysis of the data set will determine the perpendicular axes (called eigenvectors) which are defined by the dimensions of the data set. There will be the same number of axes as variables/dimensions; the longest axis is the First Principle Component (PC1), the next major axis is the Second Principle Component (PC2), etc. In the example of a three-dimensional football shape, PC1 is the axis running through the football tip to tip, and PC2 and PC3 are two equal perpendicular axes through the equator of the football. If the football is deflated and flattened a bit, then PC2 and PC3 are no longer equal; PC2 is by definition the longer of the two.

The nature of the Principle Components does not change if the data set is not multivariate normal, or consists of several subgroups of data. PC1 will still be the longest possible axis running through the data points, PC2 will be the longest possible axis perpendicular to PC1, and so forth.

The advantage of defining these Principle Component axes is that the axes can now be used to define planar sections through the data set. If one makes a slice through the cloud of data points using the plane defined by PC1 and PC2 and projects all of the data points onto this plane, then it becomes a two-dimensional representation of the data retaining the maximum variation (and hopefully information) contained in the multivariate data set. In many cases, this is the best two-dimensional representation of the multi-dimensional system. Similarly, if the multi-dimensional data contain multiple separate clusters of points, the plane of PC1 and PC2 will often provide a view of the maximum two-dimensional separation between them. Depending upon the distribution of the data set and the intended goal of the analysis, this will often be the best two-dimensional representation of a set of multi-dimensional data clusters.

Furthermore, each variable in the analyzed data set can be assessed concerning its contribution to the overall distribution of the data set. This is done by correlating the direction of maximum spread of each variable with the direction of each Principle Component axis (eigenvector). If one particular variable has a much larger range of values than others (for example, if it is responsible for stretching out a 3-D sphere of data points into an elongate football-shape), then the direction of maximum spread for this variable will strongly correspond to PC1. A high correlation between PC1 and a variable indicates that the variable is associated with the direction of the maximum amount of variation in the data set. More than one variable may correspond highly with PC1; more than one variable may be having a strong influence on the distribution of the data. Similarly, if the whole data set contains two data clusters and a single variable corresponds highly with PC1, then that variable may be responsible for the separation and unique definition of the two data groups. A strong correlation between a variable and PC2 indicates that the variable is responsible for the next largest variation in the data perpendicular to PC1, and so on.

Conversely, if a variable does not correspond to any PC axis, or corresponds only with high-number PC axes, this usually suggests that the variable has little or no control on the distribution of the data set. Therefore, Principle Component Analysis may often indicate which variables in a data set are important and which ones may be of little consequence. Some of these low-performance variables might therefore be "weeded out" and removed from consideration in order to simplify the overall analyses.

For PCA, the calculations of eigenvectors can be made using either the covariance matrix or the correlation matrix of the data set. The latter is commonly used when different variables in the data set are measured in different units, or if different variables have strongly different variances. Using the correlation matrix recalculates all of the variables so that their variances are equal. This can be a significant concern with geochemical data, since some elements typically have a much broader range of concentrations than others in the samples.

Such is a quick overview of Principle Component Analysis and its potential benefits. PCA was conducted for the Prairie du Chien data set using the SAS procedure princomp.

 

Element Selection

Out of a total of 55 elements analyzed in the INAA process, only fifteen were consistently present in both standard and unknowns. However, it is quite possible that out of the fifteen elements, only a few provide most of the meaningful information to be found in the geochemical data. Also, some elements may have high degrees of intercorrelation, in essence producing redundant information. Principle Component Analysis is in many cases a good method for addressing these possibilities. The elements which provide little information to the data set are likely to be adding "noise" to the analyses. Therefore it might be desirable to narrow down the number of variables.

The SAS procedure princomp was run using the correlation matrix of the geochemical data set. Part of the output is in the form of the eigenvector/element correlation matrix found in Table 3. As explained above, those elements which correlate highly with the first two or three eigenvectors (Principle Components) are the variables with the greatest variability in the data set. As can be seen in the table, PC1 correlates moderately well with five elements (Ce, Fe, La, Sc, and Sm), and PC2 correlates moderately well with four elements (Al, Cr, Cs, and Eu); two of these nine elements (Al and Eu) also correlate moderately with PC3. Therefore these nine elements were chosen as the best variables to represent the Prairie du Chien data set, and the others were excluded from further Principle Component analyses.

 

PC1

PC2

PC3

PC4

PC5

PC6

PC7

PC8

Al

0.091030

0.313974

0.553262

-0.149033

0.191024

0.191860

-0.154082

0.112446

Mn

0.210519

-0.214412

0.032990

-0.357778

0.322334

0.431976

0.270668

0.383394

Na

0.197319

0.256733

0.393208

-0.145528

0.398377

-0.048703

0.059673

-0.441724

Br

0.237593

-0.133207

-0.148348

-0.114827

0.366633

-0.629558

0.134138

0.280716

Ce

0.310789

-0.153204

0.174277

0.424389

-0.059743

0.020621

0.182071

-0.108815

Co

0.268686

-0.217477

-0.325007

-0.264940

-0.001150

0.286614

0.085416

-0.289723

Cr

0.148723

0.500147

-0.190339

0.011252

-0.159376

0.178156

-0.204580

0.543233

Cs

0.204563

0.464098

-0.224263

-0.105064

-0.170290

0.007983

-0.028185

-0.262404

Eu

0.198263

-0.352826

0.335338

0.166123

-0.320455

0.212371

-0.254025

0.046827

Fe

0.336150

-0.162813

-0.223426

-0.245212

-0.040553

0.161223

-0.230638

-0.120152

Hf

0.245628

0.117812

0.187192

-0.022610

-0.441582

-0.070881

0.716860

0.130534

La

0.383947

-0.028602

0.059491

0.091231

0.056412

-0.220394

-0.270395

0.174274

Sc

0.347886

0.154995

-0.079725

-0.128461

-0.208561

-0.189837

-0.058653

-0.191865

Sm

0.366569

-0.069342

-0.018169

0.367004

0.106307

-0.023817

-0.186690

0.045654

U

0.061745

0.202401

-0.293312

0.553354

0.388851

0.319683

0.235231

-0.061397

 

PC9

PC10

PC11

PC12

PC13

PC14

PC15

Al

0.273871

0.226325

-0.365249

-0.438376

0.053102

0.008869

-0.020738

Mn

-0.502323

0.066884

0.101067

0.017929

-0.046715

0.009282

-0.015977

Na

0.070520

-0.234085

0.358430

0.410554

-0.031145

-0.024325

-0.035334

Br

0.272840

0.406799

0.094090

0.030315

0.109288

-0.033175

-0.017663

Ce

-0.070764

0.230318

-0.208930

0.101732

-0.494912

-0.512046

-0.035925

Co

0.479409

0.029526

-0.026347

-0.110678

-0.412214

0.328918

0.097719

Cr

0.259105

-0.078915

0.269854

0.230909

-0.255200

-0.160268

-0.083473

Cs

-0.341502

0.533979

-0.224025

0.261461

0.094973

0.222355

0.062881

Eu

0.123397

0.361713

0.425305

0.146247

0.273152

0.147027

0.192559

Fe

0.142489

-0.176042

-0.277230

0.143084

0.491887

-0.494641

-0.141186

Hf

0.155503

-0.218647

-0.101690

0.038961

0.237964

0.135518

-0.021239

La

-0.164339

-0.376178

-0.267374

0.061434

-0.082016

0.198826

0.628192

Sc

-0.253153

-0.063205

0.438529

-0.640663

-0.032365

-0.200978

0.005357

Sm

-0.108080

-0.152286

-0.073319

-0.048589

0.034110

0.432387

-0.668027

U

0.110299

0.020353

0.120190

-0.183817

0.333090

-0.033231

0.278594

Table 3: Correlation of Principle Component Analysis eigenvectors with elements. Numbers with high absolute values indicate that the corresponding element has a strong influence on the given eigenvector (Principle Component).

 

Element Correlations

With the number of element variables shortened to the nine listed above, the Principle Component Analysis was again run on the chert data set, the output of which can be seen in Tables 4 - 6. Table 4 shows the calculated correlations of elements with each other. Values near one indicate that a pair of elements in the data set are behaving nearly identically. That is, they both increase/decrease the same way when plotted. Conversely, a value near -1 indicates that a pair of elements are behaving in opposite manners – one increases whenever the other decreases, etc. A value near zero indicates that when two elements are plotted they are behaving independently from one another. These number values can also be expressed as percentages by moving the decimal.

 

Al

Ce

Cr

Cs

Eu

Fe

La

Sc

Sm

Al 1.0000                
Ce 0.0956 1.0000              
Cr 0.2173 0.0069 1.0000            
Cs 0.1571 0.1199 0.5687 1.0000          
Eu 0.1047 0.5503 -.1384 -.1655 1.0000        
Fe -.0396 0.3493 0.1705 0.3052 0.3154 1.0000      
La 0.1797 0.6058 0.2510 0.2875 0.3648 0.6023 1.0000    
Sc 0.1200 0.3875 0.3743 0.5478 0.2159 0.5480 0.6174 1.0000  
Sm 0.0658 0.7158 0.2024 0.2445 0.4386 0.5196 0.8035 0.5305 1.0000

Table 4: Correlation matrix between elements.

It is notable that cerium, lanthanum, and samarium all correlate rather highly with one another: cerium and lanthanum 61%; cerium and samarium 72%; and lanthanum and samarium 80%. Also, lanthanum correlates rather well with scandium, at 62%. This is not surprising, since Ce, La, and Sm are all Rare Earth Elements, and Sc behaves similarly to the Rare Earth Elements in sedimentary processes (McLennan, 1989). This information is useful, since researchers have sometimes found that elements with high correlations may reduce the amount of random "noise" in a data set when their ratios are used as input rather than their individual concentrations (e.g. Aspinall and Feather, 1972). Therefore these four element ratios were later incorporated into some of the analyses rather than the individual element concentrations.

As can be seen in Table 5, PC1 correlates moderately strongly with the elements lanthanum (La), and samarium (Sm); PC2 correlates moderately strongly with chromium (Cr) and cesium (Cs), and has a negative correlation with europium (Eu); PC3 has a strong correlation with aluminum (Al). This indicates that these six elements are most strongly associated with the directions of maximum variance in the data set. It is also notable that PC1 has positive correlations with all the elements, which would indicate that to a certain extent, PC1 is controlled by an overall increase in element concentrations.

 

PC1

PC2

PC3

PC4

PC5

PC6

PC7

PC8

PC9

Al

0.10117

0.16075

0.88629

0.30581

-0.23900

0.02054

0.08519

0.06168

0.12765

Ce

0.37140

-0.31400

0.11965

-0.46040

-0.00336

0.28223

0.28683

0.59236

-0.15606

Cr

0.19295

0.54315

0.10582

-0.26687

0.47139

-0.55412

-0.06136

0.22686

-0.02921

Cs

0.24651

0.53793

-0.08201

-0.07799

0.10661

0.52052

0.43337

-0.39660

-0.09826

Eu

0.25015

-0.46846

0.21536

0.20593

0.71968

0.06179

-0.04083

-0.31523

-0.07988

Fe

0.36536

-0.01541

-0.34299

0.64328

-0.04415

-0.24578

0.43605

0.26604

0.10744

La

0.45108

-0.06161

0.00777

-0.02988

-0.38181

-0.27942

-0.17926

-0.29858

-0.66830

Sc

0.39982

0.20732

-0.13101

0.21580

0.02046

0.39996

-0.70350

0.24095

0.14171

Sm

0.44096

-0.15722

-0.03173

-0.33694

-0.20784

-0.19915

0.00655

-0.34526

0.68133

Table 5: Correlation of eigenvectors and elements.

 

Plotting Principle Components

The information in Table 6 indicates that PC1 accounts for about 43% of the variation in the chert data set; PC2 accounts for 20%, and the percentages drop off gradually for the rest of the Principle Components. Cumulatively, PC1 and PC2 together account for 62% of the variation in the data set. A two-dimensional view (of the nine-dimensional data set) can be created by projecting all of the data points onto the plane defined by the axes PC1 and PC2. This two-dimensional view will retain 62% of the information from the nine-dimensional plot. As described earlier for Principle Component Analysis in general, this plane derived from the axes PC1 and PC2 will provide a two-dimensional plot retaining the maximum possible amount of information from the multivariate plot. This plot of PC1 versus PC2 can be seen in Figure 7. The scale on the axes of the Principle Component plots are in increments of standard deviations from the variable means (this is done to equalize the scale at which each variable is plotted and compared to the others).

 

Eigenvalue

Difference

Proportion

Cumulative

PC1

3.82631

2.03291

0.425146

0.42515

PC2

1.79340

0.74682

0.199266

0.62441

PC3

1.04657

0.43845

0.116286

0.74070

PC4

0.60812

0.09401

0.067569

0.80827

PC5

0.51411

0.03907

0.057124

0.86539

PC6

0.47504

0.12711

0.052782

0.91817

PC7

0.34793

0.11654

0.038659

0.95683

PC8

0.23139

0.07427

0.025710

0.98254

PC9

0.15712

.

0.017458

1.00000

Table 6: Eigenvalues of the correlation matrix. The Proportion value indicates what percentage of the total variation in the data set is accounted for by each Principle Component. PC1 accounts for almost 43% of the variation by itself, and the rest of the scores drop off gradually in significance.

 

The overall plot for the entire Prairie du Chien data set is very dense and close to having a multivariate normal distribution. That is, it is nearly elliptical in distribution with few outliers. In fact, the overall grouping is so tight that there is a great deal of overlap between the sample groupings from individual sample locations. This is somewhat unfortunate, though not surprising given that they are, after all, derived from essentially the same geologic formation. Had the samples come from multiple formations, they probably would appear as two or more individual groupings, separate or somewhat overlapping. The tight clustering and overlap between individual locations in this case prevents the identification or definition of distinctive provenance sources, and indicates that they will probably not be easily distinguished geochemically by any means.

fig7.gif (8113 bytes)

Figure 7: Plot of PC1 versus PC2. Each closed figure represents the range of values from one sample location.

A second PCA plot was made for the data, using the four element ratios identified above. This was as a test to see if the ratios might remove any noise from the data and reveal otherwise unseen trends. Once again, the samples were plotted according to sample location. The element ratios did not do any better at separating sample locations, but produced an interesting trend in about half of the individual sample plots. The samples started separating into two separate groups within many of the sample locations, both outcrop and stream deposits. When each data point was labeled with its corresponding sample number, however, there was no evident reason for the separation. The groups do not correspond to Shakopee and Oneota, or any other stratigraphic rule; nor do they correspond to any textural or other visible macroscopic feature. It seems that either there is an unknown process influencing the rare earth element ratios throughout the Prairie du Chien, or else there is a quirk of the statistical analytical process which produces these apparent separations.

In any case, there is still no evident means of distinguishing different source regions using PCA. Therefore further methods other than PCA must be employed.

 


ppoint4.gif (297 bytes)     NEXT     PREVIOUS     TABLE OF CONTENTS

Intro and Background     Fieldwork     Sample Prep     Data Analysis     PCA     Correspondence Analysis    Stepwise DA   

Discriminant Analysis     More PCA     Element Trends     Conclusions     Bibliography     Appendix A: Part 1 Part 2 Part 3 Part 4 Part 5     Appendix B     Appendix C


The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.