Principal Components Analysis Using SPSS

We consider the following data found in Johnson & Wichern (2002), p. 470. The data are census information on a few key variables. In practice, performing a principal components analysis on such a small data set may not seem worthwhile, but for illustrative purposes, it serves well to demonstrate the SPSS procedure. First, a look at the data as it should appear in the SPSS Data Editor:

The variables in this data are:

TRACT - independent areas of census-taking
TOTPOP - total population (in thousands)
MEDSHYEAR - median school years
TOTEMPLY - total employment (thousands)
HEALTH - health services employment (hundreds)
MEDHOME - Median value home ($10,000s)

Before we conduct the PCA, it is wise to have a look at our data. Let's obtain some descriptives:

Let's also have a look at the covariance matrix for these data. To obtain the covariance matrix, simply proceed as if you were computing bivariate correlations between all variables (except for TRACT):

After you've moved over from left to right the variables you want to do the principal component analysis on (TOTPOP through MEDHOME), select the "Options" window:

Once you've selected the "Options" tab, select "Cross-product deviations and covariances" then click the "Continue" tab and then "Ok" to run the bivariate correlation procedure:

SPSS returns a 5x5 correlation matrix, but along with it are the sums of squares and cross-products, as well as the covariances. For example, the covariance for variables TOTPOP and MEDSHYEAR is 1.684 (highlighted in green). The covariance for variables TOTPOP and TOTEMPlY is 1.803, etc., Notice that the reported covariance for variables TOTPOP and TOTPOP is equal to 4.308. This isn't really a covariance, it's simply the variance of the variable (or if we must, the covariance of a variable with itself).

Running the PCA

Let's now review how to run the actual PCA. In SPSS, select "Analyze - Data Reduction - Factor":

Next, the factor analysis window opens. We select the five variables that we wish to do the PCA procedure on, and move them in the "Variables" box:

Next, click on the "Extraction" tab. Since this is a principal components analysis, the "Method" should equal "Principal components." We'll analyze the covariance matrix, so be sure that box is checked. Also, we'll at first glance want the "Unrotated factor solution" and the "Scree plot" to help us decide how many components to retain. Under "Extract," we have the option of choosing to extract only components greater than a given value (e.g., 1). However, for this example, since we seek to show that PCA always extracts the same number of components as there are initial variables, we'll select "Number of factors" equal to 5. Under "Maximum Iterations for Convergence," this should have a default of 25, which is usually more than enough for the procedure to run. If the maximum is set too low, simply increase it to 25 or more. Click "Continue," then once back to the Factor Analysis window, click "Ok".

Principal Components Output

We're ready to now interpret the PCA. Depending on how you have your copy of SPSS configured, the first thing reported will be the syntax used to execute the requested procedure. If you do not see syntax, but would like to see it, select "Edit" from the main SPSS Editor window, then scroll down to "Options." This will open up the tab below. Select "Viewer." Make sure "Log" is selected below "Initial Output State - Item." Then, check off the box "Display commands in the log." This will ensure that your syntax is recorded with each procedure you run.

The syntax for this particular PCA is shown below.


The first bit of real output for the procedure is the "Communalities" box. Because we're performing a principal components analysis, and not a factor analysis, the "Initial" communalities are equal to 1.0 for each variable. We will discuss the extraction communalities later. 

Next, take a look at the "Total Variance Explained" table below. The reported eigenvalues are the variances of each component, and are reported for each component extracted. We see that component one has a variance of 6.931, component 2 has a variance of 1.785, and component 3 has a variance of .390. The "% of Variance" shows the percentage of variance accounted for by each component, and is easily computed by dividing the given eigenvalue by the eigenvalue sum. For component 1, the computation is 6.931/(6.931 + 1.785 + .390 + .230 + .014) = 6.931/9.35 = 0.74128, which as a percentage, is 74.130, and is within rounding error of that reported by SPSS.

Look closely at the scaled components on the right side of the above component matrix. What do we notice? We notice that the first component extracted loads highly on TOTPOP, TOTEMPlY, and HEALTH. Component two loads highly on MEDSHYEAR and HEALTH. Component 3 loads highly on MEDHOME. Components 4 and 5, although reported, do not have great loadings on any of the variables, and in conjunction with the scree-plot information, it's safe to disregard these two components. Recall that simply because they were extracted doesn't mean much since principal components analysis will always extract as many components as there are variables.

Issues & Specifications

1. In the "Descriptives" tab, there is an option for selecting "KMO & Bartlett's Test of Sphericity." This tests the assumption that the correlation/covariance matrix is an identity matrix. That is, it tests the null hypothesis that the pairwise bivariate correlations among variables is equal to zero (leaving only 1's in the main diagonal, which is why it's an identity matrix).

For our data, let's interpret the output for the test:

Because the test is statistically significant (p < .001), we can reject the null hypothesis that the correlation matrix is an identity matrix, and infer the alternative hypothesis that it is not an identity matrix. Consequently, we may proceed with the principal components with the rough assumption that there is "enough" correlation among variables to make it worthwhile.

Determinant of Covariance Matrix

By checking off "Determinant" in the "Descriptives" window returns the determinant of the covariance (or correlation, whichever you're analyzing) matrix. If the determinant is very low (e.g., 0.0001 or so), it could suggest an issue of singularity, which implies multicollinearity in your data. The covariance matrix in this case yields a determinant of .016, making it acceptable to proceed with the PCA.

Inverse of Covariance Matrix

Selecting the "Inverse" tab yields the inverse covariance matrix. This matrix is not important for applied purposes, but it is related to the determinant just mentioned. If the determinant were extremely low to suggest singularity, we would not be able to invert the covariance matrix, and hence would not be able to conduct the PCA. These issues (determinant, inverse) relate more generally to the issue of multicollinearity.

Had we used the correlation matrix instead

If we had used the correlation matrix instead of the covariance matrix, then the sum of the eigenvalues for each component would equal the number of variables entered into the analysis. Each variable contributes one "unit" of variance into the PCA, which is why the sum of eigenvalues is equal to the number of variables. For the present data, consider the following output produced by analyzing the correlation matrix:

Notice that under "Initial Eigenvalues," the sum is 3.029 + 1.291 + .572 + .095 + .012 = 4.999, rounded to 5.0, which equals the number of variables entered into the analysis. It seems reasonable then that if the eigenvalue for any given component is greater than its original variable "entry" of 1.0, that it be considered somewhat important or valuable (at least statistically). This idea, that if the eigenvalue for a given component is greater than 1.0 it should be retained, is known as the Kaiser-criterion. In our data, we can see that the first two components have eigenvalues greater than 1.0, and hence according to the Kaiser-criterion, should be retained. The scree plot (shown earlier above) as well suggests their retention.

Computing Factor/Component Scores

After conducting a principal components analysis, you may wish to use the results to produce so-called "component scores" for the components you wish to retain. You would want to do this if you plan on using these components as variables in an ensuing analysis. To obtain factor/component scores in SPSS, click on the "Scores" tab in the "Factor Analysis" window:

This will bring up a new window. Select "Save as variables," then under "Method," choose "Regression." Also, you may wish to select "Display factor score coefficient matrix" as well, though it is mostly for theoretical interest, not applied:

Once you run the factor analysis, the actual factor scores will not appear in the output of your analysis with the other output. Rather, the estimated scores will appear in the actual data file of the SPSS Data Editor Window:

Orthogonality of Factors

Recall that one of the goals of PCA was to extract components that were uncorrelated (i.e., orthogonal) with one another. If this was accomplished, then it stands that the bivariate pairwise correlations between estimated factor scores should be equal to 0. To demonstrate this, we can run a simple bivariate correlation procedure of FAC1_1 through FAC5_1 in SPSS and observe the resulting correlation matrix:

Notice that all pairwise correlations among estimated factor scores are equal to 0, as they should be. However, this is assuming we didn't rotate our factor solution, or at minimum, that the rotation method used was one that produced orthogonal components (e.g., such as varimax or quartimax, which are orthogonal rotation methods). Had we selected "Direct Oblimin" for instance, the resulting correlation matrix of estimated factor scores would not have resulted in the identity matrix (i.e., 1's along the main diagonal with zeros everywhere else). Using the "Direct Oblimin" for the present data would have resulted in the following estimated factor score correlation matrix:

Questions & Answers

1. How big of a sample size do I need to conduct a PCA?

There is no definite answer to this question. There are only general guidelines. One useful guideline is to apply the following rule: the minimal number should be at least 100 subjects and/or five times the number of variables being analyzed (Streiner, 1994). So, for our data above, we had 5 variables, so our minimal sample size would be N = 25. Although very small, in practice, a basic PCA could be done on this data. Of course, usually, we would not be wanting to perform a PCA on such a small number of variables anyway, so rarely will we ever want to run the procedure on samples of less than N = 100. Sample sizes upward of 300+ are preferred, especially if the pairwise correlations among variables is not tremendously strong. If the pairwise correlations are quite high, then that minimizes somewhat the necessity of having a very large sample size. Again, these are only very general guidelines. 

2. What are the general assumptions I should consider before running a PCA?

Before running a PCA, you should ensure that your data are generally at the level of interval or ratio measurement. Determining levels of measurement can sometimes be a "fuzzy" issue, but in general, you should have data that can more or less be considered continuous in nature. If you have dichotomous, categorical, or nominal data, you should not run a traditional PCA. Latent class analysis (LCA) may be an option you wish to consider, with more specialized software than SPSS (you may require Mplus or Latent Gold to run your procedure). Latent Gold is a very useful package that is especially suited to analyzing survey data and establishing profiles of respondents. In essence, it performs a cluster analysis on nominal or ordinal (or continuous) data. 

As noted by O'Rourke, Hatcher & Stepanski (2005), your data should also be a random sample drawn from the given population, should exhibit at minimum pairwise linearity among variables, and should follow a pairwise normal bivariate density.

3. What is the difference between factor analysis and principal components analysis?

There are two ways of answering this question, one is technical, the other is substantive or conceptual. Technically, the difference between factor analysis and principal components analysis is that in FA, cummunalities are used in the main diagonal of the correlation matrix, whereas in PCA, variances (1's) are used instead. In this sense, FA is interested in accounting for the shared (i.e., "common") variance among variables, whereas PCA is interested in accounting for the variance of variables. Does it matter what is put in the diagonal? Nunnally (1978) suggests that if you have at least 20 variables in your analysis, whether you use communalities or 1's will not make much of a difference. As a general rule, if communalities are relatively high (e.g., .70 and higher), and variable number is greater than 20-25, whether you do a principal components analysis or factor analysis will not make much of a difference. If communalities are low and/or variable number is low, then you're likely to see a difference in results from a PCA vs. FA. Again, these are practical guidelines. 

The other probably more powerful way of appreciating the difference between FA and PCA is to consider their substantive/research purposes. If you theorize the existence of an underlying dimension that gave rise to the correlations among your observed variables, then this is a problem of factor analysis. However, if you hold no such assumptions, and wish to simply account for your observed variables by reducing their dimensionality, without imposing any kind of "give rise to" definition, then PCA is usually the procedure of choice. Most often results from an FA or PCA will provide similar findings, but you should be aware that the use of each procedure is usually grounded in your theory about what you wish to do.

References & Readings

Johnson, R. A., & Wichern, D. W. (2002). Applied Multivariate Statistical Analysis. Prentice Hall: New Jersey.

O'Rourke, N., Hatcher, L., & Stepanski, E. J. (2005). A Step-by-Step Approach to Using SAS for Univariate & Multivariate Statistics. SAS Institute Inc., Cary, NC, USA.

DATA & DECISION, Copyright 2010, Daniel J. Denis, Ph.D. Department of Psychology, University of Montana. Contact Daniel J. Denis by e-mail