We consider the following data found in Johnson & Wichern (2002), p. 470. The data are census information on a few key variables. In practice, performing a principal components analysis on such a small data set may not seem worthwhile, but for illustrative purposes, it serves well to demonstrate the SPSS procedure. First, a look at the data as it should appear in the SPSS Data Editor:
The variables in this data are:
- independent areas of census-taking
TOTPOP - total population (in thousands)
MEDSHYEAR - median school years
TOTEMPLY - total employment (thousands)
HEALTH - health services employment (hundreds)
MEDHOME - Median value home ($10,000s)
Before we conduct the PCA, it is wise to have a look at our data. Let's obtain some descriptives:
Let's also have a look at the covariance matrix for these data. To obtain the covariance matrix, simply proceed as if you were computing bivariate correlations between all variables (except for TRACT):
After you've moved over from left to right the variables you want to do the principal component analysis on (TOTPOP through MEDHOME), select the "Options" window:
selected the "Options"
tab, select "Cross-product deviations and covariances" then click the
tab and then "Ok" to run the bivariate correlation procedure:
SPSS returns a 5x5 correlation matrix, but along with it are the sums of squares and cross-products, as well as the covariances. For example, the covariance for variables TOTPOP and MEDSHYEAR is 1.684 (highlighted in green). The covariance for variables TOTPOP and TOTEMPlY is 1.803, etc., Notice that the reported covariance for variables TOTPOP and TOTPOP is equal to 4.308. This isn't really a covariance, it's simply the variance of the variable (or if we must, the covariance of a variable with itself).
Running the PCA
Let's now review how to run the actual PCA. In SPSS, select "Analyze - Data Reduction - Factor":
Next, the factor analysis window opens. We select the five variables that we wish to do the PCA procedure on, and move them in the "Variables" box:
Next, click on the "Extraction" tab. Since this is a principal components analysis, the "Method" should equal "Principal components." We'll analyze the covariance matrix, so be sure that box is checked. Also, we'll at first glance want the "Unrotated factor solution" and the "Scree plot" to help us decide how many components to retain. Under "Extract," we have the option of choosing to extract only components greater than a given value (e.g., 1). However, for this example, since we seek to show that PCA always extracts the same number of components as there are initial variables, we'll select "Number of factors" equal to 5. Under "Maximum Iterations for Convergence," this should have a default of 25, which is usually more than enough for the procedure to run. If the maximum is set too low, simply increase it to 25 or more. Click "Continue," then once back to the Factor Analysis window, click "Ok".
Principal Components Output
We're ready to now interpret the PCA. Depending on how you have your copy of SPSS configured, the first thing reported will be the syntax used to execute the requested procedure. If you do not see syntax, but would like to see it, select "Edit" from the main SPSS Editor window, then scroll down to "Options." This will open up the tab below. Select "Viewer." Make sure "Log" is selected below "Initial Output State - Item." Then, check off the box "Display commands in the log." This will ensure that your syntax is recorded with each procedure you run.
The syntax for this particular PCA is shown below.
/VARIABLES TOTPOP MEDSHYEAR TOTEMPLY HEALTH MEDHOME /MISSING LISTWISE
/ANALYSIS TOTPOP MEDSHYEAR TOTEMPLY HEALTH MEDHOME
/PRINT INITIAL EXTRACTION ROTATION
/CRITERIA FACTORS(5) ITERATE(25)
The first bit of
real output for the
procedure is the "Communalities" box. Because we're performing a
components analysis, and not a factor analysis, the "Initial"
are equal to 1.0 for each variable. We will discuss the extraction
Next, take a look at the "Total Variance Explained" table below. The reported eigenvalues are the variances of each component, and are reported for each component extracted. We see that component one has a variance of 6.931, component 2 has a variance of 1.785, and component 3 has a variance of .390. The "% of Variance" shows the percentage of variance accounted for by each component, and is easily computed by dividing the given eigenvalue by the eigenvalue sum. For component 1, the computation is 6.931/(6.931 + 1.785 + .390 + .230 + .014) = 6.931/9.35 = 0.74128, which as a percentage, is 74.130, and is within rounding error of that reported by SPSS.
Look closely at the scaled components on the right side of the above component matrix. What do we notice? We notice that the first component extracted loads highly on TOTPOP, TOTEMPlY, and HEALTH. Component two loads highly on MEDSHYEAR and HEALTH. Component 3 loads highly on MEDHOME. Components 4 and 5, although reported, do not have great loadings on any of the variables, and in conjunction with the scree-plot information, it's safe to disregard these two components. Recall that simply because they were extracted doesn't mean much since principal components analysis will always extract as many components as there are variables.
Issues & Specifications
1. In the "Descriptives" tab, there is an option for selecting "KMO & Bartlett's Test of Sphericity." This tests the assumption that the correlation/covariance matrix is an identity matrix. That is, it tests the null hypothesis that the pairwise bivariate correlations among variables is equal to zero (leaving only 1's in the main diagonal, which is why it's an identity matrix).
For our data, let's interpret the output for the test:
Because the test
significant (p < .001), we can reject the null hypothesis that the
matrix is an identity matrix, and infer the alternative hypothesis that
it is not an identity matrix. Consequently, we may proceed with the
components with the rough assumption that there is "enough" correlation
among variables to make it worthwhile.
Determinant of Covariance Matrix
By checking off "Determinant" in the "Descriptives" window returns the determinant of the covariance (or correlation, whichever you're analyzing) matrix. If the determinant is very low (e.g., 0.0001 or so), it could suggest an issue of singularity, which implies multicollinearity in your data. The covariance matrix in this case yields a determinant of .016, making it acceptable to proceed with the PCA.
Inverse of Covariance Matrix
"Inverse" tab yields
the inverse covariance matrix. This matrix is not important for applied
purposes, but it is related to the determinant just mentioned. If the
were extremely low to suggest singularity, we would not be able to
the covariance matrix, and hence would not be able to conduct the PCA.
These issues (determinant, inverse) relate more generally to the
issue of multicollinearity.
Had we used the correlation matrix instead
If we had used the
instead of the covariance matrix, then the sum of the eigenvalues for
component would equal the number of variables entered into the
Each variable contributes one "unit" of variance into the PCA, which is
why the sum of eigenvalues is equal to the number of variables. For the
present data, consider the following output produced by analyzing the
Notice that under
the sum is 3.029 + 1.291 + .572 + .095 + .012 = 4.999, rounded to 5.0,
which equals the number of variables entered into the analysis. It
reasonable then that if the eigenvalue for any given component is
than its original variable "entry" of 1.0, that it be considered
important or valuable (at least statistically). This idea, that if the
eigenvalue for a given component is greater than 1.0 it should be
is known as the Kaiser-criterion. In
our data, we can see that the
two components have eigenvalues greater than 1.0, and hence according
the Kaiser-criterion, should be retained. The scree plot (shown earlier
above) as well
Computing Factor/Component Scores
After conducting a principal components analysis, you may wish to use the results to produce so-called "component scores" for the components you wish to retain. You would want to do this if you plan on using these components as variables in an ensuing analysis. To obtain factor/component scores in SPSS, click on the "Scores" tab in the "Factor Analysis" window:
This will bring up
a new window. Select
"Save as variables," then under "Method," choose "Regression." Also,
may wish to select "Display factor score coefficient matrix" as well,
it is mostly for theoretical interest, not applied:
Once you run the
the actual factor scores will not appear in the output of your analysis
with the other output. Rather, the estimated scores will appear in the
actual data file of the SPSS Data Editor Window:
Orthogonality of Factors
Recall that one of the goals of PCA was to extract components that were uncorrelated (i.e., orthogonal) with one another. If this was accomplished, then it stands that the bivariate pairwise correlations between estimated factor scores should be equal to 0. To demonstrate this, we can run a simple bivariate correlation procedure of FAC1_1 through FAC5_1 in SPSS and observe the resulting correlation matrix:
Notice that all pairwise correlations among estimated factor scores are equal to 0, as they should be. However, this is assuming we didn't rotate our factor solution, or at minimum, that the rotation method used was one that produced orthogonal components (e.g., such as varimax or quartimax, which are orthogonal rotation methods). Had we selected "Direct Oblimin" for instance, the resulting correlation matrix of estimated factor scores would not have resulted in the identity matrix (i.e., 1's along the main diagonal with zeros everywhere else). Using the "Direct Oblimin" for the present data would have resulted in the following estimated factor score correlation matrix:
Questions & Answers
1. How big of a sample size do I need to conduct a PCA?
There is no
definite answer to this
question. There are only general guidelines. One useful guideline is
apply the following rule: the minimal number should be at least 100
and/or five times the number of variables being analyzed (Streiner,
So, for our data above, we had 5 variables, so our minimal sample size
would be N = 25. Although very small, in practice, a basic PCA could be
done on this data. Of course, usually, we would not be wanting to
a PCA on such a small number of variables anyway, so rarely will we
want to run the procedure on samples of less than N = 100. Sample sizes
upward of 300+ are preferred, especially if the pairwise correlations
variables is not tremendously strong. If the
correlations are quite high, then that minimizes somewhat the necessity
of having a very large sample size. Again, these are only very general
2. What are the general assumptions I should consider before running a PCA?
Before running a PCA, you should ensure that your data are generally at the level of interval or ratio measurement. Determining levels of measurement can sometimes be a "fuzzy" issue, but in general, you should have data that can more or less be considered continuous in nature. If you have dichotomous, categorical, or nominal data, you should not run a traditional PCA. Latent class analysis (LCA) may be an option you wish to consider, with more specialized software than SPSS (you may require Mplus or Latent Gold to run your procedure). Latent Gold is a very useful package that is especially suited to analyzing survey data and establishing profiles of respondents. In essence, it performs a cluster analysis on nominal or ordinal (or continuous) data.
As noted by
O'Rourke, Hatcher &
Stepanski (2005), your data should also be a random sample drawn from
given population, should exhibit at minimum pairwise linearity among
and should follow a pairwise normal bivariate density.
3. What is the difference between factor analysis and principal components analysis?
There are two ways of answering this question, one is technical, the other is substantive or conceptual. Technically, the difference between factor analysis and principal components analysis is that in FA, cummunalities are used in the main diagonal of the correlation matrix, whereas in PCA, variances (1's) are used instead. In this sense, FA is interested in accounting for the shared (i.e., "common") variance among variables, whereas PCA is interested in accounting for the variance of variables. Does it matter what is put in the diagonal? Nunnally (1978) suggests that if you have at least 20 variables in your analysis, whether you use communalities or 1's will not make much of a difference. As a general rule, if communalities are relatively high (e.g., .70 and higher), and variable number is greater than 20-25, whether you do a principal components analysis or factor analysis will not make much of a difference. If communalities are low and/or variable number is low, then you're likely to see a difference in results from a PCA vs. FA. Again, these are practical guidelines.
The other probably more powerful way of appreciating the difference between FA and PCA is to consider their substantive/research purposes. If you theorize the existence of an underlying dimension that gave rise to the correlations among your observed variables, then this is a problem of factor analysis. However, if you hold no such assumptions, and wish to simply account for your observed variables by reducing their dimensionality, without imposing any kind of "give rise to" definition, then PCA is usually the procedure of choice. Most often results from an FA or PCA will provide similar findings, but you should be aware that the use of each procedure is usually grounded in your theory about what you wish to do.
Johnson, R. A., & Wichern, D. W. (2002). Applied Multivariate Statistical Analysis. Prentice Hall: New Jersey.
N., Hatcher, L., &
E. J. (2005). A
Step-by-Step Approach to Using SAS for Univariate &
SAS Institute Inc., Cary, NC, USA.
DATA & DECISION, Copyright 2010, Daniel J. Denis, Ph.D. Department of Psychology, University of Montana. Contact Daniel J. Denis by e-mail firstname.lastname@example.org.