Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA 93740
Note to the Instructor: The data set used in this exercise is gss14_subset_for_classes_STATISTICS.sav which is a subset of the 2014 General Social Survey. Some of the variables in the GSS have been recoded to make them easier to use and some new variables have been created. The data have been weighted according to the instructions from the National Opinion Research Center. This exercise uses FREQUENCIES in SPSS to explore measures of central tendency and dispersion. A good reference on using SPSS is SPSS for Windows Version 23.0 A Basic Tutorial by Linda Fiddler, John Korey, Edward Nelson (Editor), and Elizabeth Nelson. The online version of the book is on the Social Science Research and Instructional Council's Website. You have permission to use this exercise and to revise it to fit your needs. Please send a copy of any revision to the author. Included with this exercise (as separate files) are more detailed notes to the instructors, the SPSS syntax necessary to carry out the exercise (SPSS syntax file), and the SPSS output for the exercise (SPSS output file). Please contact the author for additional information.
I’m attaching the following files.
I’m attaching the following files.
- Data subset (.sav format)
- Extended notes for instructors (MS Word; docx format).
- Syntax file (.sps format)
- Output file (.spv format)
- This page (MS Word; docx format).
Goals of Exercise
The goal of this exercise is to explore measures of central tendency (mode, median, and mean) and dispersion (range, interquartile range, standard deviation, and variance). The exercise also gives you practice in using FREQUENCIES in SPSS.
Part I – Measures of Central Tendency
Data analysis always starts with describing variables one-at-a-time. Sometimes this is referred to as univariate (one-variable) analysis. Central tendency refers to the center of the distribution.
There are three commonly used measures of central tendency – the mode, median, and mean of a distribution. The mode is the most common value or values in a distribution. The median is the middle value of a distribution. The mean is the sum of all the values divided by the number of values.
Run FREQUENCIES in SPSS for the variable d9_sibs. (See Chapter 4, Frequencies in the online SPSS book mentioned on page 1.) Once you have selected this variable click on the “Statistics” button and check the boxes for mode, median, and mean. Then click on “Continue” and click on the “Charts” button. Select “Histogram” and check the box for “Show normal curve on histograms.” Then click on “Continue.” That will take you back to the screen where you selected the variable. Click on “OK” and SPSS will open the Output window and display the results that you requested.
Your output will display the frequency distribution for d9_sibs and a box showing the mode, median, and mean with the following values displayed.
- Mode = 2 meaning that two brothers and sisters was the most common answer (19.4%) from the 2,531 respondents who answered this question. However, not far behind are those with one sibling (18.6%) and those with three siblings (17.9%). So while technically two siblings is the mode, what you really found is that the most common values are one, two, and three siblings. Another part of your output is the histogram which is a chart or graph of the frequency distribution. The histogram clearly shows that one, two, and three are the most common values (i.e., the highest bars in the histogram). So we would want to report that these three categories are the most common responses.
- Median = 3 which means that three siblings is the middle category in this distribution. The middle category is the category that contains the 50th percentile which is the value that divides the distribution into two equal parts. In other words, it’s the value that has 50% of the cases above it and 50% of the cases below it. The cumulative percent column of the frequency distribution tells you that 41.4% of the cases have two or fewer siblings and that 59.3% of the cases have three or fewer siblings. So the middle case (i.e., the 50th percentile) falls somewhere in the category of three siblings. That is the median category.
- Mean = 3.74 which is the sum of all the values in the distribution divided by the number of responses. If you were to sum all these values that sum would be 9,476. Dividing that by the number of responses or 2,531 will give you the mean of 3.74.
Part II – Deciding Which Measure of Central Tendency to Use
The first thing to consider is the level of measurement (nominal, ordinal, interval, ratio) of your variable (see Exercise STAT1S).
- If the variable is nominal, you have only one choice. You must use the mode.
- If the variable is ordinal, you could use the mode or the median. You should report both measures of central tendency since they tell you different things about the distribution. The mode tells you the most common value or values while the median tells you where the middle of the distribution lies.
- If the variable is interval or ratio, you could use the mode or the median or the mean. Now it gets a little more complicated. There are several things to consider.
- How skewed is your distribution? Go back and look at the histogram for d9_sibs. Notice that there is a long tail to the right of the distribution. Most of the values are at the lower level – one, two, and three siblings. But there are quite a few respondents who report having four or more siblings and about 5% said they have ten or more siblings. That’s what we call a positively skewed distribution where there is a long tail towards the right or the positive direction. Now look at the median and mean. The mean (3.74) is larger than the median (3.0). The respondents with lots of siblings pull the mean up. That’s what happens in a skewed distribution. The mean is pulled in the direction of the skew. The opposite would happen in a negatively skewed distribution. The long tail would be towards the left and the mean would be lower than the median. In a heavily skewed distribution the mean is distorted and pulled considerably in the direction of the skew. So consider reporting only the median in a heavily skewed distribution. That’s why you almost always see median income reported and not mean income. Imagine what would happen if your sample happened to include Bill Gates. The income distribution would have this very, very large value which would pull the mean up but not affect the median.
- Is there more than one clearly defined peak in your distribution? The number of siblings has one clearly defined peak – one, two and three siblings. But what if there is more than one clearly defined peak? For example, consider a hypothetical distribution of 100 cases in which there 50 cases with a value of two and fifty cases with a value of 8. The median and mean would be five but there are really two centers of this distribution – two and eight. The median and the mean aren’t telling the correct story about the center. You’re better off reporting the two clearly defined peaks of this distribution and not reporting the median and mean.
- If your distribution is normal in appearance then the mode, median, and mean will all be about the same. A normal distribution is a perfectly symmetrical distribution with a single peak in the center. No empirical distribution is perfectly normal but distributions often are approximately normal. Here we would report all three measures of central tendency. Go back to your SPSS output and look at the histogram for d9_sibs. When you told SPSS to give you the histogram you checked the box that said “Show normal curve on histograms.” SPSS then superimposed the normal curve on the histogram. The normal curve doesn’t fit the histogram perfectly particularly at the lower end but it does suggest that it approximates a normal curve particularly at the upper end.
Run FREQUENCIES for the following variables. Once you have selected the variables click on the “Statistics” button and check the boxes for mode, median, and mean. Then click on “Continue” and click on the “Charts” button. Select “Histogram” and check the box for “Show normal curve on histograms.” Then click on “Continue.” That will take you back to the screen where you selected the variables. Click on “OK” and SPSS will open the Output window and display the results of what you requested. For each variable write a sentence or two indicating which measure(s) of central tendency would be appropriate to use to describe the center of the distribution and what the values of those statistics mean.
Part III – Measures of Dispersion or Variation
Dispersion or variation refers to the degree that values in a distribution are spread out or dispersed. The measures of dispersion that we’re going to discuss are appropriate for interval and ratio level variables (see Exercise STAT1S). We’re going to discuss four such measures – the range, the inter-quartile range, the variance, and the standard deviation.
The range is the difference between the highest and the lowest values in the distribution. Run FREQUENCIES for d1_age and compute the range by looking at the frequency distribution. You can also ask SPSS to compute it for you. Click on “Statistics” and then click on “Range.” You should get 71 which is 89 – 18. The range is not a very stable measure since it depends on the two most extreme values – the highest and lowest values. These are the values most likely to change from sample to sample.
A more stable measure of dispersion is the interquartile range which is the difference between the third quartile (Q3) and the first quartile (Q1). The third quartile is the same thing as the seventy-fifth percentile which is the value that has 25% of the cases above it and 75% of the cases below it. The first quartile is the same as the twenty-fifth percentile which is the value that has 75% of the cases above it and 25% of the cases below it. SPSS will calculate Q3 and Q1 for you. Click on the “Statistics” button and then click on “Quartiles” in the “Percentiles” box in the upper left. Once you know Q3 and Q1 you can calculate the interquartile range by subtracting Q1 from Q3. Since it’s not based on the most extreme values it will be more stable from sample to sample. Go back to SPSS and calculate Q3 and Q1 for d1_age and then calculate the interquartile range. Q3 will equal 60 and Q1 will equal 33 and the interquartile range will equal 60 – 33 or 27.
The variance is the sum of the squared deviations from the mean divided by the number of cases minus 1 and the standard deviation is just the square root of the variance. Your instructor may want to go into more detail on how to calculate the variance by hand. SPSS will also calculate it for you. Click on the “Statistics” button and then click on “Variance” and on “Standard deviation.” The variance should equal 297.29 and the standard deviation will equal 17.242.
The variance and the standard deviation can never be negative. A value of 0 means that there is no variation or dispersion at all in the distribution. All the values are the same. The more variation there is, the larger the variance and standard deviation.
So what does the variance (297.29) and the standard deviation (17.242) of the age distribution mean? That’s hard to answer because you don’t have anything to compare it to. But if you knew the standard deviation for both men and women you would be able to determine whether men or women have more variation. Instead of comparing the standard deviations for men and women you would compute a statistic called the Coefficient of Relative Variation (CRV). CRV is equal to the standard deviation divided by the mean of the distribution. A CRV of 2 means that the standard deviation is twice the mean and a CRV of 0.5 means that the standard deviation is one-half of the mean. You would compare the CRV’s for men and women to see whether men or women have more variation relative to their respective means.
You might also have wondered why you need both the variance and the standard deviation when the standard deviation is just the square root of the variance. You’ll just have to take my word for it that you will need both as you go further in statistics.
Run FREQUENCIES for the following variables. Once you have selected the variables click on the “Statistics” button and check the boxes for quartiles, range, variance, standard deviation, and mean. Then click on “Continue.” That will take you back to the screen where you selected the variables. Click on “OK” and SPSS will open the Output window and display the results of what you requested. For each variable write a sentence or two indicating what the values of these statistics are for each of the variables and what the values of those statistics mean. Compare the relative variation for the number of male sex partners since the age of 18 (s1_nummen) and the number of female sex partners (s2_numwomen) by comparing the CRV’s for each variable.
 Frequency distributions can be grouped or ungrouped. Think of age. We could have a distribution that lists all the ages in years of the respondents to our survey. One of the variables (d1_age) in our data set does this. But we could also divide age into a series of categories such as under 30, 30 to 39, 40 to 49, 50 to 59, 60 to 69, and 70 and older. In a grouped frequency distribution the mode would be the most common category or categories.
 In a grouped frequency distribution the median would be the category that contains the middle value.
 See Exercise STAT3S for a more through discussion of skewness.
The Index of Qualitative Variation can be used to measure variation for nominal variables.