STAT16S: Exercise Using SPSS to Explore Dummy Variable Regression

Author:   Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA 93740
Email:  ednelson@csufresno.edu

Note to the Instructor: The data set used in this exercise is gss14_subset_for_classes_STATISTICS.sav which is a subset of the 2014 General Social Survey. Some of the variables in the GSS have been recoded to make them easier to use and some new variables have been created.  The data have been weighted according to the instructions from the National Opinion Research Center.  This exercise uses LINEAR REGRESSION in SPSS to explore dummy variable regression and also uses FREQUENCIES, SELECT CASES, and COMPUTE.  A good reference on using SPSS is SPSS for Windows Version 23.0 A Basic Tutorial by Linda Fiddler, John Korey, Edward Nelson (Editor), and Elizabeth Nelson.  The online version of the book is on the Social Science Research and Instructional Council's Website.  You have permission to use this exercise and to revise it to fit your needs.  Please send a copy of any revision to the author. Included with this exercise (as separate files) are more detailed notes to the instructors, the SPSS syntax necessary to carry out the exercise (SPSS syntax file), and the SPSS output for the exercise (SPSS output file).  Please contact the author for additional information.

I’m attaching the following files.

Goals of Exercise

The goal of this exercise is to introduce dummy variable regression.  The exercise also gives you practice using LINEAR REGRESSION, FREQUENCIES, SELECT CASES, and COMPUTE in SPSS.

Part I –Dummy Variables

We’re going to use the General Social Survey (GSS) for this exercise.  The GSS is a national probability sample of adults in the United States conducted by the National Opinion Research Center (NORC).  The GSS started in 1972 and has been an annual or biannual survey ever since. For this exercise we’re going to use a subset of the 2014 GSS. Your instructor will tell you how to access this data set which is called gss14_subset_for_classes_STATISTICS.sav. 

In a previous exercise (STAT14S) we considered linear regression for one independent and one dependent variable which is often referred to as bivariate linear regression.  Multiple linear regression (STAT15S) expands the analysis to include multiple independent variables.  In both these exercises the variables in the regression analysis were interval or ratio (STAT1S).  What do you do if you want to include a nominal or ordinal variable as one of your independent variables in the regression?

The answer is to create dummy variables.  Consider the respondent’s sex.  The variable d5_sex has two categories – 1 for males and 2 for females.  What we do is to create two dummy variables – one for males and the other for females.  Here’s how we do it:

  • d5_sex_male = 1 if male and 0 if female, and
  • d5_sex_female = 1 if female and 0 if male.

If there are k categories, then you use k – 1 of the dummy variables in your regression analysis.  The category that you omit becomes your comparison group.  We’re going to enter d5_sex_male into the analysis and omit d5_sex_female.  That means that females will be the comparison group.

What if you had more than two categories?  For example, the region where the respondent lives (d25_region) has nine categories.  So you would create nine dummy variables and omit one of them.  Actually, you wouldn’t need to create all nine dummy variables since you’re going to omit one.  If we decide to omit the category for the Pacific region (value 9), then you would create eight dummy variables, one for each of the other categories, and the Pacific region would be our comparison group. 

Neither of these two variables – d5_sex and d25_region – have missing data but if the variable for which you are creating dummy variables has missing data you need to be careful to exclude those cases with missing data from the analysis.  You want to be careful not to include them in one of your dummy variables.

Let’s use tv1_tvhours as our dependent variable as we did in the previous two exercises (STAT14S and STAT15S).  Run FREQUENCIES to get the frequency distribution for tv1_tvhours.  (See Chapter 3, Frequencies in the online SPSS book mentioned on page 1.)  In the previous exercises we discussed outliers and noted that there are a few individuals (i.e., 14) who watched a lot of television which we defined as 14 or more hours per day.  We also noted that outliers can affect our statistical analysis so we decided to remove these outliers from our analysis. 

To remove these outliers click on “Data” in the menu bar in SPSS and then click on “Select Cases” in the dropdown menu.  (See Chapter 3, Select Cases in the online SPSS book.)  Click on the circle next to “If condition is satisfied” and then click on the “If” button directly below it.  Scroll down the list of variables on the left and select tv1_tvhours and click on the arrow pointing to the right to move it into the box in the upper right.  Then click on the <= button, the 1 button, and the 3 button so the expression in the box reads “tv1_tvhours <= 13”.  Finally click on “Continue” and then on “OK”.  Now use FREQUENCIES again to get the frequency distribution for tv1_tvhours and make sure that you correctly removed the outlines.  You should not see any cases with more than 13 hours.[1] 

To create the dummy variable for males (d5_sex_males) click on Transform in the menu bar for SPSS and then click on “Compute Variable.”  (See Chapter 3, Compute in the online SPSS book.)  Enter the variable name, d5_sex_males, in the target variable box and enter 0 in the “Numeric Expression” box.  Then click on “OK.”  This will assign the value 0 to all cases.  Then click on “Compute Variables” again and enter the value 1 in the “Numeric Expression” box.  This time click on the “If” box in the lower left of the dialog box.  Select the button labelled “Select if case satisfied condition” and enter the condition in the box below the button.  The condition will be that “d5_sex = 1”.  You can select the variable from the list on the left and click on the arrow pointing to the right.  Click on the equal sign and then click on 1.  Now click on “Continue” and then on “OK.”  SPSS will ask you if it is OK to change the variable and you will click on “Yes.”  Now all males will have the value 1 and all females will have the value 0.

Part II – Regression with Dummy Variables

Click on “Analyze” in the menu bar of SPSS and then click on “Regression” which will open another dropdown menu.  Click on “Linear” in the menu.  Your dependent variable will be tv1_tvhours.  Enter the dummy variable for males (d5_sex_males) as your independent variable.  Remember that you are omitting the dummy variable for females (d5_sex_females) so this becomes your comparison group. 

Let’s look at the output box that contains your unstandardized regression coefficients.  From this you can see that your regression equation for predicting tv1_tvhours is 2.679 + .129 X where X is your dummy variable.  Remember that your dummy variable, d5_sex_males, equals 1 if the person is male and 0 if the person is female.  So for males the predicted number of hours watching television is 2.679 + .129 (1) or 2.808.  For females the predicted number of hours is 2.679 + .129 (0) or 2.679.  Since we left the dummy variables for females (d5_sex_females) out of the regression equation, females become our comparison group.  The unstandardized regression coefficient (0.129) is the mean number of hours that males watch television (2.81) minus the mean for females (2.68) which is 0.13.[2]

SPSS will also calculate t tests to test the null hypotheses that the regression coefficients in the population are equal to 0.  Normally we’re only interested in the slope.  The t value is 1.300 and the significance value is .194.  This means that we can’t reject the null hypotheses.  In others words, we have no basis for asserting that the population slopes are significantly different from zero.  The Pearson Correlation Coefficient Squared (Coefficient of Determination) tells us that the dummy variable for sex explains virtually none of the variation in the dependent variable.

Part III – Now it’s Your Turn

Use SPSS to get the frequency distribution for d6_race.  There are three categories for this variable – white (value 1), black (value 2), and other (value 3).  We want to compare whites with non-whites.  This means that there will be two dummy variables:

  • d6_race_white – equals 1 if the person is white and 0 if the person is non-white, and
  • d6_race_nonwhite – equals 1 if the person is nonwhite and 0 if the person is white.

Let’s make non-whites our comparison group so that means that we’ll leave d6_race_nonwhite out of the regression equation.  Use COMPUTE to create the d6_race_white dummy variable.

Now run the regression analysis with tv1_tvhours as your dependent variable and d6_race_white as your independent variable.

  • Write out the regression equation.
  • What do the unstandardized multiple regression coefficients tell you?
  • What are the values of R and R2 and what do they tell you?
  • What are the different tests of significance that you can carry out and what do they tell you?

Part IV – Multiple Regression with Dummy Variables

In STAT15S you did a regression analysis with tv1_tvhours as your dependent variable and d1_age, d24_paeduc, and d4_educ as your independent variables.  This time we’re going to add d5_sex_males into the analysis.  Use SPSS to carry out the regression analysis for this model.

The regression equation for predicting tv1_tvhours is 3.372 + .021 (d1_age) - .055 (d24_paeduc) - .086 (d4_educ) + .223 (d5_sex_males).  The unstandardized regression coefficients show the average change in the dependent variable when the independent variable increases by one unit after statistically adjusting for the other independent variables in the equation.  As age increases, television viewing increases but as the respondent’s education and father’s education increase, television viewing goes down.  Males watch more television that females.  The t tests show that all the unstandardized coefficients are statistically significant meaning that that we can reject the null hypotheses that they are zero in the population.  The Pearson Multiple Correlation Coefficient Squared (Coefficient of Determination) tells us that together the independent variables explain or account for 9.6% of the variation in television viewing.  The Adjusted Squared Correlation Coefficient adjusts for the number of independent variables and is slightly lower (9.3%).  The F test in the analysis of variance table is also statistically significant meaning that we can reject the null hypothesis that our independent variables explain none of the variation in the dependent variable.  The Beta values show the relative importance of the independent variables in predicting television viewing and tell us that age is the most important and sex the least important with both respondent’s education and father’s education in between. 

Part V – Now it’s Your Turn Again

Repeat the regression analysis you did in Part 4 but instead of adding d5_sex_males into the analysis this time add the dummy variable you created in Part 3 (d6_race_whites).  This means you will have four independent variables -- d1_age, d24_paeduc, d4_educ, and d6_race_whites.

  • Write out the regression equation.
  • What do the unstandardized multiple regression coefficients tell you?
  • What do the standardized regression coefficients (Beta) tell you?
  • What are the values of R and R2 and what do they tell you?
  • What are the different tests of significance that you can carry out and what do they tell you?

[1] You didn’t permanently remove the outliers from the data.  Rather you temporarily removed them.

[2] The difference between .129 and 0.13 is due to the fact that SPSS calculated the means to two decimal points and the regression coefficient to three decimal points.