Interpreting Statistics

Edited 11/09/04.

Interpreting Statistics


A The simpliest approach to this is to use the Chi Sqr:

Chisq(P) = 61.63 (p= 0.00)

What this says is that this distribution is very unlikely, less then .000.  The typical social science study uses a significance of .05.  Ths means that there is indeed very likely a relationship between our two variables (abany and relig).  We still have to look at the pattern in the table to see exactly what that relationship might be.



Inductive Generalizations

There are many computer programs for performing statistical analysis on data. The most popular are SPSS, SAS, MiniTab and STATA

Descriptive statistics: normally the first task in analysis of data. Describing single variables involves determining distributional characteristics of the variables such as central tendency and, where possible, visual techniques (box plots, stem&leaf, etc.) are used to make the central tendency and distributional characteristics clear. Summary statistics for the variables are calculated--central tendency (mean, median, mode) and distributional characteristics (frequency distributions, % in classifications, range, IQR, standard deviation etc.). This information on the variables allows one to: (1) describe the data, (2) decide if a hypothesis is testable using the data one has and (3) provides necessary information for selection of appropriate statistics to test the hypothesis.

Inferential statistics: Inferring is the process of making a guess (hopefully, educated) about a population (target) based on information collected in a sample. There are many different possible statistics that could be used, but some form of significance is commonly used.

Basic concepts:

Hypothesis, an educated guess about a possible relationship between two variables (1. Amount of food intake determines one's weight; 2. Gender determines how much financial support one would recieve from friends and relatives to go to college; 3. A person's Religious belief is related to their attitude about abortion.)

Independent Variable, the cause, prior in time, determiner of the other variable in your hypothesis (1. food intake; 2. gender; 3. Religious belief)

Dependent Variable, the determined, later in time, determined by the other variable in your hypothesis (1. weight; 2. amount of financial help one recieves from relatives and friends to go to college; 3. attitude about abortion)

Control Variable, a variable you have reason to believe is also related to your variables. You control for this variable to examine its relationship to your other variables (1. amount of exercise, metabolic rate, height; 2. age, marital status, socioeconomic status of respondent, friends and relatives; 3. gender of respondant, age, education)

Significance (statistical), if a distribution could have occurred by chance (randomly or accidentally) then the distribution is not significant. If a distribution is not likely to have occurred by chance it is significant. Say you are rolling dice for real bucks and your opponent gets his points every time. Such a case is not likely to have occurred by chance (that is what is meant by statistical significance), probably he/she is cheating (that is your personal significance). Typical statistical significance levels for social science studies would be .05 or less meaning that what occurred in the table distribution, correlation, etc. could have occurred by chance in only 5 out of 100 replications of your study.  This means there is a likely relationship between the variables you are examining. One variable is "likely a cause" of the other variable.  Statistical reports always say "likely" since statistics are probabilities and there are other possibilities, such as another variable may have caused changes in both of the variables you are examining.

Representativeness and bias, the sample should be created so as to represent the population (target) as closely as possible.  The best mathematical way to ensure this is by collecting a random sample (this is a very precise technique of Equal Probality of Selection Method in which each member of the target has the same likelihood of selection)

Sample Size, in general, the bigger the better but 1,500 cases is an adequate number and is used for most national surveys

Error Margin, the range within which the best statistical guess occurs. Examples include:

Guessing peoples' weight at the fair within 3 lbs (this error margin actually is a range of 5 lbs as seen below where my weight is 165)

163 164 165 166 167

Predicting election results (for example saying Grey Davis was predicted to obtain 60% to 68% of the vote for Governor).

Confidence Level, a measure of strength of your conviction that the real occurence will occur within your error margin.  Generally this is 95% level of confidence, stated as "We are 95% confident that the election results will be between 60 and 68% [error margin] for Grey Davis for Governor" or we are 65% confident that our weight guess of 150lbs is within 3 lbs.

Typically for a computer statistical analysis of survey data, you:

  1. obtain a frequency of your independent and dependent variables first to make sure they do indeed vary in your data. In other words ensuring that all your data do not fit into single classifications such as being all male or all female.
  2. create a table  with your dependent and independent variables, see Table conventions.  There are other possibilities such as determining the correlation between two variables.
  3. interpret the distributions in the table and the meaning of the statistics calculated.
  4. Look at appropriate significance statistics to determine if your table distributions are significant (did not likely occur by chance)

The Logic of Inferential Statistics

Interpreting Statistics


Inferential statistics techniques are designed to specify estimates, and confidence in estimates, of a population based on data collected from a sample. Estimates of the distribution of all possible samples, the sampling distribution, are made from the sample's variability, sample A, B, or C in graphic above. One can then estimate the likelihood of finding a difference equal to or greater than that found in the sample data. For example a survey might collect a random sample of 100 surveys from a University containing 5,000 students. The central tendency, mean, median or mode, is used to estimate the central tendency of the population of 5,000.