This page outlines the basic principles needed to organize, prepare, and conduct statistical analyses. We will be using one of the most popular statistical analysis packages, SPSS for Windows, to illustrate this process, but the principles we describe apply to virtually any statistical analysis package.
SPSS for Windows is designed to run on IBM-compatible personal computers running the Windows operating system. A variety of other statistical packages exist—some designed to run on mainframe computers (e.g., BMDP, SPSS-X, & Minitab) and others designed to run on personal computers (e.g., SPSS for Windows, Statistica). Although each statistical package uses different commands, they tend to operate similarly.
Programs designed to run on mainframe computers (large, centralized computers that you access from remote terminals) usually operate in batch mode. This means that you set up a series of command statements describing the data and the analyses you want and send these statements to the computer in one batch to be run. Some of the early statistical analysis packages designed for personal computers (e.g., SPSS-PC+) also could be run in batch mode. However, most packages running on personal computers take advantage of the fact that you have exclusive access to the computer. They are structured to perform each step immediately after you select it, giving you immediate feedback of results before you select the next step or analysis.
Most statistical analysis programs are documented in manuals, which can be purchased at most university bookstores. Many university computer centers sponsor free or low-cost courses on using particular statistical packages. The trend is to make statistical analysis packages increasingly easier to operate. That is as it should be. However, even modern statistical analysis packages cannot decide what statistical procedures are appropriate and how to interpret the results of the analysis. There is no substitute for solid training in statistics and research design.
The first step in any data analysis is to organize and enter the data onto a file accessible to the data-analysis program. Most statistical packages designed to run on personal computers have an integrated data entry system. Statistical analysis packages written to run on mainframe computers typically require that the data be entered into a computer file in a specified format.
We will not be covering the process of entering data for mainframe computers in this overview, but the manuals for any statistical package that runs on a mainframe will describe the procedures. Before you begin the process of data entry, always consult the manual for the statistical package that you intend to use to see what restrictions apply to the data organization and entry.
Virtually all statistical analysis packages follow the same conventions for data entry. For example, the order of the variables has to be identical for each participant. If you have six variables measured on each participant (age, gender, IQ score, the condition the participant was assigned to, and pretest and posttest measures on your dependent variable), those six variables must appear in the same order for each participant. In addition, each line (called a record) contains data from a single participant. The only exception to this principle is that, when using a matched-subjects designs, you should place the data from the set of matched participants on a single record.
Statistical analysis packages usually arrange data by field and record, although a given package may use different terms for these concepts. A record consists of all the data for a given participant and is represented as rows in your data entry sheet. Within each record are individual fields, which contain the scores for a participant on a given variable. In our previous example, there are six fields in each record (age, gender, IQ, condition, pretest, posttest). Actually, we often add a seventh field to provide an identification number for each participant.
It is best to organize the data before you begin data entry. One way to do this is to construct a data matrix table that resembles the data file. Another way to organize the data is to have data sheets for each participant, which organize the variables and have a place for the participant's score on each variable. Click here to see a data sheet for a rather complex hypothetical study. This example is for a treatment study of severe depression. A properly constructed data sheet will not only organize the data entry process, but can also help in the coding of the data (discussed below).
Many variables produce numerical scores, which require no coding before data entry. Examples are reaction times, scores on psychological tests, the number of responses in a given time interval, and so on.
Other variables require coding or scoring before the data are suitable for analysis. Sometimes the codes are simple. For example, the gender of the participant can be coded as M or F for male and female. Some statistical analysis programs operate more efficiently with numerical codes than alphabetic codes, so you may want to use 1 and 2 as the codes for male and female, respectively. The numerical code in this case is arbitrary. When arbitrary codes are used, special care must be exercised to label the output so that the codes are easily identifiable. Most statistical analysis packages have routine procedures for incorporating such labels.
In some situations, coding the data is an involved process that needs to be carefully thought out if the data analysis is to go smoothly. Suppose that you are coding a diagnostic interview for indications of psychopathology. To give an overly simplified example, suppose you asked four questions: (1) Do you experience obsessions? (2) Do you experience compulsions? (3) If obsessions and/or compulsions are experienced, do they seem unreasonable to you? (4) Do they cause marked distress and/or are excessively time-consuming (over one hour per day)?
Of course the questions would be worded differently to define for the participant exactly what you mean by each term, and some of these questions would actually involve a series of questions. We have simplified the questions here for the sake of illustration. The information for these four basic questions is sufficient to decide whether the participant suffers from obsessive-compulsive disorder—OCD for short (DSM-IV-TR, American Psychiatric Association, 2000).
How could you code such data to give yourself maximum flexibility in the data analysis? You could take the information from the four questions, make the diagnosis, and code your diagnosis directly (e.g., 1 for OCD present; 2 for OCD not present). To get a 1 code (OCD present) you would have to have a yes to either or both questions 1 and 2 as well as a yes to both question 3 and 4. That means that there are several patterns of responses that would receive a diagnosis code of 1 (yes, yes, yes, yes; yes, no, yes, yes; and no, yes, yes, yes). There are also several ways in which you could get a diagnosis code of 2.
By coding only the diagnosis (OCD either present or absent), you have lost information that might have been useful in an analysis. For example, unless you go back and recode, you no longer know which participants reported obsessions and which reported compulsions.
A carefully constructed coding scheme allows you to record the original data with all of the detail intact and still be able to simplify the data if you wish.