This section will cover the basics of selecting the sample, measuring the variable(s), organizing the information into tables, and creating basic graphs to represent the data.
In the previous section, we noted that random sampling is the best way to obtain a representative sample, but that random sampling is rarely done in psychological research. Your research methods textbook explains the reason that random sampling is so rare in Chapter 9. So we will not cover that issue here. Nevertheless, there are aspects of sampling that are important to understand from a statistical perspective, and those aspects will be covered here.
Although true sampling may not be possible in psychology, you want to avoid sampling procedures that create obviously biased samples. Biased samples are not representative of the population on one or more dimensions.
For example, suppose you want to conduct market research for an automobile company to determine what characteristics of a car are most important in determining which car people are likely to buy. If you were to sample individuals who attended a hot rod show, it is likely that this group is a biased sample of automobile buyers. In general, such a sample would over-represent males who like cars with a lot of horsepower and a sleek profile. They would under-represent both males and females who are interested in other characteristics, such as spaciousness, dependability, good gas mileage, and safety features.
It may not be possible to get random sample of all car buyers, but it would be important to try to get as unbiased a sample as possible. For example, you might want to make random calls to people, asking if they expect to buy a car in the next two years, and ask those who say yes questions about what is important to them in their car buying decision. You could also select those individuals who stop by several dealerships to buy a car, ideally in several locations. You would want to sample from multiple locations, because people in different locations often vary on critical variables, such as how much money they make, and variables like income are likely to affect what people look for in a car.
Statisticians make the distinction between sampling with replacement and sampling without replacement. Sampling with replacement means that each person or item sampled is returned to the population and has the potential to be sampled a second time. Although it may not be obvious, this means that every sampled item or individual is drawn from the same population set, because the population was returned to its former state by replacing each sampled item or individual before the next sampling occurs.
This concept is easier to understand with an example. Assume that the population includes the 50 states in the United States, and that you wanted to sample 10 of those states randomly. Suppose you do the sampling by writing the state name on a card, shuffle the cards, and then select randomly a state from the pile of cards. If you are sampling with replacement, you return the card with the state name to the deck before you shuffle and select the next state. With just 50 states and a sample of 10, there is a good chance that you will select at least one state more than once. In contrast, if you are sampling without replacement, you shuffle the deck of fifty state names and randomly select 10 of the cards. There is no possibility of selecting the same state twice, because once you select the state, the deck will no longer have that state in it.
You might wonder why anyone would select with replacement, because such a procedure makes it possible that someone could be represented more than once in the sample. Technically, all statistical procedures are based on the assumption of random sampling, and true random sampling is sampling with replacement. However, if the population size is much larger than the sample size, sampling with replacement and sampling without replacement are likely to provide the same results. For example, if the population size is 1,000,000 and the sample size is 10, it is extremely unlikely that one of those 1,000,000 people in the population will be sampled twice in a random sample with replacement, so the two sampling methods will give you the same results. It is extremely rare that statistical procedures are based on samples that are close to the size of the population. Consequently, we can almost always ignore this subtle distinction.
A common research design is to draw more than one sample from the population and apply different manipulations to the different samples. For example, a drug company may want to compare a new drug against an existing drug on its effectiveness at reducing a medical problem. You will learn in the research methods text, that the norm is to draw a single sample large enough for the number of groups you want to use in your study and then randomly assign each of the people in your sample to the groups.
Technically, if the population is very large and the samples you want to draw are relatively small by comparison, you could draw each sample separately and dispense with the random assignment to groups. In practice, however, the actual sample available to most researchers is only a subset of the large population. For example, the population may include 1,000,000 people, but the people who live close enough to the researcher to be available for the study may only be 200. Since the research would be using sampling without replacement because each person could only be in one treatment condition, selecting the one sample would have too large an effect on who could be selected for the other sample. This is why the steps of sampling from the population and randomly assigning participants to groups is done in two steps.
Measuring the variables is applying numbers to represent the amount of a variable. This process requires an operational definition for the variable, which is a detailed set of procedures to follow to make those measurements. This process is discussed in Chapter 4 of the research methods textbook.
Also discussed in that chapter is that the measurement procedures may produce measures that are a good match to the abstract number system or numbers that are a relatively poor match to the abstract number system. We define the degree of matching by indicating the scale of measurement. Four scales were identified in the discussion in Chapter 4: nominal, ordinal, interval, and ratio. These are discussed in some detail in the chapter, so that discussion will not be repeated here.
The primary reason that we note the scale of measurement for a variable is that it affects what statistics are appropriate to compute on that variable. We will discuss this issue as we discuss each of the descriptive and inferential statistics covered on this website.
The next section will discuss how to organize data in distribution tables.