Collecting Data
About $12-15\%$ of the questions on your AP Statistics exam will cover the category of Collecting Data.
Planning a Study
The entire set of people, items, or subjects of interest to us is called a population. Because it is often not feasible to collect data from a population, a sample, or smaller subset, is selected from the population. One of the goals of statistics is to use sample data to make reliable inferences about populations.
Once a sample is selected, data collection must take place. In an experiment, the participants or subjects are explicitly assigned to two or more different conditions, or treatments. For example, a medical study investigating a new cold medication might assign half of the people in the study to a group that receives the medication, and the other half to a group that receives an older medication.
Experiments are the only way to determine causal relationships between variables. In the experiment just described, the manufacturer of the medication would like to be able to state that taking their medication causes a reduced duration of the cold.
When experiments are not possible to do for logistical or ethical reasons, observational studies often take their place. In an observational study, treatments are not assigned. Rather, data that already exists is collected and analyzed. As noted, an observational study can never be used to determine causality.
Whether a study is experimental or observational, it is important to keep in mind that the results can only be generalized to the population from which the sample was selected.
Data Collection
The methods used in collected data play a large role in determining what conclusions can be drawn from statistical analysis of the data. A sampling method is a technique, or plan, used in selecting a sample from a population.
When a sampling method allows for the possibility of an item being selected more than once, the sampling is said to be done with replacement. If that is not possible, so that each item can be selected at most once, the sampling is without replacement.
A random sample is one in which every item from the population has an equal chance of being chosen for the sample. A simple random sample, or SRS, is one in which every group of a given size has an equal chance of being chosen. Every simple random sample is also random, but the opposite is not true: some sampling techniques lead to random samples that are not simple random samples.
In a stratified sample, the population is first divided into groups, or strata, based on some shared characteristic. A sample is then selected from within each stratum, and these are combined into a single larger sample. A stratified sample may be random, but it will never be an SRS.
Another kind of sample is called a cluster sample. As with a stratified sample, the population is first divided into groups, called clusters. A sample of clusters is then chosen, and every item within each of the chosen clusters is used as part of the larger sample. Here again, a cluster sample may be random, but it will never be an SRS.
A systematic random sample consists of choosing a random starting point within a population and then selecting every item at a fixed periodic interval. For example, perhaps every \(10^{\text {th }}\) item in a list is chosen. Again, this kind of sample is not an SRS.
Each of these sampling methods has pros and cons that depend on the population from which they are drawn, as well as the kind of study being done.
Problems with Sampling
There are many potential problems with sampling that can lead to unreliable statistical conclusions. Bias occurs when certain values or responses are more likely to be obtained than others. Examples of bias include:
- Voluntary response bias, which occurs when a sample consists of people who choose to participate
- Undercoverage bias, which happens when some segment of the population has a smaller chance of being included in a sample
- Nonresponse bias, which happens when data cannot be obtained from some part of the chosen sample
- Question wording bias, which is the result of confusing or leading questions
A random sample, and specifically a simple random sample, is an important tool in helping to avoid bias, though it certainly does not guarantee that bias will not occur.
Experimental Design
A well-designed experiment is the only kind of statistical study that can lead to a claim of a causal relationship. A sample is broken into one or more groups, and each group is assigned a treatment. The results of the data collection that follows show the effect that the treatment had on the subjects.
In an experiment, the experimental units are the individuals that are assigned one of the treatments being investigated; these may or may not be people. When they are people, they are also called participants or subjects. The explanatory variable in an experiment is whatever variable is being manipulated by the experimenter, and the different values that it takes on are called treatments. The response variable is the outcome that is measured to determine what effects, if any, the treatments had. A potential problem in any experiment is the existence of confounding variables.
A confounding variable has an effect on the response variable, and may create the impression of a relationship between the explanatory and response variable even where none exists. When possible, confounding variables should be controlled for by careful design of treatments and data collection. Even when they cannot be entirely controlled for, they should be acknowledged as potentially having an effect on the results of the experiment.
A well-designed experiment should always consist of at least two treatment groups, so that the treatment under investigation can be compared to something else. Often, it is compared to a control group, whose sole purpose is to provide comparison data. The control group either receives no treatment, or treatment with an inactive substance called a placebo. It is important to realize, however, that there is a well document phenomenon called a placebo effect, in which subjects do respond to treatment with a placebo.
Blinding is a precaution taken to ensure that the subjects and/or the researcher do not know which treatment is being given to a particular individual. In a single-blind experiment, either the subject or the researcher does have this information, but the other does not. In a double-blind experiment, neither party has this information.
The experimental units should always be randomly assigned to the different treatment groups; if they are not, bias of the sort discussed in the previous section is likely to be an issue. In a completely randomized design, experimental units are assigned to treatment groups completely at random. This is usually done using random number generators, or some other technique for generating random choices. This design is most useful for controlling con founding variables.
In a randomized block design, the experimental units are first grouped, or blocked, based on a blocking variable. The members of each block are then randomly assigned to treatment groups. This means that all the values of the blocking variable are represented in each treatment group, which helps ensure that it does not act as a confounding variable in the experiment. A matched pairs design is a particular kind of block design in which the experimental units are first arranged into pairs based on factors relevant to the experiment.
Each pair is then randomly split into the two treatment groups.
Free Response Tip
When a free response question asks you to describe an experimental design, be sure to explain why you are making the choices you are. For example, if you are blocking the experimental units, explicitly state why the variable you are blocking on might be confounding.
Suggested Reading
- Starnes \& Tabor. The Practice of Statistics. \(6^{\text {th }}\) edition. Chapter 4. New York, NY: Macmillan.
- Larson \& Farber. Elementary Statistics: Picturing the World. \(7^{\text {th }}\) edition. Chapter 1. New York, NY: Pearson.
- Bock, Velleman, De Veaux, \& Bullard. Stats: Modeling the World. \(5^{\text {th }}\) edition. Chapters \(10-12\). New York, NY: Pearson.
- Sullivan. Statistics: Informed Decisions Using Data. \(5^{\text {th }}\) edition. Chapter 1. New York, NY: Pearson.
- Peck, Short, \& Olsen. Introduction to Statistics and Data Analysis. \(6^{\text {th }}\) edition. Chapter 2. Boston, MA: Cengage Learning.
Sample Collecting Data Questions
A four-year liberal arts college is deciding whether or not to begin a new graduate degree program. They wish to assess the opinion of alumni of the college. The Alumni Affairs Department decides to mail a questionnaire to a random sample of 3500 alumni from the past 30 years. Of the 3500 mailed, 679 were returned, and of these, 218 supported the launching of a new graduate degree program.
Which of the following statements is true?
A. The population of this study consists of the 218 respondents who favor the graduate degree program.
B. The 3500 alumni who were randomly mailed a questionnaire is a representative sample of all alumni of the college for the past 30 years.
C. The population of this study consists of the 679 alumni who mailed back a response.
D. The 3500 alumni receiving the questionnaire constitute the population of the study.
E. Current students are part of the population of this study.
▶️Answer/Explanation
Explanation:
The correct answer is B. This is a very large sample of the graduates from the past 30 years of a small liberal arts college and so, is representative of that population. Choice \(A\) is inco rrect because this is simply the number of respondents in the sample who had this opinion. The population is the broader group about which we are trying to infer an opinion on the matter. Choice \(\mathrm{C}\) is incorrect because this is simply the number of alumni in the sample who responded to the questionnaire. The population is the broader group about which we are trying to infer an opinion on the matter. Choice \(D\) is incorrect because this is simply the size of the sample. The population is the broader group about which we are trying to infer an opinion on the matter. Choice \(E\) is incorrect because only the opinion of the alumni was of interest in this study.
Suppose a simple random sample of size 50 is selected from a population. Which of the following is true of such a sample?
I. It is selected so that every set of 50 subjects in the population has an equal chance of being the sample chosen.
II. It is drawn in such a manner so that every subject has the same chance of being selected.
III. Some members of the population have no chance of being selected, but those that can be selected have the same chance of being selected.
A. I only
B. Il only
C. III only
D. I and II only
E. II and III only
▶️Answer/Explanation
Explanation:
The correct answer is D. If I were not true, then some subjects would necessarily have a different chance of being selected, which would render the sample as not being truly random. So, I must be true. Also, if different subjects had a different chance of being selected, the sample would not be truly random. So, II is true. Choice II is false; ALL members of the population must have the same chance of being selected in order for the sample to be random.
A college admissions officer wishes to compare the SAT scores for the incoming freshmen class to the current sophomore class. Which of the following is the most appropriate technique for gathering the data needed to make this comparison?
A. observational study
B. experiment
C. census
D. sample survey
E. a double-blind experiment
▶️Answer/Explanation
Explanation:
The correct answer is \(\mathbf{C}\). Making this comparison requires that you collect data for all members satisfying a certain characteristic (here, being an incoming freshman). This is precisely what is done in a census. Choice A s incorrect because you are not trying to make inferences a bout the effect of a treatment on a group of subjects. Rather, making this comparison requires that you collect data for all members satisfying a certain characteristic (here, being an incoming freshman). Choice B is incorrect because you are not conducting an experiment but making this comparison requires that you collect data for all members satisfying a certain characteristic (here, being an incoming freshman). Choice D is incorrect because there is no reason to take only a sample. All of the data is available for this set of people and so, a census study is more appropriate when trying to make the described comparison. Choice \(\mathrm{E}\) is incorrect because you are not conducting an experiment but making this comparison requires that you collect data for all members satisfying a certain characteristic (here, being an incoming freshman).