AP Statistics- Unit 2: Exploring Two-Variable Data Summary Notes

AP PhysicsAP Calculus AP ChemistryAP Biology

Exploring Two-Variable Data
On your AP exam, 5‒7% of questions will fall under the topic of Exploring Two-Variable Data.

Two Categorical Variables
When a data set involves two categorical variables, a contingency table can show how the data points are distributed categories. For example, suppose 600 high school students were asked whether or not they enjoy school. The students could be separated by grade level and by their answer to the question. The data might be organized as follows:

Totals can be calculated for the rows and columns, along with a grand total for the entire table. The entries can be given as relative frequencies by representing the value in each cell as a percentage of either the row or column total. For example, the preceding data is shown below as relative frequencies based on the column totals:

Note that since the percentages are relative to the row column totals, each column now has a total of $100 \%$. The row totals are shown as a percentage of the table total and are referred to as a marginal distribution. If the entries are given as relative frequencies by dividing the total for the entire table, rather than by the row or column totals, the table is referred to as a joint relative frequency.

Two Quantitative Variables
When data consists of two quantitative variables, it can be represented as a scatterplot, which shows the relationship between the two variables. The variables are assigned to the $x$ – and $y$ axes, and then each point can be represented by a point on the $x y$-plane. The variable that is chosen for the $x$-axis is often referred to as the explanatory variable, while the variable represented on the $y$-axis is the response variable.

A scatterplot shows what kind of association, if any, exists between the two variables. The direction of the association can be described as positive or negative; positive means that as one variable increases, the other increases as well, while negative means that as one variable increases, the other decreases.

The form of an association describes the shape that the points make. In particular, we are generally most interested in whether or not the association is linear. When it is non-linear, it may also be described as having another form, such as exponential or quadratic.

The strength of an association is determined by how closely the points in the scatterplot follow a pattern (whether the pattern is linear or not). In the previous two examples, the nonlinear plot shows a much stronger association than the linear plot, since the points more closely follow a particular curve.

Finally, a scatterplot might have some unusual features. Just as with data involving a single variable, these features include clusters and outliers.

Correlation
The correlation between two variables is a single number, $r$, that quantifies the direction and strength of a linear association:
$
r=\frac{1}{n-1} \sum\left(\frac{x_i-\bar{x}}{s_x}\right)\left(\frac{y_i-\bar{y}}{s_y}\right)
$

In this formula, $s_x$ and $s_y$ denote the sample standard deviations of the $x$ and $y$ variables, respectively. Although it is possible to calculate by hand, it is implausible for all but the smallest data sets.

The correlation is always between -1 and 1 . The sign of $r$ indicates the direction of the association, and the absolute value is a measure of its strength: values close to 0 indicate a weak association, and the strength increases as the values move toward -1 or 1 . If $r$ is 0 , there is absolutely no linear relationship between the variables, whereas an $r$ of -1 or 1 indicates a perfect linear relationship.

It is important to note that a value close to -1 or 1 does not, by itself, imply that a linear model is appropriate for the data set. On the other hand, a value close to 0 does indicate that a linear model is probably not appropriate.

Regression and Residuals
A linear regression model is a linear equation that relates the explanatory and response variables of a data set. The model is given by $\hat{y}=a+b x$, where $a$ is the $y$-intercept, $b$ is the slope, $x$ is the value of the explanatory variable, and $\hat{y}$ is the predicted value of the response variable.

The purpose of the linear regression model is to predict a $y$ given an $x$ that does not appear within the data set used to construct the model. If the $x$ used is outside of the range of $x$-values of the original data set, using the model for prediction is called extrapolation. This tends to yield less reliable predictions than interpolation, which is the process of predicting $y$ values for $x$-values that are within the range of the original data set.

Since regression models are rarely perfect, we need methods to analyze the prediction errors that occur. The difference between an actual $y$ and the predicted $y, y-\hat{y}$, is called a residual. When the residuals for every data point are calculated and plotted versus the explanatory variable, $x$, the resulting scatterplot is called a residual plot.

A residual plot gives useful information about the appropriateness of a linear model. In particular, any obvious pattern or trend in the residuals indicates that a linear model is probably inappropriate. When a linear model is appropriate, the points on the residual plot should appear random.

The most common method for creating a linear regression model is called least-squares regression. The least squares model is defined by two features: it minimizes the sum of the squares of the residuals, and it passes through the point $(\bar{x}, \bar{y})$.

The slope $b$ of the least-squares regression line is given by the formula $b=r \cdot \frac{s_x}{s_y}$. The slope of the line is best interpreted as the predicted amount of change in $y$ for every unit increase in $x$.

Once the slope is known, the $y$-intercept, $a$, can be determined by ensuring that the line contains the point $(\bar{x}, \bar{y}): a=\bar{y}-b \bar{x}$.

The $y$-intercept represents the predicted value of $y$ when $x$ is 0 . Depending on the type of data under consideration, however, this may or may not have a reasonable interpretation. It always helps to define the line, but it does not necessarily have contextual significance.

The square of the correlation $r$, or $r^2$, is also called the coefficient of determination. Its interpretation is difficult, but is usually explained as the proportion of the variation in $y$ that is explained by its relationship to $x$ as given in the linear model.
There are three ways to classify unusual points in the context of linear regression:

A point that has a particularly large residual is called an outlier.
A point that has a relatively large or small $x$-value than the other points is called a highleverage point.
An influential point is any point that, if removed, would cause a significant change in the regression model.

Outliers and high-leverage points are usually also influential.
There are situations in which transforming one of the variables results in a linear model of increased strength compared to the original data. For example, consider the following scatterplot, associated least-squares line, and residual plot:

Although the coefficient of determination is high, the residual plot shows a clear lack of randomness. This indicates that a linear model is not appropriate, despite the relatively high correlation. Here are the results of performing the same analysis on the data after taking the logarithm of all the $y$-values:

Not only is the correlation even higher now, the residual plot does not show any obvious patterns. This means that the data were successfully transformed for the pu rposes of fitting a linear model.

There are many other transformations that can be tried, including squaring or taking the square root of one of the variables.

Free Response Tip
If a free response question asks you to justify the use of a linear model for relating two variables, you can mention a correlation near -1 or 1 . However, that is not a full justification on its own. You must also analyze the residuals as described in this section.

Suggested Reading

Starnes $\&$ Tabor. The Practice of Statistics. $6^{\text {th }}$ edition. Chapter 3. New York, NY: Macmillan.
Essentials of Statistics 6e, Triola. Chapter 9.
Bock, Velleman, De Veaux, $\&$ Bullard. Stats:Modeling the World. $5^{\text {th }}$ edition. Chapters 6-9. New York, NY: Pearson.
Sullivan. Statistics: Informed Decisions Using Data. $5^{\text {th }}$ edition. Chapter 4. New York, NY: Pearson.
Peck, Short, $\&$ Olsen. Introduction to Statistics and Data Analysis. $6^{\text {th }}$ edition. Chapter 5. Boston, MA: Cengage Learning.

Sample Exploring Two-Variable Data Questions
For new trees of a certain variety between the ages of 6 months and 30 months, there is approximately a linear relationship between height and age. This relationship can be described by $y=15.4+0.35 x$, where $y$ represents the height (in inches) and $x$ represents the age (in months). The tree you planted in the front yard is 16.4 months old and is 23 inches tall. What is its residual according to this model?
A. 5.7400
B. 44.1435
C. -1.8565
D. 1.8565
E. 21.1435

▶️Answer/Explanation

Explanation:
The correct answer is $\mathbf{D}$. The residual is the actual value minus the predicted value given by the linear model at the age of 16.4 months. This yields:
$
23-(15.4+0.35(16.4))=1.8565
$

Choice $A$ is the amount of growth experienced by the tree at an age of 16.4 months. Choice $B$ is incorrect because you should have subtracted the actual height and the predicted height at an age of 16.4 months given by the linear model. Choice $\mathrm{C}$ is incorrect because this is the negative of the correct value, so you subtracted in the wrong order. Choice $E$ is the predicted height for the age of 16.4 months provided by the linear model. You must subtract this from the actual height of the tree to get the residual.

The effects of a nutritional supplement on hamsters were examined by feeding hamsters various concentrations of the supplement in their daily water supply (measured in $\mathrm{mg}$ per liter). The time (in days) until the hamsters exhibited an increase in activity was recorded. A total of 21 different experiments were performed. A preliminary plot of the data showed that the relationship of time versus concentration was approximately linear. The output appears below:

Which of the following is the best fit regression line?
A. $y=0.36+3.415 x$
B. $y=3.415+3.6 x$
C. $y=3.415+0.36 x$
D. $y=4.932+0.84 x$
E. $y=0.36 x$

▶️Answer/Explanation

Explanation:
The correct answer is $\mathbf{C}$. This choice is the result of correctly extracting the slope and intercept from the table, and inserting them in the model $y=60+b_1 x$. Choice $A$ is the result of switching the slope and intercept. Choice B is incorrect because the slope is off by a factor of 10. Choice D is incorrect because you used the test statistics instead of the actual estimates of the slope and intercept provided. Choice E is incorrect because you neglected to include the intercept.

Consider the following three scatterplots:

Which of the following statements, if any, are true?
I. The intercept for the line of best fit for the data in scatterplot A will be positive.
II. The slope for the line of best fit for the data in scatterplot $B$ will be negative.
III. There is no discernible relationship between the variables $x$ and $y$ in scatterplot $C$.
A. I only
B. Il only
C. III only
D. II and III only
E. I and II only

▶️Answer/Explanation

Explanation:
The correct answer is $\mathrm{E}$. Statement I is true because the best fit line is a horizontal line above the $x$-axis, so that its $y$-intercept will intersect the $y$-axis in a positive number. Statement II is true because the best fit line is a line whose slope is the same as the parallel lines along which the data in the scatterplot conform. Since these lines fall from left to right, the slope is negative. Statement III is false because there is a discernible relationship between $x$ and $y$ in scatterplot C, it is simply nonlinear.

Need Help ? Book A Tutor