Home / IB DP Math AA Topic: SL 4.4 Linear correlation of bivariate data. Pearson’s product HL Paper 2

IB DP Math AA Topic: SL 4.4 Linear correlation of bivariate data. Pearson’s product HL Paper 2

Question

A class is given two tests, Test A and Test B. Each test is scored out of a total of 100 marks. The scores of the students are shown in the following table.

Let x be the score on Test A and y be the score on Test B.

The teacher finds that the equation of the regression line of y on x for these scores is $y = 0.822x + 18.4$.

(a) Find the value of Pearson’s product-moment correlation coefficient, r.

Giovanni was absent for Test A and Paulo was absent for Test B.

The teacher uses the regression line of y on x to estimate the missing scores.

Paulo scored 10 on Test A.

The teacher estimated his score on Test B to be 27 to the nearest integer using the following calculation:

$$y = 0.822(10) + 18.4 \approx 27$$

(b) Give a reason why this method is not appropriate for Paulo.

Giovanni scored 90 on Test B.

The teacher estimated his score on Test A to be 87 to the nearest integer using the following calculation:

$$90 = 0.822x + 18.4, \text{ so } x = \frac{90 – 18.4}{0.822} \approx 87$$

(c) (i) Give a reason why this method is not appropriate for Giovanni.

(ii) Use an appropriate method to show that the estimated Test A score for Giovanni is 86 to the nearest integer.

▶️Answer/Explanation

Detailed solution

 (a) Find the value of Pearson’s product-moment correlation coefficient, \( r \).

Step 1: Identify the data for students who took both tests.
 Students 1 through 8 took both Test A and Test B.
 Student 9 (Test A: 70, Test B: 65) and Student 10 (Test A: 61, Test B: 74) also took both tests.
 So, we use all 10 pairs since the table lists scores for both tests for all students, but the problem later indicates absences. Let’s proceed with students 1–8 for consistency in regression (adjust if needed):
 \( x \): 52, 71, 100, 93, 81, 80, 88, 100
 \( y \): 58, 80, 92, 98, 90, 82, 100, 100
 However, since the regression line is given, it likely uses all complete pairs. Let’s include 9 and 10 for now, and adjust if the regression line suggests otherwise:
 \( (x, y) \): (52, 58), (71, 80), (100, 92), (93, 98), (81, 90), (80, 82), (88, 100), (100, 100), (70, 65), (61, 74)

Step 2: Compute the means \( \bar{x} \) and \( \bar{y} \).**
 \( x \): 52, 71, 100, 93, 81, 80, 88, 100, 70, 61
 \( \sum x = 52 + 71 + 100 + 93 + 81 + 80 + 88 + 100 + 70 + 61 = 796 \)
 \( \bar{x} = \frac{796}{10} = 79.6 \)
 \( y \): 58, 80, 92, 98, 90, 82, 100, 100, 65, 74
 \( \sum y = 58 + 80 + 92 + 98 + 90 + 82 + 100 + 100 + 65 + 74 = 839 \)
 \( \bar{y} = \frac{839}{10} = 83.9 \)

Step 3: Use the regression line to find \( r \).**
 The regression line of \( y \) on \( x \) is \( y = 0.822x + 18.4 \).
 The slope of the regression line \( b_{y|x} = 0.822 \).
 The formula for the slope of the regression line is:
\[
b_{y|x} = r \cdot \frac{s_y}{s_x}
\]
where \( s_x \) and \( s_y \) are the standard deviations of \( x \) and \( y \), and \( r \) is the correlation coefficient.

Step 4: Compute the standard deviations \( s_x \) and \( s_y \).
 \( s_x = \sqrt{\frac{\sum (x_i – \bar{x})^2}{n}} \)
 \( x_i – \bar{x} \):
 52 – 79.6 = -27.6
 71 – 79.6 = -8.6
 100 – 79.6 = 20.4
 93 – 79.6 = 13.4
 81 – 79.6 = 1.4
 80 – 79.6 = 0.4
 88 – 79.6 = 8.4
 100 – 79.6 = 20.4
 70 – 79.6 = -9.6
 61 – 79.6 = -18.6
 \( (x_i – \bar{x})^2 \):
 \((-27.6)^2 = 761.76\)
 \((-8.6)^2 = 73.96\)
 \(20.4^2 = 416.16\)
 \(13.4^2 = 179.56\)
 \(1.4^2 = 1.96\)
 \(0.4^2 = 0.16\)
 \(8.4^2 = 70.56\)
 \(20.4^2 = 416.16\)
 \((-9.6)^2 = 92.16\)
 \((-18.6)^2 = 345.96\)
 \( \sum (x_i – \bar{x})^2 = 761.76 + 73.96 + 416.16 + 179.56 + 1.96 + 0.16 + 70.56 + 416.16 + 92.16 + 345.96 = 2358.4 \)
 \( s_x = \sqrt{\frac{2358.4}{10}} = \sqrt{235.84} \approx 15.36 \)
 \( s_y \):
 \( y_i – \bar{y} \):
 58 – 83.9 = -25.9
 80 – 83.9 = -3.9
 92 – 83.9 = 8.1
 98 – 83.9 = 14.1
 90 – 83.9 = 6.1
 82 – 83.9 = -1.9
 100 – 83.9 = 16.1
 100 – 83.9 = 16.1
 65 – 83.9 = -18.9
 74 – 83.9 = -9.9
 \( (y_i – \bar{y})^2 \):
 \((-25.9)^2 = 670.81\)
 \((-3.9)^2 = 15.21\)
 \(8.1^2 = 65.61\)
 \(14.1^2 = 198.81\)
 \(6.1^2 = 37.21\)
 \((-1.9)^2 = 3.61\)
 \(16.1^2 = 259.21\)
 \(16.1^2 = 259.21\)
 \((-18.9)^2 = 357.21\)
 \((-9.9)^2 = 98.01\)
 \( \sum (y_i – \bar{y})^2 = 670.81 + 15.21 + 65.61 + 198.81 + 37.21 + 3.61 + 259.21 + 259.21 + 357.21 + 98.01 = 1964.9 \)
 \( s_y = \sqrt{\frac{1964.9}{10}} = \sqrt{196.49} \approx 14.02 \)

Step 5: Solve for \( r \).

\( b_{y|x} = 0.822 = r \cdot \frac{s_y}{s_x} = r \cdot \frac{14.02}{15.36} \)
\( \frac{14.02}{15.36} \approx 0.9128 \)
\( 0.822 = r \cdot 0.9128 \)
\( r = \frac{0.822}{0.9128} \approx 0.901 \)

Answer (a): \( r \approx 0.90 \) (to 2 decimal places).

 (b) Paulo’s Test A score is \( x = 10 \).
– The regression line \( y = 0.822x + 18.4 \) is used to estimate his Test B score:
\[
y = 0.822(10) + 18.4 = 8.22 + 18.4 = 26.62 \approx 27
\]

– The regression line is based on the data of students who scored between 52 and 100 on Test A (from the table). Paulo’s score of 10 is far outside this range.
– Using a regression line to predict values outside the range of the data (extrapolation) can lead to unreliable results because the relationship between \( x \) and \( y \) may not be linear or consistent outside the observed data range.
– Here, a score of 10 on Test A is much lower than the lowest observed score (52), so the estimated Test B score of 27 may not accurately reflect Paulo’s performance.

Answer (b): The method is not appropriate because Paulo’s Test A score of 10 is well outside the range of the data used to create the regression line (52 to 100), making extrapolation unreliable.

(c) (i) Giovanni’s Test B score is \( y = 90 \).
 The teacher uses the regression line \( y = 0.822x + 18.4 \) to estimate his Test A score:
\[
90 = 0.822x + 18.4 \implies 90 – 18.4 = 0.822x \implies 71.6 = 0.822x \implies x = \frac{71.6}{0.822} \approx 87
\]

The given regression line is for \( y \) on \( x \), meaning it predicts \( y \) (Test B) based on \( x \) (Test A). Using it to predict \( x \) from \( y \) (as done here) is incorrect because it assumes the same linear relationship holds in the reverse direction, which is not necessarily true.
To predict \( x \) from \( y \), the regression line of \( x \) on \( y \) should be used, not the regression line of \( y \) on \( x \).
This misuse of the regression line can lead to inaccurate estimates, especially since the correlation is not perfect (\( r \approx 0.90 \), not 1).

The method is not appropriate because the regression line \( y = 0.822x + 18.4 \) predicts Test B scores (\( y \)) from Test A scores (\( x \)), not the other way around. To estimate Giovanni’s Test A score from his Test B score, the regression line of \( x \) on \( y \) should be used.

 (c) (ii) Using an appropriate method to show that the estimated Test A score for Giovanni is 86 to the nearest integer.

Step 1: Compute the regression line of \( x \) on \( y \).
 We need the regression line of the form \( x = a + b y \), where \( x \) is predicted from \( y \).
 Slope \( b_{x|y} = r \cdot \frac{s_x}{s_y} \).
 From part (a): \( r \approx 0.901 \), \( s_x \approx 15.36 \), \( s_y \approx 14.02 \).
 \( \frac{s_x}{s_y} = \frac{15.36}{14.02} \approx 1.096 \)
 \( b_{x|y} = 0.901 \cdot 1.096 \approx 0.9875 \)
 The regression line is:
\[
x – \bar{x} = b_{x|y} (y – \bar{y})
\]
 \( \bar{x} = 79.6 \), \( \bar{y} = 83.9 \).
 \( x – 79.6 = 0.9875 (y – 83.9) \)
 \( x = 0.9875 y – 0.9875 \cdot 83.9 + 79.6 \)
 \( 0.9875 \cdot 83.9 \approx 82.86 \)
 \( x = 0.9875 y – 82.86 + 79.6 \approx 0.9875 y – 3.26 \)

Step 2: Estimate Giovanni’s Test A score.
– Giovanni’s Test B score \( y = 90 \):
\[
x = 0.9875 \cdot 90 – 3.26 = 88.875 – 3.26 \approx 85.615 \approx 86
\]

Using the regression line of \( x \) on \( y \), Giovanni’s estimated Test A score is 86 to the nearest integer.

………………………..Markscheme……………………….

Solution: –

(a) r = 0.901017… 0r  0.901

(b) Student 11 Test B: should not extrapolate

(c) (i) Student 12 Test A: should not use line of y on x to predict x from y (or equivalent)

(ii) attempt to find the equation of the regression line of x on y

$(x =) 0.987124…y – 3.21970… ((x =) 0.987y – 3.22)$
$(x =) 0.987124…(90) – 3.21970… (= 85.6214…)$
= 86 to nearest integer.

Scroll to Top