Home / IBDP Maths AI: Topic: SL 4.4: Linear correlation of bivariate data: IB style Questions HL Paper 1

IBDP Maths AI: Topic: SL 4.4: Linear correlation of bivariate data: IB style Questions HL Paper 1

Question .

After taking a mathematics test, Fatima wonders how many more marks she would have achieved if she had spent an extra 1.5 hours studying.
To find out, she randomly selects five students from her class who took the same test and asks them how many hours ( t ) they spent studying for the test and the marks ( m ) they achieved. Their responses are shown in the following table.

(a) (i) Find the Pearson’s product moment correlation coefficient, $r$, for this data.

(ii) Find the least squares regression line of $m$ on $t$ for this data.

(b) According to her model, find how many more marks Fatima would have achieved if she spent an extra 1.5 hours studying.

(c) State one reason why the value obtained in part (b) might not be valid.

▶️Answer/Explanation

Detailed Solution

Given Data:

Hours studied ,

tt
0 1.2 1.6 2.5 4

Marks obtained ,

mm
45 54 61 72 86

(a) (i) Find Pearson’s product-moment correlation coefficient

rr

The formula for Pearson’s correlation coefficient is:

r=n(timi)timi[nti2(ti)2][nmi2(mi)2]r = \frac{n \sum (t_i m_i) – \sum t_i \sum m_i}{\sqrt{[n \sum t_i^2 – (\sum t_i)^2] [n \sum m_i^2 – (\sum m_i)^2]}}

where:

  • tit_i

    and

    mim_i

    are the given values,

  • n=5n = 5

    (number of data points).

Let’s calculate

rr

.

The Pearson’s product-moment correlation coefficient

rr

is approximately 0.995, indicating a very strong positive correlation between study time and marks obtained.

(a) (ii) Find the least squares regression line of

mm

on

tt

The least squares regression line is given by:

m=a+btm = a + bt

where:

  • b=n(timi)timinti2(ti)2b = \frac{n \sum (t_i m_i) – \sum t_i \sum m_i}{n \sum t_i^2 – (\sum t_i)^2}

    (slope)

  • a=minbtina = \frac{\sum m_i}{n} – b \frac{\sum t_i}{n}

    (intercept)

Let’s calculate

aa

and

bb

.

The least squares regression line of

mm

on

tt

is:

m=43.88+10.60tm = 43.88 + 10.60t

(b) Predicting Additional Marks for Fatima

If Fatima studied an extra 1.5 hours, the increase in marks can be estimated using the slope

bb

from our regression equation:

Additional Marks=b×1.5\text{Additional Marks} = b \times 1.5

If Fatima had studied an extra 1.5 hours, she would have achieved approximately 15.90 more marks.

(c): One reason the value 15.9 might not be valid is that the linear regression model assumes a constant increase in marks per hour of study, but in reality, additional study time might yield diminishing returns due to factors like fatigue or saturation, and the model may not account for individual variability in Fatima’s study effectiveness.

…………………………………..Markscheme………………………………….

Solution: –

(a) (i) r = 0.995 (0.994705…)

(ii) m = 10.6t + 43.9 (10.6032…t + 43.8780…)

(b) EITHER
$10.6032…\times1.5$
OR
(10.6032…(t+1.5)+43.8780…)-(10.6032…(t)+43.8780…)
THEN

15.9 (marks) (15.9048…)

(c) Accept any valid reason

e.g:

The students in the sample might not be of equal ability / she has not controlled for ability. She might have originally obtained close to full marks so an extra 15.9 would not be possible.

Question

Observations on 12 pairs of values of the random variables X , Y yielded the following results.

Σx = 76.3 , Σx 2 = 563.7, Σy = 72.2, Σy 2 = 460.1, Σxy = 495.4

    1. (i) Calculate the value of r , the product moment correlation coefficient of the sample.

      (ii) Assuming that the distribution of X , Y is bivariate normal with product moment correlation coefficient ρ , calculate the p-value of your result when testing the hypotheses H0 : ρ = 0; H1 : ρ > 0.

  1.   (iii) State whether your p-value suggests that X and Y are independent. [7]
  2. b             Given a further value x = 5.2 from from the distribution of X , Y , predict the corresponding value of y . Give your answer to one decimal place. [3]
▶️Answer/Explanation

Ans:

(a)

(i) use of 

(ii)

t = 0.80856… \(\sqrt{\frac{10}{1-0.80856…}}\)

= 4.345…

p-value = 7.27 × 10-4 

(iii) this value indicates that X,Y are not independent

(b)

use of

putting x = 5.2 gives y = 5.5

Question

Jim is investigating the relationship between height and foot length in teenage boys.

A sample of 13 boys is taken and the height and foot length of each boy are measured.

The results are shown in the table.

You may assume that this is a random sample from a bivariate normal distribution.

Jim wishes to determine whether or not there is a positive association between height and foot length.

a.Calculate the product moment correlation coefficient.[2]

b.Find the \(p\)value.[2]

c.Interpret the \(p\)value in the context of the question.[1]

d.Find the equation of the regression line of \(y\) on \(x\).[2]

e.Estimate the foot length of a boy of height 170 cm.[2]

 
▶️Answer/Explanation

Markscheme

Note: In all parts accept answers which round to the correct 2sf answer.

\(r = 0.806\)     A2

a.

\(4.38 \times {10^{ – 4}}\)     A2

b.

\(p\)-value represents strong evidence to indicate a (positive) association between height and foot length     A1

Note: FT the \(p\)-value

c.

\(y = 0.103x + 12.3\)     A2

d.

attempted substitution of \(x = 170\)     (M1)

\(y = 29.7\)     A1

Note: Accept \(y = 29.8\)

e.

Question

Bill is investigating whether or not there is a positive association between the heights and weights of boys of a certain age. He defines the hypotheses\[{{\rm{H}}_0}:\rho  = 0;{{\rm{H}}_1}:\rho  > 0 ,\]where \(\rho \) denotes the population correlation coefficient between heights and weights of boys of this age. He measures the height, \(h\) cm, and weight, \(w\) kg, of each of a random sample of \(20\) boys of this age and he calculates the following statistics.\[\sum {w = 340,\sum {h = 2002,\sum {{w^2} = 5830} } } ,\sum {{h^2} = 201124} ,\sum {hw = 34150} \]

a.(i)     Calculate the correlation coefficient for this sample.

(ii)     Calculate the \(p\)-value of your result and interpret it at the \(1\% \) level of significance.[8]

b.(i)     Calculate the equation of the least squares regression line of \(w\) on \(h\) .

(ii)     The height of a randomly selected boy of this age of \(90\) cm. Estimate his weight.[3]

 
▶️Answer/Explanation

Markscheme

(i)     \(r = \frac{{34150 – 340 \times \frac{{2002}}{{20}}}}{{\sqrt {\left( {5830 – \frac{{{{340}^2}}}{{20}}} \right)} \left( {201124 – \frac{{{{2002}^2}}}{{20}}} \right)}}\)     (M1)(A1)

Note: Accept equivalent formula.

 

\( = 0.610\)     A1

 

(ii)     (\(T = R \times \sqrt {\frac{{n – 2}}{{1 – {R^2}}}} \) has the t-distribution with \(n – 2\) degrees of freedom)

\(t = 0.6097666 \ldots \sqrt {\frac{{18}}{{1 – 0.6097666{ \ldots ^2}}}} \)     M1

\( = 3.2640 \ldots \)     A1

\({\rm{DF}} = 18\)     A1

\(p{\rm{ – value}} = 0.00215 \ldots \)     A1

this is less than \(0.01\), so we conclude that there is a positive association between heights and weights of boys of this age     R1

 

[8 marks]

a.

(i)     the equation of the regression line of \(w\) on \(h\) is

\(w – \frac{{340}}{{20}} = \left( {\frac{{20 \times 34150 – 340 \times 2002}}{{20 \times 201124 – {{2002}^2}}}} \right)\left( {h – \frac{{2002}}{{20}}} \right)\)     M1

\(w = 0.160h + 0.957\)     A1

(ii) putting \(h = 90\) , \(w = 15.4\) (kg)     A1

Note: Award M0A0A0 for calculation of \(h\) on \(w\).

[3 marks]

b.

Question

The random variables \(X\), \(Y\) follow a bivariate normal distribution with product moment correlation coefficient \(\rho \). The following table gives a random sample from this distribution.

(a)     Determine the value of \(r\), the product moment correlation coefficient of this sample.

(b)     (i)     Write down hypotheses in terms of \(\rho \) which would enable you to test whether or not \(X\) and \(Y\) are independent.

(ii)     Determine the p-value of the above sample and state your conclusion at the 5% significance level. Justify your answer.

(c)     (i)     Determine the equation of the regression line of \(y\) on \(x\).

(ii)     State whether or not this equation can be used to obtain an accurate prediction of the value of \(y\) for a given value of \(x\). Give a reason for your answer.

▶️Answer/Explanation

Markscheme

(a)     \(r =  – 0.163\)     A2

[2 marks]

 

(b)     (i)     \({{\text{H}}_0}:\rho  = 0:{{\text{H}}_1}:\rho  \ne 0\)     A1

(ii)     \(t = r\sqrt {\frac{{n – 2}}{{1 – {r^2}}}}  =  – 0.468 \ldots \)     (A1)

\({\text{DF}} = 8\)     (A1)

\(p{\text{-value}} = 2 \times 0.326 \ldots  = 0.652\)   A1

since \(0.652 > 0.05\), we accept \({{\text{H}}_0}\)     R1

Note: Award (A1)(A1)A0 if the p-value is given as \(0.326\) without prior working.

Note: Follow through their p-value for the R1.

[5 marks]

 

(c)     (i)     \(y =  – 0.257x + 5.22\)     A1

Note: Accept answers which round to \(–0.26\) and \(5.2\).

(ii)     no, because \(X\) and \(Y\) have been shown to be independent (or equivalent)     A1

[2 marks]

Question

[Maximum mark: 6]
Consider the following data

The regression line for y on x is y = 2.2x – 0.5
(a) Solve the equation above for x to find an expression in the form x = ay+b [2]
(b) Find the equation x = cy+d of the regression line for x on y. [2]
(c) Describe the advantage of the linear equation in (b). [2]

▶️Answer/Explanation

Ans:
(a) y = 2.2x – 0.5 ⇔ y + 0.5 = 2.2x ⇔ x = 0.455 y + 0.227
(b) x = 0.423y + 0.385
(c) The relation in (a) is in fact the inverse function of the line y = 2.2x – 0.5
If y is given, the answer in (c) gives a more reliable estimation of x.

Scroll to Top