IBDP MAI :Topic 4: Statistics and probability-AHL 4.13-Non-linear regression.Exam Style Questions Paper 3

Question

This question is about modelling the spread of a computer virus to predict the number of computers in a city which will be infected by the virus.

A systems analyst defines the following variables in a model:
– $t$ is the number of days since the first computer was infected by the virus.
– $Q(t)$ is the total number of computers that have been infected up to and including day $t$.

The following data were collected:

A model for the early stage of the spread of the computer virus suggests that
$$
Q^{\prime}(t)=\beta N Q(t)
$$
where $N$ is the total number of computers in a city and $\beta$ is a measure of how easily the virus is spreading between computers. Both $N$ and $\beta$ are assumed to be constant.

The data above are taken from city $\mathrm{X}$ which is estimated to have 2.6 million computers.
The analyst looks at data for another city, Y. These data indicate a value of $\beta=9.64 \times 10^{-8}$.

An estimate for $Q^{\prime}(t), t \geq 5$, can be found by using the formula:
$$
Q^{\prime}(t) \approx \frac{Q(t+5)-Q(t-5)}{10}
$$

The following table shows estimates of $Q^{\prime}(t)$ for city $\mathrm{X}$ at different values of $t$.

An improved model for $Q(t)$, which is valid for large values of $t$, is the logistic differential equation
$$
Q^{\prime}(t)=k Q(t)\left(1-\frac{Q(t)}{L}\right)
$$
where $k$ and $L$ are constants.
Based on this differential equation, the graph of $\frac{Q^{\prime}(t)}{Q(t)}$ against $Q(t)$ is predicted to be a straight line.
a.i. Find the equation of the regression line of $Q(t)$ on $t$.
a.ii.Write down the value of $r$, Pearson’s product-moment correlation coefficient.
a.iiiExplain why it would not be appropriate to conduct a hypothesis test on the value of $r$ found in (a)(ii).
b.i. Find the general solution of the differential equation $Q^{\prime}(t)=\beta N Q(t)$.
b.ii.Using the data in the table write down the equation for an appropriate non-linear regression model.
b.iiiWrite down the value of $R^2$ for this model.
b.ivHence comment on the suitability of the model from (b)(ii) in comparison with the linear model found in part (a).
b.v.By considering large values of $t$ write down one criticism of the model found in (b)(ii).
c. Use your answer from part (b)(ii) to estimate the time taken for the number of infected computers to double.
d. Find in which city, $\mathrm{X}$ or $\mathrm{Y}$, the computer virus is spreading more easily. Justify your answer using your results from part (b).
e. Determine the value of $a$ and of $b$. Give your answers correct to one decimal place.
f.i. Use linear regression to estimate the value of $k$ and of $L$.
f.ii. The solution to the differential equation is given by
$$
Q(t)=\frac{L}{1+C \mathrm{e}^{-k t}}
$$
where $C$ is a constant.
Using your answer to part (f)(i), estimate the percentage of computers in city $X$ that are expected to have been infected by the virus over a long period of time.

▶️Answer/Explanation

a.i. $Q(t)=3090 t-54000(3094.27 \ldots t-54042.3 \ldots)$
A1A1
Note: Award at most A1AO if answer is not an equation. Award A1AO for an answer including either $x$ or $y$.
[2 marks]
a.ii. $0.755(0.754741 \ldots)$
A1
[1 mark]
a.iiit is not a random variable OR it is not a (bivariate) normal distribution
OR data is not a sample from a population
OR data appears nonlinear
OR $r$ only measures linear correlation
R1
Note: Do not accept ” $r$ is not large enough”.
[1 mark]
b.i.attempt to separate variables
(M1)
$$
\begin{aligned}
& \int \frac{1}{Q} \mathrm{~d} Q=\int \beta N \mathrm{~d} t \\
& \ln |Q|=\beta N t+c
\end{aligned}
$$

A1A1A1

Note: Award $\boldsymbol{A 1}$ for LHS, $\boldsymbol{A} 1$ for $\beta N t$, and $\boldsymbol{A 1}$ for $+c$.
Award full marks for $Q=\mathrm{e}^{\beta N t+c}$ OR $Q=A \mathrm{e}^{\beta N t}$.
Award M1A1A1A0 for $Q=\mathrm{e}^{\beta N t}$
[4 marks]

b.iiattempt at exponential regression
(M1)
$$
Q=1.15 \mathrm{e}^{0.292 t}\left(Q=1.14864 \ldots \mathrm{e}^{0.292055 \ldots t}\right)
$$
A1
OR
attempt at exponential regression
(M1)
$$
Q=1.15 \times 1.34^t(1.14864 \ldots \times 1.33917 \ldots t)
$$
A1
Note: Condone answers involving $y$ or $x$. Condone absence of ” $Q=$ ” Award M1AO for an incorrect answer in correct format.
[2 marks]
b.iiio. $999(0.999431 \ldots)$
A1
[1 mark]
b.ivcomparing something to do with $R^2$ and something to do with $r \quad M 1$
Note: Examples of where the $\boldsymbol{M 1}$ should be awarded:
$$
\begin{aligned}
& R^2>r \\
& R>r \\
& 0.999>0.755 \\
& 0.999>0.755^2 \quad(=0.563)
\end{aligned}
$$

The “correlation coefficient” in the exponential model is larger.
Model B has a larger $R^2$
Examples of where the $\boldsymbol{M 1}$ should not be awarded:
The exponential model shows better correlation (since not clear how it is being measured)
Model 2 has a better fit
Model 2 is more correlated
an unambiguous comparison between $R^2$ and $r^2$ or $R$ and $r$ leading to the conclusion that the model in part (b) is more suitable / better

Note: Condone candidates claiming that $R$ is the “correlation coefficient” for the non-linear model.
[2 marks]

b.vit suggests that there will be more infected computers than the entire population
R1
Note: Accept any response that recognizes unlimited growth.
[1 mark]
c. $1.15 \mathrm{e}^{0.292 t}=2.3$ OR $1.15 \times 1.34^t=2.3$ OR $t=\frac{\ln 2}{0.292}$ OR using the model to find two specific times with values of $Q(t)$ which double
M1
$t=2.37$ (days)
A1
Note: Do not $\boldsymbol{F T}$ from a model which is not exponential. Award MOAO for an answer of 2.13 which comes from using (10, 20) from the data or any other answer which finds a doubling time from figures given in the table.
[2 marks]
d. an attempt to calculate $\beta$ for city $\mathrm{X}$
(M1)
$$
\begin{aligned}
\beta & =\frac{0.292055 \ldots}{2.6 \times 10^6} \text { OR } \beta=\frac{\ln 1.33917 \ldots}{2.6 \times 10^6} \\
& =1.12328 \ldots \times 10^{-7} \quad \text { A1 }
\end{aligned}
$$
this is larger than $9.64 \times 10^{-8}$ so the virus spreads more easily in city X $\quad \boldsymbol{R 1}$

Note: It is possible to award M1AOR1.
Condone “so the virus spreads faster in city X” for the final $\boldsymbol{R 1}$.
[3 marks]
e. $a=38.3, b=3086.1$
A1A1
Note: Award $\mathbf{A 1 A O}$ if values are correct but not to $1 \mathrm{dp}$.
[2 marks]

f.i. $\frac{Q^{\prime}}{Q}=0.42228-2.5561 \times 10^{-6} Q$
(A1)(A1)
Note: Award $\mathbf{A 1}$ for each coefficient seen – not necessarily in the equation. Do not penalize seeing in the context of $y$ and $x$.
identifying that the constant is $k$ OR that the gradient is $-\frac{k}{L}$
(M1)
therefore $k=0.422(0.422228 \ldots)$
A1
$\frac{k}{L}=2.5561 \times 10^{-6}$
$L=165000(165205)$
A1
Note: Accept a value of $L$ of 164843 from use of $3 \mathrm{sf}$ value of $k$, or any other value from plausible pre-rounding.
Allow follow-through within the question part, from the equation of their line to the final two $\boldsymbol{A 1}$ marks.
[5 marks]
f.ii. recognizing that their $L$ is the eventual number of infected
(M1)
$$
\frac{165205 \ldots}{2600000}=6.35 \%
$$
(6.35403…\%)
A1

Note: Accept any final answer consistent with their answer to part (f)(i) unless their $L$ is less than 120146 in which case award at most M1AO.
[2 marks]

 

 

Question

In this question you will explore possible models for the spread of an infectious disease

An infectious disease has begun spreading in a country. The National Disease Control Centre (NDCC) has compiled the following data after receiving alerts from hospitals.

A graph of against  is shown below.

The NDCC want to find a model to predict the total number of people infected, so they can plan for medicine and hospital facilities. After looking at the data, they think an exponential function in the form $n=a b^d$ could be used as a model.

Use your answer to part (a) to predict

The NDCC want to verify the accuracy of these predictions. They decide to perform a $\chi^2$ goodness of fit test.

The predictions given by the model for the first five days are shown in the table.

In fact, the first day when the total number of people infected is greater than 1000 is day 14 , when a total of 1015 people are infected.

Based on this new data, the NDCC decide to try a logistic model in the form $n=\frac{L}{1+c e^{-k d}}$.

Use the data from days $1-5$, together with day 14 , to find the value of
a. Use an exponential regression to find the value of $a$ and of $b$, correct to 4 decimal places.
b.i.the number of new people infected on day 6.
b.ii.the day when the total number of people infected will be greater than 1000.
c. Use your answer to part (a) to show that the model predicts 16.7 people will be infected on the first day.
d.i.Explain why the number of degrees of freedom is 2.
d.iiPerform a $\chi^2$ goodness of fit test at the $5 \%$ significance level. You should clearly state your hypotheses, the p-value, and your conclusion.
e. Give two reasons why the prediction in part (b)(ii) might be lower than 14.
f.i. $L$.
f.ii. $c$.
f.iii.k.
g. Hence predict the total number of people infected by this disease after several months.
h. Use the logistic model to find the day when the rate of increase of people infected is greatest.

▶️Answer/Explanation

a. $a=9.7782, b=1.7125 \quad$ M1A1A1
[3 marks]
b.i. $n(6)=247$
A1
number of new people infected $=247-140=107$ M1A1
[3 marks]
b.ii.use of graph or table M1
day 9 A1
[2 marks]
c. $9.7782(1.7125)^1 \quad$ M1
$=16.7$ people $\boldsymbol{A G}$
[1 mark]
d.i. 2 parameters $(a$ and $b$ ) were estimated from the data. $\quad \boldsymbol{R 1}$
$v=5-1-2 \quad M 1$
$=2 \quad A G$
[2 marks]
d.ii. $H_0$ : data is modeled by $n(d)=9.7782(1.7125)^d$ and $H_1$ : data is not modeled by $n(d)=9.7782(1.7125)^d$
A1
p-value $=0.893 \quad$ A2
Since $0.893>0.05 \quad \boldsymbol{R 1}$
Insufficient evidence to reject $H_0$. So data is modeled by $n(d)=9.7782(1.7125)^d \quad$ A1
[5 marks]
e. vaccine or medicine might slow down rate of infection $\boldsymbol{R 1}$
People become more aware of disease and take precautions to avoid infection
R1
Accept other valid reasons
f.i. 1060
M1A1
[2 marks]
f.ii. 108
A1
[1 mark]
f.iii.0.560 A1
[1 mark]
g. As $d \rightarrow \infty \quad$ M1
$$
n \rightarrow 1060
$$
[2 marks]
h. sketch of $\frac{\mathrm{d} n}{\mathrm{~d} d}$ or solve $\frac{\mathrm{d}^2 n}{\mathrm{~d} d^2}=0 \quad \boldsymbol{M 1}$
$$
d=8.36
$$
A1
Day 8
A1
[3 marks]

 

Question

This question explores methods to determine the area bounded by an unknown curve.
The curve $y=f(x)$ is shown in the graph, for $0 \leqslant x \leqslant 4.4$.

 The curve  y=f(x) passes through the following points. 

It is required to find the area bounded by the curve, the $x$-axis, the $y$-axis and the line $x=4.4$.

One possible model for the curve $y=f(x)$ is a cubic function.

A second possible model for the curve $y=f(x)$ is an exponential function, $y=p \mathrm{e}^{q x}$, where $p, q \in \mathbb{R}$.
a.i. Use the trapezoidal rule to find an estimate for the area.
a.ii.With reference to the shape of the graph, explain whether your answer to part (a)(i) will be an over-estimate or an underestimate of the area.
b.i. Use all the coordinates in the table to find the equation of the least squares cubic regression curve.
b.ii.Write down the coefficient of determination.
c.i. Write down an expression for the area enclosed by the cubic function, the $x$-axis, the $y$-axis and the line $x=4.4$.
c.ii.Find the value of this area.
d.i.Show that $\ln y=q x+\ln p$.
d.iiHence explain how a straight line graph could be drawn using the coordinates in the table.
d.iiiBy finding the equation of a suitable regression line, show that $p=1.83$ and $q=0.986$.
d.ivtence find the area enclosed by the exponential function, the $x$-axis, the $y$-axis and the line $x=4.4$.

▶️Answer/Explanation

a.i. Area $=\frac{1.1}{2}(2+2(5+15+47)+148) \quad$ M1A1
Area $=156$ units $^2$
A1
[3 marks]
a.i.The graph is concave up, $\quad \boldsymbol{R 1}$
so the trapezoidal rule will give an overestimate.
A1
[2 marks]
b.i. $f(x)=3.88 x^3-12.8 x^2+14.1 x+1.54 \quad$ M1A2
[3 marks]
b.ii. $R^2=0.999$
A1
[1 mark]
c.i. Area $=\int_0^{4.4}\left(3.88 x^3-12.8 x^2+14.1 x+1.54\right) d x \quad$ A1A1
[2 marks]
c.i.Area $=145$ units $^2 \quad$ (Condone 143-145 units $^2$, using rounded values.) A2
[2 marks]
d.i. $\ln y=\ln \left(p \mathrm{e}^{q x}\right) \quad$ M1
$\ln y=\ln p+\ln \left(\mathrm{e}^{q x}\right) \quad \boldsymbol{A 1}$
$\ln y=q x+\ln p \quad$ AG
[2 marks]
d.ii.Plot $\ln y$ against $p . \quad \boldsymbol{R 1}$
[1 mark]
d.iiiRegression line is $\ln y=0.986 x+0.602 \quad$ M1A1
So $q=$ gradient $=0.986 \quad \boldsymbol{R 1}$
$p=e^{0.602}=1.83 \quad$ M1A1
\(\text { d. }_{\text {Area }}=\int_0^{4.4} 1.83 e^{0.986 x} d x=140 \text { units }^2 \quad \text { M1A1 }\)

 

Question

This question explores models for the height of water in a cylindrical container as water drains out.

The diagram shows a cylindrical water container of height 3.2 metres and base radius 1 metre. At the base of the container is a small circular valve, which enables water to drain out.

Eva closes the valve and fills the container with water.
At time $t=0$, Eva opens the valve. She records the height, $h$ metres, of water remaining in the container every 5 minutes.

Eva first tries to model the height using a linear function, $h(t)=a t+b$, where $a, b \in \mathbb{R}$.

Eva uses the equation of the regression line of $h$ on $t$, to predict the time it will take for all the water to drain out of the container.

Eva thinks she can improve her model by using a quadratic function, $h(t)=p t^2+q t+r$, where $p, q, r \in \mathbb{R}$.

Eva uses this equation to predict the time it will take for all the water to drain out of the container and obtains an answer of $k$ minutes.

Let $V$ be the volume, in cubic metres, of water in the container at time $t$ minutes.
Let $R$ be the radius, in metres, of the circular valve.

Eva does some research and discovers a formula for the rate of change of $V$.
$$
\frac{\mathrm{d} V}{\mathrm{~d} t}=-\pi R^2 \sqrt{70560 h}
$$

Eva measures the radius of the valve to be 0.023 metres. Let $T$ be the time, in minutes, it takes for all the water to drain out of the container.

Eva wants to use the container as a timer. She adjusts the initial height of water in the container so that all the water will drain out of the container in 15 minutes.

Eva has another water container that is identical to the first one. She places one water container above the other one, so that all the water from the highest container will drain into the lowest container. Eva completely fills the highest container, but only fills the lowest container to a height of 1 metre, as shown in the diagram.

At time $t=0$ Eva opens both valves. Let $H$ be the height of water, in metres, in the lowest container at time $t$.
a.i. Find the equation of the regression line of $h$ on $t$.
a.ii.Interpret the meaning of parameter $a$ in the context of the model.
a.iiiSuggest why Eva’s use of the linear regression equation in this way could be unreliable.
b.i. Find the equation of the least squares quadratic regression curve.
b.iiFind the value of $k$.
b.iiitence, write down a suitable domain for Eva’s function $h(t)=p t^2+q t+r$.
c. Show that $\frac{\mathrm{d} h}{\mathrm{~d} t}=-R^2 \sqrt{70560 h}$.
d. By solving the differential equation $\frac{\mathrm{d} h}{\mathrm{~d} t}=-R^2 \sqrt{70560 h}$, show that the general solution is given by $h=17640\left(c-R^2 t\right)^2$, where $c \in \mathbb{R}$.
e. Use the general solution from part (d) and the initial condition $h(0)=3.2$ to predict the value of $T$.
f. Find this new height.
g.i. Show that $\frac{\mathrm{d} H}{\mathrm{~d} t} \approx 0.2514-0.009873 t-0.1405 \sqrt{H}$, where $0 \leq t \leq T$.
g.ii.Use Euler’s method with a step length of 0.5 minutes to estimate the maximum value of $H$.

▶️Answer/Explanation

a.i. $h(t)=-0.134 t+3.1$
A1A1
Note: Award $\mathbf{A 1}$ for an equation in $h$ and $t$ and $\boldsymbol{A 1}$ for the coefficient -0.134 and constant 3.1.
[2 marks]
a.i.EITHER
the rate of change of height (of water in metres per minute)
A1
Note: Accept “rate of decrease” or “rate of increase” in place of “rate of change”.

OR
the (average) amount that the height (of the water) decreases each minute
A1
[1 mark]
a.iiiEITHER
unreliable to use $h$ on $t$ equation to estimate $t$
A1
OR
unreliable to extrapolate from original data
A1
OR
rate of change (of height) might not remain constant (as the water drains out)
A1
[1 mark]
b.i. $h(t)=0.002 t^2-0.174 t+3.2$
A1
[1 mark]
b.ii. $0.002 t^2-0.174 t+3.2=0$
(M1)
$26.4(26.4046 \ldots)$
A1

b.iiiEITHER
$$
(0 \leq) t \leq 26.4 \quad(t \leq 26.4046 \ldots)
$$
A1
OR
$(0 \leq) t \leq 20$ (due to range of original data / interpolation)
A1
[1 mark]
c. $V=\pi(1)^2 h$
(A1)
EITHER
$$
\frac{\mathrm{d} V}{\mathrm{dt}}=\pi \frac{\mathrm{d} h}{\mathrm{dt}} \quad \boldsymbol{M 1}
$$

OR
attempt to use chain rule M1
$$
\frac{\mathrm{d} h}{\mathrm{~d} t}=\frac{\mathrm{d} h}{\mathrm{~d} V} \times \frac{\mathrm{d} V}{\mathrm{~d} t}
$$

THEN
$$
\begin{aligned}
& \frac{\mathrm{d} h}{\mathrm{~d} t}=\frac{1}{\pi} \times-\pi R^2 \sqrt{70560 h} \quad \text { A1 } \\
& \frac{\mathrm{d} h}{\mathrm{~d} t}=-R^2 \sqrt{70560 h} \quad \text { AG }
\end{aligned}
$$
[3 marks]

d. attempt to separate variables
M1
$$
\begin{aligned}
& \int \frac{1}{\sqrt{70560 h}} \mathrm{~d} h=\int-R^2 \mathrm{~d} t \\
& \frac{2 \sqrt{h}}{\sqrt{70560}}=-R^2 t+c
\end{aligned}
$$
A1
A1A1
Note: Award $\boldsymbol{A 1}$ for each correct side of the equation.
$\sqrt{h}=\frac{\sqrt{70560}}{2}\left(c-R^2 t\right)$
A1
Note: Award the final $\mathbf{A 1}$ for any correct intermediate step that clearly leads to the given equation.
$$
h=17640\left(c-R^2 t\right)^2 \quad A G
$$
[5 marks]
e. $t=0 \Rightarrow 3.2=17640 c^2$
(M1)
$$
c=0.0134687 \ldots
$$
substituting $h=0$ and their non-zero value of $c$
(M1)
$T=\frac{c}{R^2}=\frac{0.0134687 \ldots}{0.023^2}$
$=25.5$ (minutes) $(25.4606 \ldots)$
A1
[4 marks]
f.
$$
\begin{aligned}
& h=0 \Rightarrow c=R^2 t \\
& c=0.023^2 \times 15(=0.007935) \\
& t=0 \Rightarrow h=17640\left(0.023^2 \times 15\right)^2 \\
& h=1.11 \text { (metres) }(1.11068 \ldots)
\end{aligned}
$$
(M1)
A1
[3 marks]

g.i. let $h$ be the height of water in the highest container from parts (d) and (e) we get
$$
\begin{aligned}
& \frac{\mathrm{d} h}{\mathrm{~d} t}=-35280 R^2\left(0.0134687 \ldots-R^2 t\right) \quad \text { (M1)(A1) } \\
& \text { so } \frac{\mathrm{d} H}{\mathrm{~d} t}=35280 R^2\left(0.0135-R^2 t\right)-R^2 \sqrt{70560 H} \quad \text { M1A1 } \\
& \left(\frac{\mathrm{d} H}{\mathrm{~d} t}=18.6631 \ldots(0.0134687 \ldots-0.000529 t)-0.000529 \sqrt{70560 H}\right) \\
& \left(\frac{\mathrm{d} H}{\mathrm{~d} t}=0.251367 \ldots-0.0987279 \ldots-0.140518 \ldots \sqrt{H}\right) \\
& \frac{\mathrm{d} H}{\mathrm{~d} t} \approx 0.2514-0.009873 t-0.1405 \sqrt{H} \quad \text { AG }
\end{aligned}
$$
M1A1
[4 marks]
g.ii.evidence of using Euler’s method correctly
e.g. $y_1=1.05545 \ldots$
(A1)
maximum value of $H=1.45$ (metres) (at 8.5 minutes)
A2
(1.44678 . . . metres)
[3 marks]

 

Question

Juliet is a sociologist who wants to investigate if income affects happiness amongst doctors. This question asks you to review Juliet’s methods and conclusions.

Juliet obtained a list of email addresses of doctors who work in her city. She contacted them and asked them to fill in an anonymous questionnaire. Participants were asked to state their annual income and to respond to a set of questions. The responses were used to determine a happiness score out of 100 . Of the 415 doctors on the list, 11 replied.

Juliet’s results are summarized in the following table.

For the remaining ten responses in the table, Juliet calculates the mean happiness score to be 52. 5 .

Juliet decides to carry out a hypothesis test on the correlation coefficient to investigate whether increased annual income is associated with greater happiness.

Juliet wants to create a model to predict how changing annual income might affect happiness scores. To do this, she assumes that annual income in dollars, $X$, is the independent variable and the happiness score, $Y$, is the dependent variable.

She first considers a linear model of the form
$$
Y=a X+b .
$$

Juliet then considers a quadratic model of the form
$$
Y=c X^2+d X+e
$$

After presenting the results of her investigation, a colleague questions whether Juliet’s sample is representative of all doctors in the city.

A report states that the mean annual income of doctors in the city is $\$ 80000$. Juliet decides to carry out a test to determine whether her sample could realistically be taken from a population with a mean of $\$ 80000$.

a.i. Describe one way in which Juliet could improve the reliability of her investigation.
a.ii.Describe one criticism that can be made about the validity of Juliet’s investigation.
b. Juliet classifies response $\mathrm{K}$ as an outlier and removes it from the data. Suggest one possible justification for her decision to remove it.
c.i. Calculate the mean annual income for these remaining responses.
c.ii.Determine the value of $r$, Pearson’s product-moment correlation coefficient, for these remaining responses.
d.i. State why the hypothesis test should be one-tailed.
d.ii.State the null and alternative hypotheses for this test.
d.iiiThe critical value for this test, at the $5 \%$ significance level, is 0.549 . Juliet assumes that the population is bivariate normal.
Determine whether there is significant evidence of a positive correlation between annual income and happiness. Justify your answer.
e.i. Use Juliet’s data to find the value of $a$ and of $b$.
e.ii.Interpret, referring to income and happiness, what the value of $a$ represents.
e.iiiFind the value of $c$, of $d$ and of $e$.
e.ivFind the coefficient of determination for each of the two models she considers.
e.v.Hence compare the two models.
e.viJuliet decides to use the coefficient of determination to choose between these two models.
Comment on the validity of her decision.
f.i. State the name of the test which Juliet should use.
f.ii. State the null and alternative hypotheses for this test.
f.iii.Perform the test, using a $5 \%$ significance level, and state your conclusion in context.

▶️Answer/Explanation

a.i. Any one from:
R1
increase sample size / increase response rate / repeat process
check whether sample is representative
test-retest participants or do a parallel test
use a stratified sample
use a random sample

Note: Do not condone:
Ask different types of doctor
Ask for proof of income
Ask for proof of being a doctor
Remove anonymity
Remove response K.
[1 mark]
a.ii.Any one from:
R1
non-random sampling means a subset of population might be responding
self-reported happiness is not the same as happiness
happiness is not a constant / cannot be quantified / is difficult to measure
income might include external sources
Juliet is only sampling doctors in her city
correlation does not imply causation
sample might be biased
Note: Do not condone the following common but vague responses unless they make a clear link to validity:
Sample size is too small
Result is not generalizable
There may be other variables Juliet is ignoring
Sample might not be representative
[1 mark]

b. because the income is very different / implausible / clearly contrived
R1
Note: Answers must explicitly reference “income” to get credit.
[1 mark]
c.i. (\$) 90200
(M1)A1
[2 marks]
c.ii. $r=0.558(0.557723 \ldots)$
A2
[2 marks]
d.i.EITHER
only looking for change in one direction
R1
OR
only looking for greater happiness with greater income
R1
OR
only looking for evidence of positive correlation
R1
[1 mark]
d.i. $\mathrm{H}_0: \rho=0 ; \mathrm{H}_1: \rho>0$
A1A1
Note: Award A1 for $\rho$ seen (do not accept $r$ ), A1 for both correct hypotheses, using their $\rho$ or $r$. Accept an equivalent statement in words, however reference to “correlation for the population” or “association for the population” must be explicit for the first $\boldsymbol{A 1}$ to be awarded.
Watch out for a null hypothesis in words similar to “Annual income is not associated with greater happiness”. This is effectively saying $\rho \leq 0$ and should not be condoned.
[2 marks]

d.iiiMETHOD 1 – using critical value of [Math Processing Error]
[Math Processing Error] R1
(therefore significant evidence of) a positive correlation
A1
Note: Do not award ROA1.

METHOD 2 – using [Math Processing Error]-value
$0.0469<0.05(0.0469463 \ldots<0.05)$
A1

Note: Follow through from their $r$-value from part (c)(ii).
(therefore significant evidence of) a positive correlation
A1
Note: Do not award AOA1.
[2 marks]
e.i. $a=0.000126(0.000125842 \ldots), \quad b=41.1(41.1490 \ldots)$
A1
[1 mark]
e.ii.EITHER
the amount the happiness score increases for every $\$ 1$ increase in (annual) income
A1
OR
rate of change of happiness with respect to (annual) income
A1
Note: Accept equivalent responses e.g. an increase of 1.26 in happiness for every $\$ 10000$ increase in salary.
[1 mark]

$$
\begin{aligned}
\text { e.iii } & =-2.06 \times 10^{-9}\left(-2.06191 \ldots \times 10^{-9}\right) \\
d & =7.05 \times 10^{-4}\left(7.05272 \ldots \times 10^{-4}\right) \\
e & =12.6(12.5878 \ldots) \quad \text { A1 }
\end{aligned}
$$
A1
[1 mark]
e.ivfor quadratic model: $R^2=0.659(0.659145 \ldots)$
A1
for linear model: $R^2=0.311(0.311056 \ldots)$
A1
Note: Follow through from their $r$ value from part (c)(ii).
[2 marks]
e.v.EITHER
quadratic model is a better fit to the data / more accurate
A1
OR
quadratic model explains a higher proportion of the variance
A1
[1 mark]
e.viEITHER
not valid, $R^2$ not a useful measure to compare models with different numbers of parameters
A1
OR
not valid, quadratic model will always have a better fit than a linear model
A1
Note: Accept any other sensible critique of the validity of the method. Do not accept any answers which focus on the conclusion rather than the method of model selection.
[1 mark]
f.i. (single sample) $t$-test
A1
[1 mark]
f.ii. EITHER

$$
\mathrm{H}_0: \mu=80000 ; \mathrm{H}_1: \mu \neq 80000
$$
A1
OR
$\mathrm{H}_0$ : (sample is drawn from a population where) the population mean is $\$ 80000$
$\mathrm{H}_1$ : the population mean is not $\$ 80000$
A1

Note: Do not allow FT from an incorrect test in part (f)(i) other than a $z$-test.
[1 mark]
f.iii. $p=0.610(0.610322 \ldots)$
A1
Note: For a $z$-test follow through from part (f)(i), either 0.578 (from biased estimate of variance) or 0.598 (from unbiased estimate of variance).
$0.610>0.05$
R1
EITHER
no (significant) evidence that mean differs from $\$ 80000$
A1
OR
the sample could plausibly have been drawn from the quoted population
A1
Note: Allow R1FTA1FT from an incorrect $p$-value, but the final A1 must still be in the context of the original research question.
[3 marks]

 

Scroll to Top