IBDP MAI :Topic 4: Statistics and probability-AHL 4.12-Definition of reliability and validity.Exam Style Questions Paper 3

Question

Two IB schools, A and B, follow the IB Diploma Programme but have different teaching methods. A research group tested whether the different teaching methods lead to a similar final result.

For the test, a group of eight students were randomly selected from each school. Both samples were given a standardized test at the start of the course and a prediction for total IB points was made based on that test; this was then compared to their points total at the end of the course.

Previous results indicate that both the predictions from the standardized tests and the final IB points can be modelled by a normal distribution.
It can be assumed that:
– the standardized test is a valid method for predicting the final IB points
– that variations from the prediction can be explained through the circumstances of the student or school.

The data for school $\mathrm{A}$ is shown in the following table.

For each student, the change from the predicted points to the final points $(f-p)$ was calculated.

The data for school B is shown in the following table.

School A also gives each student a score for effort in each subject. This effort score is based on a scale of 1 to 5 where 5 is regarded as outstanding effort.

It is claimed that the effort put in by a student is an important factor in improving upon their predicted IB points.

A mathematics teacher in school A claims that the comparison between the two schools is not valid because the sample for school B contained mainly girls and that for school A, mainly boys. She believes that girls are likely to show a greater improvement from their predicted points to their final points.

She collects more data from other schools, asking them to class their results into four categories as shown in the following table.

a. Identify a test that might have been used to verify the null hypothesis that the predictions from the standardized test can be modelled by a normal distribution.
b. State why comparing only the final IB points of the students from the two schools would not be a valid test for the effectiveness of the two different teaching methods.
c.i. Find the mean change.
c.ii.Find the standard deviation of the changes.
d. Use a paired $t$-test to determine whether there is significant evidence that the students in school A have improved their IB points since the start of the course.
e.i. Use an appropriate test to determine whether there is evidence, at the $5 \%$ significance level, that the students in school B have improved more than those in school A.
e.ii.State why it was important to test that both sets of points were normally distributed.
f.i. Perform a test on the data from school A to show it is reasonable to assume a linear relationship between effort scores and improvements in IB points. You may assume effort scores follow a normal distribution.
f.ii. Hence, find the expected improvement between predicted and final points for an increase of one unit in effort grades, giving your answer to one decimal place.
g. Use an appropriate test to determine whether showing an improvement is independent of gender.
h. If you were to repeat the test performed in part (e) intending to compare the quality of the teaching between the two schools, suggest two ways in which you might choose your sample to improve the validity of the test.

▶️Answer/Explanation

a. $\chi^2$ (goodness of fit)
A1
[1 mark]
b. EITHER
because aim is to measure improvement
OR
because the students may be of different ability in the two schools $\boldsymbol{R 1}$
[1 mark]
c.i. 0.1875 (accept $0.188,0.19) \quad$ A1
[1 mark]
c.ii. 2.46
(M1)A1
Note: Award (M1)AO for 2.63.
[2 marks]
d. $\mathrm{H}_0$ : there has been no improvement
$\mathrm{H}_1$ : there has been an improvement
A1
attempt at a one-tailed paired $t$-test
(M1)
$p$-value $=0.423$
A1
there is no significant evidence that the students have improved
R1
Note: If the hypotheses are not stated award a maximum of AOM1A1R0.
[4 marks]
e.i. $\mathrm{H}_0$ : there is no difference between the schools
$\mathrm{H}_1$ : school B did better than school $\mathrm{A}$
A1
one-tailed 2 sample $t$-test
(M1)
$p$-value $=0.0984$
A1
$0.0984>0.05$ (not significant at the $5 \%$ level) so do not reject the null hypothesis
R1A1

Note: The final $\boldsymbol{A 1}$ cannot be awarded following an incorrect reason. The final R1A1 can follow through from their incorrect $p$-value. Award a maximum of A1(M1)AOR1A1 for $p$-value $=0.0993$.
[5 marks]
e.ii.sample too small for the central limit theorem to apply (and $t$-tests assume normal distribution)
R1

[1 mark]
f.i. $\mathrm{H}_0: \rho=0$
$$
\mathrm{H}_0: \rho>0
$$
A1
Note: Allow hypotheses to be expressed in words.
$$
p \text {-value }=0.00157
$$
A1
$(0.00157<0.01)$ there is a significant evidence of a (linear) correlation between effort and improvement (so it is reasonable to assume a linear relationship) $\quad \boldsymbol{R 1}$
[3 marks]
f.ii. (gradient of line of regression =) 6.6
A1
[1 mark]
g. $\mathrm{H}_0$ : improvement and gender are independent
$\mathrm{H}_1$ : improvement and gender are not independent
A1
choice of $\chi^2$ test for independence
(M1)
groups first two columns as expected values in first column less than $5 \quad$ M1
new observed table

$$
p \text {-value }=0.581
$$
no significant evidence that gender and improvement are dependent
R1
[6 marks]
h. For example:
larger samples / include data from whole school
take equal numbers of boys and girls in each sample
have a similar range of abilities in each sample
(if possible) have similar ranges of effort
R1R1
Note: Award $\boldsymbol{R 1}$ for each reasonable suggestion to improve the validity of the test.
[2 marks]

Question

Juliet is a sociologist who wants to investigate if income affects happiness amongst doctors. This question asks you to review Juliet’s methods and conclusions.

Juliet obtained a list of email addresses of doctors who work in her city. She contacted them and asked them to fill in an anonymous questionnaire. Participants were asked to state their annual income and to respond to a set of questions. The responses were used to determine a happiness score out of 100 . Of the 415 doctors on the list, 11 replied.

Juliet’s results are summarized in the following table.

For the remaining ten responses in the table, Juliet calculates the mean happiness score to be 52. 5 .

Juliet decides to carry out a hypothesis test on the correlation coefficient to investigate whether increased annual income is associated with greater happiness.

Juliet wants to create a model to predict how changing annual income might affect happiness scores. To do this, she assumes that annual income in dollars, $X$, is the independent variable and the happiness score, $Y$, is the dependent variable.

She first considers a linear model of the form
$$
Y=a X+b .
$$

Juliet then considers a quadratic model of the form
$$
Y=c X^2+d X+e
$$

After presenting the results of her investigation, a colleague questions whether Juliet’s sample is representative of all doctors in the city.

A report states that the mean annual income of doctors in the city is $\$ 80000$. Juliet decides to carry out a test to determine whether her sample could realistically be taken from a population with a mean of $\$ 80000$.

a.i. Describe one way in which Juliet could improve the reliability of her investigation.
a.ii.Describe one criticism that can be made about the validity of Juliet’s investigation.
b. Juliet classifies response $\mathrm{K}$ as an outlier and removes it from the data. Suggest one possible justification for her decision to remove it.
c.i. Calculate the mean annual income for these remaining responses.
c.ii.Determine the value of $r$, Pearson’s product-moment correlation coefficient, for these remaining responses.
d.i. State why the hypothesis test should be one-tailed.
d.ii.State the null and alternative hypotheses for this test.
d.iiiThe critical value for this test, at the $5 \%$ significance level, is 0.549 . Juliet assumes that the population is bivariate normal.
Determine whether there is significant evidence of a positive correlation between annual income and happiness. Justify your answer.
e.i. Use Juliet’s data to find the value of $a$ and of $b$.
e.ii.Interpret, referring to income and happiness, what the value of $a$ represents.
e.iiiFind the value of $c$, of $d$ and of $e$.
e.ivFind the coefficient of determination for each of the two models she considers.
e.v.Hence compare the two models.
e.viJuliet decides to use the coefficient of determination to choose between these two models.
Comment on the validity of her decision.
f.i. State the name of the test which Juliet should use.
f.ii. State the null and alternative hypotheses for this test.
f.iii.Perform the test, using a $5 \%$ significance level, and state your conclusion in context.

▶️Answer/Explanation

a.i. Any one from:
R1
increase sample size / increase response rate / repeat process
check whether sample is representative
test-retest participants or do a parallel test
use a stratified sample
use a random sample

Note: Do not condone:
Ask different types of doctor
Ask for proof of income
Ask for proof of being a doctor
Remove anonymity
Remove response K.
[1 mark]
a.ii.Any one from:
R1
non-random sampling means a subset of population might be responding
self-reported happiness is not the same as happiness
happiness is not a constant / cannot be quantified / is difficult to measure
income might include external sources
Juliet is only sampling doctors in her city
correlation does not imply causation
sample might be biased
Note: Do not condone the following common but vague responses unless they make a clear link to validity:
Sample size is too small
Result is not generalizable
There may be other variables Juliet is ignoring
Sample might not be representative
[1 mark]

b. because the income is very different / implausible / clearly contrived
R1
Note: Answers must explicitly reference “income” to get credit.
[1 mark]
c.i. (\$) 90200
(M1)A1
[2 marks]
c.ii. $r=0.558(0.557723 \ldots)$
A2
[2 marks]
d.i.EITHER
only looking for change in one direction
R1
OR
only looking for greater happiness with greater income
R1
OR
only looking for evidence of positive correlation
R1
[1 mark]
d.i. $\mathrm{H}_0: \rho=0 ; \mathrm{H}_1: \rho>0$
A1A1
Note: Award A1 for $\rho$ seen (do not accept $r$ ), A1 for both correct hypotheses, using their $\rho$ or $r$. Accept an equivalent statement in words, however reference to “correlation for the population” or “association for the population” must be explicit for the first $\boldsymbol{A 1}$ to be awarded.
Watch out for a null hypothesis in words similar to “Annual income is not associated with greater happiness”. This is effectively saying $\rho \leq 0$ and should not be condoned.
[2 marks]

d.iiiMETHOD 1 – using critical value of [Math Processing Error]
[Math Processing Error] R1
(therefore significant evidence of) a positive correlation
A1
Note: Do not award ROA1.

METHOD 2 – using [Math Processing Error]-value
$0.0469<0.05(0.0469463 \ldots<0.05)$
A1

Note: Follow through from their $r$-value from part (c)(ii).
(therefore significant evidence of) a positive correlation
A1
Note: Do not award AOA1.
[2 marks]
e.i. $a=0.000126(0.000125842 \ldots), \quad b=41.1(41.1490 \ldots)$
A1
[1 mark]
e.ii.EITHER
the amount the happiness score increases for every $\$ 1$ increase in (annual) income
A1
OR
rate of change of happiness with respect to (annual) income
A1
Note: Accept equivalent responses e.g. an increase of 1.26 in happiness for every $\$ 10000$ increase in salary.
[1 mark]

$$
\begin{aligned}
\text { e.iii } & =-2.06 \times 10^{-9}\left(-2.06191 \ldots \times 10^{-9}\right) \\
d & =7.05 \times 10^{-4}\left(7.05272 \ldots \times 10^{-4}\right) \\
e & =12.6(12.5878 \ldots) \quad \text { A1 }
\end{aligned}
$$
A1
[1 mark]
e.ivfor quadratic model: $R^2=0.659(0.659145 \ldots)$
A1
for linear model: $R^2=0.311(0.311056 \ldots)$
A1
Note: Follow through from their $r$ value from part (c)(ii).
[2 marks]
e.v.EITHER
quadratic model is a better fit to the data / more accurate
A1
OR
quadratic model explains a higher proportion of the variance
A1
[1 mark]
e.viEITHER
not valid, $R^2$ not a useful measure to compare models with different numbers of parameters
A1
OR
not valid, quadratic model will always have a better fit than a linear model
A1
Note: Accept any other sensible critique of the validity of the method. Do not accept any answers which focus on the conclusion rather than the method of model selection.
[1 mark]
f.i. (single sample) $t$-test
A1
[1 mark]
f.ii. EITHER

$$
\mathrm{H}_0: \mu=80000 ; \mathrm{H}_1: \mu \neq 80000
$$
A1
OR
$\mathrm{H}_0$ : (sample is drawn from a population where) the population mean is $\$ 80000$
$\mathrm{H}_1$ : the population mean is not $\$ 80000$
A1

Note: Do not allow FT from an incorrect test in part (f)(i) other than a $z$-test.
[1 mark]
f.iii. $p=0.610(0.610322 \ldots)$
A1
Note: For a $z$-test follow through from part (f)(i), either 0.578 (from biased estimate of variance) or 0.598 (from unbiased estimate of variance).
$0.610>0.05$
R1
EITHER
no (significant) evidence that mean differs from $\$ 80000$
A1
OR
the sample could plausibly have been drawn from the quoted population
A1
Note: Allow R1FTA1FT from an incorrect $p$-value, but the final A1 must still be in the context of the original research question.
[3 marks]

 

Question

A firm wishes to review its recruitment processes. This question considers the validity and reliability of the methods used.

Every year an accountancy firm recruits new employees for a trial period of one year from a large group of applicants.
At the start, all applicants are interviewed and given a rating. Those with a rating of either Excellent, Very good or Good are recruited for the trial period. At the end of this period, some of the new employees will stay with the firm.

It is decided to test how valid the interview rating is as a way of predicting which of the new employees will stay with the firm.
Data is collected and recorded in a contingency table.

The next year’s group of applicants are asked to complete a written assessment which is then analysed. From those recruited as new employees, a random sample of size 18 is selected.

The sample is stratified by department. Of the 91 new employees recruited that year, 55 were placed in the national department and 36 in the international department.

At the end of their first year, the level of performance of each of the 18 employees in the sample is assessed by their department manager. They are awarded a score between 1 (low performance) and 10 (high performance).

The marks in the written assessment and the scores given by the managers are shown in both the table and the scatter diagram.

The firm decides to find a Spearman’s rank correlation coefficient, $r_s$, for this data.

The same seven employees are given the written assessment a second time, at the end of the first year, to measure its reliability. Their marks are shown in the table below.

The written assessment is in five sections, numbered 1 to 5 . At the end of the year, the employees are also given a score for each of five professional attributes: V, W, X, Y and Z.

The firm decides to test the hypothesis that there is a correlation between the mark in a section and the score for an attribute.
They compare marks in each of the sections with scores for each of the attributes.
a. Use an appropriate test, at the $5 \%$ significance level, to determine whether a new employee staying with the firm is independent of their interview rating. State the null and alternative hypotheses, the $p$-value and the conclusion of the test.
b. Show that 11 employees are selected for the sample from the national department.
c.i. Without calculation, explain why it might not be appropriate to calculate a correlation coefficient for the whole sample of 18 employees.
c.ii.Find $r_s$ for the seven employees working in the international department.
c.iiiHence comment on the validity of the written assessment as a measure of the level of performance of employees in this department. Justify your answer.
d.i. State the name of this type of test for reliability.
d.iiFor the data in this table, test the null hypothesis, $\mathrm{H}_0: \rho=0$, against the alternative hypothesis, $\mathrm{H}_1: \rho>0$, at the $5 \%$ significance level. You may assume that all the requirements for carrying out the test have been met.
d.iiilence comment on the reliability of the written assessment.
e.i. Write down the number of tests they carry out.
e.ii.The tests are performed at the $5 \%$ significance level.
Assuming that:
– there is no correlation between the marks in any of the sections and scores in any of the attributes,
– the outcome of each hypothesis test is independent of the outcome of the other hypothesis tests, find the probability that at least one of the tests will be significant.
e.iiiThe firm obtains a significant result when comparing section 2 of the written assessment and attribute X. Interpret this result.

▶️Answer/Explanation

a. Use of $\chi^2$ test for independence
(M1)
$\mathrm{H}_0$ : Staying (or leaving) the firm and interview rating are independent.
$\mathrm{H}_1$ : Staying (or leaving) the firm and interview rating are not independent
A1
Note: For $\mathrm{H}_1$ accept ‘…are dependent’ in place of ‘…not independent’.
$p$-value $=0.487(0.487221 \ldots)$
A2
Note: Award $\mathbf{A 1}$ for $\chi^2=1.438 \ldots$ if $p$-value is omitted or incorrect.
$0.487>0.05$
R1
(the result is not significant at the $5 \%$ level)
insufficient evidence to reject the $\mathrm{H}_0$ (or “accept $\mathrm{H}_0$ “)
A1
Note: Do not award ROA1. The final R1A1 can follow through from their incorrect $p$-value
[6 marks]
b. $\frac{55}{91} \times 18=10.9(10.8791 \ldots)$
M1A1
Note: Award A1 for anything that rounds to 10.9.
$\approx 11$
AG
[2 marks]
c.i. there seems to be a difference between the two departments
(A1)
the international department manager seems to be less generous than the national department manager
R1
Note: The $\boldsymbol{A 1}$ is for commenting there is a difference between the two departments and the $\boldsymbol{R} \mathbf{1}$ is for correctly commenting on the direction of the difference
[2 marks]

c.ii.

Note: Award (M1) for an attempt to rank the data, and (A1) for correct ranks for both variables. Accept either set of rankings in reverse.
$$
r_s=0.909(0.909241 \ldots)
$$
(M1)(A1)

Note: The (M1) is for calculating the PMCC for their ranks.
Note: If a final answer of 0.9107 is seen, from use of $1-\frac{6 \Sigma d^2}{n\left(n^2-1\right)}$, award (M1)(A1)A1.
Accept -0.909 if one set of ranks has been ordered in reverse.
[4 marks]
c.iiiEITHER
there is a (strong) association between the written assessment mark and the manager scores.
A1
OR
there is a (strong) agreement in the rank order of the written assessment marks and the rank order of the manager scores.
A1
OR
there is a (strong linear) correlation between the rank order of the written assessment marks and the rank order of the manager scores.
A1
Note: Follow through on a value for their value of $r_s$ in $\mathrm{c}(\mathrm{ii})$.

THEN
the written assessment is likely to be a valid measure (of the level of employee performance)
R1
[2 marks]
d.i.test-retest
A1
[1 mark]
d.ii $p$-value $=0.00209(0.0020939 \ldots)$
A2
$0.00209<0.05$
R1
(the result is significant at the $5 \%$ level)
(there is sufficient evidence to) reject $\mathrm{H}_0$
A1

Note: Do not award R0A1. Accept “accept $\mathrm{H}_1$ “. The final R1A1 can follow through from their incorrect $p$-value.

d.iiithe test seems reliable
A1
Note: Follow through from their answer in part (d)(ii). Do not award if there is no conclusion in d(ii).
[1 mark]
e.i. 25
A1
[1 mark]
e.ii.probability of significant result given no correlation is 0.05
(M1)
probability of at least one significant result in 25 tests is
$$
1-0.95^{25}
$$
(M1)(A1)

Note: Award (M1) for use of $1-\mathrm{P}(0)$ or the binomial distribution with any value of $p$.
$$
=0.723(0.722610 \ldots)
$$
A1
[4 marks]
e.iii(though the result is significant) it is very likely that one significant result would be achieved by chance, so it should be disregarded or further evidence sought
R1
[1 mark]

 

Scroll to Top