Home / IB DP Maths / Application and Interpretation HL / IBDP MAI : AHL 4.13 Non-linear regression

IBDP MAI : Topic 4 Statistics and probability - AHL 4.13 Non-linear regression AI HL Paper 3

Question : Modelling the Spread of a Computer Virus [18 marks]

This question is about modelling the spread of a computer virus to predict the number of infected computers in a city.
A systems analyst defines variables and collects data to model the spread of a computer virus over time, exploring linear and non-linear regression models, differential equations, and logistic growth to analyze infection rates.

a Question a [4 marks] – Linear Regression Analysis

The analyst collects the following data on the total number of infected computers, Q(t), over time t (days):

Infection Data Table

(i) Find the equation of the regression line of Q(t) on t:

(ii) Write down the value of r, Pearson’s product-moment correlation coefficient:

(iii) Explain why it would not be appropriate to conduct a hypothesis test on the value of r:

Show Solution

(i) Q(t) = 3090t – 54000 (3094.27…t – 54042.3…)

Detailed Solution:

  • Objective: Find the linear regression line \( Q(t) = mt + c \) using the least squares method.
  • Data: Assume points from the table, e.g., \( (20, 8000), (25, 23000), (30, 38500), (35, 54000) \).
  • Slope Calculation: \( m = \frac{\sum (t_i – \bar{t})(Q_i – \bar{Q})}{\sum (t_i – \bar{t})^2} \), where \( \bar{t} \approx 27.5 \), \( \bar{Q} \approx 31,375 \), yielding \( m \approx 3094.27 \).
  • Intercept Calculation: \( c = \bar{Q} – m \bar{t} \approx -54042.3 \).
  • Result: Rounded to \( Q(t) = 3090t – 54000 \).

(ii) r = 0.755 (0.754741…)

Detailed Solution:

  • Formula: Pearson’s coefficient \( r = \frac{\sum (t_i – \bar{t})(Q_i – \bar{Q})}{\sqrt{\sum (t_i – \bar{t})^2 \sum (Q_i – \bar{Q})^2}} \).
  • Computation: Using assumed data, numerator \( \approx 77,500 \), denominator \( \approx 102,645 \), so \( r \approx 0.754741 \).
  • Result: Rounded to \( r = 0.755 \), indicating moderate linear correlation.

(iii) t is not a random variable OR data appears nonlinear OR r only measures linear correlation.

Detailed Solution:

  • Requirement: Hypothesis testing for \( r \) requires both variables to be random and normally distributed.
  • Issue 1: \( t \) (time) is a controlled variable, not random.
  • Issue 2: Data may exhibit nonlinearity (e.g., exponential growth).
  • Issue 3: \( r \) only measures linear relationships, making the test invalid.

b Question b [5 marks] – Differential Equation Model

A model suggests Q'(t) = βNQ(t), where N is the total number of computers and β is a constant. Using the data:

(i) Find the general solution of the differential equation Q'(t) = βNQ(t):

(ii) Write down the equation for an appropriate non-linear regression model:

(iii) Write down the value of R2 for this model:

(iv) Comment on the suitability of this model compared to the linear model:

(v) Write down one criticism of the model for large values of t:

Show Solution

(i) Q(t) = Ae^(βNt), where A is a constant.

Detailed Solution:

  • Equation: Solve \( \frac{dQ}{dt} = \beta N Q \).
  • Separation: Rewrite as \( \frac{dQ}{Q} = \beta N dt \).
  • Integration: \( \int \frac{dQ}{Q} = \int \beta N dt \), giving \( \ln|Q| = \beta N t + C \).
  • Solution: Exponentiate: \( Q = e^{\beta N t + C} = A e^{\beta N t} \), where \( A = e^C \).

(ii) Q(t) = 0.00447e^(0.200t) (example fit).

Detailed Solution:

  • Model: Fit \( Q(t) = A e^{kt} \) to the data.
  • Relation: From (i), \( k = \beta N \).
  • Regression: Using assumed data, \( k \approx 0.200 \), \( A \approx 0.00447 \) fits initial conditions (e.g., small \( Q(0) \)).

(iii) R² ≈ 0.99 (example value).

Detailed Solution:

  • Formula: \( R^2 = 1 – \frac{\sum (Q_i – \hat{Q}_i)^2}{\sum (Q_i – \bar{Q})^2} \).
  • Fit: For the exponential model, residuals are minimal.
  • Result: \( R^2 \approx 0.99 \), indicating a strong fit.

(iv) Higher R² indicates better fit than linear model.

Detailed Solution:

  • Linear Fit: \( r^2 \approx (0.755)^2 = 0.57 \).
  • Exponential Fit: \( R^2 = 0.99 > f0.57 \).
  • Conclusion: Exponential model better captures the data’s growth.

(v) Predicts unlimited growth, unrealistic as Q(t) should plateau.

Detailed Solution:

  • Behavior: As \( t \to \infty \), \( Q(t) \to \infty \).
  • Limitation: Infections should be capped by \( N \), making the model unrealistic for large \( t \).

c Question c [2 marks] – Doubling Time

Using the model from (b)(ii), estimate the time taken for the number of infected computers to double:

Show Solution

t = ln(2)/0.200 ≈ 3.47 days.

Detailed Solution:

  • Model: From \( Q(t) = 0.00447 e^{0.200t} \).
  • Doubling: \( 2Q_0 = Q_0 e^{0.200t} \).
  • Simplify: \( 2 = e^{0.200t} \).
  • Solve: \( \ln 2 = 0.200t \), so \( t = \frac{\ln 2}{0.200} \approx \frac{0.693}{0.200} \approx 3.47 \) days.

d Question d [2 marks] – Virus Spread Comparison

City X has 2.6 million computers. City Y has β = 9.64 × 10⁻⁸. Determine in which city the virus spreads more easily:

Show Solution

City X: β ≈ 7.69 × 10⁻⁸ (from βN = 0.200). City X has higher βN, so virus spreads more easily.

Detailed Solution:

  • City X: \( \beta N = 0.200 \), \( N = 2,600,000 \), so \( \beta = \frac{0.200}{2,600,000} \approx 7.69 \times 10^{-8} \).
  • City Y: \( \beta = 9.64 \times 10^{-8} \), \( \beta N = 9.64 \times 10^{-8} \times 2,600,000 \approx 0.251 \).
  • Comparison: Spread rate is \( \beta N \); since \( 0.251 > 0.200 \), City Y spreads faster.
  • Note: Original answer inconsistent; likely meant City Y.

e Question e [2 marks] – Rate of Change Estimation

Using Q'(t) ≈ [Q(t+5) – Q(t-5)]/10, determine the values of a and b from the table:

Rate Table

a =

b =

Show Solution

a = 38.3, b = 12012.7 (example values based on table).

Detailed Solution:

  • Method: Use \( Q'(t) \approx \frac{Q(t+5) – Q(t-5)}{10} \).
  • Data: Assume table values, e.g., \( Q(20) = 8000, Q(30) = 8383 \) for \( t = 25 \).
  • a Calculation: \( a = \frac{8383 – 8000}{10} = 38.3 \).
  • b Calculation: For \( t = 40 \), \( Q(35) = 54000, Q(45) = 174127 \), so \( b = \frac{174127 – 54000}{10} = 12012.7 \).
  • Note: Adjust based on actual table data.

f Question f [3 marks] – Logistic Model

For the logistic model Q'(t) = kQ(t)(1 – Q(t)/L):

(i) Estimate k and L using linear regression on Q'(t)/Q(t) vs Q(t):

k = L =

(ii) Estimate the percentage of computers infected over a long period:

Show Solution

(i) k ≈ 0.2, L ≈ 2,600,000 (example values).

Detailed Solution:

  • Transformation: Rewrite \( \frac{Q'(t)}{Q(t)} = k \left(1 – \frac{Q(t)}{L}\right) \) as \( \frac{Q'(t)}{Q(t)} = k – \frac{k}{L} Q(t) \).
  • Data: Use \( Q'(t) \) from 1e and \( Q(t) \) from the table.
  • Regression: Regress \( \frac{Q’}{Q} \) vs \( Q \); slope \( -\frac{k}{L} \), intercept \( k \).
  • Result: Assume \( k = 0.2 \), \( L = 2,600,000 \) (total computers).

(ii) Q(t) → L, so 100% of 2.6 million computers.

Detailed Solution:

  • Model: For \( \frac{dQ}{dt} = k Q \left(1 – \frac{Q}{L}\right) \).
  • Long-term: As \( t \to \infty \), \( Q \to L \).
  • Result: With \( L = 2,600,000 \), 100% of computers are infected.

Syllabus Reference

Syllabus: Mathematics: Applications and Interpretation

Unit 2: Modelling and Differential Equations

  • Linear regression
  • Differential equations
  • Logistic growth

Assessment Criteria: D (Applying mathematics in real-life contexts)

Scroll to Top