Question
The length of stay in a hospital after receiving a particular treatment is of interest to the patient, the hospital, and insurance providers. Of particular interest are unusually short or long lengths of stay. A random sample of 50 patients who received the treatment was selected, and the length of stay, in number of days, was recorded for each patient. The results are summarized in the following table and are shown in the dotplot.
(a) Determine the five-number summary of the distribution of length of stay.
(b) Consider two rules for identifying outliers, method A and method B. Let method A represent the \(1.5 \times \mathrm{IQR}\) rule, and let method \(\mathrm{B}\) represent the 2 standard deviations rule.
(i) Using method A, determine any data points that are potential outliers in the distribution of length of stay. Justify your answer.
(ii) The mean length of stay for the sample is 7.42 days with a standard deviation of 2.37 days. Using method B, determine any data points that are potential outliers in the distribution of length of stay. Justify your answer.
(c) Explain why method A might identify more data points as potential outliers than method B for a distribution that is strongly skewed to the right.
▶️Answer/Explanation
Ans:
mean: 7.42 days
\(\min : 5\) days
max:2ldays
mode:” days is
ronge:16 days
b(i) \(\begin{aligned} & 1 Q R=2 \text { dops } \\ & 2 \cdot 1.5=3 \\ & 7 \pm 3=14,10 \quad \text { of thes } \rightarrow 12,21\end{aligned}\)
(ii)standad deviction \(=2.54\) dets
$
\begin{aligned}
2.37 \cdot 2 & =4.74 \text { dags } \\
& 7.42 \pm 4.74=(2.68,12.16)
\end{aligned}
$
The only data point that method \(B\) describes as an out lat is (d) 21 days.
(c) method \(A\) identified more date points as outlines because the man is skived towers the ing h end the stander deviation is lacier than it wald be in a normally distributed dotyot.
Question
Researchers will conduct a year-long investigation of walking and cholesterol levels in adults. They will select a random sample of 100 adults from the target population to participate as subjects in the study.
(a) One aspect of the study is to record the number of miles each subject walks per day. The researchers are deciding whether to have subjects wear an activity tracker to record the data or to have subjects keep a daily journal of the miles they walk each day. Describe what bias could be introduced by keeping the daily journal instead of wearing the activity tracker.
During the course of the study, the subjects will have their cholesterol levels measured each month by a doctor. The researchers will perform a significance test at the end of the study to determine whether the average cholesterol level for subjects who walk fewer miles each day is greater than for those who walk more miles each day.
(b) Selecting a random sample creates a reasonable representative sample of the target population. Explain the benefit of using a representative sample from the population.
(c) Suppose the researchers conduct the test and find a statistically significant result. Would it be valid to claim that increased walking causes a decrease in average cholesterol levels for adults in the target population? Explain your reasoning.
▶️Answer/Explanation
Ans:
(a) using a journal to track mileage is likely to underestimate true miles walked each day because the adults will likely not record the small distances they walked each day, like walking from the kitchen to the couch.
(b) using a representative sample makes it easier to collect data within the study, and that data can be used to make an accurate conclusion about the relationship between walking miles and cholesterol levels in the target population
(c) No. Because this was an observational study and not an experiment, we cannot determine causation between miles walked and cholesterol levels.
Question
To increase morale among employees, a company began a program in which one employee is randomly selected each week to receive a gift card. Each of the company’s 200 employees is equally likely to be selected each week, and the same employee could be selected more than once. Each week’s selection is independent from every other week.
(a) Consider the probability that a particular employee receives at least one gift card in a 52-week year.
(i) Define the random variable of interest and state how the random variable is distributed.
(ii) Determine the probability that a particular employee receives at least one gift card in a 52 -week year. Show your work.
(b) Calculate and interpret the expected value for the number of gift cards a particular employee will receive in a 52-week year. Show your work.
(c) Suppose that Agatha, an employee at the company, never receives a gift card for an entire 52-week year. Based on her experience, does Agatha have a strong argument that the selection process was not truly random? Explain your answer.
▶️Answer/Explanation
Ans:
\(x=\) the number of \(g_i f t\) cards an employee recieves,
The randan variable is Binomial with \(p=0,005 \& n=52\).
\(B=\) Binary (Recieres or does nit recieve)
\(I=\) “Each week’s selection is indeparders fromeveryother we uk”
\(N=\) fixed number \(=52\)
\(S \rightarrow\) Set probability \(\rightarrow 0.005\)
(ii) \(\begin{aligned} p(x \geq 1) & =1-p(x \leq 0) \\ & =1-\left(\begin{array}{c}52 \\ 0\end{array}\right)(0.005)^{\circ}(1-.995)^{52} \\ & =1-0.7705 \\ & =0.2295 \text { probability of recieving at least } \\ & \text { one gift carding } 52 \text {-week year. }\end{aligned}\)
(b)
$
\begin{aligned}
& E(x)=n p=52(000) \\
& E(x)=0.26
\end{aligned}
$
If many, many random samples of 52 -w ert years we chooses, then there will approx. an average of \(.26 \mathrm{gift}\) cards a particular employee will recieve.
\(\begin{aligned} & \text { truly random? Explain your answer. } \\ & p(x=0)=\left(\begin{array}{c}52 \\ 0\end{array}\right)(0.005)^{\circ}(1-.995)^{62} \\ &=.7705\end{aligned}\)
There is approx, a 7705 chance of not getting a gif card cor on entire 52-week, this is a likely occurance to occur so Agatha does n’t have a strong arguement that the selection process is not truely random.
Question
The manager of a large company that sells pet supplies online wants to increase sales by encouraging repeat purchases. The manager believes that if past customers are offered \(\$ 10\) off their next purchase, more than 40 percent of them will place an order. To investigate the belief, 90 customers who placed an order in the past year are selected at random. Each of the selected customers is sent an e-mail with a coupon for \(\$ 10\) off the next purchase if the order is placed within 30 days. Of those who receive the coupon, 38 place an order.
(a) Is there convincing statistical evidence, at the significance level of \(\alpha=0.05\), that the manager’s belief is correct? Complete the appropriate inference procedure to support your answer.
(b) Based on your conclusion from part (a), which of the two errors, Type I or Type II, could have been made? Interpret the consequence of the error in context.
▶️Answer/Explanation
Ans:
\(H_0\) : if given coupon, customer order rate: \(40 \%\)
\(\mathrm{Ha}\) : if given coupon, customer order rate \(140 \%\)
1 prop 2 test
- random sample used
- 90 customers likely < 10\%. all customers last year
\(\rightarrow\) assume independent
$
\begin{aligned}
& 0.4 \cdot 90: 36>10 \quad 0.6 \cdot 90: 54+10 \\
& n p, n e=10
\end{aligned}
$
\(\rightarrow\) assume normal \(P(z+0.4303): 0.333\)
$
z=\frac{38 \cdot 36}{\sqrt{\frac{0.42 \cdot 0.68}{90}}}: 0.4303
$
No, there is not convincing statistical evidence at the significance level of \(\alpha=0.05\), as our p-value \(=0.333\). Therefore we fail to reject the null hypothesis that when given a coupon, the customers’ order rate will increase.
A Type 11 error could have been made, as we failed to reject the null hypothesis. This could result in us believing that a coupon weald not increase the customers order rate, when it in fact may.
Question
A research center conducted a national survey about teenage behavior. Teens were asked whether they had consumed a soft drink in the past week. The following table shows the counts for three independent random samples from major cities.
(a) Suppose one teen is randomly selected from each city’s sample. A researcher claims that the likelihood of selecting a teen from Baltimore who consumed a soft drink in the past week is less than the likelihood of selecting a teen from either one of the other cities who consumed a soft drink in the past week because Baltimore has the least number of teens who consumed a soft drink. Is the researcher’s claim correct? Explain your answer.
(b) Consider the values in the table. (i) Construct a segmented bar chart of relative frequencies based on the information in the table.
(ii) Which city had the smallest proportion of teens who consumed a soft drink in the previous week? Determine the value of the proportion.
(c) Consider the inference procedure that is appropriate for investigating whether there is a difference among the three cities in the proportion of all teens who consumed a soft drink in the past week.
(i) Identify the appropriate inference procedure.
(ii) Identify the hypotheses of the test.
▶️Answer/Explanation
Ans:
The researcher’s claim is incorrect. Just because the count of Baltimore teens who consumed a soft drink in the past meek doesn’t mean it is leas likely to select a teen from Baltimore who answered yes than the other cities. In reality, the likelihood of a Baltimore teen answering yes is \(\frac{727}{904}=0.804\), which is higher than Detroit \(\left(\frac{1233}{1662}=0.742\right)\) and San Diego \(\left(\frac{1482}{2250}=0,650\right)\). There were differing numbers of teens surveyed in the three cities, so the numbers cannot be compared directly; theymust be proportions first.
(b)
San Diego \(\rightarrow p=\frac{1482}{2280}=0.65\)
(c) The appropriate inference procedure is a \(x^2\)-test for homogeneity.
\(H_s:\) All three cities have the same proportion of teens who consumed a soft drink in the past week.
\(H_a\) : The proportion of teens who had a soft drink in the past week differs among the three cities,
Question
Attendance at games for a certain baseball team is being investigated by the team owner. The following boxplots summarize the attendance, measured as average number of attendees per game, for 47 years of the team’s existence. The boxplots include the 30 years of games played in the old stadium and the 17 years played in the new stadium.
Old Stadium New Stadium
(a) Compare the distributions of average attendance between the old and new stadiums.
The following scatterplot shows average attendance versus year.
(b) Compare the trends in average attendance over time between the old and new stadium.
▶️Answer/Explanation
Ans:
Shape: The distribution of avg attendance at the old stadium is roughly uniform while at the new stadium is skewed to the left.
Center: The median avg attendance at the old stadium is 16,000 attendees while much higher at 25,000 attendees at the new stadium.
Spread: The range of avg attendance at the new stadium (about 12,000 ) is greater than at the old stadium (about 8,000).
Outliers. There is at least one potential outlier of avg attendance at the new stadium at approximately 16,000 attendees.
b) There has been no trend of increasing or decreasing average attendance over time at the old stadium, but average attendance has rapidly increased over time at the new stadium, although the rate of increase is slowing down. The new stadium has seen rapid growth in attendance over the years while the old stadium has seen \(v\) o significant change in trends over time.
c(i) There is a strong positive linear association between the number of games won and average attendance for each year.
(ii) Graph II does rot suggest a change in rate for games in the new stadium compared to the old stadium because there is only a very miner shift io rate between the two clusters of old vs new stadium years and the same regression line could reasonably represent the whole graph.
(d) The number of games won could be a confounding variable in the relationship between average affendance and year or stadium. While the graphs have shown a clear association between year and average attendance (average attendance rapidly increases in 2000 and on) and between stadium and average attendance, (far more people on average attended games at the new stadium than the old stadium), the number of games won in a year is a lurking variable for both of these relationships In the new stadium, the baseball team won way more games per year than they did at the old stadium, but the rate at which average attendance increased with the number of games won didn’t charge when the stadium charged, suggesting the new stadium did n’t cause the increase in average attendance. The average attendance did increase as the years went on, but for most of the years with high attendance (at the new stadium), the team also won more games than before. Therefore, the number of games won per year confounded the relationships between average attendance and year or stadium and explained the average attendance. variation in