AP Statistics- Unit 1: Exploring One-Variable Data Summary Notes

AP PhysicsAP Calculus AP ChemistryAP Biology

Exploring One-Variable Data
On your AP exam, 15‒23% of questions will fall under the topic of Exploring One-Variable Data.

Variables and Frequency Tables
A variable is a characteristic or quantity that potentially differs between individuals in a group. A categorical variable is one that that classifies an individual by group or category, while a quantitative variable takes on a numerical value that can be measured.

It is important to recognize that it is possible for a categorical variable to look, superficially, like a number. For example, despite being composed of numbers, a zip code is categorical data. It does not represent any quantity or count; rather, it’s simply a label for a location.

Quantitative variables can be further classified as discrete or continuous. A discrete variable can take on only countably many values. The number of possible values is either finite or countably infinite. In contrast, a continuous variable can take on uncountably many values. An important characteristic of a continuous variable is that between any two possible values another value can be found.

Graphs for Categorical Variables
A categorical variable can be represented in a frequency table, which shows how many individual items in a population fall into each category. For example, suppose a student was interested in which color of car is most popular. He collects data from the parking lot at school, and his results are shown in the following frequency table:

A relative frequency table gives the proportion of the total that is accounted for by each category. For example, in the previous data, 14 of the 50 cars, or $28 \%$, were black. The full relative frequency table is as follows:

Note that the percentages add up to $100 \%$, since all of the cars were of one of the colors represented in the table.

A bar chart is a graph that represents the frequencies, or relative frequencies, of a categorical variable. The categories are organized along a horizontal axis, with a bar rising above each category. The height of the bar corresponds to the number of observations of that category. The vertical axis may be labeled with frequencies or with relative frequencies, as in the following examples.

A bar chart representing data from more than one set is useful for comparing the frequencies across the sets. For example, suppose that the day after collecting the init ial data on car colors, the student collected the same information from a parking lot at a nearby school. The results can be compared using the following bar chart, which shows the relative frequencies for each color, separated by school:

Graphs for Quantitative Variables
A histogram is related to a bar chart but is used for quantitative data. The data is split into intervals, or bins, and the number of data points in each interval is counted. The horizontal axis contains the different intervals, which are adjacent to each other, as they form a number line. The vertical axis shows the count for each interval. The following histogram represents the scores that 50 students received on a test:

How the data is split into intervals can have a big impact on the appearance of the histogram. Two histograms that represent the same data can show different characteristics, depending on the choice of interval width.

A stem-and-leaf plot is another graphical representation of a quantitative variable. Each data value is split into a stem (one or more digits) and a leaf (the last digit). The stems are arranged in a column, and the leaves are listed alongside the stem to which they belong. The test score data is shown in the following stem-and-leaf plot:

In a dotplot, each data value is represented by a dot placed above a horizontal axis. The height of a column of dots shows how many repetitions there are of that value. The following is a subset of the test score data:

The Distribution of a Quantitative Variable
The distribution of quantitative data is described by reference to shape, center, variability, and unusual features such as outliers, clusters, and gaps.

When a distribution has a longer tail on either the right or left, the distribution is said to be skewed in that direction. If the right and left sides are approximately mirror images, the distribution is symmetric. A distribution with a single peak is unimodal; if it has two distinct peaks, it is bimodal. A distribution without any noticeable peaks is uniform.

An outlier is a value that is unusually large or small. A gap is a significant interval that contains no data points, and a cluster is an interval that contains a high concentration of data points. In many cases, a cluster will be surrounded by gaps.

Free Response Tip
If you are asked to compare two distributions, be sure to address both their similarities and differences. For example, perhaps they are both unimodal, but one is skewed while the other is symmetric. Perhaps one has an outlier while the other does not. In particular, be sure to note if one has greater variability than the other, even if you cannot quantify the difference.

Summary Statistics and Outliers

A statistic is a value that summarizes and is derived from a sample. Measures of center and position include the mean, median, quartiles, and percentiles. The commonly used measures of variability are variance, standard deviation, range, and IQR.

The mean of a sample is denoted $\bar{x}$, and is defined as the sum of the values divided by the number of values. That is, $\bar{x}=\frac{1}{n} \sum_{i=1}^n x_i$. The median is the value in the center when the data points are in order. In case the number of values is even, the median is usually taken to be the mean of the two middle values. The first quartile, $Q_1$, and the third quartile, $Q_3$, are the medians of the lower and upper halves of the data set.

The ideas behind the first and third quartiles can be generalized to the notion of percentiles. The $\boldsymbol{p}^{\text {th }}$ percentile is the data point that has $p \%$ of the data less than or equal to it. With this terminology, the first and third quartiles are the $25^{\text {th }}$ and $75^{\text {th }}$ percentiles, respectively.

The range of a data set is the difference between the maximum and minimum values, and the interquartile range, or $I Q R$, is the difference between the first and third quartiles. That is, $I Q R=Q_3-Q_1$.

Variance is defined in terms of the squares of the differences between the data points and the mean. More precisely, the variance $s^2$ is given by the formula $s^2=\frac{1}{n-1} \sum_{i=1}^n\left(x_i-\bar{x}\right)^2$. The standard deviation is then simply the square root of the variance: $s=\sqrt{\frac{1}{n-1} \sum_{i=1}^n\left(x_i-\bar{x}\right)^2}$.

When units of measurement are changed, summary statistics behave in predictable
ways that depend on the type of operation done.

There are many possible ways to define an outlier. There are two methods commonly used in AP Statistics, depending on what statistic is being used to describe the spread of the distribution.

When the IQR is used to describe the spread, the 1.5IQR rule is used to define outliers. Under this rule, a value is considered an outlier if it lies more than $1.5 \times I Q R$ away from one of the quartiles. Specifically, an outlier is a value that is either less than $Q_1-1.5 \times I Q R$ or greater than $Q_3+1.5 \times I Q R$.

On the other hand, if the standard deviation is being used to describe the variation of the distribution, then any value that is more than 2 standard deviations away from the mean is considered an outlier. In other words, a value is an outlier if it is less than $\bar{x}-2 s$ or greater than $\bar{x}+2 s$.

If the existence of an outlier does not have a significant effect on the value of a certain statistic, we say that statistic is resistant (or robust). The median and IQR are examples of resistant statistics. On the other hand, some statistics, including mean, standard deviation, and range, are changed significantly by an outlier. These statistics are called nonresistant (or nonrobust).

Related to the idea of robustness is the relationship between mean and median in skewed distributions. If a distribution is close to symmetric, the mean and median will be approximately equal to each other. On the other hand, in a skewed distribution the mean will usually be pulled in the direction of the skew. That is, if the distribution is skewed right, the mean will usually be greater than the median, while if the distribution is skewed left, the mean will usually be less than the median.

Graphs of Summary Statistics
The five-number summary of a data set is composed of the following five values, in order: minimum, first quartile, median, third quartile, and maximum. A boxplot is a graphical representation of the five-number summary that can be drawn vertically or horizontally along a number line. In a boxplot, a box is constructed that spans the distance between the quartiles. A line, representing the median, cuts the box in two.

Lines, often called whiskers, connect the ends of the box with the maximum and minimum points. If the set contains one or more outliers, the whiskers end at the most extreme values that are not outliers, and the outliers themselves are indicated by stars or dots.

Note that the two sections of the box, along with the two whiskers, each represent a section of the number line that contains approximately $25 \%$ of the values.

Boxplots can be used to compare two or more distributions to each other. The relative positions and sizes of the sections of the box and the whiskers can demonstrate differences in the center and spread of the distributions.

The Normal Distribution
A normal distribution is unimodal and symmetric. It is often described as a bell curve. In fact, there are infinitely many normal distributions. Any single one is described by two parameters: the mean, $\mu$, and the standard deviation, $\sigma$. The mean is the center of the distribution, and the standard deviation determines whether the peak is relatively tall and narrow or short and wide.

The empirical rule gives guidelines for how much of a normally distributed data set is located within certain distances from the center. In particular, approximately $68 \%$ of the data points are within 1 standard deviation of the mean, approximately $95 \%$ are within 2 standard deviations of the mean, and approximately $99.7 \%$ are within 3 standard deviations of the mean.

In practice, many sets of data that arise in statistics can be described as approximately normal: they are well modeled by a normal distribution, although it is rarely perfect.

The standardized score, or z-score, of a data point is the number of standard deviations above or below the mean at which it lies. The formula is $z=\frac{x-\mu}{\sigma}$. It is analogous to a percentile in the sense that it describes the relative position of a point within a data set. If the $\mathrm{z}$-score is positive, the value is greater than the mean, while if it is negative, the value is less than the mean. In either case, the absolute value of the $z$-score describes how far away the value is from the center of the distribution.

Suggested Reading

Starnes $\&$ Tabor. The Practice of Statistics. $6^{\text {th }}$ edition. Chapters 1 and 2. New York, NY: Macmillan.
Larson $\&$ Farber. Elementary Statistics: Picturing the World. $7^{\text {th }}$ edition. Chapter 2. New York, NY: Pearson.
Bock, Velleman, De Veaux, $\&$ Bullard. Stats: Modeling the World. $5^{\text {th }}$ edition. Chapters 1-5. New York, NY: Pearson.
Sullivan. Statistics:Informed Decisions Using Data. $5^{\text {th }}$ edition. Chapters 2 and 3. New York, NY: Pearson.
Peck, Short, $\&$ Olsen. Introduction to Statistics and Data Analysis. $6^{\text {th }}$ edition. Chapters 3 and 4. Boston, MA: Cengage Learning.

Sample Exploring One-Variable Data Questions

Consider the following output obtained when analyzing the percent nitrogen composition of soil collected in neighborhoods near a water treatment facility in 2019.
$
\begin{aligned}
& \text { NumCases }=55 \\
& \text { Mean }=23.01 \\
& \text { Median }=24.26 \\
& \text { StdDev }=4.131 \\
& \text { Min }=12.05 \\
& \text { Max }=31.49 \\
& 75^{\text {th }} \% \text { ile }=30.12
\end{aligned}
$
A. The 25th percentile must be about 18.4 .

B. Some outliers appear to be present.

C. The IQR is 19.44

D. About $10 \%$ of the values are in the range 30.12 to 31.49 .

E. Soil levels at $11 \%$ exist in the sample, but are not prevalent.

▶️Answer/Explanation

Explanation:
The correct answer is B. An outlier is typically taken to be a data point that is more than two standard deviations from the mean. If you compute mean + 2(standard deviation), you get 31.272. Since the maximum is larger than this value, and $25 \%$ of the values are larger than 30.12 , there must be some outliers in the data. Choice A is incorrect because the data need not be uniformly spaced and so, the manner in which the data is dispersed to the left of the median may not be the same as how it is dispersed to the right. Choice $C$ is incorrect because 19.44 is the range, not the IQR. Choice $D$ is incorrect because about $25 \%$ values are within this range. Choice $\mathrm{E}$ is incorrect because the minimum value in this data set is 12.05 .

A researcher is interested in the age at which adolescents get their first paying job. She surveyed a simple random sample of 150 adolescents who have had at least one paying job before the age of 19. The distribution of the ages was found to be approximately normal with a mean of 15.2 years and a standard deviation of 1.6 years. According to the empirical rule, between which two ages do approximately $95 \%$ of the adolescents get their first paying job?

A. 13.2 years and 17.2 years

B. 15.2 years and 18.4 years

C. 12 years and 15.2 years

D. 12 years and 18.4 years

E. 13.6 years and 16.8 years

▶️Answer/Explanation

Explanation:
The correct answer is $\mathrm{D}$. Let $X$ be the age at which an adolescent gets his or her first paying job. Since $X$ is assumed to be normal with mean 15.2 and standard deviation 1.6, the empirical rule states that about $95 \%$ of data will be within 2 st.dev of the mean and $15.2-2(1.6)=12,15.2+$ $2(1.6)=18.4$. So, $95 \%$ of adolescents get their first paying job between the ages of 12 years and 18.4 years. Choice $A$ is incorrect because you used 2 instead of 2 times the standard deviation 1.6 when computing the margin of error. Choice $B$ is incorrect because you forgot to subtract the margin of error 2(1.6) from the left endpoint. Choice $C$ is incorrect because you forgot to add the margin of error 2(1.6) to the right endpoint. Choice $\mathrm{E}$ is incorrect because because you used 1(1.6) as the margin of error instead of 2(1.6). As such, this is the range for when approximately $68 \%$ of adolescents get their first paying job.

Thirty-six students completed an algebra exam consisting of 40 questions. The score distribution is described by the following stem-and-leaf plot:

The first quartile of the score distribution is equal to which of the following?
A. 17
B. 7
C. 36
D. 17.5
E. 29

▶️Answer/Explanation

Explanation:
The correct answer is A. Since there are 36 scores in the stem-and-leaf plot, the position of the $0.25(36)=9^{\text {th }}$ score, measured starting from the lowest score, is the 25 th percentile, or first quartile. The score in the $9^{\text {th }}$ position is 17 . Choice $B$ is incorrect because is incorrect because you likely forgot to include the stem ” 1 ” when reporting the score. Choice $\mathrm{C}$ is incorrect because this is the third quartile, not the first. Remember, the first quartile is the $25^{\text {th }}$ percentile and the third quartile is the $75^{\text {th }}$ percentile. Choice $D$ is incorrect because you averaged the $9^{\text {th }}$ and $10^{\text {th }}$ scores. But, the position of the first quartile, or $25^{\text {th }}$ percentile, is $0.25(36)=9$, an integer, so there is no need to average two scores. Choice E is incorrect because it is the median

Need Help ? Book A Tutor