# Maths: Statistics 1 (S1)

**D1** Types of variable:

**Qualitative**: non-numerical (e.g. colour)**Quantitative**: numerical (e.g. length)**Continuous**: can take any value (e.g. age)**Discrete**: can only take certain values (e.g. cost)**Categorical**: listed by a category/property and not a number

**D2 - 8** Data presentation:

- Know how to use and create pie charts, bar charts, line charts, histograms, stem and leaf diagrams, box and whisker plots and cumulative frequency charts
- In a box and whisker plot (boxplot), an outlier is at least 1.5 x IQR from the nearest quartile

**D9** Skewness:

**Skewness**can be described as positive, symmetrical or negative- A
**positive skew**has more distribution on the left than the right (a left-skew) - A
**negative skew**has most of the distribution concentrated on the right (a right-skew) - A distribution can have many types of shape, including
**unimodal**(one peak),**bimodal**(two peaks, sometimes, but not always, of the same height) and**uniform**(constant)

**D10** Measures of central tendency:

- The
**mean**is calculated with`∑x ÷ n`

where n is the number of items - The symbol for the mean is x̄,
*x bar* - The element which the median falls on is found with
`(n + 1) ÷ 2`

(the list must be in ascending order). If n is odd then there is a middle value. If it is even, then the two middle values are added then divided by two to find the final median - The
**mode**is the value which appears most frequently. The data is**bimodal**if two values occur more than the rest, showing that the data may have been taken from two populations - The
**mid-range**is the value mid way between the upper-extreme and lower-extreme (highest and lowest values), calculated by adding them and dividing by 2

**D11** Usefulness of each measure of central tendency:

- The mean is the best known average and makes use of all of the data, but is affected by extreme values and cannot be obtained graphically
- The median is not influenced by extreme values and can be obtained even if some of the data values are unknown, but it can often only be estimated and it cannot be used for further statistical calculations like the mean can
- The mode is also unaffected by extreme values and is easy to calculate, but there may be more than one and sometimes it cannot be determined exactly

**D13** Ranges and percentiles:

- The range is found with
`largest value - smallest value`

, or`x`

_{m}_{a}_{x}- x_{m}_{i}_{n} - The four quartiles are found as follows:

- Q_{1}:`1(n + 1) ÷ 4`

- Q_{2}:`2(n + 1) ÷ 4`

- Q_{3}:`3(n + 1) ÷ 4`

- Q_{4}:`4(n + 1) ÷ 4`

- The
**interquartile range**(IQR) is equal to`Q`

_{3}- Q_{1} - To find the x
^{t}^{h}**percentile**, use`x(n + 1) ÷ 100`

, like the interquartile ranges (Q_{1}= the 25th percentile, Q_{2}= the 50th percentile/the median etc)

**D14** Measures of spread:

- The
**sum of squares**(or S_{x}_{x}) is calculated with or

- Therefore, to calculate S_{x}_{x}, find the sum of all values squared and subtract n multiplied by the mean squared

- For example, for the data {1, 2, 10}, ∑x^{2}= 1^{2}+ 2^{2}+ 10^{2}

- n × mean^{2}= 3 × 4.33^{2}

- Therefore, S_{x}_{x}_{ }= 48.7 - Remember not to use a rounded mean when calculating S
_{x}_{x} **Mean square deviation**(MSD) =`S`

_{x}_{x}÷ n**Root mean square deviation**(RMSD) is the root of the mean square deviation**Variance**(s^{2}) =`S`

_{x}_{x}÷ (n - 1)**Sample standard deviation**(s) is the root of the variance

- Sample standard deviation (just called*standard deviation*in the exam) uses the symbol*sx*in Casio calculators

**D15** Linear coding:

- Adding/subtracting a constant from all of the data will change the mean by this amount and will not affect standard deviation

- For example, the data {2, 3, 4, 5, 6} has a mean of 4 and standard deviation of 1.58

- Adding 1 to each item (to make {3, 4, 5, 6, 7}) will increase the mean by 1 but the standard deviation will remain at 1.58 - Multiplying/dividing all of the data will affect the mean and standard deviation by this amount

- Using the above example, multiplying each item by 2 will result in {4, 6, 8, 10, 12}

- The mean will now be 2 x 4 = 8 and standard deviation will be 2 x 1.58 = 3.16

**D16** Outliers:

- An
**outlier**is 2 standard deviations from the mean, or 1.5 IQRs beyond the nearest quartile

**u1- u4** Probability - notation:

**P(A)**is the probability of A**P(B)**is the probability of B**P(A∩B)**is the probability of A**and**B occurring**P(A∪B)**is the probability of A occurring, B occurring or both occurring**P(A')**is the probability of A**not**occurring**P(A|B)**is the probability of A occurring once B has already happened

**u5** Mutually exclusive and independent events:

**Independent**events are not affected by one another

- If an event is independent, then`P(A) x P(B) = P(A∩B)`

- If P(B|A) = P(B), then A and B are independent
**Mutually exclusive**events**cannot**both happen at the same time - e.g. getting heads and tails in one flip

- If two events are mutually exclusive,`P(A∩B) = 0`

and`P(A∪B) = P(A) + P(B)`

**u6, u7** Calculating outcomes:

- When two mutually exclusive events occur, the probability of either A or B occurring is equal to P(A) + P(B)
- The probability of two independent events occurring (e.g. two coins flipped, both getting tails) is equal to P(A) x P(B), 0.5 x 0.5 = 0.25 in this example
**TO DO: Anything I've missed? Check back over spec**

**R1, R2** Discrete random variables:

- A
**Probability distribution**is a table of values showing the probabilities of various outcomes, for example:x 0 1 2 3 4 P(X = x) ^{1}*/*_{10}^{1}*/*_{10}^{2}*/*_{10}^{3}*/*_{10}^{2}*/*_{10} - Here, the probability of x = 0 is
^{1}*/*_{10}, x = 3 is^{3}*/*_{10}etc - The probabilities will
**always**sum to 1 - These are known as
**discrete random variables**because they can only take a set of values (here only 0, 1, 2, 3 and 4) and their probabilities sum to 1

**R3** Expectation:

**Expectation**is the mean value, written as µ or E(*X*)- It is calculated with
`∑xP(X = x)`

(so sum each value multiplied by frequency) - In the above table, the expectation = (0 ×
^{1}*/*_{10}) + (1 ×^{1}*/*_{10}) + (2 ×^{2}*/*_{10}) + (3 ×^{3}*/*_{10}) + (4 ×^{2}*/*_{10}) = 1.9

- Therefore, the mean value is 1.90

**R4** Variance:

**Variance**is a measure of spread and is the square of standard deviation- It is represented with Var(
*X*) - From a probability distribution, it can be calculated with
`Var(X) = E(X`

^{2}) - E(X)^{2}

- Another way of writing this is`Var(X) = E ( (X - µ)`

^{2}) - For the above table:

- E(X^{2}) = (0^{2}×^{1}*/*_{10}) + (1^{2}×^{1}*/*_{10}) + (2^{2}×^{2}*/*_{10}) + (3^{2}×^{3}*/*_{10}) + (4^{2}×^{2}*/*_{10}) = 6.80

- E(X)^{2}= 1.9 (calculated in the expectation section) squared = 3.61

- Therfore, Var(X) = 6.80 - 3.61 = 3.19

- Standard deviation is the root of this, 1.79 to 3 sig. figs.

**H1 - H3** Binomial distributions:

- A
**binomial distribution**is applicable with a fixed number of independent and repeated trials, each of which is a*success*(p) or*fail*(q)

- Therefore`p + q = 1`

- If X has a binomial distribution B(n, p), this can be written as
**X ~ B(n, p)** - The sample size is denoted by
*n* `P(X = r) =`

^{n}C_{r}p^{r}q^{n}^{-}^{r}- Example:
*A coin is tossed ten times. What is the probability of it coming down heads five times and tails five times?*

- p = 0.5 and q = 0.5 since it is a fair coin

- n = 10 because there are ten tosses

- Consider heads a*success*and tails a*failure*

- r = 5 because this is the number of successes (heads) we are testing for

- Therefore, the probability of 5 successes and 5 failures is equal to^{1}^{0}C_{5}× 0.5^{5}× 0.5^{1}^{0}^{-}^{5}= 0.246 - Probabilities can be added. So for the above example of coin tosses, the probability of getting less than 4 or 5 heads is equal to

^{1}^{0}C_{5}× 0.5^{5}× 0.5^{1}^{0}^{-}^{5}+^{1}^{0}C_{4}× 0.5^{4}× 0.5^{1}^{0}^{-}^{4}= 0.451 - Remember that all probabilities will add to 1. This can sometimes be used to reduce the number of calculations - for example to find the probability of getting 1 - 9 heads, it is quicker to subtract the probability of 10 heads from 1 than adding all probabilities up to 9

## Binomial probability tables:

- A
**cumulative binomial table**shows P(X ≤ x) when X ~ B(n,p)

- For example, if a die is rolled 8 times (n = 8), the probability of getting 0, 1, 2 or 3 sixes (p =^{1}*/*_{6}and x = 3) can be found:

- Therefore it is equal to 0.9693

- These tables start on page 12 of the exam formula booklet

**H4, H5** n!:

- n! (n factorial) is used to calculate the number of arrangements of a set of objects/digits etc, without removing or adding any
- For example, there are 3! = 3 x 2 x 1 = 6 ways of arranging the letters A, B and C, only using each once

**H3** ^{n}C_{r}, combinations:

^{n}C_{r}is the number of ways to select r objects from n. For example, with the letters A, B, C and D, there are^{4}C_{2}= 6 possible ways of selecting two of the four letters randomly (AB, AC, AD, BC, BD, CD)- The formula is
^{n}C_{r}= n! ÷ (n - r)!r! **Combination**is used when order**does not**matter

^{n}P_{r} Permutations:

**Permutations**are used when order does matter- The formula is
^{n}P_{r}= n! ÷ r!

- Therefore, for the example used in*combinations*, there would be 24 ÷ 2 permutations (AB, AC, AD, BC, BD, CD, BA, CA, DA, CD, DB, DC)

**H6** mean = np:

- If X ~ B(n,p),
`E(X) = np = number of trials × number of successes`

- For example, if a fair coin is tossed twenty times, the most likely number of heads is equal to n × p = 20 × 0.5 = 10

**TO DO: H7 - be able to calculate the expected frequencies of the various possible outcomes from a series of binomial trials**

**H8 - 13** Hypothesis testing:

- The
**null hypothesis**, H_{0}, is the hypothesis which is being disproved - The
**alternative hypothesis**, H_{1}is the opposite - The
**significance level**is the probability at which it is decided that the null hypothesis is incorrect. For example, to prove that a coin is biased towards heads at a 5% significance level, the coin would have to land on heads at least 95% of the time. The null hypothesis is that p = 0.5 and the alternative hypothesis is that p > 0.5 (heads is more likely) - The
**critical region**is the set of values of the test statistic for which the null hypothesis is rejected - The
**acceptance region**is the opposite to the critical region - the set of values for which the null hypothesis is accepted - The
**critical value**is the value seperating the regions of acceptance and rejection - A
**1-tail test**is a test where only one side is being tested for (e.g. a die is biased towards 1s) - A
**2-tail test**is a test for both sides (e.g. finding if a die is biased but without the side being stated)

## Example question:

- A manufacturer produces titanium bicycle frames. The bicycle frames are tested before use and on average5% of them are found to be faulty. A cheaper manufacturing process is introduced and the manufacturerwishes to check whether the proportion of faulty bicycle frames has increased. A random sample of 18bicycle frames is selected and it is found that 4 of them are faulty. Carry out a hypothesis test at the 5%significance level to investigate whether the proportion of faulty bicycle frames has increased.
**Let P =**the probability that a randomly selected frame is faulty**H**0.05_{0}: P =**H**0.05_{1}: P >- P(X ≥ 4) or 1 - P(X ≤ 3) = 0.0109
*Note: the sign in the P(X > 4) points in the same direction as the H*_{1}sign. This is always the case- 0.0109 < 0.05 (
*significance level = 0.05*) ∴ reject H_{0} *There is evidence to suggest that the proportion of faulty frames has increased.*

**(8 marks)**

**Full marks solution:**