Statistics Terminologies
#statisticsSetup Environment¶
Python Package¶
In [ ]:
!pip install matplotlib
import pandas as pd
Requirement already satisfied: matplotlib in /home/albin/Workspace/albinsun.github.io/.venv/lib/python3.6/site-packages (3.3.4)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/albin/Workspace/albinsun.github.io/.venv/lib/python3.6/site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: numpy>=1.15 in /home/albin/Workspace/albinsun.github.io/.venv/lib/python3.6/site-packages (from matplotlib) (1.19.5)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/albin/Workspace/albinsun.github.io/.venv/lib/python3.6/site-packages (from matplotlib) (1.3.1)
Requirement already satisfied: python-dateutil>=2.1 in /home/albin/Workspace/albinsun.github.io/.venv/lib/python3.6/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: pillow>=6.2.0 in /home/albin/Workspace/albinsun.github.io/.venv/lib/python3.6/site-packages (from matplotlib) (8.4.0)
Requirement already satisfied: cycler>=0.10 in /home/albin/Workspace/albinsun.github.io/.venv/lib/python3.6/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: six>=1.5 in /home/albin/Workspace/albinsun.github.io/.venv/lib/python3.6/site-packages (from python-dateutil>=2.1->matplotlib) (1.16.0)
WARNING: You are using pip version 20.0.2; however, version 21.3.1 is available.
You should consider upgrading via the '/home/albin/Workspace/albinsun.github.io/.venv/bin/python3 -m pip install --upgrade pip' command.
Test Data¶
In [ ]:
prices = pd.Series([8, 12, 11, 10, 12, 9, 13, 15, 20, 30], name="Price (Million)")
prices.plot.hist(bins=5)
Out[ ]:
<AxesSubplot:ylabel='Frequency'>
Percentile¶
The $k_{th}$ percentile $P_k$ is defined as: $$
k\% \text{ of data} \le P_k \le (100-k)\% \text{ of data}
$$
Five Numbers Summary¶
In [ ]:
q0 = prices.quantile(0)
q1 = prices.quantile(0.25)
q2 = prices.quantile(0.5)
q3 = prices.quantile(0.75)
q4 = prices.quantile(1)
print(f" 0th percentile = {q0:<5} (Minimum)")
print(f" 25th percentile = {q1:<5} (1st quartile / Q1)")
print(f" 50th percentile = {q2:<5} (2nd quartile / Q2 / Median)")
print(f" 75th percentile = {q3:<5} (3rd quartile / Q3)")
print(f"100th percentile = {q4:<5} (Maximum)")
0th percentile = 8.0 (Minimum) 25th percentile = 10.25 (1st quartile / Q1) 50th percentile = 12.0 (2nd quartile / Q2 / Median) 75th percentile = 14.5 (3rd quartile / Q3) 100th percentile = 30.0 (Maximum)
Box Plot¶
Box plot illustrates five number summary. Note that matplotlib will auto pick outliers out.
In [ ]:
prices.plot.box()
Out[ ]:
<AxesSubplot:>
Inter Quartile Range (IQR)¶
The Inter Quartile Range (IQR) is defined as $$
IQR = Q3 - Q1
$$
In [ ]:
print(f"IQR = Q3 - Q1 = {q3 - q1}")
IQR = Q3 - Q1 = 4.25
Mean and Standard Deviation¶
For numbers $x_1, x_2, ..., x_n$
The mean (average) $\bar{x}$ is defined as
$$ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i $$and the standard deviation $\sigma_x$
$$ \sigma_x = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2} $$In [ ]:
print(f"mean: {prices.mean()}")
print(f"stdev: {prices.std()}")
mean: 14.0 stdev: 6.5659052011974035
Mesure with Mean & $\sigma$ or Median & IQR?¶
Mean and stardart deviation are sensitive to edge values, if it is a concern, use median and inter quartile range (IQR) instead
In [ ]:
print(f"Median: {prices.median()}")
print(f"IQR: {prices.quantile(0.75) - prices.quantile(0.25)}")
Median: 12.0 IQR: 4.25
In [ ]: