Statistics and probability play a crucial role in analyzing data and making informed decisions. This guide covers essential statistical concepts, including measures of central tendency (mean, median, mode), data dispersion (range, standard deviation), and probability fundamentals. Understanding these concepts helps in interpreting data accurately, identifying patterns, and making predictions in various fields, from finance to machine learning. Whether you're a student or a professional, mastering these principles provides a strong foundation for data analysis and decision-making.
The mean, also known as the average, is a simple way to find the central value of a set of numbers. It helps us understand the overall trend of the data.
Mean = (Sum of all values) ÷ (Number of values)
Let's say we have the numbers [2, 4, 6, 8]. To find the mean:
(2 + 4 + 6 + 8) ÷ 4 = 20 ÷ 4 = 5
So, the mean of these numbers is 5.
The mean helps us understand the central tendency of data. It gives a single value that represents the overall dataset. For example, if we calculate the mean score of students in a class, we can quickly see the average performance of the class.
The median is the middle value of a dataset when the numbers are arranged in order. It helps find the central point of data, especially when there are extreme values (outliers) that might distort the mean.
Consider these numbers: [3, 1, 7, 5, 9]
Step 1: Arrange them in ascending order → [1, 3, 5, 7, 9]
Step 2: The middle value is 5, so the median is 5.
Now, for an even-numbered dataset: [2, 4, 6, 8, 10, 12]
Step 1: Arrange them in order (already sorted).
Step 2: The two middle numbers are 6 and 8.
Step 3: Find the average: (6 + 8) ÷ 2 = 7
So, the median is 7.
The median is a great way to find the center of a dataset without being affected by very high or very low values. For example, if a few students in a class score extremely high or low, the median gives a better idea of the typical score than the mean.
The mode is the value that appears most frequently in a dataset. It’s useful for understanding the most common or popular value in a set of data.
Consider the numbers: [2, 4, 4, 6, 7, 8, 8, 8]
The number 8 appears three times, more than any other number. So, the mode is 8.
Now, consider these numbers: [1, 2, 2, 3, 3, 4, 5]
Here, both 2 and 3 appear twice, so this dataset has two modes: 2 and 3.
And for these numbers: [5, 7, 9, 10, 12]
Since no number repeats, there is no mode in this dataset.
The mode is great when you want to know which value appears most often. For example, if you’re looking at the most common shoe size in a store, the mode would tell you the most popular size.
Outliers are values in a dataset that are significantly different from most of the other values. These values can be much higher or much lower than the rest of the data. Outliers can impact the results of statistical analyses and are important to identify.
Consider the numbers: [1, 2, 3, 4, 5, 100]
The number 100 is much larger than the others, so it’s an outlier in this dataset.
Now, consider these numbers: [50, 52, 53, 51, 49, 200]
The number 200 is much higher than the rest of the values, so it’s an outlier.
Outliers are important to identify because they can distort results. For example, in a test where most students score between 50 and 80, a student who scores 200 might skew the average score, making it higher than it actually is. In such cases, it's important to consider whether to remove or adjust outliers based on the context of the data.
The range is a simple way to measure how spread out the values are in a dataset. It gives you an idea of the difference between the highest and lowest values.
Range = (Largest value) - (Smallest value)
Consider the numbers: [4, 7, 2, 9, 5]
Step 1: Find the largest value, which is 9.
Step 2: Find the smallest value, which is 2.
Step 3: Subtract the smallest value from the largest value: 9 - 2 = 7.
So, the range of this dataset is 7.
Now, consider the numbers: [15, 18, 25, 12, 30]
Step 1: The largest value is 30, and the smallest value is 12.
Step 2: Subtract 12 from 30: 30 - 12 = 18.
So, the range is 18.
The range is a quick way to get an idea of the spread or variability in your data. It tells you how wide the values are between the smallest and largest. For example, if you were measuring the temperature in a city over a week, a large range would suggest big temperature fluctuations, while a small range would indicate consistent temperatures.
Although the range is easy to calculate, it’s sensitive to outliers. A very high or low value can drastically change the range, even if it’s not typical of the data. So, while it’s helpful, it may not always give the full picture of how data is distributed.
Average deviation measures how far each value in a dataset is from the mean (average). It tells you how spread out or clustered the data points are around the mean. The smaller the average deviation, the more consistent the values are, and vice versa.
The formula for average deviation is:
Average Deviation = \[ \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}| \]
Where:
Consider the numbers: [3, 7, 5, 10]
Step 1: Find the mean (average) of the dataset:
\[ \bar{x} = \frac{3 + 7 + 5 + 10}{4} = \frac{25}{4} = 6.25 \]
Step 2: Find the absolute differences between each value and the mean:
Step 3: Find the average of these absolute differences:
\[ \frac{3.25 + 0.75 + 1.25 + 3.75}{4} = \frac{9}{4} = 2.25 \]
So, the average deviation is 2.25.
Average deviation gives you an idea of how spread out the data points are around the mean. If the average deviation is small, it means the values are close to the mean, and if it's large, the values are more spread out. This can be useful for understanding how consistent or varied the data is in fields like quality control or financial analysis.
While average deviation is a helpful measure of spread, it doesn’t take into account the direction of the differences (whether they’re above or below the mean). Additionally, it might be less commonly used than other measures like standard deviation, which gives more insight into the data's variability.
Absolute deviation measures how far a value in a dataset is from a reference point, usually the mean or median. It looks at the absolute differences, meaning it doesn't consider whether a value is above or below the reference point—it only measures how far the value is, regardless of direction.
The formula for absolute deviation is:
Absolute Deviation = \[ |x_i - \bar{x}| \]
Where:
Consider the numbers: [3, 7, 5, 10] and we’ll use the mean (average) as the reference point.
Step 1: Find the mean of the dataset:
\[ \bar{x} = \frac{3 + 7 + 5 + 10}{4} = \frac{25}{4} = 6.25 \]
Step 2: Find the absolute deviation for each value by subtracting the mean and taking the absolute value:
So, the absolute deviations for the dataset are: 3.25, 0.75, 1.25, and 3.75.
Absolute deviation is useful when you want to measure how spread out data points are around a reference point without considering direction. It helps in understanding the consistency of data, especially when you want to focus on the size of the deviation rather than whether the values are above or below the mean.
Absolute deviation is simple to calculate, but it doesn't provide as much insight into variability as other measures, like standard deviation, because it doesn't take into account the overall spread of the data. It also doesn't work as well when comparing data that is heavily influenced by extreme values or outliers.
Squared deviation is a measure of how far each data point is from the mean, but with a twist: it squares the differences between the data points and the mean. This gives more weight to larger deviations, making it useful for measuring variance and understanding how spread out the data is.
The formula for squared deviation is:
Squared Deviation = \[ (x_i - \bar{x})^2 \]
Where:
Consider the numbers: [4, 6, 8, 10] and we’ll calculate the squared deviation based on the mean of the dataset.
Step 1: Find the mean (average) of the dataset:
\[ \bar{x} = \frac{4 + 6 + 8 + 10}{4} = \frac{28}{4} = 7 \]
Step 2: Find the squared deviation for each value by subtracting the mean and squaring the result:
So, the squared deviations for the dataset are: 9, 1, 1, and 9.
Squared deviation is helpful because it makes large deviations stand out more. This is useful when we want to measure how spread out the data is, and it is a key part of calculating the variance and standard deviation, which are commonly used in statistics to understand the variability in a dataset.
While squared deviation is useful for emphasizing larger deviations, it can be sensitive to extreme values (outliers). Because it squares the differences, large deviations can have a disproportionately large effect on the final result. This is why squared deviation is often used in combination with other measures like variance or standard deviation to get a more complete picture of the data’s spread.
Standard deviation is a measure of how spread out the values in a dataset are. It tells you how much each value deviates (or differs) from the mean (average) of the dataset. The larger the standard deviation, the more spread out the data points are. The smaller the standard deviation, the closer the data points are to the mean.
The formula for standard deviation is:
Standard Deviation = \[ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2} \]
Where:
Consider the numbers: [2, 4, 6, 8, 10]. Let's calculate the standard deviation for this dataset.
Step 1: Find the mean (average) of the dataset:
\[ \bar{x} = \frac{2 + 4 + 6 + 8 + 10}{5} = \frac{30}{5} = 6 \]
Step 2: Find the squared deviation for each value:
Step 3: Find the average of the squared deviations (variance):
\[ \text{Variance} = \frac{16 + 4 + 0 + 4 + 16}{5} = \frac{40}{5} = 8 \]
Step 4: Take the square root of the variance to find the standard deviation:
\[ \sigma = \sqrt{8} \approx 2.83 \]
So, the standard deviation of the dataset is approximately 2.83.
Standard deviation is widely used because it gives you a clear idea of how much variation there is in your data. In simple terms, it tells you whether the data points are mostly close to the mean (low standard deviation) or spread out over a wide range (high standard deviation). This is particularly helpful in fields like finance, science, and engineering where understanding the consistency or variability of data is important.
While standard deviation is a great measure of spread, it is sensitive to outliers (extreme values). A few extreme values can cause the standard deviation to be much higher than it would be otherwise. In these cases, other measures of spread, like interquartile range (IQR), may be more useful.
Probability theory is all about understanding and predicting uncertainty. It helps us figure out the likelihood of different events happening, whether we're flipping a coin, rolling a die, or predicting the weather. It’s a fundamental area of mathematics that is widely used in statistics, machine learning, and decision-making when outcomes are uncertain.
Example: If we flip a fair coin, the probability of getting heads is 0.5, or 50%. This means that, on average, you’ll get heads half of the time over many flips.
Example: If we roll a die, the event could be rolling a 6, which is one specific outcome from the roll.
Example: If we roll a six-sided die, the sample space is {1, 2, 3, 4, 5, 6}, because these are all the possible outcomes of a die roll.
Example: Flipping a coin multiple times is an example of independent events. The outcome of one flip (e.g., heads or tails) does not affect the outcome of the next flip.
Example: If you know it’s cloudy outside, the probability of rain may be higher. The probability of rain, given that it’s cloudy, is an example of conditional probability.
Probability theory forms the foundation of many fields like statistics, data science, and machine learning. It helps us model uncertainty, make predictions, and inform decision-making under uncertainty. Whether you're assessing the chances of a stock price going up or predicting the weather, probability plays a key role in understanding the world around us.
A vector is an ordered list of numbers that can represent different things like positions, directions, or data points. It is a fundamental concept in mathematics, physics, and machine learning.
\[ \begin{bmatrix} 2 \\ 3 \\ 5 \end{bmatrix} \]
This is called a 3-dimensional (3D) vector because it has three elements.If \( v = \begin{bmatrix} a \\ b \\ c \end{bmatrix} \), then its magnitude is: \[ |v| = \sqrt{a^2 + b^2 + c^2} \]
Example: For \( v = \begin{bmatrix} 2 \\ 3 \\ 5
\end{bmatrix} \),
\[
|v| = \sqrt{2^2 + 3^2 + 5^2} = \sqrt{4 + 9 + 25} = \sqrt{38}
\]
Vector addition is the process of combining two or more vectors to get a new vector. In machine learning, vectors are often used to represent data points, feature sets, or weights in models. Adding vectors helps in operations like updating model parameters and aggregating feature representations.
\[ A + B = \begin{bmatrix} a_1 + b_1 \\ a_2 + b_2 \end{bmatrix} \]
In machine learning, a data point can be represented as a vector of features. Suppose we have two feature vectors:
\[ X_1 = \begin{bmatrix} 1.2 \\ 3.5 \end{bmatrix}, \quad X_2 = \begin{bmatrix} 2.8 \\ -1.5 \end{bmatrix} \]
Find the combined feature representation:
\[ X_1 + X_2 = \begin{bmatrix} 1.2 + 2.8 \\ 3.5 + (-1.5) \end{bmatrix} = \begin{bmatrix} 4.0 \\ 2.0 \end{bmatrix} \]
Result: The new feature vector is \( \begin{bmatrix} 4.0 \\ 2.0 \end{bmatrix} \).
Word embeddings in NLP (Natural Language Processing) represent words as vectors. If:
\[ W_{\text{happy}} = \begin{bmatrix} 0.5 \\ 0.7 \end{bmatrix}, \quad W_{\text{excited}} = \begin{bmatrix} 0.6 \\ 0.9 \end{bmatrix} \]
Find the combined representation:
\[ W_{\text{happy}} + W_{\text{excited}} = \begin{bmatrix} 0.5 + 0.6 \\ 0.7 + 0.9 \end{bmatrix} = \begin{bmatrix} 1.1 \\ 1.6 \end{bmatrix} \]
Result: The new vector \( \begin{bmatrix} 1.1 \\ 1.6 \end{bmatrix} \) represents a combined sentiment.
During training, weight updates in a neural network can be represented as vectors. If the weight update in two iterations is:
\[ W_{\text{update 1}} = \begin{bmatrix} -0.1 \\ 0.3 \end{bmatrix}, \quad \] \[ W_{\text{update 2}} = \begin{bmatrix} 0.05 \\ -0.2 \end{bmatrix} \]
Find the net update:
\[ W_{\text{update 1}} + W_{\text{update 2}} = \begin{bmatrix} -0.1 + 0.05 \\ 0.3 + (-0.2) \end{bmatrix} = \begin{bmatrix} -0.05 \\ 0.1 \end{bmatrix} \]
Result: The overall weight update is \( \begin{bmatrix} -0.05 \\ 0.1 \end{bmatrix} \).
Vector subtraction is the process of finding the difference between two vectors. In machine learning, vector subtraction is used in feature scaling, computing error values, and comparing data points.
\[ A - B = \begin{bmatrix} a_1 - b_1 \\ a_2 - b_2 \end{bmatrix} \]
In machine learning, feature vectors can be compared using subtraction. Suppose we have two feature vectors:
\[ X_1 = \begin{bmatrix} 3.2 \\ 5.4 \end{bmatrix}, \quad X_2 = \begin{bmatrix} 1.8 \\ 2.9 \end{bmatrix} \]
Find the difference:
\[ X_1 - X_2 = \begin{bmatrix} 3.2 - 1.8 \\ 5.4 - 2.9 \end{bmatrix} = \begin{bmatrix} 1.4 \\ 2.5 \end{bmatrix} \]
Result: The difference vector is \( \begin{bmatrix} 1.4 \\ 2.5 \end{bmatrix} \), showing how the first feature vector differs from the second.
In machine learning, prediction errors are calculated using vector subtraction. Suppose:
\[ \text{Actual Output} = \begin{bmatrix} 4.5 \\ 3.2 \end{bmatrix}, \quad \] \[ \text{Predicted Output} = \begin{bmatrix} 3.9 \\ 2.8 \end{bmatrix} \]
Find the error vector:
E = Actual Output - Predicted Output
\[ = \begin{bmatrix} 4.5 - 3.9 \\ 3.2 - 2.8 \end{bmatrix} = \begin{bmatrix} 0.6 \\ 0.4 \end{bmatrix} \]
Result: The error vector \( \begin{bmatrix} 0.6 \\ 0.4 \end{bmatrix} \) tells us how much the prediction deviates from the actual values.
In NLP, subtracting word embeddings can capture relationships between words. If:
\[ W_{\text{"king"}} = \begin{bmatrix} 1.2 \\ 2.3 \end{bmatrix}, \quad W_{\text{"man"}} = \begin{bmatrix} 0.8 \\ 1.5 \end{bmatrix} \]
Find the semantic difference:
\[ W_{\text{"king"}} - W_{\text{"man"}} = \begin{bmatrix} 1.2 - 0.8 \\ 2.3 - 1.5 \end{bmatrix} = \begin{bmatrix} 0.4 \\ 0.8 \end{bmatrix} \]
Result: The difference vector \( \begin{bmatrix} 0.4 \\ 0.8 \end{bmatrix} \) represents the unique attributes of "king" compared to "man" (like power or royalty).
Scalar multiplication is the process of multiplying a vector by a single number (a scalar). This operation scales the vector by stretching or shrinking it while keeping its direction the same (or reversing it if multiplied by a negative scalar).
In machine learning, scalar multiplication is used in operations like scaling feature vectors, adjusting learning rates, and updating weights in optimization algorithms.
\[ kA = \begin{bmatrix} k \cdot a_1 \\ k \cdot a_2 \end{bmatrix} \]
Feature vectors often need to be scaled for better performance in ML models. Suppose we have:
\[ X = \begin{bmatrix} 2.5 \\ 4.0 \end{bmatrix} \]
Scale the vector by \( k = 0.5 \):
\[ 0.5 \cdot X = \begin{bmatrix} 0.5 \times 2.5 \\ 0.5 \times 4.0 \end{bmatrix} = \begin{bmatrix} 1.25 \\ 2.0 \end{bmatrix} \]
Result: The vector is reduced to half its original size.
Gradient descent updates weights using a learning rate \( \alpha \). If the gradient vector is:
\[ G = \begin{bmatrix} -0.2 \\ 0.5 \end{bmatrix} \]
And learning rate \( \alpha = 0.1 \), compute the weight update:
\[ \alpha G = \begin{bmatrix} 0.1 \times (-0.2) \\ 0.1 \times 0.5 \end{bmatrix} = \begin{bmatrix} -0.02 \\ 0.05 \end{bmatrix} \]
Result: The weight update is a smaller step in the direction of the gradient.
In NLP, reversing a vector can change its meaning. If the word embedding for "happy" is:
\[ W_{\text{happy}} = \begin{bmatrix} 0.3 \\ 0.7 \end{bmatrix} \]
Multiplying by \( k = -1 \) flips its direction:
\[ -1 \cdot W_{\text{happy}} = \begin{bmatrix} -0.3 \\ -0.7 \end{bmatrix} \]
Result: The flipped vector may now represent an opposite sentiment.
A matrix is a powerful mathematical tool that organizes numbers in a structured way. It's widely used in various fields like physics, computer science, machine learning, and graphics.
\[ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \]
\[ \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} + \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 6 & 8 \\ 10 & 12 \end{bmatrix} \]
\[ 2 \times \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 2 & 4 \\ 6 & 8 \end{bmatrix} \]
\[ \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \times \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} (1\times5 + 2\times7) & (1\times6 + 2\times8) \\ (3\times5 + 4\times7) & (3\times6 + 4\times8) \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix} \]
The transpose of a matrix is a fundamental operation in linear algebra, especially important in machine learning for matrix manipulations, such as feature transformation, data normalization, and more.
The transpose of a matrix involves swapping the rows and columns.
If \( A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \), then the transpose of \( A \) is denoted as \( A^T \) and is given by:
\[ A^T = \begin{bmatrix} a_{11} & a_{21} \\ a_{12} & a_{22} \end{bmatrix} \]
So, the element at row \( i \) and column \( j \) in matrix \( A \) becomes the element at row \( j \) and column \( i \) in \( A^T \).
Example Problems
Given the matrix:
\[ A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \]
The transpose of \( A \), denoted as \( A^T \), is:
\[ A^T = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \]
Result: We swapped rows and columns of the original matrix.
Given the matrix:
\[ B = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix} \]
The transpose of \( B \), denoted as \( B^T \), is:
\[ B^T = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \end{bmatrix} \]
Result: The 3x2 matrix becomes a 2x3 matrix after transposing.
\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \]
\[ \begin{bmatrix} 2 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 5 \end{bmatrix} \]
\[ \begin{bmatrix} 1 & 2 & 3 \\ 2 & 4 & 5 \\ 3 & 5 & 6 \end{bmatrix} \]
\[ A \times A^{-1} = I \]
If \[ A = \begin{bmatrix} 2 & 1 \\ 1 & 1 \end{bmatrix} \] then \[ A^{-1} = \begin{bmatrix} 1 & -1 \\ -1 & 2 \end{bmatrix} \]
Here are some key supervised learning algorithms:
Here are some key unsupervised learning algorithms:
Here are some key reinforcement learning algorithms:
In supervised learning, it's like teaching with examples. The algorithm gets both the input and the correct output, so it knows what the right answer is. It learns by making connections between the two, so it can predict the correct answer for new data. Now, in unsupervised learning, there are no correct answers given. The algorithm just gets the input data and has to figure out patterns or groupings all on its own. It’s like trying to make sense of a puzzle without knowing what the final picture looks like. Finally, in reinforcement learning, the algorithm learns by doing. It takes actions, gets feedback in the form of rewards or penalties, and then adjusts to make better choices. It's like learning by trial and error, where it gets better the more it tries. So, while supervised learning needs both input and output, unsupervised learning only has input and figures things out, and reinforcement learning learns by interacting and improving from feedback.