Unit 1

Fundamental Concepts of Statistics and Probability

Statistics and probability play a crucial role in analyzing data and making informed decisions. This guide covers essential statistical concepts, including measures of central tendency (mean, median, mode), data dispersion (range, standard deviation), and probability fundamentals. Understanding these concepts helps in interpreting data accurately, identifying patterns, and making predictions in various fields, from finance to machine learning. Whether you're a student or a professional, mastering these principles provides a strong foundation for data analysis and decision-making.

Mean (Average)

The mean, also known as the average, is a simple way to find the central value of a set of numbers. It helps us understand the overall trend of the data.

Formula:
Mean = (Sum of all values) ÷ (Number of values)
Example:
Let's say we have the numbers [2, 4, 6, 8]. To find the mean:

(2 + 4 + 6 + 8) ÷ 4 = 20 ÷ 4 = 5

So, the mean of these numbers is 5.
Why it’s useful:
The mean helps us understand the central tendency of data. It gives a single value that represents the overall dataset. For example, if we calculate the mean score of students in a class, we can quickly see the average performance of the class.

Median

The median is the middle value of a dataset when the numbers are arranged in order. It helps find the central point of data, especially when there are extreme values (outliers) that might distort the mean.

How to find it:
- If the dataset has an odd number of values: The median is the middle value.
- If the dataset has an even number of values: The median is the average of the two middle values.
Example:
Consider these numbers: [3, 1, 7, 5, 9]
- Step 1: Arrange them in ascending order → [1, 3, 5, 7, 9]
- Step 2: The middle value is 5, so the median is 5.
Now, for an even-numbered dataset: [2, 4, 6, 8, 10, 12]
- Step 1: Arrange them in order (already sorted).
- Step 2: The two middle numbers are 6 and 8.
- Step 3: Find the average: (6 + 8) ÷ 2 = 7
So, the median is 7.
Why it’s useful:
The median is a great way to find the center of a dataset without being affected by very high or very low values. For example, if a few students in a class score extremely high or low, the median gives a better idea of the typical score than the mean.

Mode

The mode is the value that appears most frequently in a dataset. It’s useful for understanding the most common or popular value in a set of data.

How to find it:
- If one value appears more often than any other, it is the mode.
- If there are multiple values that appear the same number of times, the dataset is said to have more than one mode (this is called "multimodal").
- If no value repeats, the dataset has no mode.
Example:
Consider the numbers: [2, 4, 4, 6, 7, 8, 8, 8]

The number 8 appears three times, more than any other number. So, the mode is 8.

Now, consider these numbers: [1, 2, 2, 3, 3, 4, 5]

Here, both 2 and 3 appear twice, so this dataset has two modes: 2 and 3.

And for these numbers: [5, 7, 9, 10, 12]

Since no number repeats, there is no mode in this dataset.
Why it’s useful:
The mode is great when you want to know which value appears most often. For example, if you’re looking at the most common shoe size in a store, the mode would tell you the most popular size.

Outliers

Outliers are data points that lie far from the majority of values in a dataset, significantly differing by being much higher or lower, and can strongly affect statistical analysis results.

How to identify outliers:
- Outliers are usually much smaller or much larger than the rest of the data.
- They can be identified visually in a graph or by using statistical methods like calculating the interquartile range (IQR).
- For example, any value that is more than 1.5 times the IQR above the third quartile or below the first quartile could be considered an outlier.
Example:
Consider the numbers: [1, 2, 3, 4, 5, 100]

The number 100 is much larger than the others, so it’s an outlier in this dataset.

Now, consider these numbers: [50, 52, 53, 51, 49, 200]

The number 200 is much higher than the rest of the values, so it’s an outlier.
Why it’s important:
Outliers are important to identify because they can distort results. For example, in a test where most students score between 50 and 80, a student who scores 200 might skew the average score, making it higher than it actually is. In such cases, it's important to consider whether to remove or adjust outliers based on the context of the data.
What to do with outliers:
- Outliers can be removed or adjusted if they are errors or if they don't represent typical data.
- However, sometimes outliers are real data points that need to be kept, especially if they provide valuable information.
How Do Outliers Impact Statistical Measures?
- Outliers can distort statistical measures, making data appear different from reality.
- Impact on Mean: Since the mean considers every value, an outlier can drag it up or down, making it unrepresentative of the majority of the data.
- Impact on Standard Deviation : Outliers increase the spread of data, making the standard deviation larger than it should be. This gives the impression that the data is more spread out than it actually is.
- In real-world analysis, outliers are often removed or adjusted to prevent misleading conclusions.
- Example: Imagine five friends' pocket money per week:
  - ₹50, ₹60, ₹55, ₹65, ₹500 (one rich friend).
  - Mean without outlier (₹50, ₹60, ₹55, ₹65) → ₹57.5
  - Mean with outlier (₹50, ₹60, ₹55, ₹65, ₹500) → ₹146 (Much higher!)
  - Standard deviation also jumps up because ₹500 is too far from the rest.
  Learning: Outliers can mislead data, making it seem higher or more spread out than it really is!

Range

The range is a simple way to measure how spread out the values are in a dataset. It gives you an idea of the difference between the highest and lowest values.

Formula:
Range = (Largest value) - (Smallest value)
Example:
Consider the numbers: [4, 7, 2, 9, 5]
- Step 1: Find the largest value, which is 9.
- Step 2: Find the smallest value, which is 2.
- Step 3: Subtract the smallest value from the largest value: 9 - 2 = 7.
So, the range of this dataset is 7.

Now, consider the numbers: [15, 18, 25, 12, 30]
- Step 1: The largest value is 30, and the smallest value is 12.
- Step 2: Subtract 12 from 30: 30 - 12 = 18.
So, the range is 18.
Why it’s useful:
The range is a quick way to get an idea of the spread or variability in your data. It tells you how wide the values are between the smallest and largest. For example, if you were measuring the temperature in a city over a week, a large range would suggest big temperature fluctuations, while a small range would indicate consistent temperatures.
Limitations:
Although the range is easy to calculate, it’s sensitive to outliers. A very high or low value can drastically change the range, even if it’s not typical of the data. So, while it’s helpful, it may not always give the full picture of how data is distributed.

Average Deviation

Average deviation measures how far each value in a dataset is from the mean (average). It tells you how spread out or clustered the data points are around the mean. The smaller the average deviation, the more consistent the values are, and vice versa.

What it is: Average deviation is the average of the absolute differences between each data point and the mean of the dataset.
Formula:
The formula for average deviation is:

Average Deviation = \[ \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}| \]

Where:
- \(n\) is the number of data points.
- \(x_i\) represents each individual data point.
- \(\bar{x}\) is the mean (average) of the data.
- \(|x_i - \bar{x}|\) is the absolute difference between each data point and the mean.
Example:
Consider the numbers: [3, 7, 5, 10]

Step 1: Find the mean (average) of the dataset:

\[ \bar{x} = \frac{3 + 7 + 5 + 10}{4} = \frac{25}{4} = 6.25 \]

Step 2: Find the absolute differences between each value and the mean:
- |3 - 6.25| = 3.25
- |7 - 6.25| = 0.75
- |5 - 6.25| = 1.25
- |10 - 6.25| = 3.75
Step 3: Find the average of these absolute differences:

\[ \frac{3.25 + 0.75 + 1.25 + 3.75}{4} = \frac{9}{4} = 2.25 \]

So, the average deviation is 2.25.
Why it’s useful:
Average deviation gives you an idea of how spread out the data points are around the mean. If the average deviation is small, it means the values are close to the mean, and if it's large, the values are more spread out. This can be useful for understanding how consistent or varied the data is in fields like quality control or financial analysis.
Limitations:
While average deviation is a helpful measure of spread, it doesn’t take into account the direction of the differences (whether they’re above or below the mean). Additionally, it might be less commonly used than other measures like standard deviation, which gives more insight into the data's variability.

Absolute Deviation

Absolute deviation measures how far a value in a dataset is from a reference point, usually the mean or median. It looks at the absolute differences, meaning it doesn't consider whether a value is above or below the reference point—it only measures how far the value is, regardless of direction.

What it is: Absolute deviation is the distance between each data point and a reference point (typically the mean or median) in a dataset. It's always a positive value since we ignore the direction of the deviation.
Formula:
The formula for absolute deviation is:

Absolute Deviation = \[ |x_i - \bar{x}| \]

Where:
- \(x_i\) represents each individual data point.
- \(\bar{x}\) is the reference point, often the mean or median.
- \(|x_i - \bar{x}|\) is the absolute difference between the data point and the reference point.
Example:
Consider the numbers: [3, 7, 5, 10] and we’ll use the mean (average) as the reference point.

Step 1: Find the mean of the dataset:

\[ \bar{x} = \frac{3 + 7 + 5 + 10}{4} = \frac{25}{4} = 6.25 \]

Step 2: Find the absolute deviation for each value by subtracting the mean and taking the absolute value:
- |3 - 6.25| = 3.25
- |7 - 6.25| = 0.75
- |5 - 6.25| = 1.25
- |10 - 6.25| = 3.75
So, the absolute deviations for the dataset are: 3.25, 0.75, 1.25, and 3.75.
Why it’s useful:
Absolute deviation is useful when you want to measure how spread out data points are around a reference point without considering direction. It helps in understanding the consistency of data, especially when you want to focus on the size of the deviation rather than whether the values are above or below the mean.
Limitations:
Absolute deviation is simple to calculate, but it doesn't provide as much insight into variability as other measures, like standard deviation, because it doesn't take into account the overall spread of the data. It also doesn't work as well when comparing data that is heavily influenced by extreme values or outliers.

Squared Deviation

Squared deviation is a measure of how far each data point is from the mean, but with a twist: it squares the differences between the data points and the mean. This gives more weight to larger deviations, making it useful for measuring variance and understanding how spread out the data is.

What it is: Squared deviation is the square of the difference between each data point and the mean of the dataset. By squaring the differences, we emphasize larger deviations, which helps to highlight data points that are far away from the mean.
Formula:
The formula for squared deviation is:

Squared Deviation = \[ (x_i - \bar{x})^2 \]

Where:
- \(x_i\) represents each individual data point.
- \(\bar{x}\) is the mean (average) of the dataset.
- \((x_i - \bar{x})^2\) is the squared difference between the data point and the mean.
Example:
Consider the numbers: [4, 6, 8, 10] and we’ll calculate the squared deviation based on the mean of the dataset.

Step 1: Find the mean (average) of the dataset:

\[ \bar{x} = \frac{4 + 6 + 8 + 10}{4} = \frac{28}{4} = 7 \]

Step 2: Find the squared deviation for each value by subtracting the mean and squaring the result:
- (4 - 7)² = (-3)² = 9
- (6 - 7)² = (-1)² = 1
- (8 - 7)² = (1)² = 1
- (10 - 7)² = (3)² = 9
So, the squared deviations for the dataset are: 9, 1, 1, and 9.
Why it’s useful:
Squared deviation is helpful because it makes large deviations stand out more. This is useful when we want to measure how spread out the data is, and it is a key part of calculating the variance and standard deviation, which are commonly used in statistics to understand the variability in a dataset.
Limitations:
While squared deviation is useful for emphasizing larger deviations, it can be sensitive to extreme values (outliers). Because it squares the differences, large deviations can have a disproportionately large effect on the final result. This is why squared deviation is often used in combination with other measures like variance or standard deviation to get a more complete picture of the data’s spread.

Standard Deviation

Standard deviation is a measure of how spread out the values in a dataset are. It tells you how much each value deviates (or differs) from the mean (average) of the dataset. The larger the standard deviation, the more spread out the data points are. The smaller the standard deviation, the closer the data points are to the mean.

What it is: Standard deviation is the square root of the average of the squared deviations from the mean. It’s a common measure used in statistics to express the variability or dispersion of a dataset.
Formula:
The formula for standard deviation is:

Standard Deviation = \[ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2} \]

Where:
- \(n\) is the number of data points.
- \(x_i\) represents each individual data point.
- \(\bar{x}\) is the mean (average) of the dataset.
- \((x_i - \bar{x})^2\) is the squared deviation for each data point.
- \(\sigma\) is the standard deviation.
Example:
Consider the numbers: [2, 4, 6, 8, 10]. Let's calculate the standard deviation for this dataset.

Step 1: Find the mean (average) of the dataset:

\[ \bar{x} = \frac{2 + 4 + 6 + 8 + 10}{5} = \frac{30}{5} = 6 \]

Step 2: Find the squared deviation for each value:
- (2 - 6)² = (-4)² = 16
- (4 - 6)² = (-2)² = 4
- (6 - 6)² = (0)² = 0
- (8 - 6)² = (2)² = 4
- (10 - 6)² = (4)² = 16
Step 3: Find the average of the squared deviations (variance):

\[ \text{Variance} = \frac{16 + 4 + 0 + 4 + 16}{5} = \frac{40}{5} = 8 \]

Step 4: Take the square root of the variance to find the standard deviation:

\[ \sigma = \sqrt{8} \approx 2.83 \]

So, the standard deviation of the dataset is approximately 2.83.
Why it’s useful:
Standard deviation is widely used because it gives you a clear idea of how much variation there is in your data. In simple terms, it tells you whether the data points are mostly close to the mean (low standard deviation) or spread out over a wide range (high standard deviation). This is particularly helpful in fields like finance, science, and engineering where understanding the consistency or variability of data is important.
Limitations:
While standard deviation is a great measure of spread, it is sensitive to outliers (extreme values). A few extreme values can cause the standard deviation to be much higher than it would be otherwise. In these cases, other measures of spread, like interquartile range (IQR), may be more useful.

Probability Theory

Probability theory is all about understanding and predicting uncertainty. It helps us figure out the likelihood of different events happening, whether we're flipping a coin, rolling a die, or predicting the weather. It’s a fundamental area of mathematics that is widely used in statistics, machine learning, and decision-making when outcomes are uncertain.

What it is: Probability theory studies how likely events are to happen. It involves calculating the chance of different outcomes based on certain conditions or assumptions. By using probability, we can quantify uncertainty and make informed predictions.
Key Concepts:
- 1. Probability: Probability is a number between 0 and 1 that represents how likely an event is to happen. A probability of 0 means the event will never happen, and a probability of 1 means the event will definitely happen.
- 2. Event: An event is a specific outcome or set of outcomes from an experiment or trial. Events can be simple or complex, depending on what we are looking for.
- 3. Sample Space: The sample space is the set of all possible outcomes for an experiment. It includes every potential result you could get.
- 4. Independent Events: Two events are independent if the outcome of one event does not affect the outcome of the other. The probability of independent events occurring together is the product of their individual probabilities.
- 5. Conditional Probability: Conditional probability is the probability of an event occurring given that another event has already occurred. It’s useful when you want to know the chance of something happening under certain conditions.
Why it’s useful:
Probability theory forms the foundation of many fields like statistics, data science, and machine learning. It helps us model uncertainty, make predictions, and inform decision-making under uncertainty. Whether you're assessing the chances of a stock price going up or predicting the weather, probability plays a key role in understanding the world around us.

Review of Linear Algebra

Vectors

A vector is an ordered list of numbers that can represent different things like positions, directions, or data points. It is a fundamental concept in mathematics, physics, and machine learning.

Example: The vector below has three numbers:

\[ \begin{bmatrix} 2 \\ 3 \\ 5 \end{bmatrix} \]
This is called a 3-dimensional (3D) vector because it has three elements.
Key Properties of Vectors:
- Dimension: The number of elements in a vector determines its dimension.
  - \(\begin{bmatrix} 2 \\ 3 \end{bmatrix}\) → 2D vector (2 elements).
  - \(\begin{bmatrix} 4 \\ 7 \\ 9 \end{bmatrix}\) → 3D vector (3 elements).
- Magnitude (Length): The size of the vector, found using the Pythagorean theorem:
  If \( v = \begin{bmatrix} a \\ b \\ c \end{bmatrix} \), then its magnitude is: \[ |v| = \sqrt{a^2 + b^2 + c^2} \]
  
  Example: For \( v = \begin{bmatrix} 2 \\ 3 \\ 5 \end{bmatrix} \),
  \[ |v| = \sqrt{2^2 + 3^2 + 5^2} = \sqrt{4 + 9 + 25} = \sqrt{38} \]
- Direction: A vector has a direction, meaning where it "points" in space. For example:
  - \(\begin{bmatrix} 1 \\ 0 \end{bmatrix}\) points along the x-axis.
  - \(\begin{bmatrix} 0 \\ 1 \end{bmatrix}\) points along the y-axis.
Uses of Vectors:
- Representing data points:
  - For example, in real estate, a vector could represent a house with features like:
    \[ \text{House} = \begin{bmatrix} \text{Size (sq. ft.)} \\ \text{Price} \\ \text{Location (latitude, longitude)} \end{bmatrix} \]
- Representing directions and forces:
  - In physics, vectors represent forces, velocity, or acceleration.
  - For example, a wind blowing east at 10 m/s can be represented as: \[ \begin{bmatrix} 10 \\ 0 \end{bmatrix} \] (10 m/s in the x-direction, 0 m/s in the y-direction).

Vector Operations

Vector Addition

Vector addition is the process of combining two or more vectors to get a new vector. In machine learning, vectors are often used to represent data points, feature sets, or weights in models. Adding vectors helps in operations like updating model parameters and aggregating feature representations.

Methods of Vector Addition:
- Component-wise Addition (Algebraic Method)
  - Add the corresponding components of the vectors.
  - If \( A = \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} \) and \( B = \begin{bmatrix} b_1 \\ b_2 \end{bmatrix} \), then:
    \[ A + B = \begin{bmatrix} a_1 + b_1 \\ a_2 + b_2 \end{bmatrix} \]
- Graphical Interpretation:
  - Each vector can be visualized as an arrow in space.
  - Adding vectors means shifting one vector so that it starts where the other ends.
  - The resultant vector represents the overall change when combining both.

Example Problems

Example 1: Adding Feature Vectors
In machine learning, a data point can be represented as a vector of features. Suppose we have two feature vectors:

\[ X_1 = \begin{bmatrix} 1.2 \\ 3.5 \end{bmatrix}, \quad X_2 = \begin{bmatrix} 2.8 \\ -1.5 \end{bmatrix} \]

Find the combined feature representation:

\[ X_1 + X_2 = \begin{bmatrix} 1.2 + 2.8 \\ 3.5 + (-1.5) \end{bmatrix} = \begin{bmatrix} 4.0 \\ 2.0 \end{bmatrix} \]

Result: The new feature vector is \( \begin{bmatrix} 4.0 \\ 2.0 \end{bmatrix} \).
Example 2: Combining Word Embeddings
Word embeddings in NLP (Natural Language Processing) represent words as vectors. If:

\[ W_{\text{happy}} = \begin{bmatrix} 0.5 \\ 0.7 \end{bmatrix}, \quad W_{\text{excited}} = \begin{bmatrix} 0.6 \\ 0.9 \end{bmatrix} \]

Find the combined representation:

\[ W_{\text{happy}} + W_{\text{excited}} = \begin{bmatrix} 0.5 + 0.6 \\ 0.7 + 0.9 \end{bmatrix} = \begin{bmatrix} 1.1 \\ 1.6 \end{bmatrix} \]

Result: The new vector \( \begin{bmatrix} 1.1 \\ 1.6 \end{bmatrix} \) represents a combined sentiment.
Example 3: Summing Weight Updates in a Model
During training, weight updates in a neural network can be represented as vectors. If the weight update in two iterations is:

\[ W_{\text{update 1}} = \begin{bmatrix} -0.1 \\ 0.3 \end{bmatrix}, \quad \] \[ W_{\text{update 2}} = \begin{bmatrix} 0.05 \\ -0.2 \end{bmatrix} \]

Find the net update:

\[ W_{\text{update 1}} + W_{\text{update 2}} = \begin{bmatrix} -0.1 + 0.05 \\ 0.3 + (-0.2) \end{bmatrix} = \begin{bmatrix} -0.05 \\ 0.1 \end{bmatrix} \]

Result: The overall weight update is \( \begin{bmatrix} -0.05 \\ 0.1 \end{bmatrix} \).

Vector Subtraction

Vector subtraction is the process of finding the difference between two vectors. In machine learning, vector subtraction is used in feature scaling, computing error values, and comparing data points.

Component-wise Subtraction (Algebraic Method):
- Subtract corresponding components of the vectors.
- If \( A = \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} \) and \( B = \begin{bmatrix} b_1 \\ b_2 \end{bmatrix} \), then:
  \[ A - B = \begin{bmatrix} a_1 - b_1 \\ a_2 - b_2 \end{bmatrix} \]
Graphical Interpretation:
- Subtracting a vector is the same as adding its negative.
- If \( B \) is subtracted from \( A \), we reverse the direction of \( B \) and then add it to \( A \).

Example Problems

Example 1: Subtracting Feature Vectors
In machine learning, feature vectors can be compared using subtraction. Suppose we have two feature vectors:

\[ X_1 = \begin{bmatrix} 3.2 \\ 5.4 \end{bmatrix}, \quad X_2 = \begin{bmatrix} 1.8 \\ 2.9 \end{bmatrix} \]

Find the difference:

\[ X_1 - X_2 = \begin{bmatrix} 3.2 - 1.8 \\ 5.4 - 2.9 \end{bmatrix} = \begin{bmatrix} 1.4 \\ 2.5 \end{bmatrix} \]

Result: The difference vector is \( \begin{bmatrix} 1.4 \\ 2.5 \end{bmatrix} \), showing how the first feature vector differs from the second.
Example 2: Computing Error in Predictions
In machine learning, prediction errors are calculated using vector subtraction. Suppose:

\[ \text{Actual Output} = \begin{bmatrix} 4.5 \\ 3.2 \end{bmatrix}, \quad \] \[ \text{Predicted Output} = \begin{bmatrix} 3.9 \\ 2.8 \end{bmatrix} \]

Find the error vector:
E = Actual Output - Predicted Output

\[ = \begin{bmatrix} 4.5 - 3.9 \\ 3.2 - 2.8 \end{bmatrix} = \begin{bmatrix} 0.6 \\ 0.4 \end{bmatrix} \]

Result: The error vector \( \begin{bmatrix} 0.6 \\ 0.4 \end{bmatrix} \) tells us how much the prediction deviates from the actual values.
Example 3: Subtracting Word Embeddings for Semantic Difference
In NLP, subtracting word embeddings can capture relationships between words. If:

\[ W_{\text{"king"}} = \begin{bmatrix} 1.2 \\ 2.3 \end{bmatrix}, \quad W_{\text{"man"}} = \begin{bmatrix} 0.8 \\ 1.5 \end{bmatrix} \]

Find the semantic difference:

\[ W_{\text{"king"}} - W_{\text{"man"}} = \begin{bmatrix} 1.2 - 0.8 \\ 2.3 - 1.5 \end{bmatrix} = \begin{bmatrix} 0.4 \\ 0.8 \end{bmatrix} \]

Result: The difference vector \( \begin{bmatrix} 0.4 \\ 0.8 \end{bmatrix} \) represents the unique attributes of "king" compared to "man" (like power or royalty).

Scalar Multiplication of Vectors

Scalar multiplication is the process of multiplying a vector by a single number (a scalar). This operation scales the vector by stretching or shrinking it while keeping its direction the same (or reversing it if multiplied by a negative scalar).

In machine learning, scalar multiplication is used in operations like scaling feature vectors, adjusting learning rates, and updating weights in optimization algorithms.

Component-wise Multiplication:
- Each element of the vector is multiplied by the scalar.
- If \( A = \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} \) and scalar \( k \), then:
  \[ kA = \begin{bmatrix} k \cdot a_1 \\ k \cdot a_2 \end{bmatrix} \]
Effects of Scalar Multiplication:
- If \( k > 1 \), the vector stretches (gets longer).
- If \( 0 < k < 1 \), the vector shrinks (gets shorter).
- If \( k < 0 \), the vector flips direction.

Example Problems

Example 1: Scaling a Feature Vector
Feature vectors often need to be scaled for better performance in ML models. Suppose we have:

\[ X = \begin{bmatrix} 2.5 \\ 4.0 \end{bmatrix} \]

Scale the vector by \( k = 0.5 \):

\[ 0.5 \cdot X = \begin{bmatrix} 0.5 \times 2.5 \\ 0.5 \times 4.0 \end{bmatrix} = \begin{bmatrix} 1.25 \\ 2.0 \end{bmatrix} \]

Result: The vector is reduced to half its original size.
Example 2: Adjusting Learning Rate in Gradient Descent
Gradient descent updates weights using a learning rate \( \alpha \). If the gradient vector is:

\[ G = \begin{bmatrix} -0.2 \\ 0.5 \end{bmatrix} \]

And learning rate \( \alpha = 0.1 \), compute the weight update:

\[ \alpha G = \begin{bmatrix} 0.1 \times (-0.2) \\ 0.1 \times 0.5 \end{bmatrix} = \begin{bmatrix} -0.02 \\ 0.05 \end{bmatrix} \]

Result: The weight update is a smaller step in the direction of the gradient.
Example 3: Flipping a Word Embedding
In NLP, reversing a vector can change its meaning. If the word embedding for "happy" is:

\[ W_{\text{happy}} = \begin{bmatrix} 0.3 \\ 0.7 \end{bmatrix} \]

Multiplying by \( k = -1 \) flips its direction:

\[ -1 \cdot W_{\text{happy}} = \begin{bmatrix} -0.3 \\ -0.7 \end{bmatrix} \]

Result: The flipped vector may now represent an opposite sentiment.

Matrices

A matrix is a powerful mathematical tool that organizes numbers in a structured way. It's widely used in various fields like physics, computer science, machine learning, and graphics.

What is a Matrix?
- A matrix is a rectangular grid of numbers arranged in rows and columns. It is like a table of numbers where each number has a specific position.
- Example: The following is a 2 × 3 matrix (2 rows, 3 columns):
Key Properties:
- Shape: The number of rows and columns. A matrix with \(m\) rows and \(n\) columns is called an m × n matrix.
- Elements: Each number in a matrix is called an element. For example, in the matrix above, 5 is the element in the 2nd row and 2nd column.
Uses of Matrices:
- Representing datasets (e.g., each row is a data point, and each column is a feature).
- Used in graphics for transformations like rotating, scaling, or translating objects.
Matrix Operations: Matrices can be manipulated using various operations:

Addition and Subtraction:
- We can add or subtract two matrices element by element, but they must have the same shape.
- Example:
Scalar Multiplication:
- When a matrix is multiplied by a single number (scalar), each element of the matrix is multiplied by that number.
- Example:
Matrix Multiplication:
- To multiply two matrices, the number of columns in the first matrix must match the number of rows in the second matrix.
- Example:

Transpose of a Matrix

The transpose of a matrix is a fundamental operation in linear algebra, especially important in machine learning for matrix manipulations, such as feature transformation, data normalization, and more.

The transpose of a matrix involves swapping the rows and columns.

Definition:
If \( A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \), then the transpose of \( A \) is denoted as \( A^T \) and is given by:

\[ A^T = \begin{bmatrix} a_{11} & a_{21} \\ a_{12} & a_{22} \end{bmatrix} \]

So, the element at row \( i \) and column \( j \) in matrix \( A \) becomes the element at row \( j \) and column \( i \) in \( A^T \).
Properties of Transpose:
- Transpose of a transpose: \[ (A^T)^T = A \]
- Transpose of a sum of matrices: \[ (A + B)^T = A^T + B^T \]
- Transpose of a product of matrices: \[ (A \cdot B)^T = B^T \cdot A^T \]
- Transpose of a scalar multiplied matrix: \[ (kA)^T = kA^T \]

Example Problems

Example 1: Transpose of a 2x2 Matrix
Given the matrix:

\[ A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \]

The transpose of \( A \), denoted as \( A^T \), is:

\[ A^T = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \]

Result: We swapped rows and columns of the original matrix.
Example 2: Transpose of a 3x2 Matrix
Given the matrix:

\[ B = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix} \]

The transpose of \( B \), denoted as \( B^T \), is:

\[ B^T = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \end{bmatrix} \]

Result: The 3x2 matrix becomes a 2x3 matrix after transposing.

Special Matrices:

Identity Matrix: A square matrix with 1s on the diagonal and 0s elsewhere.

\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \]

Diagonal Matrix: A matrix where only the diagonal elements are non-zero.

\[ \begin{bmatrix} 2 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 5 \end{bmatrix} \]

Symmetric Matrix: A matrix that is equal to its transpose.

\[ \begin{bmatrix} 1 & 2 & 3 \\ 2 & 4 & 5 \\ 3 & 5 & 6 \end{bmatrix} \]

Matrix Inverse:

The inverse of a matrix \( A \), denoted as \( A^{-1} \), satisfies:

\[ A \times A^{-1} = I \]

Example:

If \[ A = \begin{bmatrix} 2 & 1 \\ 1 & 1 \end{bmatrix} \] then \[ A^{-1} = \begin{bmatrix} 1 & -1 \\ -1 & 2 \end{bmatrix} \]

Note: Not all matrices have an inverse. A matrix must be square and have a non-zero determinant to have an inverse.

Introduction to Machine Learning

Machine Learning (ML) is a part of Artificial Intelligence (AI) that helps computers learn from data and make decisions or predictions without needing explicit instructions for every step. Instead of just following fixed rules, it finds patterns in data and keeps improving over time.
It works by using algorithms to process large amounts of data, recognize trends, and refine its accuracy with experience. That’s why it's used in things like speech recognition, recommendation systems, and even fraud detection.
Ever noticed how YouTube suggests videos you might like or how online shopping sites recommend products based on what you've browsed? That’s Machine Learning in action—analyzing your past behavior to make smart suggestions without anyone manually programming each choice.
The Three Main Approaches to Machine Learning:
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning

1. Supervised Learning

Supervised learning is a machine learning approach where the algorithm is trained using labeled data. This means that for every input, the correct output is already known, allowing the model to learn patterns and relationships.
The process starts with feeding the algorithm structured data, where inputs and their corresponding outputs are provided. Over time, it analyzes this data and learns to make accurate predictions on new, unseen inputs.
This method is widely used in various applications, such as:
- Determining property prices based on factors like location, size, and amenities.
- Detecting spam emails by analyzing patterns in subject lines and content.
- Interpreting handwritten numbers in banking or postal systems.
Since the model learns from labeled examples, its accuracy improves as more data is provided, making it a powerful tool for predictive tasks.

Here are some key supervised learning algorithms:

Linear Regression:
- What it does: It is used to predict continuous values, such as house prices or temperatures.
- How it works: It tries to fit the best possible straight line through the given data points, minimizing the difference between predicted and actual values.
- Example: Suppose we want to predict house prices based on the size of the house. Linear regression will find a line that best represents how house size affects price.
Logistic Regression:
- What it does: It is used for classification problems where the output is either one category or another (e.g., Yes/No, Spam/Not Spam).
- How it works: Instead of a straight line, it uses a curve (logistic function) that squashes values between 0 and 1, representing probabilities.
- Example: Email spam detection – logistic regression helps decide whether an email is spam (1) or not spam (0) based on features like the number of links, special characters, and sender information.
Decision Trees:
- What it does: It makes predictions by breaking the data into smaller and smaller groups based on certain conditions.
- How it works: Imagine a flowchart where each question leads to a different path. The tree splits the data at each step based on the most important features, eventually leading to a final decision.
- Example: Predicting whether a customer will buy a product based on age, income, and browsing history. The tree might start by checking age, then income, and so on.
Support Vector Machines (SVM):
- What it does: It is mainly used for classification, finding the best possible boundary between different categories.
- How it works: Imagine drawing a line that best separates two groups of data points. SVM finds the "widest possible street" that keeps the groups apart.
- Example: Classifying images of cats and dogs. The algorithm finds the best dividing line (or curve) that separates cat images from dog images.
Neural Networks:
- What it does: It is inspired by how the human brain works and is used to model complex relationships between inputs and outputs.
- How it works: It consists of layers of "neurons" that process information and adjust their connections based on errors.
- Example: Recognizing handwritten digits – the network learns from thousands of digit images and then can identify new handwritten numbers.

2.Unsupervised Learning

Unsupervised learning is a type of machine learning where the algorithm works with data that has no predefined labels or categories. Instead of being told what the correct answers are, the model identifies patterns and relationships on its own.
The algorithm analyzes the data and groups similar items together based on shared characteristics. It is often used for discovering hidden structures in large datasets without human intervention.
Common real-world applications include:
- Grouping customers with similar shopping behaviors for targeted marketing.
- Detecting unusual patterns in financial transactions to identify fraud.
- Organizing news articles into different categories based on content similarity.
This learning method is especially useful when dealing with vast amounts of data where predefined labels are unavailable, making it an essential tool for data analysis and pattern recognition.

Here are some key unsupervised learning algorithms:

K-Means Clustering:
- What it does: It groups similar data points into a fixed number of clusters (k).
- How it works: It assigns each data point to the nearest cluster center (centroid) and keeps adjusting the clusters to minimize the distance between points and their assigned centroid.
- Example: Businesses use K-Means to group customers based on their shopping behavior. For instance, some customers may prefer luxury products, while others go for budget-friendly items.
Hierarchical Clustering:
- What it does: Instead of fixing the number of clusters beforehand, it builds a tree-like structure (dendrogram) to show how data points are related.
- How it works: It either starts by treating each data point as its own cluster and merging similar ones (bottom-up) or starts with all data points in one cluster and keeps splitting them (top-down).
- Example: Biologists use hierarchical clustering to group species based on genetic similarity, creating a family tree of related organisms.
Principal Component Analysis (PCA):
- What it does: It reduces the number of variables in a dataset while preserving the most important information.
- How it works: It transforms the data into a new set of uncorrelated features (principal components) that capture the most variance.
- Example: Imagine you have a dataset with 100 features, making it hard to visualize. PCA helps reduce it to 2D or 3D while keeping the key patterns intact, making it easier to analyze.
Apriori Algorithm:
- What it does: It finds frequently occurring groups of items in large datasets.
- How it works: It follows a "bottom-up" approach, starting with individual items and combining them into larger sets based on how often they appear together.
- Example: In market basket analysis, this algorithm helps stores find product pairs that are often bought together, like bread and butter. This is how online stores recommend "Customers who bought this also bought..." items.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
- What it does: It groups data points based on how densely packed they are and marks outliers as noise.
- How it works: Instead of assuming a fixed number of clusters, it identifies high-density areas and considers points in low-density regions as noise.
- Example: It is used in fraud detection to spot unusual financial transactions that do not fit into normal spending patterns.

3. Reinforcement Learning

Reinforcement Learning (RL) is a machine learning approach where an agent learns by interacting with an environment. Instead of being given direct answers, the agent takes actions and receives feedback in the form of rewards or penalties.
The goal is to maximize rewards over time by improving decision-making based on past experiences. The agent continuously adjusts its strategy to achieve better outcomes.
Real-world applications of reinforcement learning include:
- Teaching robots how to walk or perform tasks by learning from trial and error.
- Training AI to play video games and improve its strategies through repeated gameplay.
- Optimizing stock trading strategies by learning from market fluctuations.
This method is highly effective for solving complex problems where the best solution is not immediately obvious, making it widely used in automation, gaming, and robotics.

Here are some key reinforcement learning algorithms:

Q-Learning:
- What it does: It helps an agent learn the best actions to take in different situations by estimating their future rewards.
- How it works: It maintains a Q-table, which stores the expected reward for each action in each state. The table is updated over time as the agent explores and learns.
- Example: A robot learning to navigate a maze. It tries different paths, learns which ones lead to the goal, and eventually finds the shortest way.
Deep Q-Networks (DQN):
- What it does: It improves Q-learning by using deep neural networks to handle complex environments with many possible states.
- How it works: Instead of storing Q-values in a table, it uses a neural network to approximate them, making it more scalable.
- Example: AI playing Atari games. It watches the game screen, learns which moves score the most points, and eventually masters the game.
Policy Gradient Methods:
- What it does: It directly learns the best strategy (policy) for choosing actions to maximize rewards.
- How it works: Instead of estimating Q-values, it continuously adjusts its policy using gradient ascent, improving its performance over time.
- Example: Teaching a robot to walk. The algorithm refines the robot's movements based on how well it balances and moves forward.
Actor-Critic Methods:
- What it does: It combines two approaches—one that evaluates actions (critic) and one that selects them (actor) to improve stability and efficiency.
- How it works: The actor makes decisions, and the critic provides feedback on whether those decisions were good or bad, helping the actor improve.
- Example: Training self-driving cars. The actor decides how to steer, and the critic evaluates whether the car is following the safest route.
Monte Carlo Methods:
- What it does: It learns by completing entire sequences of actions (episodes) before updating its knowledge.
- How it works: Instead of learning step-by-step, it waits until the end of an episode and then adjusts its strategy based on the total reward received.
- Example: Learning to play board games like chess. The agent plays full games, then updates its strategy based on whether it won or lost.

In supervised learning, it's like teaching with examples. The algorithm gets both the input and the correct output, so it knows what the right answer is. It learns by making connections between the two, so it can predict the correct answer for new data. Now, in unsupervised learning, there are no correct answers given. The algorithm just gets the input data and has to figure out patterns or groupings all on its own. It’s like trying to make sense of a puzzle without knowing what the final picture looks like. Finally, in reinforcement learning, the algorithm learns by doing. It takes actions, gets feedback in the form of rewards or penalties, and then adjusts to make better choices. It's like learning by trial and error, where it gets better the more it tries. So, while supervised learning needs both input and output, unsupervised learning only has input and figures things out, and reinforcement learning learns by interacting and improving from feedback.