Data Mining Techniques
|
├── 1. Association Rules
│ ├── From Transaction Databases
│ ├── From Relational Databases
│ └── Correlation Analysis
│
├── 2. Classification and Prediction
│ └── Using Decision Tree Induction
│
└── 3. Clustering Techniques
├── Introduction to Clustering
├── Partition Method
└── Hierarchical Method
Consider a transaction database of a retail grocery store, where each transaction records a set of items purchased by a customer during a single visit.
A sample of such a database is shown below:
Transaction ID | Items Bought |
---|---|
1 | Bread, Milk |
2 | Bread, Eggs, Beer |
3 | Milk, Eggs, Beer, Cola |
4 | Bread, Milk, Eggs, Beer |
5 | Bread, Milk, Eggs, Cola |
From this dataset, we can identify recurring combinations of items. For example, customers who purchase both Bread and Milk also tend to purchase Eggs. This relationship is expressed as an association rule:
\( \text{Bread} \land \text{Milk} \Rightarrow \text{Eggs} \)
The usefulness and strength of such a rule are measured using:
These metrics help determine whether a rule is meaningful or coincidental.
Now that we understand what association rules are and how they reveal patterns in data, the next question is: How do we find these rules?
That’s where the Apriori Algorithm comes in. It is a foundational technique used to automatically generate frequent itemsets and discover association rules from large datasets efficiently.
Association Rule Mining is commonly used in Market Basket Analysis to discover patterns such as:
Given transaction data, generate association rules like:
\( X \Rightarrow Y \)
This means: If X is purchased, then Y is also likely purchased.
Transaction ID | Items |
---|---|
T1 | Bread, Butter, Milk |
T2 | Bread, Butter, Jam |
T3 | Bread, Milk, Juice |
T4 | Curd, Milk, Juice |
T5 | Bread, Butter, Milk, Juice |
Total Transactions: 5
Note: An itemset just means a group of items that appear
together in a transaction.
– A 1-itemset is a single item (like Bread).
– A 2-itemset is a pair of items (like Bread and Milk).
– A 3-itemset is a group of three items (like Bread, Milk, and Juice).
We look for these sets in the data and check how often they appear (called support). If they
appear frequently, we keep them. If not, we drop them.
Goal: First, we list all individual items (1-itemsets) and count how many transactions they appear in.
Support is calculated using the formula:
\( \text{Support} = \left(\frac{\text{Item Count}}{\text{Total Transactions}}\right) \times 100\% \)
Total Transactions = 5
Minimum Support = 50%. So we remove items with less than 50% support.
Frequent 1-Itemsets: Bread (80%), Butter (60%), Milk (80%), Juice (60%)
These are called frequent because they appear in at least 50% of transactions.
Now we combine the frequent 1-itemsets in pairs and check how often those pairs occur together in transactions.
Only combinations of previously frequent 1-itemsets are considered (Bread, Butter, Milk, Juice).
Again, we apply the minimum support = 50%.
Frequent 2-Itemsets: {Bread, Milk}, {Milk, Juice}, {Bread, Butter}
Now we form groups of 3 items from the frequent 2-itemsets.
Only valid combinations from previous frequent sets are checked:
Both are less than 50%, so they are not frequent.
Frequent 3-Itemsets: None
Now we take the frequent 2-itemsets and generate rules like X → Y.
Confidence is calculated using:
\( \text{Confidence}(X \Rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)} \times 100\% \)
Rule | Confidence | Interpretation |
---|---|---|
Bread → Milk | 75% | 75% of the time, when Bread is bought, Milk is also bought |
Milk → Bread | 75% | 75% of the time, when Milk is bought, Bread is also bought |
Milk → Juice | 75% | 75% of the time, when Milk is bought, Juice is also bought |
Juice → Milk | 100% | 100% of the time, when Juice is bought, Milk is always bought |
This process uncovers strong purchasing patterns and aids in data-driven decisions for retail and e-commerce.
In data mining, we often find patterns using association rules. These rules tell us which items are frequently bought together. However, frequent co-occurrence does not always mean a real relationship exists between the items. This is where correlation analysis comes in.
The Pearson correlation coefficient (\( \gamma \) or \( r \)) tells how strongly and in what direction two variables are related. It measures the strength and direction of a linear relationship between two continuous variables, ranging from -1 to +1.
\[ \gamma(A,B) = \frac{\sum (A - \bar{A})(B - \bar{B})}{(n-1) \cdot \sigma_A \cdot \sigma_B} \]
Value | Interpretation | Strength |
---|---|---|
+1 | Perfect positive correlation | Perfect |
+0.7 to +0.9 | Strong positive correlation | Strong |
+0.3 to +0.7 | Moderate positive correlation | Moderate |
0 | No correlation | None |
-0.3 to -0.7 | Moderate negative correlation | Moderate |
-0.7 to -0.9 | Strong negative correlation | Strong |
-1 | Perfect negative correlation | Perfect |
First, find the average of each variable:
\( \bar{A} = \frac{\sum A}{n} = \frac{A_1 + A_2 + ... + A_n}{n} \)
\( \bar{B} = \frac{\sum B}{n} = \frac{B_1 + B_2 + ... + B_n}{n} \)
For each data point, calculate how far it is from the mean:
For A: \( d_{A_i} = A_i - \bar{A} \)
For B: \( d_{B_i} = B_i - \bar{B} \)
Some deviations will be positive (above mean) and some negative (below mean).
Calculate the standard deviation for each variable:
\[ \sigma_A = \sqrt{\frac{\sum (A - \bar{A})^2}{n - 1}} \]
\[\sigma_A = \sqrt{\frac{(A_1 - \bar{A})^2 + (A_2 - \bar{A})^2 + ... + (A_n - \bar{A})^2}{n - 1}} \]
\[ \sigma_B = \sqrt{\frac{\sum (B - \bar{B})^2}{n - 1}} \]
\[ \sigma_B = \sqrt{\frac{(B_1 - \bar{B})^2 + (B_2 - \bar{B})^2 + ... + (B_n - \bar{B})^2}{n - 1}} \]
Note: We use (n-1) instead of n for sample standard deviation (Bessel's correction).
Calculate the sum of products of deviations (this measures covariance):
\[ \text{Numerator} = \sum (A - \bar{A})(B - \bar{B}) = (A_1 - \bar{A})(B_1 - \bar{B}) + (A_2 - \bar{A})(B_2 - \bar{B}) + ... + (A_n - \bar{A})(B_n - \bar{B}) \]
Plug in all calculated values into the main formula:
\[ \gamma(A,B) = \frac{\text{Numerator}}{(n-1) \cdot \sigma_A \cdot \sigma_B} \]
Given Data:
Point | A | B |
---|---|---|
1 | 20 | 8 |
2 | 12 | 34 |
3 | 9 | 4 |
Step 1: Calculate means
Step 2: Calculate deviations from mean
Point | A | B | \(A - \bar{A}\) | \(B - \bar{B}\) |
---|---|---|---|---|
1 | 20 | 8 | 20 - 13.67 = 6.33 | 8 - 15.33 = -7.33 |
2 | 12 | 34 | 12 - 13.67 = -1.67 | 34 - 15.33 = 18.67 |
3 | 9 | 4 | 9 - 13.67 = -4.67 | 4 - 15.33 = -11.33 |
Step 3: Calculate standard deviations
First, calculate squared deviations:
Point | \((A - \bar{A})^2\) | \((B - \bar{B})^2\) |
---|---|---|
1 | \((6.33)^2 = 40.07\) | \((-7.33)^2 = 53.73\) |
2 | \((-1.67)^2 = 2.79\) | \((18.67)^2 = 348.57\) |
3 | \((-4.67)^2 = 21.81\) | \((-11.33)^2 = 128.37\) |
Sum | 64.67 | 530.67 |
Now calculate standard deviations:
\[ \sigma_A = \sqrt{\frac{64.67}{3-1}} = \sqrt{\frac{64.67}{2}} = \sqrt{32.34} = 5.69 \]
\[ \sigma_B = \sqrt{\frac{530.67}{3-1}} = \sqrt{\frac{530.67}{2}} = \sqrt{265.34} = 16.29 \]
Step 4: Calculate products of deviations
Point | \((A - \bar{A})\) | \((B - \bar{B})\) | \((A - \bar{A})(B - \bar{B})\) |
---|---|---|---|
1 | 6.33 | -7.33 | \((6.33)(-7.33) = -46.40\) |
2 | -1.67 | 18.67 | \((-1.67)(18.67) = -31.16\) |
3 | -4.67 | -11.33 | \((-4.67)(-11.33) = 52.91\) |
Sum | -24.65 |
Step 5: Apply correlation formula
\[ \gamma(A,B) = \frac{-24.65}{(3-1) \cdot 5.69 \cdot 16.29} \]
\[ = \frac{-24.65}{2 \cdot 5.69 \cdot 16.29} = \frac{-24.65}{185.38} = -0.133 \]
[Root Node]
|
---------------
| | |
[Decision Nodes] (attributes)
| |
[Leaf Nodes] → Final decisions (Yes/No)
Partition methods construct k partitions (clusters) of the data, where each data point belongs to exactly one cluster. The goal is to find the optimal grouping of data points such that:
K-Means is one of the most popular clustering algorithms used to group similar data points into clusters. It is mainly used in clustering problems where we don't have labeled data. The goal is to divide the data into K groups (clusters) based on similarity.
When using just one feature like Shoe Size, distance between two points is the absolute difference:
\( d = |x_2 - x_1| \)
Consider the following dataset of clients and their shoe sizes:
Client | Shoe Size |
---|---|
Alice | 6 |
Bob | 9 |
Charlie | 7 |
Diana | 5 |
Edward | 8 |
Fiona | 10 |
George | 11 |
This formula is used to measure how far two points are from each other in a 2D space (like age and amount):
\( d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} \)
Let’s consider the following dataset:
Client | Age | Amount |
---|---|---|
C1 | 20 | 500 |
C2 | 40 | 1000 |
C3 | 30 | 800 |
C4 | 18 | 300 |
C5 | 28 | 1200 |
C6 | 35 | 1400 |
C7 | 45 | 1800 |
Hierarchical clustering builds a hierarchy of clusters. Unlike K-Means, it does not require choosing the number of clusters (K) beforehand. There are two main approaches:
Given 1D data points: [1, 5, 8, 10, 19, 20]
Step-by-step using single linkage:
A dendrogram is a tree-like diagram that shows the sequence of merges (or splits). You can “cut” the dendrogram at any level to get the desired number of clusters.
This method starts with all points in one big cluster and splits them recursively. It's less common and computationally heavier than agglomerative.
Steps:
Use libraries like scipy.cluster.hierarchy in Python to create dendrograms and perform clustering automatically.