× back
Next Topic → ← Previous Topic

High-Level Database Design & Normalization

Features of Relational Database Design

When a database designer is planning to design a database, they consider the process, key points, and methodologies involved in the construction of a relational database.

1. Semantics of Attributes:

  • This aspect revolves around the meaning of each individual attribute within the database.
  • Guideline: Design a relation schema to facilitate easy explanation of its meaning; names should align with the data stored in them.
  • Do not combine attributes from multiple entity types to avoid confusion.

2. Reducing Redundant Information from Tuples:

  • Duplicate values should not exist in multiple tuples.
  • Guideline: Design the database to eliminate insertion, deletion, and updation anomalies.

3. Reducing Null Values in Attributes:

  • The value of an attribute should not be frequently null.
  • Guidelines: Avoid including attributes with frequent null values in the relation.

4. Disallowing the Possibility of Generating Spurious Tuples

  • This refers to avoiding the generation of spurious tuples, meaning undesired results when joining tables.
  • Guidelines: Design the relational schema to enable joining with equality conditions on common attributes.

Anomalies in Relational Database:-

Insertion Anomaly

  • An insertion anomaly occurs when there is a desire to insert information, but it cannot be added due to certain constraints or restrictions.

Example:

                    
+-------------------------------------------+
| EID |   Name  | Salary | DID | Department |
+-------------------------------------------+
|  1  | Paul    | 10,000 |  1  | A/C        |  
|  2  | John    | 6,000  |  1  | A/C        |  
|  3  | Sharav  | 7,000  |  3  | Sales      |
|  4  | Nike    | 6,000  |  2  | Marketing  |
|  5  | Hudson  | 10,000 |  2  | Marketing  |
|  6  | Smith   | 20,000 |  1  | A/C        |  
|  7  | Paula   | 50,000 |  2  | Marketing  |
|  8  | Maithli | 15,000 |  1  | A/C        |
|     |         |        |  5  | Finance    | 
+-------------------------------------------+
                    
                
  • We can't add Finance and DID 5 due to EID constraints, as EID (primary key) cannot be left blank or null.

Deletion Anomaly

  • A deletion anomaly occurs when we remove some existing information from a database, and this action unintentionally leads to the deletion of other related but necessary data.
  • Example:
  •         
    +-------------------------------------------+
    | EID |   Name  | Salary | DID | Department |
    +-------------------------------------------+
    |  1  | Paul    | 10,000 |  1  | A/C        |  
    |  2  | John    | 6,000  |  1  | A/C        |  
    |  3  | Sharav  | 7,000  |  2  | Sales      |
    +-------------------------------------------+
            
        
    • If we delete the tuple with EID 3, which is the only occurrence of the "Sales" department, we lose the definition of the "Sales" department entirely.

Updation Anomaly

  • An updation anomaly occurs when we attempt to update data in a way that introduces inconsistency into the dataset.
  • Example: Suppose we want to change the Department ID (DID) from 1 to 2 for the A/C department:
  •                         
    +-------------------------------------------+
    | EID |   Name  | Salary | DID | Department |
    +-------------------------------------------+
    |  1  | Paul    | 10,000 |  2  | A/C        |  
    |  2  | John    | 6,000  |  1  | A/C        |  
    |  3  | Sharav  | 7,000  |  3  | Sales      |
    +-------------------------------------------+
                            
                        
    • This update would create inconsistency because now the Department A/C is associated with both DID 1 and DID 2, making it unreliable and confusing.
    • For accurate updates, one would need to change the Department ID for all instances of A/C, leading to a more complex and error-prone process.

Normalization: Introduction

Normalization is a systematic process aimed at decomposing or dividing a relational database into multiple relations to eliminate anomalies and improve its overall structure.

Benefits of Normalization

  1. Smaller tables with smaller rows:
    • Normalization results in breaking down large tables into smaller, more manageable ones, reducing redundancy and improving efficiency.
  2. Searching, sorting & creating indexes are faster:
    • Normalized tables simplify data retrieval operations, leading to faster search, sort, and index creation processes.
  3. Small number of null values & less redundant data:
    • Normalization aims to minimize null values and redundant data, ensuring a cleaner and more efficient database design.
  4. Data modification anomalies are reduced:
    • Normalization helps mitigate issues such as insertion, update, and deletion anomalies, promoting data integrity.
  5. Easier to maintain & change as the needs change:
    • A normalized database is more adaptable to changes in requirements, making it easier to maintain and modify over time.
  6. Reduces data dependency:
    • Normalization minimizes data redundancy, leading to reduced dependency and ensuring that changes in one part of the database do not adversely affect other parts.

Pros and Cons of Normalization in Databases

Advantages

  • Reduces data redundancy, minimizing storage space.
  • Enhances overall organization of the database.
  • Promotes data consistency within the database.
  • Facilitates a more flexible database design.
  • Reduces the number of indexes per table, speeding up maintenance tasks.
  • Allows for selective table joins, improving query efficiency.
  • Enhances database security management.
  • Increases storage efficiency.
  • Speeds up data access.

Disadvantages

  • Database design must align with user requirements before development.
  • Normalization can consume significant CPU, memory, and I/O resources, affecting performance.
  • Requires more joins for desired results; poorly written queries can impact database performance.
  • Normalizing relations of a higher degree is time-consuming and challenging.
  • Careless decomposition may lead to poorly designed databases and serious issues.

Types of Normal Forms:

  1. First Normal Form (1NF):
    • Focuses on eliminating duplicate data by ensuring that each attribute in a table contains only atomic (indivisible) values.
  2. Second Normal Form (2NF):
    • Builds on 1NF by addressing partial dependencies. All non-prime attributes must be fully functionally dependent on the primary key.
  3. Third Normal Form (3NF):
    • Further refines data integrity by removing transitive dependencies. No non-prime attribute should be transitively dependent on the primary key.
  4. Boyce-Codd Normal Form (BCNF):
    • Aims to eliminate all non-trivial functional dependencies. Every non-trivial functional dependency is a superkey.
  5. Fourth Normal Form (4NF) based on Multivalued Dependency:
    • Addresses situations where a table has multiple independent multi-valued attributes. It ensures data is organized efficiently.
  6. Fifth Normal Form (5NF) based on Join Dependency:
    • Deals with scenarios where a table contains multiple overlapping candidate keys, and it aims to eliminate redundancy through the use of join dependencies.

The first four depend on the concept of functional dependency.

What are Functional Dependencies?

Functional dependencies result from the interrelationship between attributes or tuples in any relation.

Example:

                
+-----------+
|  X  |  Y  |
+-----------+
|  a  |  1  |
|  b  |  22 |
|  c  |  65 |
|  a  |  1  |
+-----------+
                
            

If X → Y, then for any two tuples (t1 and t2) where t1[X] = t2[X], it follows that t1[Y] = t2[Y]. For every unique value of X, we get a unique value from Y.
a → 1
X is the determinant, and Y is dependent.

Question: Is X → Y?

                
+-----------+
|  X  |  Y  |
+-----------+
|  a  |  2  |
|  b  |  2  |
|  a  |  2  |
|  b  |  2  |
|  c  |  2  |
+-----------+
                
            

For every 'a' in X, we get Y = 2, and the same holds for 'b'; however, 'c' won't affect anything.
a → 2, b → 2, c → 2

Another example:

                    
+---------------------------+
|  EID  |   Name  |  Salary |
+---------------------------+
|   1   | Aditya  |  15000  |
|   2   |  Manoj  |  16000  |
|   3   | Sandeep |  9000   |
|   4   |  Vikas  |  10000  |
|   5   |  Manoj  |  9000   |
+---------------------------+
                    
                

Consider the functional dependency:

First Normal Form - 1NF

Examples:

        
+-------------------------------------+
|  ID  |  Name  |  Age  |     Phone   |
+-------------------------------------+
|   1  |    x   |  32   |     666     |
|   2  |    y   |  65   |     777     |
|   3  |    z   |  42   |  888, 222   |
+-------------------------------------+
                
            

ID 3 has two phone numbers, and we shouldn't have multiple values in a single attribute. To convert this to 1NF form, we can repeat the row like this ↓

                    
+-------------------------------------+
|  ID  |  Name  |  Age  |     Phone   |
+-------------------------------------+
|   1  |    x   |  32   |     666     |
|   2  |    y   |  65   |     777     |
|   3  |    z   |  42   |     888     |
|   3  |    z   |  42   |     222     |
+-------------------------------------+
                    
                

Now, each attribute contains only atomic values, adhering to the 1NF requirements.

We can't even have composite attributes as it violates the atomic property for 1NF. For example ↓

                    
+----------------------------------------+
|  ID  |  Name  |  Address               |
+----------------------------------------+
|   1  |    x   |  74A, Block 7, Delhi   |
|   2  |    y   |  664b, Block 8, Delhi  |
+----------------------------------------+ 
                    
                

Converting it to 1NF form by distributing the address attribute into sub-attributes.

                    
+--------------------------------------------+
|  ID  |  Name  |  Street  |  Block | State  |
+--------------------------------------------+
|   1  |    x   |   74A    | Block 7 | Delhi |
|   2  |    y   |   664b   | Block 8 | Delhi |
+--------------------------------------------+ 
                    
                

Working with derived attributes:

As derived attribute values can be easily calculated at runtime, they are removed in 1NF form.

                    
+-----------------------------------------+
|  ID  |  Name  |      DOJ       |  WExp  |
+-----------------------------------------+
|   1  |    x   |   16 May 2018  |   3    |
|   2  |    y   |   14 Aug 2019  |   2    |
+-----------------------------------------+ 
                    
                

Converting to 1NF form ↓

                    
+-------------------------------+
|  ID  |  Name  |      DOJ      |
+-------------------------------+
|   1  |    x   |   16 May 2018 |
|   2  |    y   |   14 Aug 2019 |
+-------------------------------+ 
                    
                

Prime and Non-Prime Attributes

Prime or Key Attribute

  • For a given relation R = {A1, A2, A3, A4, ..., An}, an attribute a is a prime attribute if A is part of any candidate key of R.
  • Alternatively, an attribute is considered prime if it is part of a key.

Non-Prime or Non-Key Attribute:

  • If an attribute is not part of any key, it is classified as a non-prime attribute.

Example:

Relation Schema:

  • Relation Schema: Student(Rollno, S-Name, S-Address, S-DOB, S-Fees, S-Course)
  • Now, let's determine which attributes can serve as candidate keys.
    • Since Rollno uniquely identifies each student, it qualifies as a candidate key: Rollno → Ck.
    • The combination [S-Name + S-Address] can also uniquely identify a student: [S-Name + S-Address] → Ck (This combination can also identify a unique value).
    • Therefore, Rollno, S-Name, and S-Address are considered prime attributes as they are either individually or collectively part of a candidate key.
    • On the other hand, S-DOB, S-Fees, and S-Course are classified as non-prime key attributes since they do not contribute to the identification of unique tuples.

Fully Functional Dependency:

Example:

Consider a relation named "Student" with prime key attributes (rollno & games) and non-prime key attributes (grade, name, and fees).

For illustration purposes:

This exemplifies a fully functional dependency, where certain non-prime key attributes rely on the entirety of the prime key for their uniqueness.

Partial Dependency

Example:

Consider a relation named "Student" with prime key attributes (rollno & games) and non-prime key attributes (grade, name, and fees).

This provides a more comprehensive view of partial dependency within the context of a relational database, emphasizing that certain non-prime key attributes depend on only a subset of the prime key attributes for their uniqueness.

Trivial and Non-trivial Dependency

Trivial Functional Dependency

  • If A → B and B ⊆ A, then this dependency is a Trivial Functional Dependency. In simpler terms, if the right-hand side (B) is a subset of the left-hand side (A), the dependency is considered trivial.
  • A → A and B → B are also categorized as Trivial Dependencies. These cases express that an attribute is functionally dependent on itself, which is inherently evident.
  • Example: Let's consider a relation where Rollno → Rollno and Rollno, S-name → S-Name. In the first case, Rollno → Rollno is trivial because it merely states that you can find the value of Rollno from itself, which is self-evident and doesn't provide new information. The second case, Rollno, S-name → S-Name, is also trivial as it essentially says that knowing both Rollno and S-name, you can find the value of S-Name. These examples highlight situations where the dependency is obvious and doesn't contribute additional insights.
  • Trivial dependencies are characterized by their predictability and lack of novelty, making them less informative in the context of database relationships.

Non-trivial Functional Dependency

  • If A → B and B is not a subset of A, then this dependency is a Non-trivial Functional Dependency. In other words, the right-hand side (B) provides new information beyond what is already known from the left-hand side (A).
  • Example: Consider a relation where Rollno → S-Name. This is a non-trivial dependency because knowing the Rollno uniquely determines the value of S-Name, and the relationship is not self-evident or redundant. The dependency introduces new information about the dataset.
  • Non-trivial dependencies are essential in database design as they contribute to the normalization process, helping to eliminate redundancy and ensure data integrity.
  • Unlike trivial dependencies, non-trivial dependencies play a crucial role in providing valuable insights and maintaining the accuracy of data relationships within a relational database.

Transitive and Non-transitive Dependency

Transitive Dependency

  • A transitive dependency occurs when there is an indirect relationship between two attributes through a third attribute. In other words, if A → B and B → C, then A → C is a transitive dependency.
  • Example: Consider a relation where EmployeeID → DepartmentID and DepartmentID → DepartmentName. Here, EmployeeID → DepartmentName is a transitive dependency, as the department name is indirectly determined by the employee ID through the relationship with the department ID.
  • Transitive dependencies can lead to data redundancy and are typically addressed through normalization techniques.

Non-transitive Dependency

  • A non-transitive dependency, on the other hand, exists when there is a direct relationship between two attributes without involving a third attribute.
  • Example: In a relation where StudentID → StudentName, this is a non-transitive dependency. The student name is directly determined by the student ID without the need for an intermediary attribute.
  • Non-transitive dependencies are desirable in database design as they help maintain simplicity and reduce the risk of data anomalies.
  • Identifying and eliminating transitive dependencies is a key step in the normalization process to ensure a well-structured and efficient relational database.

Second Normal Form (2NF)

Example:

Consider a relation named "Student" with prime key attributes (rollno & games) and non-prime key attributes (grade, name, and fees).

In the new structure, each table represents a distinct functional dependency, ensuring that non-prime key attributes are fully dependent on the primary key attributes. This adheres to the principles of the second normal form, promoting data integrity and reducing redundancy in the relational database.

Dependencies & Logical Implication

We can utilize Armstrong's Axioms to further understand these dependencies:

  1. Reflexivity: If Y is a subset of X, then X → Y, signifying the unique identification of Y using X. This axiom also implies that X → X.
  2. Augmentation: If X → Y holds and Z is a set of attributes, then ZX → ZY.
  3. Transitivity: If X → Y holds and Y → Z, then X → Z holds.

Additional Axioms:

  1. Additivity or Union: If X → Y and X → Z, then X → YZ holds.
  2. Projectivity or Decomposition: If X → YX holds, then X → Y and X → Z also hold.
  3. Pseudo-Transitivity: If X → Y and ZY → W hold, then XZ → W holds.

These axioms play a crucial role in determining closures, which, in turn, assist us in identifying keys, particularly in the context of achieving Third Normal Form (3NF).

Closure of Functional Dependencies

The closure, (AB)+, essentially represents the set of all attributes that can be functionally determined by the attributes A and B, considering the given set of dependencies.

Q→ R = {X, Y, Z, H, D}
F.D = X → YZ, DX → W, Y → H then find closure F+ of FD's.

Solution:
Given Functional Dependencies (FD):

  • X → YZ
  • DX → W
  • Y → H

We need to find the closure F+ of these FDs.

  1. Applying X → YZ, we get X → Y and X → Z.
  2. Now, considering X → Y and Y → H, applying transitivity, we find X → H.
  3. No other dependencies can be derived from the given FDs.

Therefore, the closure F+ is: {X, Y, H}

Q→ R = {A, B, C, D, E}
F.D = A → BC, CD → E, B → D, E → A
Find F+

Solution:
Given Functional Dependencies (FD):

  • A → BC
  • CD → E
  • B → D
  • E → A

We need to find the closure F+ of these FDs.

  1. Starting with A → BC, we get A → B and A → C.
  2. Considering CD → E, no new attributes are added.
  3. Considering B → D, we add B → D.
  4. Considering E → A, we add E → A.
  5. No more attributes can be added. The closure F+ is: {A, B, C, D, E}

Q→ R = {A, B, C, D, E, F, G}
F.D = A → B, BC → DE, AEF → G, E → C, A → E

Solution:
Given Functional Dependencies (FD):

  • A → B
  • BC → DE
  • AEF → G
  • E → C
  • A → E

We need to find the closure F+ of these FDs.

  1. Starting with A → B, we get A → B.
  2. Considering BC → DE, we add BC → D and BC → E.
  3. Considering AEF → G, we add AEF → G.
  4. Considering E → C, we add E → C.
  5. Considering A → E, we add A → E.
  6. No more attributes can be added. The closure F+ is: {A, B, C, D, E, F, G}

Finding the Candidate Key of a Relation by Closure

  • Candidate key: It is the minimal set of attributes whose attribute closure includes all attributes of the relation; this set is called the candidate key of the relation.

Example: R{A, B, C, D, E}
F.D = AB → C, CD → E, DE → B
Find the candidate key.

Solution:
According to the rules, we don't consider attributes on the right-hand side of the functional dependencies. In this case, B, C, and E are ignored, leaving us with A and D.
Now, using the AD combination, we will create more combinations such as ADC, ABC, ADE. For each combination, we will find the attribute closure.

  • For {ADC}+ = {A, D, C}
    = {A, C, D, E} as CD → E
    = {A, B, C, D, E} as DE → B
    So, Ck = {ADC}, and it is a candidate key.
  • For {ADB}+ = {A, D, B}
    = {A, B, C, D}
    = {A, B, C, D, E}
    So, Ck = {ADB} is also a candidate key.
  • For {ADE}+ = {A, D, E}
    = {A, B, D, E}
    = {A, B, C, D, E}
    So, Ck = {ADE} is yet another candidate key.

Convert Relation to 2NF

R(A, B, C, D)
A, B prime key attributes (denoted as Ck)
C, D non-prime key attributes
F.D = AB → D, B → C

In 2NF, all non-prime attributes should be fully functionally dependent on the prime key attributes. Let's analyze the given relation:

  • On the right side, we have AB, so we will find the closure (AB)+ = {A, B, C, D}.
    As we are obtaining all the attributes, it qualifies as a candidate key.
  • Now, the goal is to convert the relation into 2NF form.
  • Observing that D is completely dependent on AB, but C is not completely dependent on AB; it is only dependent on B. This implies that the relation is not in 2NF, and decomposition is required.
  • We will create two relations:
    R1 (ABD), which is in 2NF as all non-prime attributes are fully functionally dependent on the prime key.
    R2 (B, C), where all non-prime attributes are dependent on the prime key according to the 2NF rule.

Third Normal Form (3NF)

Example:

Properties of Normalization

Normalization is a crucial process in database design, aiming to minimize redundancy and dependency issues within relation schemas. While the first three normal forms (1NF, 2NF, 3NF) focus on decomposing relations based on functional dependencies, achieving a good relational schema requires considering additional properties.

Lossless Join Property

Picture the Lossless Join Property as taking a big table, splitting it into smaller ones, and then being able to merge them back together without losing any data. In the database world, this property ensures that when we decompose a table into smaller tables, we can recreate the original table without any missing information.

  • No Missing Rows: Suppose we have a 'Employees' table split into 'Employee Names' and 'Employee Departments.' The Lossless Join Property guarantees that combining 'Employee Names' and 'Employee Departments' brings us back to the full 'Employees' table, making sure no employee details are left out.
  • Essential in Normalization: It's a crucial rule in the process of normalization. When we organize our database, we want to be certain that no data is lost as we break it down and piece it back together.

Dependency Preservation Property

Now, think of the Dependency Preservation Property as ensuring that the relationships between different parts of our data remain clear, even when we split them. It's about preserving the dependencies.

  • Preserve Relationships: Consider a 'Orders' table with 'Order ID' and 'Customer Name.' If we decompose this into 'Order ID' and 'Customer Details,' Dependency Preservation ensures that the connection between 'Order ID' and 'Customer Name' is maintained.
  • Not Always Guaranteed: Unlike Lossless Join, Dependency Preservation isn't always guaranteed. Sometimes, when we break down our tables, we might lose some connections between pieces. It's like making sure the dependencies between tables stay intact - it doesn't always happen perfectly.

Boyce Codd Normal Form (BCNF)

BCNF is a more refined version of the Third Normal Form (3NF) in database normalization. A relation is said to be in BCNF if and only if all of its determinants are candidate keys.

Example:

Let's take a relation EmployeeSkills(EmployeeID, Skill, Proficiency) with the following functional dependencies:

In BCNF, both EmployeeID and Skill must be candidate keys. This ensures that the determinants (EmployeeID and Skill) are capable of uniquely identifying each tuple in the relation.

Multivalued Dependency (MVD)

Multivalued Dependency (MVD) is a concept in database normalization that deals with relationships involving multiple values. It occurs when one attribute in a table uniquely determines another attribute, independent of the other attributes in the table.

Example:

Consider a relation StudentCourses(StudentID, Course, Instructor) with the following multivalued dependency:

In this example, StudentID determines both the Course and the Instructor independently, representing a multivalued dependency.

Fourth Normal Form (4NF)

Fourth Normal Form (4NF) is an advanced level of normalization that addresses certain complex scenarios involving multivalued dependencies and composite keys.

Example:

Consider a relation EmployeeProjects(EmployeeID, Project, Responsibilities) with the multivalued dependency:

In 4NF, the relation should be in BCNF and free from non-trivial multivalued dependencies.

5NF

Fifth Normal Form (5NF)

Fifth Normal Form (5NF) is a further extension of normalization that addresses cases where information can be derived from the combination of multiple multivalued dependencies.

Example:

Consider a relation SupplierParts(SupplierID, Part, Location) with the multivalued dependencies:

In 5NF, the relation should be in 4NF, and there should be no non-trivial join dependencies, ensuring a high level of data integrity and structure.