Data Warehouse and Data Mining

1)Data Mining: Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems.

The most commonly used techniques in data mining are:

  • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
  • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) .
  • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.
  • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.
  • Rule induction: The extraction of useful if-then rules from data based on statistical significance.

 

Image result for data mining kdd process

Data Mining Architecture:

2)Data Warehouse:

A data warehouse is a:

  • subject-oriented
  • integrated
  • time varying
  • non-volatile

collection of data in support of the management’s decision-making process.A data warehouse is a centralized repository that stores data from multiple information sources and transforms them into a common, multidimensional data model for efficient querying and analysis.

 

Subject Oriented:Data warehouses are designed to help you analyze data. For example, to learn more about your company’s sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like “Who was our best customer for this item last year?” This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.

Integrated:Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.

Nonvolatile:Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.

Time Variant:In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse’s focus on change over time is what is meant by the term time variant.

There are two approaches to data warehousing, top down and bottom up. The top down approach spins off data marts for specific groups of users after the complete data warehouse has been created. The bottom up approach builds the data marts first and then combines them into a single, all-encompassing data warehouse.

Slice and dice refers to a strategy for segmenting, viewing and understanding data in a database. Users slices and dice by cutting a large segment of data into smaller parts, and repeating this process until arriving at the right level of detail for analysis. Slicing and dicing helps provide a closer view of data for analysis and presents data in new and diverse perspectives.The term is typically used with OLAP databases that present information to the user in the form of multidimensional cubes similar to a 3D spreadsheet.

Image result for data warehouse architecture

ETL process

ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse. ETL involves the following tasks:

Extracting the data from source systems (SAP, ERP, other operational systems), data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing.

Transforming the data may involve the following tasks:

  •   applying business rules (so-called derivations, e.g., calculating new measures and dimensions),
  •   cleaning (e.g., mapping NULL to 0 or “Male” to “M” and “Female” to “F” etc.),
  •   filtering (e.g., selecting only certain columns to load),
  •   splitting a column into multiple columns and vice versa,
  •   joining together data from multiple sources (e.g., lookup, merge),
  •   transposing rows and columns,
  •   applying any kind of simple or complex data validation (e.g., if the first 3 columns in a row are empty then reject the row from processing)

Loading the data into a data warehouse or data repository other reporting applications

Image result for difference between oltp and olap

Image result for difference between oltp and olap




Normalization

Normalization: It is the process of removing redundant data from your tables in order to improve storage efficiency, data integrity and scalability. This improvement is balanced against an increase in complexity and potential performance losses from the joining of the normalized tables at query-time.There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored. Normalization is also called “Bottom-up-approach”, because this technique requires full knowledge of every participating attribute and its dependencies on the key attributes, if you try to add new attributes after normalization is done, it may change the normal form of the database design.

Redundancy:Dependencies between attributes within a relation cause redundancy.

Without Normalization Problems: Without Normalization, it becomes difficult to handle and update the database, without facing data loss. Insertion, Updation and Deletion Anomalies are very frequent if Database is not normalized.There’s clearly redundant information stored here.

  • Insert Anomaly– Due to lack of data i.e., all the data available for insertion such that null values in keys should be avoided. This kind of anomaly can seriously damage a database
  • Update Anomaly– It is due to data redundancy i.e. multiple occurrences of same values in a column. This can lead to inefficiency.
  • Deletion Anomaly – It leads to loss of data for rows that are not stored elsewhere. It could result in loss of vital data.

On decomposition of a relation into smaller relations with fewer attributes on normalization the resulting relations whenever joined must result in the same relation without any extra rows. The join operations can be performed in any order. This is known as Lossless Join decomposition. The resulting relations (tables) obtained on normalization should possess the properties such as each row must be identified by a unique key, no repeating groups, homogenous columns, each column is assigned a unique name etc.

Functional Dependency: The attributes of a table is said to be dependent on each other when an attribute of a table uniquely identifies another attribute of the same table.If column A of a table uniquely identifies the column B of same table then it can represented as A->B (Attribute B is functionally dependent on attribute A).

  • Partial Function Dependency ; It is a form of Functional dependency that holds on a set of attributes.Let us assume a relation R with attributes A, B, C, and D. Also, assume that the set of functional dependencies F that hold on R as follows;

                  F = {A → B, D → C}.

From set of attributes F, we can derive the primary key. For R, the key can be (A,D), a composite primary key. That means, AD → BC, AD can uniquely identify B and C. But, for this case A and D is not required to identify B or C uniquely. To identify B, attribute A is enough. Likewise, to identify C, attribute D is enough. The functional dependencies AD → B or AD → C are called as Partial functional dependencies.

  • Trivial Dependency ; The dependency of an attribute on a set of attributes is known as Trivial Dependency if the set of attributes includes that attribute.

Consider a table with two columns Student_id and Student_Name.{Student_Id, Student_Name} -> Student_Id is a trivial functional dependency as Student_Id is a subset of {Student_Id, Student_Name}.  That makes sense because if we know the values of Student_Id and Student_Name then the value of Student_Id can be uniquely determined.Also, Student_Id -> Student_Id & Student_Name -> Student_Name are trivial dependencies too.

  • Non-Trivial Dependency ; If a functional dependency X->Y holds true where Y is not a subset of X then this dependency is called Non-Trivial Dependency.

An employee table with three attributes: emp_id, emp_name, emp_address.
The following functional dependencies are non-trivial:
emp_id -> emp_name (emp_name is not a subset of emp_id)
emp_id -> emp_address (emp_address is not a subset of emp_id)

On the other hand, the following dependencies are trivial:
{emp_id, emp_name} -> emp_name [emp_name is a subset of {emp_id, emp_name}]

Normalization has Five Normal Forms:

a)1NF

b)2NF

c)3NF

d)BCNF

e)(4NF)

f)5NF

a)1NF: A relation is considered to be in first normal form if all of its attributes have domain that are indivisible or atomic.

A table is in 1NF if and only if its satisfies the following five conditions:

  • There is no top-to-bottom ordering to the rows.
  • There is no left-to-right ordering to the columns.
  • There are no duplicate rows.
  • Every row and column intersection contains exactly one value from the applicable domain.
  • All columns are regular
  • Each attribute must contain only a single value from its predefined domain.

b)2NF: 

  • Table is in 1NF (First normal form)
  • No non-prime attribute is dependent on the proper subset of any candidate key of table.
  • Based on Fully Functional dependency.

An attribute that is not part of any candidate key is known as non-prime attribute.

c)3NF: A functional dependency is said to be transitive if it is indirectly formed by two functional dependencies. For e.g.X -> Z is a transitive dependency if the following three functional dependencies hold true:

  • X->Y
  • Y does not ->X
  • Y->Z

A table design is said to be in 3NF if both the following conditions hold:

  • Table must be in 2NF
  • Transitive functional dependency  of non-prime attribute on any super key should be removed.

An attribute that is not part of any candidate key is known as non-prime attribute.

In other words 3NF can be explained like this: A table is in 3NF if it is in 2NF and for each functional dependency X-> Y at least one of the following conditions hold:

  • X is a super keyof table
  • Y is a prime attribute of table

An attribute that is a part of one of the candidate keys is known as prime attribute.

d)BCNF:A relational schema R is considered to be in Boyce–Codd normal form (BCNF) if it is in 3NF, for every one of its dependencies X → Y, one of the following conditions holds true:

  • X → Y is a  non trivial functional dependency (i.e., Y is a subset of X)
  •  X is a superkey for schema R

BCNF is more restrictive than 3NF.While decomposing relation to make them in BCNF we may loose some dependencies i.e BCNF does not guarantee the dependency preservation property.

Note:A relation with only two attributes is always in BCNF

e)4NF

  • It should meet all the requirement of 3NF
  • Attribute of one or more rows in the table should not result in more than one rows of the same table leading to multi-valued dependencies.
  • Every relation in 4NF is in BCNF

f)5NF

Fifth normal form (5NF), also known as project-join normal form (PJ/NF) is a level of database normalization designed to reduce redundancy in relational databases recording multi-valued facts by isolating semantically related multiple relationships. A table is said to be in the 5NF if and only if every non-trivial join dependency in it is implied by the candidate keys.

A join dependency *{A, B, … Z} on R is implied by the candidate key(s) of R if and only if each of A, B, …, Z is a superkey for R.

 

 

 




Entity Relationship Model Part-2

Relationship: A relationship is an association among several entities.

Relationship Set: A relationship set is a set of relationships of the same type.

Relationship Type: A relationship type defines a set of associations among entities of the different entity types.

Two Types of Relationship Constraints:

a)Cardinality Ratio(degree of relationship is also called cardinality)

b)Participation Constraint

a)Cardinality Ratio: Specifics the number of relationship instances that an entity can participate in.The possible cardinality ratios are:

Image result for relationship types in er diagram

b)Participation Constraint: The participation constraint specifies whether the existence of an entity depends on its being relate to another entity via the relationship type.There are two types of participation constraints:

1)Total Participation Constraints(Existence dependency):The participation of an entity set E in a relationship set R is said to be total if every entity in E participates in at least one relationship in R. This participation is displayed as a double line connection.

2)Partial Dependency: If only some entities in E participate in relationship in R, the participation of entity set E in relationship R is said to be partial.This participation is displayed as a single line connecting.

Image result for partial participation constraints

Extended E-R Features:

1)Specialization:

  • Top down design process
  • We take higher level entity and add new attributes to it to produce lower level entity.The lower level entities inherit the characteristics of higher level entity.
  • In terms of ER diagram, specialization is depicted by a triangle component labeled ISA.

Consider an entity set person, with attributes name, street, and city. A person may be further classified as one of the following:

customer

employee

2)Generalization:

  • Bottom-up design approach
  • Union of lower entity types to produce higher entity types.

3)Aggregation:

Aggregration is a process when relation between two entity is treated as a single entity.Here the relation between Student and Course, is acting as an Entity in relation with Subject.

Image result for aggregation in dbms




Entity Relationship Model Part-1

Database Model:
Logical structure of a database and fundamental determines in which manner data can be stored, organized and manipulated.

1)Hierarchical Model:

  • Data is organized in tree like structure, implying a single parent for each record.
  • Allows to one to many relationship

2)Network Model:

  • Allows many to many relationship in a graph like structure that allows multiple parents.
  • Organise data using two fundamental concepts called records and sets.

3)Relational Data Model:
Collection of tables to represent data and the relationship among those data. Eg: Oracle, Sybase.

4)Object Oriented Data Model:
Data and their relationship are organized or contained in a single structure known as object.

Hierarchical ,Network and Relational data model is type of Record Based Model 

Image result for data models in dbms

ENTITY RELATIONSHIP MODEL DESIGN

1)Entity: It is “thing” or “object” in the real world that is distinguishable from all other objects. An entity has a set of properties and values for some set of properties that may uniquely identify an entity.

2)Entity Set:
Collection of entities all having same properties or attributes.

3)Attributes:
Each entity is described by set of attributes/properties. Attributes are descriptive properties possessed by each member of an entity set.
For each attributes, there is set of permitted values called domain or value set of the attributes.

Types of attributes:

1)Simple Attributes: Not divided into subpart eg: any unique number like 1234
2)Composite Attributes: Divided into subpart eg: Name is divided into first name, middle name and last name.
3)Single Value Attribute: Single value for a particular entity eg: order_id
4)Multivalued Attribute: More than one value for a particular entity eg: Phone No.
5)Derived Attribute: Attribute value is dependent on some other attribute.Eg: Age

Null Values: Entity doesn’t have value for the attribute.

Keys:
Key plays an important role in relational database; it is used for identifying unique rows from table. It also establishes relationship among tables.
Types of Key:
1)Primary Key
2)Composite Key
3)Super Key
4)Candidate Key
5)Secondary Key
6)Foreign key
1)Primary key:
A primary is a column or set of columns in a table that uniquely identifies tuples (rows) in that table.
A relation may contain many candidate keys.When the designer select one of them to indentify a tuple in the relation,it becomes a primary key.It means that if there is only one candidate key ,it will automatically selected as primary key.
primary key of table
2)Composite key
Key that consist of two or more attributes that uniquely identify an entity occurrence is called Composite key. But any attribute that makes up the Composite key is not a simple key in its own.
composite key o table
3)Super Key
A super key is the most general type of key.A super key is a set of one of more columns (attributes) to uniquely identify rows in a table.Super key is a superset of Candidate key.
4)Candidate key
A candidate key is simply the “shortest” super key. Candidate Key are individual columns in a table that qualifies for uniqueness of each row/tuple.Every table must have at least one candidate key but at the same time can have several.
5)Secondary key
Out of all candidate keys, only one gets selected as primary key, remaining keys are known as alternate or secondary keys.
6)Foreign key
A FOREIGN KEY in one table points to a PRIMARY KEY in another table.They act as a cross-reference between tables.
Image result for foreign key in dbms



Introduction of Database

Data: Facts, figures, statistics etc.
Record: Collection of related data items.
Table or Relation: Collection of related records.
Database: Collection of related relation/data. In database, data is organized strictly in row and column format.The columns are called FieldsAttributes or Domains. The rows are called Tuples or Records.

Features of Data In a Database:
1)Security
2)Consistency
3)Non-Redundancy
4)Shared
5)Independence
6)Persistence
DBMS(Database Management System)It is software that allows creation, definition and manipulation of database.l It is middle layer between data and program.
File System:
  • Stores permanent records in various files
  • Need application program to access and manipulate data.

Disadvantage of File System:

  • Data Redundancy
  • Data Inconsistency
  • Difficult in accessing data
  • Data Integrity
  • Low Security

Data redundancy: Data redundancy is the repetition or superfluity of data. Data redundancy data is an common issue in computer data storage and database systems.This data repetition may occur either if a field is repeated in two or more tables or if the field is repeated within the table.Data can appear multiple times in a database for a variety of reasons. A positive type of data redundancy works to safeguard data and promote consistency. Many developers consider it acceptable for data to be stored in multiple places. The key is to have a central, master field or space for this data, so that there is a way to update all of the places where data is redundant through one central access point. Otherwise, data redundancy can lead to big problems with data inconsistency, where one update does not automatically update another field.For example, a shop may have the same customer’s name appearing several times if that customer has bought several different products at different dates.

Disadvantages Of Data Redundancy:
1)Increases the size of the database unnecessarily.
2)Causes data inconsistency.
3)Decreases efficiency of database.
4)May cause data corruption.

Data Isolation: The database must remain in a consistent state after any transaction. No transaction should have any adverse effect on the data residing in the database. If the database was in a consistent state before the execution of a transaction, it must remain consistent after the execution of the transaction as well.As an example, if two people are updating the same catalog item, it’s not acceptable for one person’s changes to be “clobbered” when the second person saves a different set of changes. Both users should be able to work in isolation, working as though he or she is the only user. Each set of changes must be isolated from those of the other users.

Data Integrity is the assurance that information is unchanged from its source, and has not been accidentally (e.g. through programming errors), or maliciously (e.g. through breaches or hacks) modified, altered or destroyed. In another words, it concerns with the completeness, soundness, and wholeness of the data that complies with the intention of data creators.It’s a logical property of the DB, independent of the actual data.

Data Consistency refers to the usability of the Data, and is mostly used in single site environment. But still in single site environment, some problems may arise in Data Consistency during recovery activities when original data is replaced by the backup copies of Data. You have to make sure that you data is usable while backing up data.

Data Abstraction:To simplify the interaction of users and database, DBMS hides some information which is not user interest is called Data Abstraction. So, developer hides complexity from users and show Abstract view of data.

DBMS Architecture/3-Tier Architecture:
1)External/View Level:It is user’s view of the database.This level describes the part of the database that is relevant to each user.
2)Conceptual/Logical Level:
  • Describes what data is stored in the database and the relationship among the data.
  • Represent all entities, their attributes and their relationship
  • Constraints on the data
  • Security and Integrity information
3)Physical/Internal Level:
  • Describes how the data is stored in the database
  • Storage Space allocation for data and indexes
  • File System
  • Data compression and Data encryption techniques
  • Record Placement

Image result for physical logical and view level

Schemas: 

  • It is overall description of the database.In three-level architecture, one schema at each level.
  • Does not specify relationship among files.
Instances:
Collection of information stored in the database at a particular moment.
Sub-schema:It is a subset of schema and inherits the same property that the schema has. It is an application programmer’s or user view of the data items types and record types which he or she uses.
Data Independence in DBMS:
Upper level are unaffected by changes in lower level.Two Types of Data Independence:
a)Physical Data Independence:

  • Physical storage structure or devices can be changed without affecting conceptual schema.
  • Modification done to improve performance.
  • It provide independence to conceptual schema and external schema

b)Logical Data Independence:

  • Conceptual schema can be changed without affecting external schema.
  • Structure of database is altered when modification done in conceptual schema.
  • It provide independence to external schema.

DBMS Components:
1)Hardware

  • Processor/main memory(used for execution)
  • Secondary Storage devices(for physical storage)

2)Data
3)Software
4)Users
5)Procedures(Set of rules for database management)

Types of Users:
a)Naive Users:
End Users of the database who work through menu driven application programs, where the type and range of response is always indicated to the users.

b)Online Users:
Those users who may communicate with database directly through an online terminal.

c)Application Programmer:
Those users who are responsible for developing the application program.

d)DBA(Database Administrator)
DBA(Database Administrator):
DBA directs or performs all activities related to maintaining a successful database environment.
Function of DBA:

  • Defining Conceptual Schema
  • Physical Database Design
  • Tuning database performance
  • Security and Integrity Check
  • Back up and Recovery Strategies
  • Improving query processing performance
  • Granting User Access

 

Database Languages:
1)DDL(Data Definition Language):

  • Deals with database schemas and description, how the data should reside in the database.
  • Used to alter/modify a database or table structure and schema.

Command used in DDL:

  • Create
  • Alter
  • Drop
  • Rename
  • Truncate
  • Comment

2)DML(Data Manipulation Language)

  • Deals with data manipulation
  • These statements affects records in a table.

Command used in DML:

  • Update
  • Select
  • Insert
  • Delete
  • Merge
  • Call
  • Lock Table

Two Types of DML:
a)Procedural DML(Non Declarative)(How data is fetch)
b)Non-Procedural DML(Declarative )(What data is to be fetch)

3)DCL(Data Control Language)
Control the level of access that users have on database objects.
Command used in DCL:

  • Grant
  • Revoke

4)Transaction Language:
Control and manage transactions to maintain integrity of data within SQL statement.
Command used in Transaction Language:

  • Set Transaction
  • Commit
  • Savepoint
  • Rollback

Image result for dml and ddl commands