Data Warehouse and Data Mining

image_pdfimage_print

1)Data Mining: Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems.

The most commonly used techniques in data mining are:

  • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
  • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) .
  • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.
  • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.
  • Rule induction: The extraction of useful if-then rules from data based on statistical significance.

 

Image result for data mining kdd process

Data Mining Architecture:

2)Data Warehouse:

A data warehouse is a:

  • subject-oriented
  • integrated
  • time varying
  • non-volatile

collection of data in support of the management’s decision-making process.A data warehouse is a centralized repository that stores data from multiple information sources and transforms them into a common, multidimensional data model for efficient querying and analysis.

 

Subject Oriented:Data warehouses are designed to help you analyze data. For example, to learn more about your company’s sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like “Who was our best customer for this item last year?” This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.

Integrated:Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.

Nonvolatile:Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.

Time Variant:In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse’s focus on change over time is what is meant by the term time variant.

There are two approaches to data warehousing, top down and bottom up. The top down approach spins off data marts for specific groups of users after the complete data warehouse has been created. The bottom up approach builds the data marts first and then combines them into a single, all-encompassing data warehouse.

Slice and dice refers to a strategy for segmenting, viewing and understanding data in a database. Users slices and dice by cutting a large segment of data into smaller parts, and repeating this process until arriving at the right level of detail for analysis. Slicing and dicing helps provide a closer view of data for analysis and presents data in new and diverse perspectives.The term is typically used with OLAP databases that present information to the user in the form of multidimensional cubes similar to a 3D spreadsheet.

Image result for data warehouse architecture

ETL process

ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse. ETL involves the following tasks:

Extracting the data from source systems (SAP, ERP, other operational systems), data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing.

Transforming the data may involve the following tasks:

  •   applying business rules (so-called derivations, e.g., calculating new measures and dimensions),
  •   cleaning (e.g., mapping NULL to 0 or “Male” to “M” and “Female” to “F” etc.),
  •   filtering (e.g., selecting only certain columns to load),
  •   splitting a column into multiple columns and vice versa,
  •   joining together data from multiple sources (e.g., lookup, merge),
  •   transposing rows and columns,
  •   applying any kind of simple or complex data validation (e.g., if the first 3 columns in a row are empty then reject the row from processing)

Loading the data into a data warehouse or data repository other reporting applications

Image result for difference between oltp and olap

Image result for difference between oltp and olap