DATA WARE HOUSING
A data warehouse is a store of information organized in a unified data model.A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In this paper we tried to explain the characteristics, architectures and processes involved in data warehousing.
Characteristics of a data warehouse are
• Subject oriented
• Integrated
• Non volatile
• Time variant
Common architectures involved in data warehousing are of three types.Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are:
• Data warehouse Architecture (Basic)
• Data warehouse Architecture (With a staging area)
• Data warehouse Architecture (With a staging area and Data Marts)
Data warehousing involves data pre-processing. All those methods are explained in this paper. Thus this paper includes gives the overview of data warehousing concepts.
Introduction
A data warehouse is a collection of data gathered and organized so that it can easily by analyzed, extracted, synthesized, and otherwise be used for the purposes of further understanding the data. It may be contrasted with data that is gathered to meet immediate business objectives such as order and payment transactions, although this data would also usually become part of a data warehouse.
A data warehouse is, primarily, a record of an enterprise's past transactional and operational information, stored in a database designed to favour efficient data analysis and reporting (especially OLAP). Data warehousing is not meant for current, "live" data.
Characteristics of a data warehouse:
• Subject oriented
• Integrated
• Non volatile
• Time variant
Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?"
This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
Integrated
Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
Nonvolatile
Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.
Time Variant
In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant
Data warehouse architecture:
Data warehouse architecture is a description of the elements and services of the warehouse, with details showing how the components will fit together and how the system will grow over time. Data warehouse Architecture (Basic)
• Data warehouse Architecture (With a staging area)
• Data warehouse Architecture (With a staging area and Data Marts)
Data from the operational systems are
• Extracted
• Cleansed
• Transformed
• Loaded & Propagated
Extraction
Data Warehouse Manager supports standard SQL-style inner, outer and exception file joins, as well as joins based on derived fields, one-to-many relationships, file members, unique key fields, and file unions. You may extract data from both local and distributed databases when OS/400 Distributed Data Base Management (DDM) is in use.
Record selection and new field calculation criteria may be defined in the extraction object or left for ad hoc entry at run-time. CL programs may also be used to pass ad hoc criteria into the extraction object. NGS provides sample CL source programs to illustrate this capability. Data Warehouse Manager also supports full Boolean selection logic and a timesaving
Transformation
The data transformation capabilities of Data Warehouse Manager convert numeric codes for inventory items, employees, departments and other business terms into meaningful, descriptive values. You can also convert dates into consistent formats and generate summary fields such as gross-to-net, gross margin and more. Other transformation functions allow you to create fields representing items such as integer keys, elapsed days and time, absolute values, remainders, etc
Also convert dates into consistent formats and generate summary fields such as gross-to-net, gross margin and more. Other transformation functions allow you to create fields representing items such as integer keys, elapsed days and time, absolute values, remainders, etc.
Cleansing
Data Warehouse Manager can handle many data cleansing requirements and provide Application Program Interfaces (API’s) for custom cleansing programs or third-party tools copy the entry of complex selection criteria.
Loading and Propagation
Data Warehouse Manager builds DB2 UDB for I-Series 400 database files, which may be accessed by numerous IBM and third-party software tools.
These files may be generated in summary or detail as required. Key fields may be assigned to the designated support star schema warehouse architectures. Depending on the object’s attributes, the output may create, replace or append DB2 UDB file
Data pre processing.
Data processing is necessary because
Real world data is dirty
Data is in consistent.
Data preprocessing tasks:
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Concept hierarchy generation for numeric data
Data cleaning:
Data cleaning, also called data cleansing or scrubbing deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data
Data Quality Problems:
Misspellings during data entry, missing information or other invalid data
Present in single collections such as files and databases
Problems increase when multiple sources need to be integrated
Data cleaning is important because,
• Poor data quality can result in loss of money
$2 billion of U.S. federal loan money has been lost due to poor data quality at a single agency.
• Data warehouses must provide high-level quality of data and service as decision support information systems.
• Probability of dirty data high.
Significant portion of cleaning and transformation work done manually or by low level programs that are difficult to maintain.
Data Cleaning Approach:
• Detect and remove all major errors and inconsistencies in individual and multiple data sources
• Inspection Supported by tools to limit manual and programming effort
• Should not be performed in isolation but with schema-related data transformations based on comprehensive metadata
• Mapping functions for data cleaning and other data transformations should be reusable for other data sources
Tools:
ETL (Extraction, Transformation, Loading) Tool:
• Typically support the ETL process for data warehouses in a comprehensive way
• Typically have little built-in data cleaning capabilities but allow the user to specify cleaning functionality via a proprietary API
• There is usually no data analysis support to automatically detect data errors and inconsistencies.
• Provide transformation libraries that cover many data transformation and cleaning needs
• Such as data type conversions (e.g., date reformatting)
• String functions (e.g., split, merge, replace, sub-string search)
• Arithmetic, scientific and statistical functions
• Typically covers if-then and case constructs that help handling exceptions in data values, such as misspellings, abbreviations, missing or cryptic values, and values outside of range
• These problems are also addressed by using a table lookup construct and join functionality
Examples of these tools include:
• Copy manager (Information Builders)
• Data stage (Informix/Ardent)
• Extract (ETI), Power mart (Informatics)
• Decision base (CA/Platinum)
• Data transformation service (Microsoft)
• Metasuite 11 (Minerva/Carleton)
• Sagent solution platform (Sagent)
• Warehouse administrator (SAS).
Data integration and transformation:
This includes:
Schema integration
Redundancy
Schema integration:
Multiple data sources may provide data on the same entity types.
For example, Meta data from two music
Servers.
Approach to data integration:
• Use mapping rules to handle structural differences
• Ambient contains a mapping rule
Manager capable of:
– Finding appropriate sets of mapping rules
– Rewriting queries based on a set of mapping
– Handling different versions of mapping rules Rules
Redundancy:
Redundancy, in general terms, refers to the quality or state of being redundant, that is: exceeding what is necessary or normal; or duplication. This can have a negative connotation, especially in
rhetoric: superfluous or repetitive; or a positive implication, especially in engineering: serving as a duplicate for preventing failure of an entire system.
Data reduction:
• Data warehouse may store terabytes of data.
Complex data analysis/mining may take a very long time to run on the complete data set
Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results.
Data cube aggregation:
• The lowest level of a data cube
o The aggregated data for an individual entity of interest
o e.g., a customer in a phone calling data warehouse
• Multiple levels of aggregation in data space
o Further reduce the size of data
• Reference appropriate levels
Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered using data when possible
Purpose:
The main purpose of a data warehouse is to support decision-making.
• Data is collected from a number of different sources.
• It is made is to perform advanced analysis
The main goal of data warehousing is to generate front-end analytics that will support business executives and operational managers.
References:
• The Data Warehouse Toolkit by Ralph Kimball
• Building the Data Warehouse by William Inmon
No comments:
Post a Comment