DATA MINING AND DATA WARE HOUSE
Many software projects are accumulated by a great deal of data, so we really need information about the effective maintenance and reteving of data from the database. The newest, hottest technology to address these concerns is data mining and data warehousing.
Data Mining is the process of automated extraction of predictive information from large databases. It predicts future trends and finds behavior that the experts may miss as it lies beyond their expectations. Data Mining is part of a larger process called knowledge discovery, specifically, the step in which advanced statistical analysis and modeling techniques are applied to the data to find useful patterns and relationships.
Data warehousing takes a relatively simple idea and incorporates it into the technological underpinnings of a company. The idea is that a unified view of all data that a company collects will help improve operations. If hiring data can be combined with sales data, the idea is that it might be possible to discover and exploit patterns in the combined entity.
This paper will present an overview of the different process and advanced techniques involving in data mining and data warehousing.
1. Introduction to Data Mining:
Data mining can be defined as "a decision support process in which we search for patterns of information in data." This search may be done just by the user, i.e. just by performing queries, in which case it is quite hard and in most of the cases not comprehensive enough to reveal intricate patterns. Data mining uses sophisticated statistical analysis and modeling techniques to uncover such patterns and relationships hidden in organizational databases - patterns that ordinary methods might miss. Once found, the information needs to be presented in a suitable form, with graphs, reports, etc.
1.1 Data Mining Processes
From a process-oriented view, there are three classes of data mining activity: discovery, Predictive modeling and forensic analysis, as shown in figure below.
Discovery is the process of looking in a database to find hidden patterns without a
Predetermined idea or hypothesis about what the patterns may be. In other words, the
program takes the initiative in finding what the interesting patterns are, without the user
thinking of the relevant questions first.
Discovery is the process of looking in a database to find hidden patterns without a predetermined idea or hypothesis about what the patterns may be. In other words, the program takes the initiative in finding what the interesting patterns are, without the user thinking of the relevant questions first.
In predictive modeling patterns discovered from the database are used to predict the future. Predictive modeling thus allows the user to submit records with some unknown field values, and the system will guess the unknown values based on previous patterns discovered from the database. While discovery finds patterns in data, predictive modeling applies the patterns to guess values for new data items.
Forensic analysis:
This is the process of applying the extracted patterns to find anomalous or
unusual data elements. To discover the unusual, we first find what is the norm, and then
we detect those items that deviate from the usual within a given threshold. Discovery
helps us find "usual knowledge," but forensic analysis looks for unusual and specific
cases.
1.2 Data Mining Users and Activities
Data mining activities are usually performed by three different classes of users - executives, end users and analysts.
Executives need top-level insights and spend far less time with computers than the
other groups.
End users are sales people, market researchers, scientists, engineers, physicians,
etc.
Analysts may be financial analysts, statisticians, consultants, or database designers.
These users usually perform three types of data mining activity within a corporate environment: episodic, strategic and continuous data mining.
In episodic mining we look at data from one specific episode such as a specific direct marketing campaign. We may try to understand this data set, or use it for prediction on new marketing campaigns. Analysts usually perform episodic mining.
In strategic mining we look at larger sets of corporate data with the intention of gaining an overall understanding of specific measures such as profitability.
In continuous mining we try to understand how the world has changed within a given time period and try to gain an understanding of the factors that influence change.
1.3 Data Mining Applications:
Virtually any process can be studied, understood, and improved using data mining. The top three end uses of data mining are, not surprisingly, in the marketing area.
Data mining can find patterns in a customer database that can be applied to a prospect database so that customer acquisition can be appropriately targeted. For example, by identifying good candidates for mail offers or catalogs direct-mail marketers can reduce expenses and increase their sales. Targeting specific promotions to existing and potential customers offers similar benefits.
Market-basket analysis helps retailers understand which products are purchased together or by an individual over time. With data mining, retailers can determine which products to stock in which stores, and even how to place them within a store. Data mining can also help assess the effectiveness of promotions and coupons.
Another common use of data mining in many organizations is to help manage customer relationships. By determining characteristics of customers who are likely to leave for a competitor, a company can take action to retain that customer because doing so is usually far less expensive than acquiring a new customer.
Fraud detection is of great interest to telecommunications firms, credit-card companies, insurance companies, stock exchanges, and government agencies. The aggregate total for fraud losses is enormous. But with data mining, these companies can identify potentially fraudulent transactions and contain the damage.
Financial companies use data mining to determine market and industry characteristics as well as predict individual company and stock performance. Another interesting niche application is in the medical field: Data mining can help predict the effectiveness of surgical procedures, diagnostic tests, medications, service management, and process control.
1.4 Data Mining Techniques:
Data Mining has three major components Clustering or Classification, Association Rules and Sequence Analysis.
1.4.1 Classification:
The clustering techniques analyze a set of data and generate a set of grouping rules that can be used to classify future data. The mining tool automatically identifies the
clusters, by studying the pattern in the training data. Once the clusters are generated, classification can be used to identify, to which particular cluster, an input belongs. For example, one may classify diseases and provide the symptoms, which describe each class or subclass.
1.4.2 Association:
An association rule is a rule that implies certain association relationships among a set of objects in a database. In this process we discover a set of association rules at multiple levels of abstraction from the relevant set(s) of data in a database. For example, one may discover a set of symptoms often occurring together with certain kinds of diseases and further study the reasons behind them.
1.4.3 Sequential Analysis:
In sequential Analysis, we seek to discover patterns that occur in sequence. This
deals with data that appear in separate transactions (as opposed to data that appear in the same transaction in the case of association) e.g. if a shopper buys item A in the first week of the month, and then he buys item B in the second week etc.
1.4.4 Neural Nets and Decision Trees:
For any given problem, the nature of the data will affect the techniques you choose. Consequently, you'll need a variety of tools and technologies to find the best possible model. Classification models are among the most common, so the more popular ways for building them have been explained here. Classifications typically involve at least one of two workhorse statistical techniques - logistic regression (a generalization of linear regression) and discriminate analysis. However, as data mining becomes more common, neural nets and decision trees are also getting more consideration. Although complex in their own way, these methods require less statistical sophistication on the part of the user.
Neural nets use many parameters (the nodes in the hidden layer) to build a model that takes and combines a set of inputs to predict a continuous or categorical variable.
Source: "Introduction to Data Mining and Knowledge Discovery" by "Two Crows Corporation"
The value from each hidden node is a function of the weighted sum of the values from all the preceding nodes that feed into it. The process of building a model involves finding the connection weights that produce the most accurate results by "training" the neural net with data. The most common training method is back propagation, in which the output result is compared with known correct values. After each comparison, the weights are adjusted and a new result computed. After enough passes through the training data, the neural net typically becomes a very good predictor.
Decision trees represent a series of rules to lead to a class or value. For example, you may wish to classify loan applicants as good or bad credit risks. Figure below shows a simple decision tree that solves this problem. Armed with this tree and a loan application, a loan officer could determine whether an applicant is a good or bad credit risk. An individual with "Income > $40,000" and "High Debt" would be classified as a "Bad Risk," whereas an individual with "Income < $40,000" and "Job > 5 Years" would be classified as a "Good Risk."
Decision trees have become very popular because they are reasonably accurate and, unlike neural nets, easy to understand. Decision trees also take less time to build than neural nets. Neural nets and decision trees can also be used to perform regressions, and some types of neural nets can even perform clustering.
2.1 Introduction to Data warehousing:
In the current knowledge economy, it is now an indisputable fact that information is the key to organizations for gaining competitive advantage. Organizations very well know that the vital information for decision making is lying in its databases. Mountains of data are getting accumulated in various databases scattered around the enterprise. But the key to gaining competitive advantage lies in deriving insight and intelligence out of these data. Data warehousing helps in integrating categorizing, codifying and arranging the data from all parts of an enterprise.
According to Bill Inmon, known as the father of Data warehousing, The concept of data warehouse is depicted as figure
2.1.1 Subject oriented data:
All relevant data about a subject is gathered and stored as a single set in a useful format.
2.1.2 Integrated data:
Data is stored in a globally accepted fashion with consistent naming conventions, measurements, encoding structures and physical attributes, even when the underlying operational system store the data differently.
2.1.3 Non-volatile data:
The data warehouse is read-only, data is loaded in to the data warehouse and accesses there.
2.1.4 Time-variant data:
This long term data is from 5 to 10 years as opposed to the 30-60 days of operational data.
2.2 Structure of data warehouse:
The design of the data architecture is probably the most critical part of a data warehousing project. The key is to plan for growth and change, as opposed to trying to design the perfect system from the start. The design of the data architecture involves understanding all of the data and how different pieces are related. For example, payroll data might be related to sales data by the ID of the sales person, while the sales data might be related to customers by the customer ID. By connecting these two relationships, payroll data could be related to customers (e.g., which employees have ties to which customers).
Once the data architecture has been designed, you can then consider the kinds of reports that you are interested in. You might want to see a breakdown of employees by region, or a ranked list of customers by revenue. These kinds of reports are fairly simple. The power of a data warehouse becomes more obvious when you want to look at links between data associated with disparate parts of a organization (e.g., HR, accounts payable, and project management).
2.3 Benefits of Data warehousing:
Cost avoidance benefits.
Higher productivity.
Benefits through better analytical capability.
Manage business complexity.
Leverage on their existing investments.
End user spending.
Spending on e-business.
Accessibility and easy of use.
Real time information and analysis.
2.4 Techniques by different organization on efficient data warehouse:
That being said, most decisions to build data warehouses are driven by non-HR needs. Over the past decade, back office (supply chain) and front office (sales and marketing) organizations have spearheaded the creation of large corporate data warehouses. Improving the efficiency of the supply chain and competition for customers rely on the tactical uses that a data warehouse can provide. The key for other organizations, including HR, is to be involved in the creation of the warehouse so that their meets can be met by any resulting system. This usually happens because both the data volume and question complexity have grown beyond what the current systems can handle. At that point the business becomes limited by the information that users can reasonably extract from the data system.
3.Conclusion:
Data mining offers great promise in helping organizations uncover hidden patterns in their data. However, data mining tools must be guided by users who understand the business, the data, and the general nature of the analytical methods involved. Realistic expectations can yield rewarding results across a wide range of applications, from improving revenues to reducing costs.
Building models is only one step in knowledge discovery. It's vital to collect and prepare the data properly and to check models against the real world. The "best" model is often found after building models of several different types and by trying out various technologies or algorithms.
The data mining area is still relatively young, and tools that support the whole of the data mining process in an easy to use fashion are rare. However, one of the most important issues facing researchers is the use of techniques against very large data sets. All the mining techniques are based on Artificial Intelligence, where they are generally executed against small sets of data, which can fit in memory. However, in data mining applications these techniques must be applied to data held in very large databases. These include use of parallelism and development of new database oriented techniques. However, much work is required before data mining can be successfully applied to large data sets. Only then will the true potential of data mining be able to be realized.
The data warehousing is the hottest concept for many software professionals to over come the sophisticated data to be managed efficiently. The data warehouse is repository (or archive) of information gathered from multiple sources, stored under a unified scheme, at a single site. Once gathered the data are stored for a long time permitting access to historical data. Thus, data ware houses provide the user a single consolidated interface to data, making decision support actions easier to implement. In the world of highly interconnected networks the data obtained or used by many companies would be very large and the maintenance becomes difficult and costly. So, the efficient data warehousing is to be implemented to obtain data from different branches (all over the world) and maintain it for providing information to all other branches (which does not have the concerned data).