Friday, May 28, 2010

DATA MINING AND DATA WARE HOUSE

DATA MINING AND DATA WARE HOUSE
Many software projects are accumulated by a great deal of data, so we really need information about the effective maintenance and reteving of data from the database. The newest, hottest technology to address these concerns is data mining and data warehousing.
Data Mining is the process of automated extraction of predictive information from large databases. It predicts future trends and finds behavior that the experts may miss as it lies beyond their expectations. Data Mining is part of a larger process called knowledge discovery, specifically, the step in which advanced statistical analysis and modeling techniques are applied to the data to find useful patterns and relationships.
Data warehousing takes a relatively simple idea and incorporates it into the technological underpinnings of a company. The idea is that a unified view of all data that a company collects will help improve operations. If hiring data can be combined with sales data, the idea is that it might be possible to discover and exploit patterns in the combined entity.
This paper will present an overview of the different process and advanced techniques involving in data mining and data warehousing.




1. Introduction to Data Mining:
Data mining can be defined as "a decision support process in which we search for patterns of information in data." This search may be done just by the user, i.e. just by performing queries, in which case it is quite hard and in most of the cases not comprehensive enough to reveal intricate patterns. Data mining uses sophisticated statistical analysis and modeling techniques to uncover such patterns and relationships hidden in organizational databases - patterns that ordinary methods might miss. Once found, the information needs to be presented in a suitable form, with graphs, reports, etc.
1.1 Data Mining Processes
From a process-oriented view, there are three classes of data mining activity: discovery, Predictive modeling and forensic analysis, as shown in figure below.
Discovery is the process of looking in a database to find hidden patterns without a
Predetermined idea or hypothesis about what the patterns may be. In other words, the
program takes the initiative in finding what the interesting patterns are, without the user
thinking of the relevant questions first.

Discovery is the process of looking in a database to find hidden patterns without a predetermined idea or hypothesis about what the patterns may be. In other words, the program takes the initiative in finding what the interesting patterns are, without the user thinking of the relevant questions first.
In predictive modeling patterns discovered from the database are used to predict the future. Predictive modeling thus allows the user to submit records with some unknown field values, and the system will guess the unknown values based on previous patterns discovered from the database. While discovery finds patterns in data, predictive modeling applies the patterns to guess values for new data items.
Forensic analysis:
This is the process of applying the extracted patterns to find anomalous or
unusual data elements. To discover the unusual, we first find what is the norm, and then
we detect those items that deviate from the usual within a given threshold. Discovery
helps us find "usual knowledge," but forensic analysis looks for unusual and specific
cases.
1.2 Data Mining Users and Activities
Data mining activities are usually performed by three different classes of users - executives, end users and analysts.
Executives need top-level insights and spend far less time with computers than the
other groups.
End users are sales people, market researchers, scientists, engineers, physicians,
etc.
Analysts may be financial analysts, statisticians, consultants, or database designers.
These users usually perform three types of data mining activity within a corporate environment: episodic, strategic and continuous data mining.
In episodic mining we look at data from one specific episode such as a specific direct marketing campaign. We may try to understand this data set, or use it for prediction on new marketing campaigns. Analysts usually perform episodic mining.
In strategic mining we look at larger sets of corporate data with the intention of gaining an overall understanding of specific measures such as profitability.
In continuous mining we try to understand how the world has changed within a given time period and try to gain an understanding of the factors that influence change.

1.3 Data Mining Applications:
Virtually any process can be studied, understood, and improved using data mining. The top three end uses of data mining are, not surprisingly, in the marketing area.
Data mining can find patterns in a customer database that can be applied to a prospect database so that customer acquisition can be appropriately targeted. For example, by identifying good candidates for mail offers or catalogs direct-mail marketers can reduce expenses and increase their sales. Targeting specific promotions to existing and potential customers offers similar benefits.
Market-basket analysis helps retailers understand which products are purchased together or by an individual over time. With data mining, retailers can determine which products to stock in which stores, and even how to place them within a store. Data mining can also help assess the effectiveness of promotions and coupons.
Another common use of data mining in many organizations is to help manage customer relationships. By determining characteristics of customers who are likely to leave for a competitor, a company can take action to retain that customer because doing so is usually far less expensive than acquiring a new customer.
Fraud detection is of great interest to telecommunications firms, credit-card companies, insurance companies, stock exchanges, and government agencies. The aggregate total for fraud losses is enormous. But with data mining, these companies can identify potentially fraudulent transactions and contain the damage.
Financial companies use data mining to determine market and industry characteristics as well as predict individual company and stock performance. Another interesting niche application is in the medical field: Data mining can help predict the effectiveness of surgical procedures, diagnostic tests, medications, service management, and process control.
1.4 Data Mining Techniques:
Data Mining has three major components Clustering or Classification, Association Rules and Sequence Analysis.


1.4.1 Classification:
The clustering techniques analyze a set of data and generate a set of grouping rules that can be used to classify future data. The mining tool automatically identifies the
clusters, by studying the pattern in the training data. Once the clusters are generated, classification can be used to identify, to which particular cluster, an input belongs. For example, one may classify diseases and provide the symptoms, which describe each class or subclass.
1.4.2 Association:
An association rule is a rule that implies certain association relationships among a set of objects in a database. In this process we discover a set of association rules at multiple levels of abstraction from the relevant set(s) of data in a database. For example, one may discover a set of symptoms often occurring together with certain kinds of diseases and further study the reasons behind them.
1.4.3 Sequential Analysis:
In sequential Analysis, we seek to discover patterns that occur in sequence. This
deals with data that appear in separate transactions (as opposed to data that appear in the same transaction in the case of association) e.g. if a shopper buys item A in the first week of the month, and then he buys item B in the second week etc.
1.4.4 Neural Nets and Decision Trees:
For any given problem, the nature of the data will affect the techniques you choose. Consequently, you'll need a variety of tools and technologies to find the best possible model. Classification models are among the most common, so the more popular ways for building them have been explained here. Classifications typically involve at least one of two workhorse statistical techniques - logistic regression (a generalization of linear regression) and discriminate analysis. However, as data mining becomes more common, neural nets and decision trees are also getting more consideration. Although complex in their own way, these methods require less statistical sophistication on the part of the user.
Neural nets use many parameters (the nodes in the hidden layer) to build a model that takes and combines a set of inputs to predict a continuous or categorical variable.

Source: "Introduction to Data Mining and Knowledge Discovery" by "Two Crows Corporation"

The value from each hidden node is a function of the weighted sum of the values from all the preceding nodes that feed into it. The process of building a model involves finding the connection weights that produce the most accurate results by "training" the neural net with data. The most common training method is back propagation, in which the output result is compared with known correct values. After each comparison, the weights are adjusted and a new result computed. After enough passes through the training data, the neural net typically becomes a very good predictor.
Decision trees represent a series of rules to lead to a class or value. For example, you may wish to classify loan applicants as good or bad credit risks. Figure below shows a simple decision tree that solves this problem. Armed with this tree and a loan application, a loan officer could determine whether an applicant is a good or bad credit risk. An individual with "Income > $40,000" and "High Debt" would be classified as a "Bad Risk," whereas an individual with "Income < $40,000" and "Job > 5 Years" would be classified as a "Good Risk."


Decision trees have become very popular because they are reasonably accurate and, unlike neural nets, easy to understand. Decision trees also take less time to build than neural nets. Neural nets and decision trees can also be used to perform regressions, and some types of neural nets can even perform clustering.
2.1 Introduction to Data warehousing:
In the current knowledge economy, it is now an indisputable fact that information is the key to organizations for gaining competitive advantage. Organizations very well know that the vital information for decision making is lying in its databases. Mountains of data are getting accumulated in various databases scattered around the enterprise. But the key to gaining competitive advantage lies in deriving insight and intelligence out of these data. Data warehousing helps in integrating categorizing, codifying and arranging the data from all parts of an enterprise.
According to Bill Inmon, known as the father of Data warehousing, The concept of data warehouse is depicted as figure




2.1.1 Subject oriented data:
All relevant data about a subject is gathered and stored as a single set in a useful format.
2.1.2 Integrated data:
Data is stored in a globally accepted fashion with consistent naming conventions, measurements, encoding structures and physical attributes, even when the underlying operational system store the data differently.


2.1.3 Non-volatile data:
The data warehouse is read-only, data is loaded in to the data warehouse and accesses there.
2.1.4 Time-variant data:
This long term data is from 5 to 10 years as opposed to the 30-60 days of operational data.
2.2 Structure of data warehouse:

The design of the data architecture is probably the most critical part of a data warehousing project. The key is to plan for growth and change, as opposed to trying to design the perfect system from the start. The design of the data architecture involves understanding all of the data and how different pieces are related. For example, payroll data might be related to sales data by the ID of the sales person, while the sales data might be related to customers by the customer ID. By connecting these two relationships, payroll data could be related to customers (e.g., which employees have ties to which customers).
Once the data architecture has been designed, you can then consider the kinds of reports that you are interested in. You might want to see a breakdown of employees by region, or a ranked list of customers by revenue. These kinds of reports are fairly simple. The power of a data warehouse becomes more obvious when you want to look at links between data associated with disparate parts of a organization (e.g., HR, accounts payable, and project management).
2.3 Benefits of Data warehousing:
Cost avoidance benefits.
Higher productivity.
Benefits through better analytical capability.
Manage business complexity.
Leverage on their existing investments.
End user spending.
Spending on e-business.
Accessibility and easy of use.
Real time information and analysis.
2.4 Techniques by different organization on efficient data warehouse:
That being said, most decisions to build data warehouses are driven by non-HR needs. Over the past decade, back office (supply chain) and front office (sales and marketing) organizations have spearheaded the creation of large corporate data warehouses. Improving the efficiency of the supply chain and competition for customers rely on the tactical uses that a data warehouse can provide. The key for other organizations, including HR, is to be involved in the creation of the warehouse so that their meets can be met by any resulting system.  This usually happens because both the data volume and question complexity have grown beyond what the current systems can handle. At that point the business becomes limited by the information that users can reasonably extract from the data system.
3.Conclusion:
Data mining offers great promise in helping organizations uncover hidden patterns in their data. However, data mining tools must be guided by users who understand the business, the data, and the general nature of the analytical methods involved. Realistic expectations can yield rewarding results across a wide range of applications, from improving revenues to reducing costs.
Building models is only one step in knowledge discovery. It's vital to collect and prepare the data properly and to check models against the real world. The "best" model is often found after building models of several different types and by trying out various technologies or algorithms.
The data mining area is still relatively young, and tools that support the whole of the data mining process in an easy to use fashion are rare. However, one of the most important issues facing researchers is the use of techniques against very large data sets. All the mining techniques are based on Artificial Intelligence, where they are generally executed against small sets of data, which can fit in memory. However, in data mining applications these techniques must be applied to data held in very large databases. These include use of parallelism and development of new database oriented techniques. However, much work is required before data mining can be successfully applied to large data sets. Only then will the true potential of data mining be able to be realized.
The data warehousing is the hottest concept for many software professionals to over come the sophisticated data to be managed efficiently. The data warehouse is repository (or archive) of information gathered from multiple sources, stored under a unified scheme, at a single site. Once gathered the data are stored for a long time permitting access to historical data. Thus, data ware houses provide the user a single consolidated interface to data, making decision support actions easier to implement. In the world of highly interconnected networks the data obtained or used by many companies would be very large and the maintenance becomes difficult and costly. So, the efficient data warehousing is to be implemented to obtain data from different branches (all over the world) and maintain it for providing information to all other branches (which does not have the concerned data).

Genetic Algorithms

Abstract
In this paper we explore the use of an adaptive search technique (genetic algorithms) to construct a system GABIL which continually learns and refines concept classification rules from its interaction with the environment. The performance of the system is measured on a set of concept learning problems and compared with the performance of two existing systems: ID5R and C4.5. Preliminary results support that, despite minimal system bias, GABIL is an effective concept learner and is quite competitive with ID5R and C4.5 as the target concept increases in complexity.
1 Introduction
An important requirement for both natural and artificial organisms is the ability to acquire concept classification rules from interactions with their environment. In this paper we explore the use of an adaptive search technique, namely Genetic Algorithms (GAs), as the central mechanism for building a system which continually learns and
refines concept classification rules from its interaction with the environment. We show how concept learning tasks can be represented and solved by GAs, and we provide empirical results, which illustrate the performance of GAs relative to more traditional methods. Finally, we discuss the advantages and disadvantages of this approach and describe future research activities.
2 Concept Learning Problems
Supervised concept learning involves inducing concept descriptions for the concepts to be learned from a set of positive and negative examples of the target concepts. Examples are represented as points in an n-dimensional feature space, which is defined a priori and for which all the legal values of the features are known. Concepts are therefore represented as subsets of points in the given dimensional space.
A concept learning program is presented with both a description of the feature space and a set of correctly classified examples of the concepts, and is expected to generate a reasonably accurate description of the (unknown) concepts. Since concepts can be arbitrarily complex subsets of a feature space, an important issue is the choice of the concept description language. The language must have sufficient expressive power to describe large subsets succinctly and yet be able to capture irregularities. The two language forms generally used are decision trees [Quinlan, 1986] and rules [Michalski, 1983]. Another important issue arises from the situation that there is a large (possibly infinite) set of concept descriptions, which are consistent with any particular finite set of examples. This is generally resolved by introducing either explicitly or implicitly a bias (preference) for certain kinds of descriptions (e.g., shorter or less complex descriptions may be preferred). Finally, there is the difficult issue of evaluating and comparing the performance of concept learning algorithms. The most widely used approach is a batch mode in which the set of examples is divided into training set and a test set. The concept learner is required to produce a concept description from the training examples. The validity of the description produced is then measured by the percentage of correct classifications made by the system on the second (test) set of examples during which no
further learning takes place. The alternative evaluation approach is an incremental mode in which the concept learner is required to produce a concept description from the examples seen so far and to use that description to classify the next incoming example. In this mode learning never stops, and evaluation is in terms of learning curves, which measure the predictive performance of the concept learner over time. This incremental and continuous model of concept learning matches more closely the kind of concept learning that an organism performs as it explores a complex and changing world. Consequently, we use predictive learning curves as our evaluation methodology.
3 Genetic Algorithms and Concept Learning
In order to apply GAs to a particular problem, we need to select an internal representation of the space to be searched and define an external evaluation function, which assigns utility to candidate solutions. Both components are critical to the successful application of the GAs to the problem of interest. 1
3.1 Representing the Search Space
The traditional internal representation used by Gas involves using fixed-length (generally binary) strings to represent points in the space to be searched. However, such representations do not appear well suited for representing the space of concept descriptions, which are generally symbolic in nature, which have both syntactic and semantic constraints, and which can be of widely varying length and complexity.There are two general approaches one might take to resolve this issue. The first involves changing the fundamental GA operators (crossover and mutation) to work effectively with complex non-string objects [Rendell, 1985]. This must be done carefully in order to preserve the properties, which make the GAs effective adaptive search procedures. Alternatively, one can attempt to construct a string representation, which minimizes any changes to the GAs. We are interested in pursuing both approaches. Our ideas on the first approach will be discussed briefly at the end of the paper. In the following sections we will describe our results using the second approach in which we try to apply classical GAs with minimal changes.
3.2 Defining Fixed-length Classifier Rules
Our approach to choosing a representation, which results in minimal changes to the standard GA operators, involves carefully selecting the concept description language. A natural way to express complex concepts is as a disjunctive set of (possibly overlapping) Classification rules. The left-hand side of each rule (distinct) consists of a conjunction of one or more tests involving feature values. The right-hand side of a rule indicates the concept (classification) to be assigned to the examples, which match its left-hand side. Collectively, a set of such rules can be thought of as representing the (unknown) concepts if the rules correctly classify the elements of the feature space. If we allow arbitrarily complex terms in the conjunctive left-hand side of such rules, we will have a very powerful description language, which will be difficult to represent as strings. However, by restricting the complexity of the elements of the conjunctions, we are able to use a string representation and standard GAs, with the only negative side effect that more rules may be required to express the concept. This is achieved by restricting each
element of a conjunction to be a test of the form: return true if the value of feature i of the example is in the given value set; return false otherwise.
For example, a rule might take the following symbolic form: "if (F2 = large) and (F5 = tall or thin) then it’s a widget". Since the left-hand sides are conjunctive forms
with internal disjunction, there is no loss of generality by requiring that there be at most one test for each feature (on the left hand side of a rule). With these restrictions we can now construct a fixed-length internal representation for classifier rules. Each fixed-length rule will have N feature tests, one for each feature. A fixed-length binary string, the length of which will depend of the type of feature (nominal, ordered, etc.) will represent each feature test. For simplicity, the examples used in this paper will involve features with nominal values. In this case we use k bits for the k values of a nominal feature. So, for example, if the legal values for F1 were the days of the week, then the pattern 0111110 would represent the test for F1 being a weekday. As an example, the left-hand side of a rule for a 5-feature problem would be represented internally as:
F1 F2 F3 F4 F5
0110010 1111 01 111100 11111
Notice that a feature test involving all 1’s matches any value of a feature and is equivalent to "dropping" that conjunctive term (i.e., the feature is irrelevant). So, in the
above example only the values of F1, F3, and F4 are relevant. For completeness, we allow patterns of all 0’s which match nothing. This means that any rule containing
such a pattern will not match (cover) any points in the feature space. While rules of this form are of no use in the final concept description, they are quite useful as storage areas for GAs when evolving and testing sets of rules. The right-hand side of a rule is simply the class (concept) to which the example belongs. This means that our "classifier system" is a "stimulus-response" system with no message passing.
3.3 Evolving Sets of Classifier Rules
Since a concept description will consist of one or more classifier rules, we still need to specify how Gas will be used to evolve sets of rules. There are currently two basic strategies: the Michigan approach exemplified by Holland’s classifier system [Holland, 1986], and the Pittsburgh approach exemplified by Smith’s LS-1 system [Smith, 1983]. Systems using the Michigan approach maintain a population of individual rules, which compete with each other for space and priority in the population. In contrast, systems using the Pittsburgh approach maintain a population of variable-length rule sets, which compete with each other with respect to performance on the domain task. Very little is currently known concerning the relative merits of the two approaches. In this paper we report on results obtained from using the Pittsburgh approach.2 that is, each individual in the population is a variable length string representing an unordered set of fixed-length rules (disjuncts). The number of rules in a particular individual is unrestricted and can range from 1 to a very large number depending on evolutionary pressures. Our goal was to achieve a representation that required minimal changes to the fundamental genetic
operators. We feel we have achieved this with our variable-length string representation involving fixedlength rules. Crossover can occur anywhere (i.e., both on rule boundaries and within rules). The only requirement is that the corresponding crossover points on the two parents "match up semantically". That is, if one parent is being cut on a rule boundary, then the other parent must be also cut on a rule boundary. Similarly, if one parent is being cut at point 5 bits to the right of a rule boundary, then the other parent must be cut in a similar spot (i.e., 5 bits to the right of some rule boundary). The mutation operator is unaffected and performs the usual bit-level mutations.
3.4 Choosing a Payoff Function
In addition to selecting a good representation, it is important to define a good payoff function, which rewards the right kinds of individuals. One of the nice features of using GAs for concept learning is that the payoff function is the natural place to centralize and make explicit any biases (preferences) for certain kinds of concept descriptions. It also makes it easy to study the effects of different biases by simply making changes to the payoff function. For the experiments reported in this paper, we wanted to minimize any a priori bias we might have. So we selected a payoff function involving only classification performance (ignoring, for example, length and complexity biases). The payoff (fitness) of each individual rule set is computed by testing the rule set on the current set of examples and letting:
payoff (individual I) = (percent correct) 2
This provides a non-linear bias toward correctly classifying all the examples while providing differential reward for imperfect rule sets.
3.5 The GA Concept Learner
Given the representation and payoff function described above, a standard GA can be used to evolve concept descriptions in several ways. The simplest approach involves using a batch mode in which a fixed set of examples is presented, and the GA must search the space of variable-length strings described above for a set of rules, which achieves a score of 100%. We will call this approach GABL (GA Batch concept Learner). The simplest way to produce an incremental GA concept learner is to use GABL incrementally in the following way. The concept learner initially accepts a single
example from a pool of examples. GABL is used to create a 100% correct rule set for this example. This rule set is used to predict the classification of the next example.
If the prediction is incorrect, GABL is invoked to evolve a new rule set using the two examples. If the prediction is correct, the example is simply stored with the previous example and the rule set remains unchanged. As each new additional instance is accepted, a prediction is made, and the GA is re-run in batch if the prediction is incorrect. We refer to this mode of operation as batch incremental and we refer to the GA batch-incremental concept learner as GABIL.
4 Initial Experiments
The experiments described in this section are designed to demonstrate the predictive performance of GABIL as a function of incremental increases in the size and complexity of the target concept. We invented a 4-feature world in which each feature has 4 possible distinct values (i.e., there are 256 instances in this world). This means that rules map into 16-bit strings and the length of individual rule sets is a multiple of 16. In addition to studying the behavior of GABIL as a function of increasing complexity, we were also interested in comparing its performance with an existing algorithm. ID5R [], which is a well-known incremental concept-learning algorithm, was chosen for comparison. ID5R uses decision trees as the description language and always produces a decision tree consistent with the instances seen.
We constructed a set of 12 concept learning problems, each consisting of a single target concept of increasing complexity. We varied the complexity by increasing
both the number of rules (disjuncts) and the number of relevant features per rule (conjuncts) required to correctly describing the concepts. The number of disjuncts ranged
from 1 to 4, while the number of conjuncts ranged from 1 to 3. Each target concept is labeled as nDmC, where n is the number of disjuncts and m is the number of conjuncts.
Each target concept is associated with one experiment. Within an experiment the number of disjuncts and conjuncts for the target concept remains fixed. The variation in target concept occurs between experiments. For each of the concepts, a set of 256 unique, noise free examples was generated from the feature space and labeled as positive or negative examples of the target concept. For he more complex concepts, this resulted in learning primarily from negative examples. For each concept, the 256 examples were randomly
Shuffled and then presented sequentially as described above. This procedure was repeated 10 times for each concept and for each learning algorithm. The performance
curves presented are the average behavior exhibited over 10 runs. In particular, they will remain at 100% indefinitely only when the algorithms have correctly learned the target concept. The graphs indicate that, on the simpler concepts, the predictive performance of ID5R improves more rapidly than that of GABIL. However, ID5R degrades in performance, as the target concept becomes more complex, with significant deterioration in predictive power.
It is not always possible for ID5R to make a prediction based on the decision tree. If it cannot use the tree to predict, we let ID5R make a random prediction. The performance of GABIL, on the other hand, is relatively insensitive to the increase in concept complexity, resulting in significantly better predictive capability than ID5R already on 4 distinct concepts. The analysis below suggests that this trend will continue with even larger numbers of disjuncts and conjuncts. We were surprised to see ID5R suffer the most on the 4D1C target concept, since syntactically the concept is only moderately complex. The target concept is of the form:
If (F1 = 0001) or (F2 = 0001) or (F3 = 0001) or
(F4 = 0001) then it’s positive
Although it is natural to expect that a simple target concept (from a syntactic viewpoint) would have a small decision tree representation, this is only a rough generalization.
This target concept is represented by ID5R as a decision tree of over 150 nodes. In fact, a unique leaf node in the decision tree represents each negative example. For this reason, ID5R cannot generalize over the negative examples, and has a good chance of predicting any negative example incorrectly. Furthermore, even the positive examples are not generalized well, resulting in prediction errors for positive examples. It is clear that the decision tree representation is poor for representing this particular concept. Target concept 4D1C represents a worst case, which explains why the difference between GABIL and ID5R is greatest for this concept. A similar situation occurs for target concepts 3D1C, 4D2C, and 4D3C, although to a lesser degree. ID5R relies upon Quinlan’s information theoretic entropy measure to build its decision trees.
5 Further Analysis and Comparisons
Having characterized the behavior of GABIL in this controlled concept world, we have begun to extend the analysis to more complex and challenging problems. One of our first steps was to look at the family of multiplexor problems introduced to the machine learning community by Wilson [Wilson, 1987]. Multiplexor problems fall into the general area of trying to induce a description of an input boolean function from input/output examples. Because no single individual input line is useful in distinguishing
class membership, information-theoretic approaches like Quinlan’s ID3 system have a particularly hard time inducing decision trees for multiplexor problems. Wilson’s work indicated that his GA-based classifier system Boole did not have such difficulties. Quinlan addressed some of these issues in the development of his C4 system. Quinlan subsequently reported that C4 outperforms Boole on the multiplexor Since we had access to C4.5 (a successor to C4 [Quinlan, 1989]), we felt that a direct comparison of
GABIL and C4.5 on multiplexor problems would be enlightening. Since C4.5 is a batch-mode system, we have to run it in a batch-incremental mode in the same manner as GABIL in order to provide meaningful comparisons. This can be achieved by running C4.5 in batch mode for every new instance seen, and using the resulting decision tree to predict the class of the next instance. The 6-input multiplexor problem has 6 features in
Which each feature has 2 possible distinct values (i.e., there are 64 instances in this world). This means that rules map into 12-bit strings and the length of individual rule
sets are a multiple of 12. For this concept we randomly generated a set of 200 examples from the feature space, each example labeled positive or negative. Since there are only 64 possible unique examples, the set does not contain unique examples, although they are noise free. This methodology allows for direct comparison with Quinlan’s reported results.
The set of 200 examples was randomly shuffled and then presented sequentially. This procedure was repeated 10 times for both learning algorithms. The performance curves presented in Figure 3 are the average behavior exhibited over 10 runs. GABIL clearly outperforms C4.5 on the 6-input multiplexor problem. As noted above, the weaker performance of C4.5 is not due to the choice of representation (decision tree). In fact, a compact decision tree can be created to describe the concept. The problem lies with the information theoretic bias itself, which makes it hard to find this compact tree. Preliminary results suggest similar performance differentials on larger multiplexor problems.
The concept description language and the search algorithm constitute strong biases for any concept learner. The above experiments indicate that ID-like systems can
suffer both from their decision tree language bias and from their information theoretic search bias. When the biases are appropriate, ID-like systems perform quite well. GABIL, however, due to its minimal system bias, performs uniformly well on target concepts of varying complexity. These initial results support the view that GABIL can be used as an effect concept learner although it may not outperform more strongly biased concept learning algorithms whose bias is appropriate for learning simpler target concepts.
6 Conclusions and Future Research
This paper presents a series of initial results regarding the use of GAs as the key element in the design of a system capable of continuously acquiring and refining concept classification rules from interactions with its environment. It is interesting to note that reasonable performance is achieved with minimal a priori bias. The initial results support the view that GAs can be used as an effective concept learner although they may not outperform algorithms specifically designed for concept learning when simple concepts are involved. This paper also sets the stage for the design of three additional GA-based concept learners. First, we wish to implement a variation of the current system that is truly incremental. Second, we are also very interested in understanding the difference between using the Pittsburgh approach and the Michigan approach in this problem domain. The current fixed-length rule representation can be used directly in Michigan-style classifier systems.
Third, we noted early in the paper that there were two basic strategies for selecting a representation for the concept description language. In this paper we developed a representation, which minimized the changes to standard GA implementations. We also plan to explore the alternative strategy of modifying the basic GA operators to deal effectively with non-string representations. We feel that the development and analysis of such systems is an important direction the research community should follow in order to develop additional results on these and other problems of interest.
Acknowledgements
We would like to thank Diana Gordon for her support and for many discussions on the biases in supervised concept learning systems. Diana was also instrumental in helping
us design our experimental methodology. We would also like to thank John Grefenstette and Alan Schultz for many useful comments about GABIL and crossover, J. R. Quinlan
for C4.5, and Paul Utgoff for ID5R.
.

INTERNET FIREWALLS


INTERNET FIREWALLS

INTRODUCTION :

The Internet has made large amount of information available to the average computer user at home, in business and education. For many people, having access to this information is no longer just an advantage, it is essential.

By connecting a private network to the Internet can expose critical or confidential data to malicious attack from anywhere in the world. The intruders could gain access to your sites private information or interfere with your use of your own systems. Users who connect their computers to the Internet must be aware of these dangers, their implications and how to protect their data and their critical systems.

Therefore, security of network is the main criteria here and firewalls provide this security. The Internet firewalls keep the flames of Internet hell out of your network or, to keep the members of your LAN pure by denying them access the all the evil Internet temptations.


DEFINITION:



A firewall is a hardware device or a software program running on the secure host computer that sits between the two entities and controls access between them.

Here the two entities are nothing but a private network and the public network like Internet.
The firewall can be a software firewall and the hardware firewall.The first computer firewall was a non-routing Unix host with connections to two different networks. One network card connected to the Internet and the other to the private LAN. To reach the Internet from the private network, you had to logon to the firewall (Unix) server. You then used the resources of the system to access the Internet. For example, you could use X-windows to run Netscape's browser on the firewall system and have the display on your workstation. With the browser, running on the firewall it has access to both networks.
This sort of dual homed system (a system with two network connections) is great if you can TRUST ALL of your users. You can simple setup a Linux system and give an account accounts on it to everyone needing Internet access. With this setup, the only computer on your private network that knows anything about the outside world is the firewall. No one can download to his or her personal workstations. They must first download a file to the firewall and then download the file from the firewall to their workstation.
Firewalls are mainly used for two purposes.
To keep people (worms/crackers) out.
To keep people (employees/children) in.

NEED OF Firewalls:
The general reasoning behind firewall usage is that without a firewall, a subnet's systems expose themselves to inherently insecure services such as NFS or NIS and to probes and attacks from hosts elsewhere on the network. In a firewall-less environment, network security relies totally on host security and all hosts must, in a sense, cooperate to achieve a uniformly high level of security. The larger the subnet, the less manageable it is to maintain all hosts at the same level of security. As mistakes and lapses in security become more common, break-ins occur not as the result of complex attacks, but because of simple errors in configuration and inadequate passwords.
A firewall approach provides numerous advantages to sites by helping to increase overall host security. The following sections summarize the primary benefits of using a firewall.
Protection from Vulnerable Services
Controlled Access to Site Systems
Concentrated Security
Enhanced Privacy
Logging and Statistics on Network Use, Misuse
Policy Enforcement
1.Protection from Vulnerable Services:
A firewall can greatly improve network security and reduce risks to hosts on the subnet by filtering inherently insecure services. As a result, the subnet network environment is exposed to fewer risks, since only selected protocols will be able to pass through the firewall.
For example, a firewall could prohibit certain vulnerable services such as NFS from entering or leaving a protected subnet. This provides the benefit of preventing the services from being exploited by outside attackers, but at the same time permits the use of these services with greatly reduced risk to exploitation. Services such as NIS or NFS that are particularly useful on a local area network basis can thus be enjoyed and used to reduce the host management burden.Firewalls can also provide protection from routing-based attacks, such as source routing and attempts to redirect routing paths to compromised sites via ICMP redirects. A firewall could reject all source-routed packets and ICMP redirects and then inform administrators of the incidents
2. Controlled Access to Site Systems :A firewall also provides the ability to control access to site systems. For example, some hosts can be made reachable from outside networks, whereas others can be effectively sealed off from unwanted access. A site could prevent outside access to its hosts except for special cases such as mail servers or information servers. This brings to the fore an access policy that firewalls are particularly adept at enforcing: do not provide access to hosts or services that do not require access. Put differently, why provide access to hosts andservices that could be exploited by attackers when the access is not used or required? If, for example, a user requires little or no network access to her desktop workstation, then a firewall can enforce this policy.
3. Concentrated Security:A firewall can actually be less expensive for an organization in that all or most modified software and additional security software could be located on the firewall systems as opposed to being distributed on many hosts. In particular, one-time password systems and other add-on authentication software could be located at the firewall as opposed to each system that needed to be accessed from the Internet.
Other solutions to network security such as Kerberos [NIST94c] involve modifications at each host system. While Kerberos and other techniques should be considered for their advantages and may be more appropriate than firewalls in certain situations, firewalls tend to be simpler to implement in that only the firewall need run specialized software.
4. Enhanced Privacy :
Privacy is of great concern to certain sites, since what would normally be considered innocuous information might actually contain clues that would be useful to an attacker. Using a firewall, some sites wish to block services such as finger and Domain Name Service. Finger displays information about users such as their last login time, whether they've read mail, and other items. But, finger could leak information to attackers about how often a system is used, whether the system has active users connected, and whether the system could be attacked without drawing attention.
Firewalls can also be used to block DNS information about site systems, thus the names and IP addresses of site systems would not be available to Internet hosts. Some sites feel that by blocking this information, they are hiding information that would otherwise be useful to attackers.
5. Logging and Statistics on Network Use, Misuse:
If all access to and from the Internet passes through a firewall, the firewall can log accesses and provide valuable statistics about network usage. A firewall, with appropriate alarms that sound when suspicious activity occurs can also provide details on whether the firewall and network are being probed or attacked.
It is important to collect network usage statistics and evidence of probing for a number of reasons. Of primary importance is knowing whether the firewall is withstanding probes and attacks, and determining whether the controls on the firewall are adequate. Network usage statistics are also important as input into network requirements studies and risk analysis activities.
6. Policy Enforcement:
Lastly, but perhaps most importantly, a firewall provides the means for implementing and enforcing a network access policy. In effect, a firewall provides access control to users and services. Thus, a network access policy can be enforced by a firewall, whereas without a firewall, such a policy depends entirely on the cooperation of users. A site may be able to depend on its own users for their cooperation, however it cannot nor should not depend on Internet users in general.
Types of firewalls : Firewalls fall into different categories.
They are mainly,
1. packet filtering firewalls
2. circuitlevel gateways
3. application gateways
4. stateful multilayer inspection firewall


1.Packet Filtering Firewalls:

These firewalls work at the network layer of OSI model, or IP layer of TCP/IP. They are usually part of a router. A router is a device that receives packets from one network and forwards them to another network. In a packet filtering firewall, each packet is compared to a set of criteria before it is forwarded. Depending on the packet and the criteria, the firewall can drop the packet, forward it or send a message to the originator. Rules can include source and destination IP addresses, source and destination port number and type of the protocol embedded in that packet. These firewalls often contain an ACL (Access Control List) to restrict who gains access to which computers and networks.

Advantages of packet filtering:

It is cost effective to simply configure routers that are already a part of the network to do additional duty as firewalls.
Network layer firewalls tend to be very fast and tend to be very transparent to users.
3 Cost: Virtually all high-speed Internet connections require a router. Therefore, organizations with high- speed Internet connections already have the capability to perform basic Packet Filtering at the Router level without purchasing additional hardware or software.
Drawbacks of packet filtering:

They don’t provide for password controls.

Users can’t identify themselves
3. The person who configures the firewall protocol for the router needs a thorough knowledge of IP packet structure.
4. There is no user authentication.
5. Remains vulnerable to attacks such as spoofing source address.

2. Circuit-level Gateway:
These firewalls work at the session layer of the OSI model, or TCP/IP layer of the TCP/IP. They monitor TCP handshaking between packets to determine whether a requested session is legitimate. Traffic is filtered based on the specified session rules, such as when a session is initiated by the recognized computer. Information passed to remote computer through a circuit level gateway appears to have originated from the gateway. This is useful for hiding information about protected networks. Circuit level gateways are relatively inexpensive and have the advantage of hiding information about the private network they protect. On the other hand, they do not filter individual packets.Unknown traffic is allowed up to level 4 of network stack. These are hardware firewalls and apply security mechanisms when a TCP or UDP connection is established.
3. Application Gateways:

These are the software firewalls. These are often used by companies specifically to monitor and log employee activity and by private citizens to protect a home computer from hackers, spy ware to set parental controls for children.
Application gateways also called proxies are similar to circuit level gateways expect that they are application specific. They can filter packets at the application layer of OSI or TCP/IP model. Incoming or outgoing packets can’t access services for which there is no proxy. In plain terms, an application level gateway is configured to be a web proxy will not allow all ftp, gopher, telnet or other traffic through. Because they examine packets at the application layer, they can filter application specific commands such as http: post, get etc;
It works like a proxy. A proxy is a process that sits between a client and a server. For a client proxy looks like a server and for a server, the proxy looks like a client.Example Application layer firewall: In Figure 3, an application layer firewall called a ``dual homed gateway'' is represented. A dual homed gateway is a highly secured host that runs proxy software. It has two network interfaces, one on each network, and blocks all traffic passing through it.



 Dual Homed Gateway
Advantages of application gateways:
Since application proxies examine packets at the application program level, a very fine level of security and access control may be achieved.
These reject all inbound packets contain common EXE and COM files.
The greatest advantage is that no direct connections are allowed through the firewall under any circumstances.
Proxies provide a high level of protection against denial of service attacks.
Disadvantages of application gateways:
1.Proxies require large amount of computing resources in the host system, which can load to performance bottlenecks or slow downs the network.
2. Proxies must be written for specific application programs and not all applications have proxies available.
4.Stateful Multilayer Inspection Firewall:

They combine the aspects of other three types of firewalls. This firewall keeps track of all packets associated with a specific communication session. A typical communication session between two computers will consists a several thousand packets, each of which is identified by a unique source and destination address and a sequence number that allows all of the packets to be reassembled into the correct data file at destination computer. Each packet of data is checked to ensure that it belongs to proper session. Any packets that are not part of an existing session are rejected. In addition to checking and validating the communication session ensuring that all packets belong to the proper session, these are further screens the packets at the application layer also.
Filtering at the s/w application port level provides an additional layer of control for the network administrator to ensure that only authorized transactions are allowed through the firewall. These firewalls close off ports until connection to the specified port is requested.
Advantages of stateful inspection:
These will typically offer much higher performance than proxies.
These ensure that all packets must be a port of an authorized communication session. Therefore, a higher level of protection is provided to users communicating with systems external to the trusted network.
Stateful Inspection provides a greater level of security control by enforcing security policies at the "application socket" or port layer as well as the protocol and address level.
Disadvantages of stateful inspection:
1. stateful inspection functionality currently requires the purchase of additional hardware and/or software and is not typically "bundled" with another existing network device.
A simple example of firewall:
CISCO developed 500 series firewall as better because they use a cut-through protocol in packet examination and an ACL that compares connections based on past connections with the same client. In other words, based on the first connection with a client, a kind of fingerprint is created using source and destination addresses, ports, TCP sequence numbers, and other TCP flags. So, instead of examining every client connection packet stream, the packets are first compared to the ACL.matches an authorized fingerprint, then the data stream is allowed without further examination. Both the cut-through protocol and the use of an ACL is said to greatly enhance speed.



1. Firewalls create barriers in order to prevent unauthorized access to a network.
2. They are the security doors through which some people (i.e. data) may pass and others may not.
3. It adds another layer of security to your systems.
4. It protects networked computers from intentional hostile intrusion that could compromise confidentiality or result in data corruption or denial of service.
5.It is is a choke point through which all the traffic flows between two network.
Advantages of firewall:
Concentration of security, all modified software and logging is located on the firewall system as opposed to being distributed on many hosts;
Protocol filtering, where the firewall filters protocols and services that are either not necessary or that cannot be adequately secured from exploitation;
Information hiding, in which a firewall can ``hide'' names of internal systems or electronic mail addresses, thereby revealing less information to outside hosts;
Application gateways, where the firewall requires inside or outside users to connect first to the firewall before connecting further, thereby filtering the protocol;
Extended logging, in which a firewall can concentrate extended logging of network traffic on one system;
Centralized and simplified network services management, in which services such as ftp, electronic mail, gopher, and other similar services are located on the firewall system(s) as opposed to being maintained on many systems.
Disadvantages of firewall :
Given these advantages, there are some disadvantages to using firewalls. 1.The most obvious being that certain types of network access may be hampered or even blocked for some hosts, including telnet, ftp, X Windows, NFS, NIS, etc. However, these disadvantages are not unique to firewalls; network access could be restricted at the host level as well, depending on a site's security policy.
2. A second disadvantage with a firewall system is that it concentrates security in one spot as opposed to distributing it among systems, thus a compromise of the firewall could be disastrous to other less-protected systems on the subnet. This weakness can be countered, however, with the argument that lapses and weaknesse in security are more likely to be found as the number of systems in a subnet increase, thereby multiplying the ways in which subnets can be exploited.
3. Another disadvantage is that relatively few vendors have offered firewall systems until very recently. Most firewalls have been somewhat ``hand-built'' by site administrators, however the time and effort that could go intoconstructing a firewall may outweigh the cost of a vendor solution. There
Is also no firm definition of what constitutes a firewall; the term ``firewall'' can mean many things to many people.
FOR WHICH FIREWALLS CAN’T PROVIDE SECURITY :In addition, Firewalls can’t provide security for the following.
1. A firewall can’t protect against attacks that don’t go through the firewall. Many corporations that connect to Internet are very concerned about confidentially date leaking out of company through route. However, a magnetic tape can just export data.
2. Many organizations that are terrified of Internet connections have no coherent policy about how dial-in access via modems should be protected. There are many organizations out there buying expensive firewalls and neglecting the numerous other back doors into their network.
3. Another thing a firewall can’t really protect you against is traitors or idiots inside the network. An industrial spy might leak information or export it through a telephone, FAX or floppy disk. Firewalls can’t protect you against this stupidity.
4. Firewalls can't protect very well against things like viruses. There are too many ways of encoding binary files for transfer over networks, and too many different architectures and viruses to try to search for them all. In other words, a firewall cannot replace security-consciousness on the part of your users. In general, a firewall cannot protect against a data-driven attack--attacks in which something is mailed or copied to an internal host where it is then executed.
Organizations that are deeply concerned about viruses should implement organization-wide virus control measures. Rather than trying to screen viruses out at the firewall, make sure that every vulnerable desktop has virus-scanning software that is run when the machine is rebooted. Blanketing your network with virus scanning software will protect against viruses that come in via floppy disks, modems, and Internet. Trying to block viruses at the firewall will only protect against viruses from the Internet--and the vast majority of viruses are caught via floppy disks.
Conclusion:
In conclusion, the Internet has become a dangerous place. Thirteen-year-old kids on dial-up accounts can crash a site supported by two T-1 connections by using hundreds of zombies (PCs hacked and uploaded with a Trojan) to flood with UDP and ICMP traffic. This is simply a malicious attack meant to consume all of the bandwidth of a connection to the Internet. Yahoo was recently crashed by what is called a 'smurf' attack. In this attack, ping requests are sent to several Internet broadcast addresses with a spoofed return address aimed at the victim (yahoo in this case). The resulting storm of packets consumes all bandwidth and disconnects or makes the site unusable for normal traffic. Hackers attack networks to destroy and/or steal information. They attack PCs so they can use them in zombie attacks, to hide their identity when trying to gain illegal entry to secured networks, or for nothing more than malicious purposes. While on the internet my firewall typically gets 1 to 3 hits an hour, primarily port scanners looking for a specific Trojan or a vulnerability to exploit. No one should be on the Internet without a firewall. All networks are protected by firewalls. However, it is always a trade-off. The whole point of the Internet is communication and exchange of information. The question is how much do we restrict access without losing all the advantages of speed and openness.