Friday, May 28, 2010

DATA MINING AND DATA WARE HOUSE

DATA MINING AND DATA WARE HOUSE
Many software projects are accumulated by a great deal of data, so we really need information about the effective maintenance and reteving of data from the database. The newest, hottest technology to address these concerns is data mining and data warehousing.
Data Mining is the process of automated extraction of predictive information from large databases. It predicts future trends and finds behavior that the experts may miss as it lies beyond their expectations. Data Mining is part of a larger process called knowledge discovery, specifically, the step in which advanced statistical analysis and modeling techniques are applied to the data to find useful patterns and relationships.
Data warehousing takes a relatively simple idea and incorporates it into the technological underpinnings of a company. The idea is that a unified view of all data that a company collects will help improve operations. If hiring data can be combined with sales data, the idea is that it might be possible to discover and exploit patterns in the combined entity.
This paper will present an overview of the different process and advanced techniques involving in data mining and data warehousing.




1. Introduction to Data Mining:
Data mining can be defined as "a decision support process in which we search for patterns of information in data." This search may be done just by the user, i.e. just by performing queries, in which case it is quite hard and in most of the cases not comprehensive enough to reveal intricate patterns. Data mining uses sophisticated statistical analysis and modeling techniques to uncover such patterns and relationships hidden in organizational databases - patterns that ordinary methods might miss. Once found, the information needs to be presented in a suitable form, with graphs, reports, etc.
1.1 Data Mining Processes
From a process-oriented view, there are three classes of data mining activity: discovery, Predictive modeling and forensic analysis, as shown in figure below.
Discovery is the process of looking in a database to find hidden patterns without a
Predetermined idea or hypothesis about what the patterns may be. In other words, the
program takes the initiative in finding what the interesting patterns are, without the user
thinking of the relevant questions first.

Discovery is the process of looking in a database to find hidden patterns without a predetermined idea or hypothesis about what the patterns may be. In other words, the program takes the initiative in finding what the interesting patterns are, without the user thinking of the relevant questions first.
In predictive modeling patterns discovered from the database are used to predict the future. Predictive modeling thus allows the user to submit records with some unknown field values, and the system will guess the unknown values based on previous patterns discovered from the database. While discovery finds patterns in data, predictive modeling applies the patterns to guess values for new data items.
Forensic analysis:
This is the process of applying the extracted patterns to find anomalous or
unusual data elements. To discover the unusual, we first find what is the norm, and then
we detect those items that deviate from the usual within a given threshold. Discovery
helps us find "usual knowledge," but forensic analysis looks for unusual and specific
cases.
1.2 Data Mining Users and Activities
Data mining activities are usually performed by three different classes of users - executives, end users and analysts.
Executives need top-level insights and spend far less time with computers than the
other groups.
End users are sales people, market researchers, scientists, engineers, physicians,
etc.
Analysts may be financial analysts, statisticians, consultants, or database designers.
These users usually perform three types of data mining activity within a corporate environment: episodic, strategic and continuous data mining.
In episodic mining we look at data from one specific episode such as a specific direct marketing campaign. We may try to understand this data set, or use it for prediction on new marketing campaigns. Analysts usually perform episodic mining.
In strategic mining we look at larger sets of corporate data with the intention of gaining an overall understanding of specific measures such as profitability.
In continuous mining we try to understand how the world has changed within a given time period and try to gain an understanding of the factors that influence change.

1.3 Data Mining Applications:
Virtually any process can be studied, understood, and improved using data mining. The top three end uses of data mining are, not surprisingly, in the marketing area.
Data mining can find patterns in a customer database that can be applied to a prospect database so that customer acquisition can be appropriately targeted. For example, by identifying good candidates for mail offers or catalogs direct-mail marketers can reduce expenses and increase their sales. Targeting specific promotions to existing and potential customers offers similar benefits.
Market-basket analysis helps retailers understand which products are purchased together or by an individual over time. With data mining, retailers can determine which products to stock in which stores, and even how to place them within a store. Data mining can also help assess the effectiveness of promotions and coupons.
Another common use of data mining in many organizations is to help manage customer relationships. By determining characteristics of customers who are likely to leave for a competitor, a company can take action to retain that customer because doing so is usually far less expensive than acquiring a new customer.
Fraud detection is of great interest to telecommunications firms, credit-card companies, insurance companies, stock exchanges, and government agencies. The aggregate total for fraud losses is enormous. But with data mining, these companies can identify potentially fraudulent transactions and contain the damage.
Financial companies use data mining to determine market and industry characteristics as well as predict individual company and stock performance. Another interesting niche application is in the medical field: Data mining can help predict the effectiveness of surgical procedures, diagnostic tests, medications, service management, and process control.
1.4 Data Mining Techniques:
Data Mining has three major components Clustering or Classification, Association Rules and Sequence Analysis.


1.4.1 Classification:
The clustering techniques analyze a set of data and generate a set of grouping rules that can be used to classify future data. The mining tool automatically identifies the
clusters, by studying the pattern in the training data. Once the clusters are generated, classification can be used to identify, to which particular cluster, an input belongs. For example, one may classify diseases and provide the symptoms, which describe each class or subclass.
1.4.2 Association:
An association rule is a rule that implies certain association relationships among a set of objects in a database. In this process we discover a set of association rules at multiple levels of abstraction from the relevant set(s) of data in a database. For example, one may discover a set of symptoms often occurring together with certain kinds of diseases and further study the reasons behind them.
1.4.3 Sequential Analysis:
In sequential Analysis, we seek to discover patterns that occur in sequence. This
deals with data that appear in separate transactions (as opposed to data that appear in the same transaction in the case of association) e.g. if a shopper buys item A in the first week of the month, and then he buys item B in the second week etc.
1.4.4 Neural Nets and Decision Trees:
For any given problem, the nature of the data will affect the techniques you choose. Consequently, you'll need a variety of tools and technologies to find the best possible model. Classification models are among the most common, so the more popular ways for building them have been explained here. Classifications typically involve at least one of two workhorse statistical techniques - logistic regression (a generalization of linear regression) and discriminate analysis. However, as data mining becomes more common, neural nets and decision trees are also getting more consideration. Although complex in their own way, these methods require less statistical sophistication on the part of the user.
Neural nets use many parameters (the nodes in the hidden layer) to build a model that takes and combines a set of inputs to predict a continuous or categorical variable.

Source: "Introduction to Data Mining and Knowledge Discovery" by "Two Crows Corporation"

The value from each hidden node is a function of the weighted sum of the values from all the preceding nodes that feed into it. The process of building a model involves finding the connection weights that produce the most accurate results by "training" the neural net with data. The most common training method is back propagation, in which the output result is compared with known correct values. After each comparison, the weights are adjusted and a new result computed. After enough passes through the training data, the neural net typically becomes a very good predictor.
Decision trees represent a series of rules to lead to a class or value. For example, you may wish to classify loan applicants as good or bad credit risks. Figure below shows a simple decision tree that solves this problem. Armed with this tree and a loan application, a loan officer could determine whether an applicant is a good or bad credit risk. An individual with "Income > $40,000" and "High Debt" would be classified as a "Bad Risk," whereas an individual with "Income < $40,000" and "Job > 5 Years" would be classified as a "Good Risk."


Decision trees have become very popular because they are reasonably accurate and, unlike neural nets, easy to understand. Decision trees also take less time to build than neural nets. Neural nets and decision trees can also be used to perform regressions, and some types of neural nets can even perform clustering.
2.1 Introduction to Data warehousing:
In the current knowledge economy, it is now an indisputable fact that information is the key to organizations for gaining competitive advantage. Organizations very well know that the vital information for decision making is lying in its databases. Mountains of data are getting accumulated in various databases scattered around the enterprise. But the key to gaining competitive advantage lies in deriving insight and intelligence out of these data. Data warehousing helps in integrating categorizing, codifying and arranging the data from all parts of an enterprise.
According to Bill Inmon, known as the father of Data warehousing, The concept of data warehouse is depicted as figure




2.1.1 Subject oriented data:
All relevant data about a subject is gathered and stored as a single set in a useful format.
2.1.2 Integrated data:
Data is stored in a globally accepted fashion with consistent naming conventions, measurements, encoding structures and physical attributes, even when the underlying operational system store the data differently.


2.1.3 Non-volatile data:
The data warehouse is read-only, data is loaded in to the data warehouse and accesses there.
2.1.4 Time-variant data:
This long term data is from 5 to 10 years as opposed to the 30-60 days of operational data.
2.2 Structure of data warehouse:

The design of the data architecture is probably the most critical part of a data warehousing project. The key is to plan for growth and change, as opposed to trying to design the perfect system from the start. The design of the data architecture involves understanding all of the data and how different pieces are related. For example, payroll data might be related to sales data by the ID of the sales person, while the sales data might be related to customers by the customer ID. By connecting these two relationships, payroll data could be related to customers (e.g., which employees have ties to which customers).
Once the data architecture has been designed, you can then consider the kinds of reports that you are interested in. You might want to see a breakdown of employees by region, or a ranked list of customers by revenue. These kinds of reports are fairly simple. The power of a data warehouse becomes more obvious when you want to look at links between data associated with disparate parts of a organization (e.g., HR, accounts payable, and project management).
2.3 Benefits of Data warehousing:
Cost avoidance benefits.
Higher productivity.
Benefits through better analytical capability.
Manage business complexity.
Leverage on their existing investments.
End user spending.
Spending on e-business.
Accessibility and easy of use.
Real time information and analysis.
2.4 Techniques by different organization on efficient data warehouse:
That being said, most decisions to build data warehouses are driven by non-HR needs. Over the past decade, back office (supply chain) and front office (sales and marketing) organizations have spearheaded the creation of large corporate data warehouses. Improving the efficiency of the supply chain and competition for customers rely on the tactical uses that a data warehouse can provide. The key for other organizations, including HR, is to be involved in the creation of the warehouse so that their meets can be met by any resulting system.  This usually happens because both the data volume and question complexity have grown beyond what the current systems can handle. At that point the business becomes limited by the information that users can reasonably extract from the data system.
3.Conclusion:
Data mining offers great promise in helping organizations uncover hidden patterns in their data. However, data mining tools must be guided by users who understand the business, the data, and the general nature of the analytical methods involved. Realistic expectations can yield rewarding results across a wide range of applications, from improving revenues to reducing costs.
Building models is only one step in knowledge discovery. It's vital to collect and prepare the data properly and to check models against the real world. The "best" model is often found after building models of several different types and by trying out various technologies or algorithms.
The data mining area is still relatively young, and tools that support the whole of the data mining process in an easy to use fashion are rare. However, one of the most important issues facing researchers is the use of techniques against very large data sets. All the mining techniques are based on Artificial Intelligence, where they are generally executed against small sets of data, which can fit in memory. However, in data mining applications these techniques must be applied to data held in very large databases. These include use of parallelism and development of new database oriented techniques. However, much work is required before data mining can be successfully applied to large data sets. Only then will the true potential of data mining be able to be realized.
The data warehousing is the hottest concept for many software professionals to over come the sophisticated data to be managed efficiently. The data warehouse is repository (or archive) of information gathered from multiple sources, stored under a unified scheme, at a single site. Once gathered the data are stored for a long time permitting access to historical data. Thus, data ware houses provide the user a single consolidated interface to data, making decision support actions easier to implement. In the world of highly interconnected networks the data obtained or used by many companies would be very large and the maintenance becomes difficult and costly. So, the efficient data warehousing is to be implemented to obtain data from different branches (all over the world) and maintain it for providing information to all other branches (which does not have the concerned data).

Genetic Algorithms

Abstract
In this paper we explore the use of an adaptive search technique (genetic algorithms) to construct a system GABIL which continually learns and refines concept classification rules from its interaction with the environment. The performance of the system is measured on a set of concept learning problems and compared with the performance of two existing systems: ID5R and C4.5. Preliminary results support that, despite minimal system bias, GABIL is an effective concept learner and is quite competitive with ID5R and C4.5 as the target concept increases in complexity.
1 Introduction
An important requirement for both natural and artificial organisms is the ability to acquire concept classification rules from interactions with their environment. In this paper we explore the use of an adaptive search technique, namely Genetic Algorithms (GAs), as the central mechanism for building a system which continually learns and
refines concept classification rules from its interaction with the environment. We show how concept learning tasks can be represented and solved by GAs, and we provide empirical results, which illustrate the performance of GAs relative to more traditional methods. Finally, we discuss the advantages and disadvantages of this approach and describe future research activities.
2 Concept Learning Problems
Supervised concept learning involves inducing concept descriptions for the concepts to be learned from a set of positive and negative examples of the target concepts. Examples are represented as points in an n-dimensional feature space, which is defined a priori and for which all the legal values of the features are known. Concepts are therefore represented as subsets of points in the given dimensional space.
A concept learning program is presented with both a description of the feature space and a set of correctly classified examples of the concepts, and is expected to generate a reasonably accurate description of the (unknown) concepts. Since concepts can be arbitrarily complex subsets of a feature space, an important issue is the choice of the concept description language. The language must have sufficient expressive power to describe large subsets succinctly and yet be able to capture irregularities. The two language forms generally used are decision trees [Quinlan, 1986] and rules [Michalski, 1983]. Another important issue arises from the situation that there is a large (possibly infinite) set of concept descriptions, which are consistent with any particular finite set of examples. This is generally resolved by introducing either explicitly or implicitly a bias (preference) for certain kinds of descriptions (e.g., shorter or less complex descriptions may be preferred). Finally, there is the difficult issue of evaluating and comparing the performance of concept learning algorithms. The most widely used approach is a batch mode in which the set of examples is divided into training set and a test set. The concept learner is required to produce a concept description from the training examples. The validity of the description produced is then measured by the percentage of correct classifications made by the system on the second (test) set of examples during which no
further learning takes place. The alternative evaluation approach is an incremental mode in which the concept learner is required to produce a concept description from the examples seen so far and to use that description to classify the next incoming example. In this mode learning never stops, and evaluation is in terms of learning curves, which measure the predictive performance of the concept learner over time. This incremental and continuous model of concept learning matches more closely the kind of concept learning that an organism performs as it explores a complex and changing world. Consequently, we use predictive learning curves as our evaluation methodology.
3 Genetic Algorithms and Concept Learning
In order to apply GAs to a particular problem, we need to select an internal representation of the space to be searched and define an external evaluation function, which assigns utility to candidate solutions. Both components are critical to the successful application of the GAs to the problem of interest. 1
3.1 Representing the Search Space
The traditional internal representation used by Gas involves using fixed-length (generally binary) strings to represent points in the space to be searched. However, such representations do not appear well suited for representing the space of concept descriptions, which are generally symbolic in nature, which have both syntactic and semantic constraints, and which can be of widely varying length and complexity.There are two general approaches one might take to resolve this issue. The first involves changing the fundamental GA operators (crossover and mutation) to work effectively with complex non-string objects [Rendell, 1985]. This must be done carefully in order to preserve the properties, which make the GAs effective adaptive search procedures. Alternatively, one can attempt to construct a string representation, which minimizes any changes to the GAs. We are interested in pursuing both approaches. Our ideas on the first approach will be discussed briefly at the end of the paper. In the following sections we will describe our results using the second approach in which we try to apply classical GAs with minimal changes.
3.2 Defining Fixed-length Classifier Rules
Our approach to choosing a representation, which results in minimal changes to the standard GA operators, involves carefully selecting the concept description language. A natural way to express complex concepts is as a disjunctive set of (possibly overlapping) Classification rules. The left-hand side of each rule (distinct) consists of a conjunction of one or more tests involving feature values. The right-hand side of a rule indicates the concept (classification) to be assigned to the examples, which match its left-hand side. Collectively, a set of such rules can be thought of as representing the (unknown) concepts if the rules correctly classify the elements of the feature space. If we allow arbitrarily complex terms in the conjunctive left-hand side of such rules, we will have a very powerful description language, which will be difficult to represent as strings. However, by restricting the complexity of the elements of the conjunctions, we are able to use a string representation and standard GAs, with the only negative side effect that more rules may be required to express the concept. This is achieved by restricting each
element of a conjunction to be a test of the form: return true if the value of feature i of the example is in the given value set; return false otherwise.
For example, a rule might take the following symbolic form: "if (F2 = large) and (F5 = tall or thin) then it’s a widget". Since the left-hand sides are conjunctive forms
with internal disjunction, there is no loss of generality by requiring that there be at most one test for each feature (on the left hand side of a rule). With these restrictions we can now construct a fixed-length internal representation for classifier rules. Each fixed-length rule will have N feature tests, one for each feature. A fixed-length binary string, the length of which will depend of the type of feature (nominal, ordered, etc.) will represent each feature test. For simplicity, the examples used in this paper will involve features with nominal values. In this case we use k bits for the k values of a nominal feature. So, for example, if the legal values for F1 were the days of the week, then the pattern 0111110 would represent the test for F1 being a weekday. As an example, the left-hand side of a rule for a 5-feature problem would be represented internally as:
F1 F2 F3 F4 F5
0110010 1111 01 111100 11111
Notice that a feature test involving all 1’s matches any value of a feature and is equivalent to "dropping" that conjunctive term (i.e., the feature is irrelevant). So, in the
above example only the values of F1, F3, and F4 are relevant. For completeness, we allow patterns of all 0’s which match nothing. This means that any rule containing
such a pattern will not match (cover) any points in the feature space. While rules of this form are of no use in the final concept description, they are quite useful as storage areas for GAs when evolving and testing sets of rules. The right-hand side of a rule is simply the class (concept) to which the example belongs. This means that our "classifier system" is a "stimulus-response" system with no message passing.
3.3 Evolving Sets of Classifier Rules
Since a concept description will consist of one or more classifier rules, we still need to specify how Gas will be used to evolve sets of rules. There are currently two basic strategies: the Michigan approach exemplified by Holland’s classifier system [Holland, 1986], and the Pittsburgh approach exemplified by Smith’s LS-1 system [Smith, 1983]. Systems using the Michigan approach maintain a population of individual rules, which compete with each other for space and priority in the population. In contrast, systems using the Pittsburgh approach maintain a population of variable-length rule sets, which compete with each other with respect to performance on the domain task. Very little is currently known concerning the relative merits of the two approaches. In this paper we report on results obtained from using the Pittsburgh approach.2 that is, each individual in the population is a variable length string representing an unordered set of fixed-length rules (disjuncts). The number of rules in a particular individual is unrestricted and can range from 1 to a very large number depending on evolutionary pressures. Our goal was to achieve a representation that required minimal changes to the fundamental genetic
operators. We feel we have achieved this with our variable-length string representation involving fixedlength rules. Crossover can occur anywhere (i.e., both on rule boundaries and within rules). The only requirement is that the corresponding crossover points on the two parents "match up semantically". That is, if one parent is being cut on a rule boundary, then the other parent must be also cut on a rule boundary. Similarly, if one parent is being cut at point 5 bits to the right of a rule boundary, then the other parent must be cut in a similar spot (i.e., 5 bits to the right of some rule boundary). The mutation operator is unaffected and performs the usual bit-level mutations.
3.4 Choosing a Payoff Function
In addition to selecting a good representation, it is important to define a good payoff function, which rewards the right kinds of individuals. One of the nice features of using GAs for concept learning is that the payoff function is the natural place to centralize and make explicit any biases (preferences) for certain kinds of concept descriptions. It also makes it easy to study the effects of different biases by simply making changes to the payoff function. For the experiments reported in this paper, we wanted to minimize any a priori bias we might have. So we selected a payoff function involving only classification performance (ignoring, for example, length and complexity biases). The payoff (fitness) of each individual rule set is computed by testing the rule set on the current set of examples and letting:
payoff (individual I) = (percent correct) 2
This provides a non-linear bias toward correctly classifying all the examples while providing differential reward for imperfect rule sets.
3.5 The GA Concept Learner
Given the representation and payoff function described above, a standard GA can be used to evolve concept descriptions in several ways. The simplest approach involves using a batch mode in which a fixed set of examples is presented, and the GA must search the space of variable-length strings described above for a set of rules, which achieves a score of 100%. We will call this approach GABL (GA Batch concept Learner). The simplest way to produce an incremental GA concept learner is to use GABL incrementally in the following way. The concept learner initially accepts a single
example from a pool of examples. GABL is used to create a 100% correct rule set for this example. This rule set is used to predict the classification of the next example.
If the prediction is incorrect, GABL is invoked to evolve a new rule set using the two examples. If the prediction is correct, the example is simply stored with the previous example and the rule set remains unchanged. As each new additional instance is accepted, a prediction is made, and the GA is re-run in batch if the prediction is incorrect. We refer to this mode of operation as batch incremental and we refer to the GA batch-incremental concept learner as GABIL.
4 Initial Experiments
The experiments described in this section are designed to demonstrate the predictive performance of GABIL as a function of incremental increases in the size and complexity of the target concept. We invented a 4-feature world in which each feature has 4 possible distinct values (i.e., there are 256 instances in this world). This means that rules map into 16-bit strings and the length of individual rule sets is a multiple of 16. In addition to studying the behavior of GABIL as a function of increasing complexity, we were also interested in comparing its performance with an existing algorithm. ID5R [], which is a well-known incremental concept-learning algorithm, was chosen for comparison. ID5R uses decision trees as the description language and always produces a decision tree consistent with the instances seen.
We constructed a set of 12 concept learning problems, each consisting of a single target concept of increasing complexity. We varied the complexity by increasing
both the number of rules (disjuncts) and the number of relevant features per rule (conjuncts) required to correctly describing the concepts. The number of disjuncts ranged
from 1 to 4, while the number of conjuncts ranged from 1 to 3. Each target concept is labeled as nDmC, where n is the number of disjuncts and m is the number of conjuncts.
Each target concept is associated with one experiment. Within an experiment the number of disjuncts and conjuncts for the target concept remains fixed. The variation in target concept occurs between experiments. For each of the concepts, a set of 256 unique, noise free examples was generated from the feature space and labeled as positive or negative examples of the target concept. For he more complex concepts, this resulted in learning primarily from negative examples. For each concept, the 256 examples were randomly
Shuffled and then presented sequentially as described above. This procedure was repeated 10 times for each concept and for each learning algorithm. The performance
curves presented are the average behavior exhibited over 10 runs. In particular, they will remain at 100% indefinitely only when the algorithms have correctly learned the target concept. The graphs indicate that, on the simpler concepts, the predictive performance of ID5R improves more rapidly than that of GABIL. However, ID5R degrades in performance, as the target concept becomes more complex, with significant deterioration in predictive power.
It is not always possible for ID5R to make a prediction based on the decision tree. If it cannot use the tree to predict, we let ID5R make a random prediction. The performance of GABIL, on the other hand, is relatively insensitive to the increase in concept complexity, resulting in significantly better predictive capability than ID5R already on 4 distinct concepts. The analysis below suggests that this trend will continue with even larger numbers of disjuncts and conjuncts. We were surprised to see ID5R suffer the most on the 4D1C target concept, since syntactically the concept is only moderately complex. The target concept is of the form:
If (F1 = 0001) or (F2 = 0001) or (F3 = 0001) or
(F4 = 0001) then it’s positive
Although it is natural to expect that a simple target concept (from a syntactic viewpoint) would have a small decision tree representation, this is only a rough generalization.
This target concept is represented by ID5R as a decision tree of over 150 nodes. In fact, a unique leaf node in the decision tree represents each negative example. For this reason, ID5R cannot generalize over the negative examples, and has a good chance of predicting any negative example incorrectly. Furthermore, even the positive examples are not generalized well, resulting in prediction errors for positive examples. It is clear that the decision tree representation is poor for representing this particular concept. Target concept 4D1C represents a worst case, which explains why the difference between GABIL and ID5R is greatest for this concept. A similar situation occurs for target concepts 3D1C, 4D2C, and 4D3C, although to a lesser degree. ID5R relies upon Quinlan’s information theoretic entropy measure to build its decision trees.
5 Further Analysis and Comparisons
Having characterized the behavior of GABIL in this controlled concept world, we have begun to extend the analysis to more complex and challenging problems. One of our first steps was to look at the family of multiplexor problems introduced to the machine learning community by Wilson [Wilson, 1987]. Multiplexor problems fall into the general area of trying to induce a description of an input boolean function from input/output examples. Because no single individual input line is useful in distinguishing
class membership, information-theoretic approaches like Quinlan’s ID3 system have a particularly hard time inducing decision trees for multiplexor problems. Wilson’s work indicated that his GA-based classifier system Boole did not have such difficulties. Quinlan addressed some of these issues in the development of his C4 system. Quinlan subsequently reported that C4 outperforms Boole on the multiplexor Since we had access to C4.5 (a successor to C4 [Quinlan, 1989]), we felt that a direct comparison of
GABIL and C4.5 on multiplexor problems would be enlightening. Since C4.5 is a batch-mode system, we have to run it in a batch-incremental mode in the same manner as GABIL in order to provide meaningful comparisons. This can be achieved by running C4.5 in batch mode for every new instance seen, and using the resulting decision tree to predict the class of the next instance. The 6-input multiplexor problem has 6 features in
Which each feature has 2 possible distinct values (i.e., there are 64 instances in this world). This means that rules map into 12-bit strings and the length of individual rule
sets are a multiple of 12. For this concept we randomly generated a set of 200 examples from the feature space, each example labeled positive or negative. Since there are only 64 possible unique examples, the set does not contain unique examples, although they are noise free. This methodology allows for direct comparison with Quinlan’s reported results.
The set of 200 examples was randomly shuffled and then presented sequentially. This procedure was repeated 10 times for both learning algorithms. The performance curves presented in Figure 3 are the average behavior exhibited over 10 runs. GABIL clearly outperforms C4.5 on the 6-input multiplexor problem. As noted above, the weaker performance of C4.5 is not due to the choice of representation (decision tree). In fact, a compact decision tree can be created to describe the concept. The problem lies with the information theoretic bias itself, which makes it hard to find this compact tree. Preliminary results suggest similar performance differentials on larger multiplexor problems.
The concept description language and the search algorithm constitute strong biases for any concept learner. The above experiments indicate that ID-like systems can
suffer both from their decision tree language bias and from their information theoretic search bias. When the biases are appropriate, ID-like systems perform quite well. GABIL, however, due to its minimal system bias, performs uniformly well on target concepts of varying complexity. These initial results support the view that GABIL can be used as an effect concept learner although it may not outperform more strongly biased concept learning algorithms whose bias is appropriate for learning simpler target concepts.
6 Conclusions and Future Research
This paper presents a series of initial results regarding the use of GAs as the key element in the design of a system capable of continuously acquiring and refining concept classification rules from interactions with its environment. It is interesting to note that reasonable performance is achieved with minimal a priori bias. The initial results support the view that GAs can be used as an effective concept learner although they may not outperform algorithms specifically designed for concept learning when simple concepts are involved. This paper also sets the stage for the design of three additional GA-based concept learners. First, we wish to implement a variation of the current system that is truly incremental. Second, we are also very interested in understanding the difference between using the Pittsburgh approach and the Michigan approach in this problem domain. The current fixed-length rule representation can be used directly in Michigan-style classifier systems.
Third, we noted early in the paper that there were two basic strategies for selecting a representation for the concept description language. In this paper we developed a representation, which minimized the changes to standard GA implementations. We also plan to explore the alternative strategy of modifying the basic GA operators to deal effectively with non-string representations. We feel that the development and analysis of such systems is an important direction the research community should follow in order to develop additional results on these and other problems of interest.
Acknowledgements
We would like to thank Diana Gordon for her support and for many discussions on the biases in supervised concept learning systems. Diana was also instrumental in helping
us design our experimental methodology. We would also like to thank John Grefenstette and Alan Schultz for many useful comments about GABIL and crossover, J. R. Quinlan
for C4.5, and Paul Utgoff for ID5R.
.

INTERNET FIREWALLS


INTERNET FIREWALLS

INTRODUCTION :

The Internet has made large amount of information available to the average computer user at home, in business and education. For many people, having access to this information is no longer just an advantage, it is essential.

By connecting a private network to the Internet can expose critical or confidential data to malicious attack from anywhere in the world. The intruders could gain access to your sites private information or interfere with your use of your own systems. Users who connect their computers to the Internet must be aware of these dangers, their implications and how to protect their data and their critical systems.

Therefore, security of network is the main criteria here and firewalls provide this security. The Internet firewalls keep the flames of Internet hell out of your network or, to keep the members of your LAN pure by denying them access the all the evil Internet temptations.


DEFINITION:



A firewall is a hardware device or a software program running on the secure host computer that sits between the two entities and controls access between them.

Here the two entities are nothing but a private network and the public network like Internet.
The firewall can be a software firewall and the hardware firewall.The first computer firewall was a non-routing Unix host with connections to two different networks. One network card connected to the Internet and the other to the private LAN. To reach the Internet from the private network, you had to logon to the firewall (Unix) server. You then used the resources of the system to access the Internet. For example, you could use X-windows to run Netscape's browser on the firewall system and have the display on your workstation. With the browser, running on the firewall it has access to both networks.
This sort of dual homed system (a system with two network connections) is great if you can TRUST ALL of your users. You can simple setup a Linux system and give an account accounts on it to everyone needing Internet access. With this setup, the only computer on your private network that knows anything about the outside world is the firewall. No one can download to his or her personal workstations. They must first download a file to the firewall and then download the file from the firewall to their workstation.
Firewalls are mainly used for two purposes.
To keep people (worms/crackers) out.
To keep people (employees/children) in.

NEED OF Firewalls:
The general reasoning behind firewall usage is that without a firewall, a subnet's systems expose themselves to inherently insecure services such as NFS or NIS and to probes and attacks from hosts elsewhere on the network. In a firewall-less environment, network security relies totally on host security and all hosts must, in a sense, cooperate to achieve a uniformly high level of security. The larger the subnet, the less manageable it is to maintain all hosts at the same level of security. As mistakes and lapses in security become more common, break-ins occur not as the result of complex attacks, but because of simple errors in configuration and inadequate passwords.
A firewall approach provides numerous advantages to sites by helping to increase overall host security. The following sections summarize the primary benefits of using a firewall.
Protection from Vulnerable Services
Controlled Access to Site Systems
Concentrated Security
Enhanced Privacy
Logging and Statistics on Network Use, Misuse
Policy Enforcement
1.Protection from Vulnerable Services:
A firewall can greatly improve network security and reduce risks to hosts on the subnet by filtering inherently insecure services. As a result, the subnet network environment is exposed to fewer risks, since only selected protocols will be able to pass through the firewall.
For example, a firewall could prohibit certain vulnerable services such as NFS from entering or leaving a protected subnet. This provides the benefit of preventing the services from being exploited by outside attackers, but at the same time permits the use of these services with greatly reduced risk to exploitation. Services such as NIS or NFS that are particularly useful on a local area network basis can thus be enjoyed and used to reduce the host management burden.Firewalls can also provide protection from routing-based attacks, such as source routing and attempts to redirect routing paths to compromised sites via ICMP redirects. A firewall could reject all source-routed packets and ICMP redirects and then inform administrators of the incidents
2. Controlled Access to Site Systems :A firewall also provides the ability to control access to site systems. For example, some hosts can be made reachable from outside networks, whereas others can be effectively sealed off from unwanted access. A site could prevent outside access to its hosts except for special cases such as mail servers or information servers. This brings to the fore an access policy that firewalls are particularly adept at enforcing: do not provide access to hosts or services that do not require access. Put differently, why provide access to hosts andservices that could be exploited by attackers when the access is not used or required? If, for example, a user requires little or no network access to her desktop workstation, then a firewall can enforce this policy.
3. Concentrated Security:A firewall can actually be less expensive for an organization in that all or most modified software and additional security software could be located on the firewall systems as opposed to being distributed on many hosts. In particular, one-time password systems and other add-on authentication software could be located at the firewall as opposed to each system that needed to be accessed from the Internet.
Other solutions to network security such as Kerberos [NIST94c] involve modifications at each host system. While Kerberos and other techniques should be considered for their advantages and may be more appropriate than firewalls in certain situations, firewalls tend to be simpler to implement in that only the firewall need run specialized software.
4. Enhanced Privacy :
Privacy is of great concern to certain sites, since what would normally be considered innocuous information might actually contain clues that would be useful to an attacker. Using a firewall, some sites wish to block services such as finger and Domain Name Service. Finger displays information about users such as their last login time, whether they've read mail, and other items. But, finger could leak information to attackers about how often a system is used, whether the system has active users connected, and whether the system could be attacked without drawing attention.
Firewalls can also be used to block DNS information about site systems, thus the names and IP addresses of site systems would not be available to Internet hosts. Some sites feel that by blocking this information, they are hiding information that would otherwise be useful to attackers.
5. Logging and Statistics on Network Use, Misuse:
If all access to and from the Internet passes through a firewall, the firewall can log accesses and provide valuable statistics about network usage. A firewall, with appropriate alarms that sound when suspicious activity occurs can also provide details on whether the firewall and network are being probed or attacked.
It is important to collect network usage statistics and evidence of probing for a number of reasons. Of primary importance is knowing whether the firewall is withstanding probes and attacks, and determining whether the controls on the firewall are adequate. Network usage statistics are also important as input into network requirements studies and risk analysis activities.
6. Policy Enforcement:
Lastly, but perhaps most importantly, a firewall provides the means for implementing and enforcing a network access policy. In effect, a firewall provides access control to users and services. Thus, a network access policy can be enforced by a firewall, whereas without a firewall, such a policy depends entirely on the cooperation of users. A site may be able to depend on its own users for their cooperation, however it cannot nor should not depend on Internet users in general.
Types of firewalls : Firewalls fall into different categories.
They are mainly,
1. packet filtering firewalls
2. circuitlevel gateways
3. application gateways
4. stateful multilayer inspection firewall


1.Packet Filtering Firewalls:

These firewalls work at the network layer of OSI model, or IP layer of TCP/IP. They are usually part of a router. A router is a device that receives packets from one network and forwards them to another network. In a packet filtering firewall, each packet is compared to a set of criteria before it is forwarded. Depending on the packet and the criteria, the firewall can drop the packet, forward it or send a message to the originator. Rules can include source and destination IP addresses, source and destination port number and type of the protocol embedded in that packet. These firewalls often contain an ACL (Access Control List) to restrict who gains access to which computers and networks.

Advantages of packet filtering:

It is cost effective to simply configure routers that are already a part of the network to do additional duty as firewalls.
Network layer firewalls tend to be very fast and tend to be very transparent to users.
3 Cost: Virtually all high-speed Internet connections require a router. Therefore, organizations with high- speed Internet connections already have the capability to perform basic Packet Filtering at the Router level without purchasing additional hardware or software.
Drawbacks of packet filtering:

They don’t provide for password controls.

Users can’t identify themselves
3. The person who configures the firewall protocol for the router needs a thorough knowledge of IP packet structure.
4. There is no user authentication.
5. Remains vulnerable to attacks such as spoofing source address.

2. Circuit-level Gateway:
These firewalls work at the session layer of the OSI model, or TCP/IP layer of the TCP/IP. They monitor TCP handshaking between packets to determine whether a requested session is legitimate. Traffic is filtered based on the specified session rules, such as when a session is initiated by the recognized computer. Information passed to remote computer through a circuit level gateway appears to have originated from the gateway. This is useful for hiding information about protected networks. Circuit level gateways are relatively inexpensive and have the advantage of hiding information about the private network they protect. On the other hand, they do not filter individual packets.Unknown traffic is allowed up to level 4 of network stack. These are hardware firewalls and apply security mechanisms when a TCP or UDP connection is established.
3. Application Gateways:

These are the software firewalls. These are often used by companies specifically to monitor and log employee activity and by private citizens to protect a home computer from hackers, spy ware to set parental controls for children.
Application gateways also called proxies are similar to circuit level gateways expect that they are application specific. They can filter packets at the application layer of OSI or TCP/IP model. Incoming or outgoing packets can’t access services for which there is no proxy. In plain terms, an application level gateway is configured to be a web proxy will not allow all ftp, gopher, telnet or other traffic through. Because they examine packets at the application layer, they can filter application specific commands such as http: post, get etc;
It works like a proxy. A proxy is a process that sits between a client and a server. For a client proxy looks like a server and for a server, the proxy looks like a client.Example Application layer firewall: In Figure 3, an application layer firewall called a ``dual homed gateway'' is represented. A dual homed gateway is a highly secured host that runs proxy software. It has two network interfaces, one on each network, and blocks all traffic passing through it.



 Dual Homed Gateway
Advantages of application gateways:
Since application proxies examine packets at the application program level, a very fine level of security and access control may be achieved.
These reject all inbound packets contain common EXE and COM files.
The greatest advantage is that no direct connections are allowed through the firewall under any circumstances.
Proxies provide a high level of protection against denial of service attacks.
Disadvantages of application gateways:
1.Proxies require large amount of computing resources in the host system, which can load to performance bottlenecks or slow downs the network.
2. Proxies must be written for specific application programs and not all applications have proxies available.
4.Stateful Multilayer Inspection Firewall:

They combine the aspects of other three types of firewalls. This firewall keeps track of all packets associated with a specific communication session. A typical communication session between two computers will consists a several thousand packets, each of which is identified by a unique source and destination address and a sequence number that allows all of the packets to be reassembled into the correct data file at destination computer. Each packet of data is checked to ensure that it belongs to proper session. Any packets that are not part of an existing session are rejected. In addition to checking and validating the communication session ensuring that all packets belong to the proper session, these are further screens the packets at the application layer also.
Filtering at the s/w application port level provides an additional layer of control for the network administrator to ensure that only authorized transactions are allowed through the firewall. These firewalls close off ports until connection to the specified port is requested.
Advantages of stateful inspection:
These will typically offer much higher performance than proxies.
These ensure that all packets must be a port of an authorized communication session. Therefore, a higher level of protection is provided to users communicating with systems external to the trusted network.
Stateful Inspection provides a greater level of security control by enforcing security policies at the "application socket" or port layer as well as the protocol and address level.
Disadvantages of stateful inspection:
1. stateful inspection functionality currently requires the purchase of additional hardware and/or software and is not typically "bundled" with another existing network device.
A simple example of firewall:
CISCO developed 500 series firewall as better because they use a cut-through protocol in packet examination and an ACL that compares connections based on past connections with the same client. In other words, based on the first connection with a client, a kind of fingerprint is created using source and destination addresses, ports, TCP sequence numbers, and other TCP flags. So, instead of examining every client connection packet stream, the packets are first compared to the ACL.matches an authorized fingerprint, then the data stream is allowed without further examination. Both the cut-through protocol and the use of an ACL is said to greatly enhance speed.



1. Firewalls create barriers in order to prevent unauthorized access to a network.
2. They are the security doors through which some people (i.e. data) may pass and others may not.
3. It adds another layer of security to your systems.
4. It protects networked computers from intentional hostile intrusion that could compromise confidentiality or result in data corruption or denial of service.
5.It is is a choke point through which all the traffic flows between two network.
Advantages of firewall:
Concentration of security, all modified software and logging is located on the firewall system as opposed to being distributed on many hosts;
Protocol filtering, where the firewall filters protocols and services that are either not necessary or that cannot be adequately secured from exploitation;
Information hiding, in which a firewall can ``hide'' names of internal systems or electronic mail addresses, thereby revealing less information to outside hosts;
Application gateways, where the firewall requires inside or outside users to connect first to the firewall before connecting further, thereby filtering the protocol;
Extended logging, in which a firewall can concentrate extended logging of network traffic on one system;
Centralized and simplified network services management, in which services such as ftp, electronic mail, gopher, and other similar services are located on the firewall system(s) as opposed to being maintained on many systems.
Disadvantages of firewall :
Given these advantages, there are some disadvantages to using firewalls. 1.The most obvious being that certain types of network access may be hampered or even blocked for some hosts, including telnet, ftp, X Windows, NFS, NIS, etc. However, these disadvantages are not unique to firewalls; network access could be restricted at the host level as well, depending on a site's security policy.
2. A second disadvantage with a firewall system is that it concentrates security in one spot as opposed to distributing it among systems, thus a compromise of the firewall could be disastrous to other less-protected systems on the subnet. This weakness can be countered, however, with the argument that lapses and weaknesse in security are more likely to be found as the number of systems in a subnet increase, thereby multiplying the ways in which subnets can be exploited.
3. Another disadvantage is that relatively few vendors have offered firewall systems until very recently. Most firewalls have been somewhat ``hand-built'' by site administrators, however the time and effort that could go intoconstructing a firewall may outweigh the cost of a vendor solution. There
Is also no firm definition of what constitutes a firewall; the term ``firewall'' can mean many things to many people.
FOR WHICH FIREWALLS CAN’T PROVIDE SECURITY :In addition, Firewalls can’t provide security for the following.
1. A firewall can’t protect against attacks that don’t go through the firewall. Many corporations that connect to Internet are very concerned about confidentially date leaking out of company through route. However, a magnetic tape can just export data.
2. Many organizations that are terrified of Internet connections have no coherent policy about how dial-in access via modems should be protected. There are many organizations out there buying expensive firewalls and neglecting the numerous other back doors into their network.
3. Another thing a firewall can’t really protect you against is traitors or idiots inside the network. An industrial spy might leak information or export it through a telephone, FAX or floppy disk. Firewalls can’t protect you against this stupidity.
4. Firewalls can't protect very well against things like viruses. There are too many ways of encoding binary files for transfer over networks, and too many different architectures and viruses to try to search for them all. In other words, a firewall cannot replace security-consciousness on the part of your users. In general, a firewall cannot protect against a data-driven attack--attacks in which something is mailed or copied to an internal host where it is then executed.
Organizations that are deeply concerned about viruses should implement organization-wide virus control measures. Rather than trying to screen viruses out at the firewall, make sure that every vulnerable desktop has virus-scanning software that is run when the machine is rebooted. Blanketing your network with virus scanning software will protect against viruses that come in via floppy disks, modems, and Internet. Trying to block viruses at the firewall will only protect against viruses from the Internet--and the vast majority of viruses are caught via floppy disks.
Conclusion:
In conclusion, the Internet has become a dangerous place. Thirteen-year-old kids on dial-up accounts can crash a site supported by two T-1 connections by using hundreds of zombies (PCs hacked and uploaded with a Trojan) to flood with UDP and ICMP traffic. This is simply a malicious attack meant to consume all of the bandwidth of a connection to the Internet. Yahoo was recently crashed by what is called a 'smurf' attack. In this attack, ping requests are sent to several Internet broadcast addresses with a spoofed return address aimed at the victim (yahoo in this case). The resulting storm of packets consumes all bandwidth and disconnects or makes the site unusable for normal traffic. Hackers attack networks to destroy and/or steal information. They attack PCs so they can use them in zombie attacks, to hide their identity when trying to gain illegal entry to secured networks, or for nothing more than malicious purposes. While on the internet my firewall typically gets 1 to 3 hits an hour, primarily port scanners looking for a specific Trojan or a vulnerability to exploit. No one should be on the Internet without a firewall. All networks are protected by firewalls. However, it is always a trade-off. The whole point of the Internet is communication and exchange of information. The question is how much do we restrict access without losing all the advantages of speed and openness.

XP TRICKS VIDEO

NETWORK SECURITY


Abstract:
Network security is a complicated subject, historically only tackled by well trained and experienced experts. However as more and more people become "wired" an increasing number of people need to understand the basics of security in a networked world in this document we discuss some of the network security issues in TCP/IP.
The Transmission control protocol/Internet protocol (TCP/IP) suite is a very widely used technique that is employed inter connect computing facilities in modern network environments TCP/IP is the "language" of the internet. Anything that can learn to "Speak TCP/IP" can play on the internet. However , there exist several security vulnerabilities in the TCP specification and additional weaknesses in a number of widely available implementations of TCP. These vulnerabilities may unable an intruder to "attack" TCP - based systems, enabling him or her to "hijack" a TCP connection are cause denial of service to legitimate users. We discuss some of the flaws present in the TCP implementation of many widely used operating system and provide recommendations to improve the security state of a TCP-based system. e.g., incorporation of a "Timer escape route" from every TCP state.
Keywords and phrases:
Network security, TCP, IP, Vulnerability analysis, state transitions
INTRODUCTION:
Internet working is an approach that allows dissimilar computers on dissimilar networks to communicate with one another in seamless fashions by hiding the details of the underlying network hardware. The most widely used form of internet working is provided by the transmission control protocol/Internet protocol (TCP/IP) suite.
There are some inherent security problems in the TCP/IP suite which makes the situation conducive to intruders. TCP sequence numbers prediction, IP address spoofing, misuse of IP's source routing principle, use of internet control message protocol (ICMP) messages denial of service, etc are some methods to exploit the networks vulnerabilities. Considering the fact that most important application programs such as simple mail transfer protocol(SMPP),Telnet-commands(rlogin,rsh,etc),file transfer protocol(FTP),etc. have TCP as their transport layer, security flaws in TCP can prove to be very hazardous for the network.
The objectives of this paper are to identify and analyze the vulnerabilities of TCP/IP and to develop security enhancements to overcome those flaws. Our work is based on analyzing the state transition diagram of TCP and determining the security relevance of some of the “improperly-defined” transitions between different states in the state transition diagram of many widely used TCP code implementations.
BASICS OF TCP/IP:
NETWORKING WITH TCP/IP:
Network protocols employ a structured and layered approach, with each layer performing a separate function. This approach helps in developing individual layers without modifying other adjacent layers. Networking using the TCP/IP suite can be viewed as a combination of four layers. The layers are as below
The lowest layer the data link layer contains the network interface layer, connecting the system with the physical media.
The next layer is the internet layer or the network layer. It assists with the movement of packets in the network.
User processes interact with the network layer through the transport layer. The transmission control protocol is the most common transport layer used in modern networking environments. TCP provides reliable data transfer between different application processes over the network. TCP provides flow control and congestion control as well.
The Application layer handles the details of a particular application. This layer interacts with the user, gets data from the user, and sends the buffered data to the transport layer.
2.2 TRANSPORT LAYER
Among all of the transport layers, TCP is the most popular. Below, we examine the details of the header format of TCP along with the TCP state-transition diagram and TCP timers.
TCP HEADER
The size of the TCP header is 20 bytes, without counting its options, as we observe in figure. Each TCP segment contains the source and destination port number to identify the sending and receiving application programs, respectively. The sequence number is essential to maintain the bytes of data from the sender to the receiver in proper data. By communicating the sequence number and the corresponding acknowledgement number, the sender and the receiver can determine lost or retransmitted data in the connection. There are six flag bits in the TCP header, namely URG, ACK, PSH, RST, SYN and FIN. At any given time, one or more of these bits can be set.
TCP provides flow control by advertising the window size. The checksum covers TCP header and TCP data and assists in determining any error in transmission of TCP header or data. TCP’s urgent mode is a method for the sender to transmit emergency/urgent data. The urgent pointer is valid only if the URG flag is set in the header. It helps to locate the sequence number of the last byte of urgent data. There is an optional options field as well, taking care of vendor specific information.
TCP STATE TRANSITION DIAGRAM
Initiation, establishment, and termination of a connection is governed by the TCP state transition diagram, which consists of well-defined states and transition arcs between these states.
TCP TIMERS
The TCP state transition diagram is very closely associated with timers. There are various timers associated with connection establishment or termination, flow control, and retransmission of data.
A connection-establishment timer is set when the SYN packet is sent during the connection-establishment phase. If a response is not received with in 75 seconds (in most TCP implementations), the connection establishment is aborted.
A FINJWAIT_2 timer is set to 10 minutes when a connection moves from the FIN_WAIT_I state to the FIN_WAIT_2 state. If the connection does not receive a TCP packet with the FIN bit set with in the stipulated time, the timer expires and is set to 75 seconds. If no FIN packet arrives with in the time, the connection is dropped.
There is a TIME_WAIT timer, often called a 2 MSL (maximum segment lifetime) timer. It is set when a connection enters the TIME_WAIT state. When the timer expires, the kernel data blocks related to that particular connection are deleted, and the connection is terminated.
A keep-alive timer can be set which periodically checks whether the other end of the connection is still active. If the SO_KEEPALIVE socket option is set, and
If the TCP state is either ESTABLISHED or CLOSE_WAIT and the connection is idle, then probes are sent to the other end of the connection once every 2 hours. If the other end does not respond to a fixed number of these probes, the connection is terminated.
Additional TCP timers such as persist timer, delayed ACK timer, and retransmission timer are not relevant for our purposes here and, hence are not discussed.
VULNERABILITIES:
IP SPOOFING:
INSTANCES
The concept of attacks on TCP/IP such as TCP sequence number guessing was first brought to light by Morris. The computer Emergency Response Team (CERT) coordination center received reports of attacks in which intruders created packets with spoofed source IP addresses. These attacks exploit applications that use authentication based on IP address. Intruder activity in spoofing IP addresses can lead to unauthorized remote access to systems behind a filtering router firewall.
On Christmas Day, 1994, Kevin Mitnick broke into the computer of Tsutomo Shimomura, a computational physicist at the San Diego Super Computer center. Prior to this attack, Mitnick had found his way into the well, a small network used mainly by an electric group of about 11,000 computers users in san Francisco Bay. Mitnick had been reading electronic mail of the wells subscribers and using well accounts for remote attacks on computers across the internet. During the attack on Shimomura’s machine, two different intrusion mechanisms were employed. IP source address spoofing and TCP sequence number prediction were used to gain initial access to a diskless work station, being used mostly as an X terminal. After obtaining root access, Mitnick “hijacked” an existing connection to another system by means of a loadable kernel STREAMS module.
METHODOLOGY:
Let us assume that there are three hosts, host A, host B, and the intruder controlled host X. Let us assume that B grants A some special privileges, and thus A can get some actions performed by B. The goal of X is to get the same action done by B for itself. In order to achieve this goal, X has to perform two arithmetic operations: first establish a forged connection with B, and second, prevent A from informing B of any malfunction of the network authentication system. Host X has to spoof the IP address of A in order to make B believe that the packets from X are actually being sent by A.
Let us also assume that the hosts A and B communicate with one another by the following three way handshake mechanism of TCP/IP. The handshake method is depicted below
A􀃆B: SYN (seq no=M)
B􀃆A: SYN (Seq no=N),ACK (Ack no=M+1)
A􀃆B : ACK (Ack no=N+1)
Host X does the following to perform IP spoofing. First, it sends a SYN packet to host B with some random sequence number, posing as host A. Hos B responds to it by sending a SYN+ACK packet back to host A with an acknowledge number which is equal to one added to the original sequence number. At the same time, host B generates its own sequence number and sends it along with the acknowledge number. In order to complete the three way handshake, host X should send an Ack packet back to host B with an acknowledge number which is equal to one added to the sequence number sent by host B to host A. If we assume that the host X is not present in the same subnet as A or B so that it cannot sniff B’s packets, host X has to figure out B’s sequence number in order to create the TCP connection .
These steps are described
X􀃆 B: SYN (seq no=M),SRC=A
B􀃆 A: SYN (Seq no=N),ACK(ack no=M+1)
X􀃆 B: ACK (Ack no=N+!),SRC=A
At the same time, host X should take away the host A’s ability to respond to the packets of host B. To achieve this, X may either wait for host A to go down (for some reason), or block the protocol part of the operating system so that it does not respond to host B, for example by flooding B with incomplete connections.
THE ATTACK:
During the Christmas Day,1994 attack shimomura observed a sequence of packets that were generated to perform IP spoofing.Let us continue with the previous example with X as the intruder
controlled system and observe the actions performed by the intruder.
X sends a number of probe packets to B and A,trying to determine whether there exists any kind of trust relationship among hosts A and B. Commands such as showmount, RPCINFO and finger were utilized for this purpose.
X sends a number of TCP SYN packets i.e., packets containing the SYN flag set with some arbitrary initial sequence numbers to host A. however the source IP address of these packets have been forged, so that they appear to be coming from some host which does not exist in the network. Host A responds to these packets by sending corresponding SYN-ACK packets to the non-existent hosts. As there are no corresponding ACK packets to the packets sent by A, the three way hand-shake is never complete. The connection queue for port 513(login port) of A are filled up with connection setup requests. Thus the port willnot be able to generate RST packets in response to unexpected SYN-ACK packets.
X sends a number of connection request packets (SYN packets) to host B. when host B responds to them by sending corresponding SYN-ACIK packets to X, X sends RST packets to B. Thus the three-way handshake is not completed and TCP connections are never established between B and X. the purpose of this step is to determine the behavior of B’s TCP sequence number generator. The sequence numbers obtained from B for each new connection are analyzed by X. the periodicity of these numbers is determined and this data will be used X in the next step to generate and send a forged packet to B with a forged sequence number.
X creates a forged SYN packet with the source IP address same as that of host A. X sends this packet to B. B sends a corresponding SYN-ACK packet to A. However, A is ignoring all of the new packets coming to its loging port; it will not send any RST packet to B in response to the unexpected SYN-ACK packets from B.
X does not receive the SYN-ACK packet sent by B to A ( assuming X is present in a different subnet ). However , X is in a position to predict the sequence number present in B’s SYN-ACK packet. X generates and sends a forged ACK packet to B with the source host address same as that of A and an acknowledgement number corresponding to the sequence number in B’s SYN-ACK packet. B assumes that the three-way handshake is successfully performed. Hence, there is a one-way TCP connection established from X to B.
Host X is now in a position to commands to B. B will perform these commands, assuming that they are being sent by the trusted host A.
Problems with TCP state Transitions :
Let us take a closer look at Step2. The intruder- controlled host X is able to stall the loging-port of host A by sending a series of SYN packets but not sending ACK packets corresponding to the SYN-ACK packets from A to X. As we have observed before, TCP maintains a connection establishment timer. If a connection does not get established within a stipulated time ( typically 75 seconds), CP resets the connection. Thus, in our previous example, the server port will not be able to respond for duration of 75 seconds.
Extraneous State Transitions :
Consire a sequence of packets between hosts X and A. X sends a packet to A, with both SYN and FIN flags set. A responds by sending an ACK packet back to X, as illustrated below.
X 􀃆 A : SYN FIN ( Seq. no. = M)
A-􀃆X: ACK ( ack no. = M+ 1 )
Examining the state – transition diagram in the figure, we observer that A is initially in state LISTEN. When it receives the packets from X, starts processing the packets. It processes the SYN flag first, then transitions to the SYN_RCVD state. Then it processes FIN flag and performs a transition to the state CLOSE_WAIT. Had the previous state been ESTABLISHED, this transition to the CLOSE_WAIT state would have been a (normal) transition. However, a transition from SYN_RCVD state to the CLOSE_WAIT, state is not defined in the TCP specifications. This phenomenon occurs in several TCP implementations, such as those in the Operating systems SUNOS 4.1.3.SVR 4.and UL-TRIX 4.3. Thus, contrary to specifications, there exists in several TCP implementations a transition arc from the state SYN_RCVD to the state CLOSE_WAIT, as shown in fig.
Security Relevance
In our example attack scenario, the TCP connection is not yet fully established since the 3-way handshake is not completed; thus, the corresponding network application never got the connection from the kernel. However, host A’s TCP “machine” is in CLOSE_WAIT state and is expecting the application to send a close signal so that it can send a FIN packet to X and terminate the connection. This half-open connection does not send any message to help TCP perform any state transition. Thus, A’s TCP “machine” gets stuck in the CLOSE_WAIT state. If the keep-alive timer feature is enabled, TCP will be able to reset the connection and perform a transition to the CLOSED state after a period of usually two hours.
Intruder – controlled host X needs to perform the following steps to wedge A’s operating steps so that it cannot respond to unexpected SYN-ACKs from other hosts for as long as two hours.
• X sends a packet to host A with SYN and FIN flags set. A responds with an ACK packet. A changes its state from CLOSED/LISTEN to SYN_RCVD and then to CLOSE.WAIT.
• X does not send any more packet to A, thus preventing any TCP state-transition in A.
Thus , we observe that extraneous state-transitions exist in several implementations of TCP and these may lead to severe violations of the system.
Experiments and Results
Assume that there are two hosts, host A, host B, and the intruder-controlled host X. We will see what happens in IP spoofing attack and extraneous state transitions.
Stalling a port
• A ftp connection is initialized from the “intruder” machine X to A.
• The tcp device of X sends a SYN packet to A. A responds by sending a SYN-ACK packet, and performs a state-transition to the SYN-RCVD state.
• X does not send any other packet to A. A remains in the SYN_RCVD state until the connection – establishment timer expires,\.
The sequence of packets, as observed by the output of tcpdump[10] is as follows:
23:26:51.475103 X.32781 > A.ftp:
S 4188491776:4188449776(0) win 8760 (DF)
23:26:51.477716 A.ftp > X2.32781
S 1382592000:1382592000(0) ack4188491777 win 4096
We observe that port 32781 of X sends a SYN packet to the “ftp” port of A with an initial sequence number of 4188491776, initial window advertisement of 8760 at time 23:26:51.475103. A, in turn, responds back with a SYACK packet , with an initial number of 1382592000 and an acknowledgement number of 4188491777 at a time 23:26:51.477716.
However, X did not send any other packet and so A gets stuck in the SYN_RCVD state for around 75 seconds.
Spurious state transition
To generate the spurious state-transition from the SYN_RCVD state to the CLOSE_WAIT state, we employed the following steps:
• We start a ftp connection from X to A.
• In order to start the connection, X sends a SYN-FIN TCP packet t A.
• A responds back with an ACK packet.
• X does not send any other packet to A.
Using tcp dump, we observed the following sequence of packets in the network
21:41:05.177249 X.32780 > ftp:
SF 1550373888:1550373888(0) win 8760
21:41:05.177606 A.ftp>X.32780:.ack 1550373890 win
21:41:05. 177606 A .ftp> X. 32780: Aack 1550373890 win 4096
Had there been no spurious state- trantion from SYN_RCVD to the CLOSE_WAIT state in TCP implementations in the OS of A, the TCP “machine” of A would have waited in the SYN_RCVD state until the connection- establishment expired. However, netstat command in A gave us the following output.
Tcp 0 0 A.ftp X. 32780 CLOSE_WAIT
This clearly indicates that there exists a TCP connection between the “ftp” port of A and port 32780 of X, and the connection exists in the CLOSE_WAIT state. The connection remains in this state in A long after the peer host closed the connection on its side.
We obtain similar results with TCP implementation of ULTRIX 4.3OS as well.
Recommendations
There is no way easy to prevent IP spoofing. We may perform the following tasks to protect our systems from this sort of attack. First, we may configure the routers and the gateways in our networks such that they do not allow connections from outside with a source IP address the same as that of any of the systems within the local subnet. Also, they should not route packets from a host in the local subnet to the outside when the source IP address of the packet is something not present in the local subnet. Second, encrypt the packets before sending them to the network. Though this process requires extensive change in the present networking environment, it will ensure the integrity and authencity of data.
To prevent the spurious state-trantion from SYN_RCVD state to CLOSE_WAIT state, we should request the OS vendors to modify the relevant part of the source code in their TCP implementation. In other words, when the TCP”machine” is in SYN_RCVD state, it should neglect any FIN packets that it might receive from a peer host

DATA WARE HOUSING


DATA WARE HOUSING
A data warehouse is a store of information organized in a unified data model.
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In this paper we tried to explain the characteristics, architectures and processes involved in data warehousing.
Characteristics of a data warehouse are
• Subject oriented
• Integrated
• Non volatile
• Time variant
Common architectures involved in data warehousing are of three types.Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are:
• Data warehouse Architecture (Basic)
• Data warehouse Architecture (With a staging area)
• Data warehouse Architecture (With a staging area and Data Marts)
Data warehousing involves data pre-processing. All those methods are explained in this paper. Thus this paper includes gives the overview of data warehousing concepts.
Introduction
A data warehouse is a collection of data gathered and organized so that it can easily by analyzed, extracted, synthesized, and otherwise be used for the purposes of further understanding the data. It may be contrasted with data that is gathered to meet immediate business objectives such as order and payment transactions, although this data would also usually become part of a data warehouse.
A data warehouse is, primarily, a record of an enterprise's past transactional and operational information, stored in a database designed to favour efficient data analysis and reporting (especially OLAP). Data warehousing is not meant for current, "live" data.
Characteristics of a data warehouse:
• Subject oriented
• Integrated
• Non volatile
• Time variant
Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?"
This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
Integrated
Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
Nonvolatile
Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.
Time Variant
In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant
Data warehouse architecture:
Data warehouse architecture is a description of the elements and services of the warehouse, with details showing how the components will fit together and how the system will grow over time. Data warehouse Architecture (Basic)
• Data warehouse Architecture (With a staging area)
• Data warehouse Architecture (With a staging area and Data Marts)

Data from the operational systems are
• Extracted
• Cleansed
• Transformed
• Loaded & Propagated
Extraction
Data Warehouse Manager supports standard SQL-style inner, outer and exception file joins, as well as joins based on derived fields, one-to-many relationships, file members, unique key fields, and file unions. You may extract data from both local and distributed databases when OS/400 Distributed Data Base Management (DDM) is in use.
Record selection and new field calculation criteria may be defined in the extraction object or left for ad hoc entry at run-time. CL programs may also be used to pass ad hoc criteria into the extraction object. NGS provides sample CL source programs to illustrate this capability. Data Warehouse Manager also supports full Boolean selection logic and a timesaving
Transformation
The data transformation capabilities of Data Warehouse Manager convert numeric codes for inventory items, employees, departments and other business terms into meaningful, descriptive values. You can also convert dates into consistent formats and generate summary fields such as gross-to-net, gross margin and more. Other transformation functions allow you to create fields representing items such as integer keys, elapsed days and time, absolute values, remainders, etc
Also convert dates into consistent formats and generate summary fields such as gross-to-net, gross margin and more. Other transformation functions allow you to create fields representing items such as integer keys, elapsed days and time, absolute values, remainders, etc.
Cleansing
Data Warehouse Manager can handle many data cleansing requirements and provide Application Program Interfaces (API’s) for custom cleansing programs or third-party tools copy the entry of complex selection criteria.
Loading and Propagation
Data Warehouse Manager builds DB2 UDB for I-Series 400 database files, which may be accessed by numerous IBM and third-party software tools.
These files may be generated in summary or detail as required. Key fields may be assigned to the designated support star schema warehouse architectures. Depending on the object’s attributes, the output may create, replace or append DB2 UDB file
Data pre processing.
Data processing is necessary because
Real world data is dirty
Data is in consistent.
Data preprocessing tasks:
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Concept hierarchy generation for numeric data
Data cleaning:
Data cleaning, also called data cleansing or scrubbing deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data
Data Quality Problems:
Misspellings during data entry, missing information or other invalid data
Present in single collections such as files and databases
Problems increase when multiple sources need to be integrated
Data cleaning is important because,
• Poor data quality can result in loss of money
$2 billion of U.S. federal loan money has been lost due to poor data quality at a single agency.
• Data warehouses must provide high-level quality of data and service as decision support information systems.
• Probability of dirty data high.
Significant portion of cleaning and transformation work done manually or by low level programs that are difficult to maintain.
Data Cleaning Approach:
• Detect and remove all major errors and inconsistencies in individual and multiple data sources
• Inspection Supported by tools to limit manual and programming effort
• Should not be performed in isolation but with schema-related data transformations based on comprehensive metadata
• Mapping functions for data cleaning and other data transformations should be reusable for other data sources
Tools:
ETL (Extraction, Transformation, Loading) Tool:
• Typically support the ETL process for data warehouses in a comprehensive way
• Typically have little built-in data cleaning capabilities but allow the user to specify cleaning functionality via a proprietary API
• There is usually no data analysis support to automatically detect data errors and inconsistencies.
• Provide transformation libraries that cover many data transformation and cleaning needs
• Such as data type conversions (e.g., date reformatting)
• String functions (e.g., split, merge, replace, sub-string search)
• Arithmetic, scientific and statistical functions
• Typically covers if-then and case constructs that help handling exceptions in data values, such as misspellings, abbreviations, missing or cryptic values, and values outside of range
• These problems are also addressed by using a table lookup construct and join functionality
Examples of these tools include:
• Copy manager (Information Builders)
• Data stage (Informix/Ardent)
• Extract (ETI), Power mart (Informatics)
• Decision base (CA/Platinum)
• Data transformation service (Microsoft)
• Metasuite 11 (Minerva/Carleton)
• Sagent solution platform (Sagent)
• Warehouse administrator (SAS).
Data integration and transformation:
This includes:
Schema integration
Redundancy
Schema integration:
Multiple data sources may provide data on the same entity types.
For example, Meta data from two music
Servers.
Approach to data integration:
• Use mapping rules to handle structural differences
• Ambient contains a mapping rule
Manager capable of:
– Finding appropriate sets of mapping rules
– Rewriting queries based on a set of mapping
– Handling different versions of mapping rules Rules
Redundancy:
Redundancy, in general terms, refers to the quality or state of being redundant, that is: exceeding what is necessary or normal; or duplication. This can have a negative connotation, especially in
rhetoric: superfluous or repetitive; or a positive implication, especially in engineering: serving as a duplicate for preventing failure of an entire system.
Data reduction:
• Data warehouse may store terabytes of data.
Complex data analysis/mining may take a very long time to run on the complete data set
Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results.
Data cube aggregation:
• The lowest level of a data cube
o The aggregated data for an individual entity of interest
o e.g., a customer in a phone calling data warehouse
• Multiple levels of aggregation in data space
o Further reduce the size of data
• Reference appropriate levels
Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered using data when possible
Purpose:
The main purpose of a data warehouse is to support decision-making.
• Data is collected from a number of different sources.
• It is made is to perform advanced analysis
The main goal of data warehousing is to generate front-end analytics that will support business executives and operational managers.
References:
• The Data Warehouse Toolkit by Ralph Kimball
• Building the Data Warehouse by William Inmon

WIMAX

WIMAX


IEEE 802.16 is working group number 16 of IEEE 802, specializing in point-to-multipoint broadband wireless access. It also is known as WiMAX, an acronym that stands for Worldwide Interoperability for Microwave Access.
WiMax can be used for wireless networking like the popular Wi-Fi. WiMax, a third generation protocol, allows higher data rates over longer distances, efficient use of bandwidth, and avoids interference almost to a minimum.
There are at least four 802.16 standards: 802.16, 802.16a, 802.16-2004 (802.16), and 802.16e.
Based on services provided by WiMAX there are three types usage Models. They are Fixed WiMAX, portable WiMAX, and Mobile WiMAX.Services Provided by WiMAX are Line-Of-Sight (LOS), Nonline-Of-Sight (NLOS).
. An important aspect of the IEEE 802.16 is that it defines a MAC layer that supports multiple physical layer (PHY) specifications. It will not entirely replace incumbent broadband technologies; it will serve as a viable complement to them.


Introduction
WiMax (Worldwide Interoperability for Microwave Access) is a wireless broadband technology, which supports point to multi-point (PMP) broadband wireless access.
WiMAX is an Institute of Electrical and Electronics Engineers standard designated 802.16-2004 (fixed wireless applications) and 802.16e-2005 (mobile wire-less).
WiMAX has the potential to replace a number of existing telecommunications infrastructures.
This extension provides for non-line of sight access in low frequency bands like 2 - 11 GHz. These bands are sometimes unlicensed. This also boosts the maximum distance from 31 to 50 miles and supports PMP (point to multipoint) and mesh technologies.
WiMax can be used for wireless networking like the popular Wi-Fi. WiMax, a third generation protocol, allows higher data rates over longer distances, efficient use of bandwidth, and avoids interference almost to a minimum
Usage Models of WiMAX
Based on the two services from a base station to a subscriber station, also known as customer premise equipment (CPE) Provided by WiMAX.
• Fixed WIMAX
• Portable WiMAX
• Mobile WiMAX
Fixed WiMAX:
Some goals for WiMAX include a radius of service coverage of 6 miles from a WiMAX base station for point-to-multipoint, non-line-of-sight service. This service should deliver approximately 40 megabits per second (Mbps) for fixed and portable access applications.
Mobile WiMAX:
Mobile WiMAX or portable WiMAX takes the fixed wireless application a step further and enables cell phone-like applications on a much larger scale. For example,
mobile WiMAX enables streaming video to be broadcast from a speeding police or other emergency vehicle at over 70 MPH.
Services provided by WiMAX
• Line-of-sight service
o Line-of-sight between transmitter & receiver
o 11 GHz to 66 GHz frequency range
o At Higher frequencies, there is less interference and lots more bandwidth
• Non-line-of-sight
o Line-of-sight is not required in between a small antenna on CPE and receiver.
o 2 GHz to 11 GHz frequency range.
o Longer-wavelength transmissions are not as easily disrupted by physical obstructions – they are better able to diffract, or bend, around obstacles.
WiMAX Standards
There are Four 802.16 standards:
• 802.16
• 802.16a/REVd/2004
• 802.16e

WiMAX Forum
The WiMAX Forum is a non-proft trade organization formed to promote the IEEE 802.16a wireless MAN standard, and to certify 802.16a equipment as interoperable. Like "Wi-Fi", the WiMAX Forum seeks tomake "WiMAX" a popular brand name for guaranteed interoperable wireless networking.But, while Wi-Fi focuses on wireless LAN, WiMAX will focus on using the 802.16a standard to provide wireless broadband access to businesses and consumers


The Problems WiMAX Solves
• Cellular Backhaul Problem : transport voice and data traffic from outlying cell sites to the operator's core network
• How to keep wireless notebooks and other mobile devices connected between 802.11 hotspots.
• Hotspot Backhaul Problem: Linking Wi-Fi hot spots to the Internet.
• Last Mile Problem: DSL can only reach about 18,000 feet (3 miles) from the central office switch—many urban, suburban and rural locations may not be served.
Advantages
• A single WiMAX main station can serve hundreds of users.
• Endpoints install within days instead of the weeks required for wired connections.
• Data rates as high as 280Mbps and distances of 30 miles are possible.
• Users can operate mobile within 3-5 miles of a base station at data rates up to 75Mbps.
• No FCC radio licensing is required.
Disadvantages
• Line-of-sight (LOS) is required for long distance (5-30 mile) connections
• Heavy rains can disrupt the service.
• Other wireless electronics in the vicinity can interfere with the WiMAX connection and
cause a reduction in data throughput or even a total disconnect.