Sunday, July 25, 2010

THE BASIC LANGUAGE




EXAMPAPERS123.BLOGSPOT.COM







CHAPTER-1

THE BASIC LANGUAGE
 object oriented programming.
 Encapsulation.
 Polymorphism.
 CPP Preprocessor Directives.
 Comments.
 CPP Data types
 New and Delete expressions
 Type convertions.

Object oriented programming: It is nothing but doing the programs with the help of objects. So first of all we have to know what is an object? How it is implemented in C++ programming? All these details are given below.
CLASS: A class is an expanded concept of a data structure: instead of holding only data, it can hold both data and functions.
OBJECT: An object is an instantiation of a class. In terms of variables, a class would be the type, and an object would be the variable.
By using these objects we can access the class members and member functions.

Classes are generally declared using the keyword class, with the following format:

class class_name { EG: Class student
access_specifier_1: {
member 1; charname;
access_specifier_2: int marks:
member2; float average;
… }s;
}object_names;
In the above eg: “S” is the object so we can access entire data of the student class by using this object. We discuss about the access specifiers in next concepts in detail.
Encapsulation: Wrapping up of a data in to single logical unit (i.e class) is called encapsulation. So writing class is known as encapsulation.
Polymorphism: Simply it one thing different actions let me explain consider one person (assume that person is one thing)he will exibit different actions depending on situations like his son called him daddy, and his father called him son, and his wife called him husband. I will explain how this concept is implemented in our C++ concepts in the upcoming chapters.

Inheritance: A key feature of C++ classes is inheritance. Inheritance allows creating classes which are derived from other classes, so that they automatically include some of its “parent’s” members, plus its own. We will the implementation of this concept in detail in the upcoming chapters.

Comments: comments are parts of the source code disregarded by the compiler. They simply do nothing. Their purpose is only to allow the programmer to insert notes or descriptions embedded within the source code.

C++ supports two ways to insert comments:
//line comment
/*block comment*/
The first of them, known as line comment, discards everything from where the pair of slash signs (//) is found up to the end of that same line. The second one, known as block comment, discards everything between the /* characters and the first appearance of the */ characters, with the possibility of including more than one line.

We are going to add comments to our second program:

If you include comments within the source code of your programs with out using the comment characters combinations //,/* or */, the compiler will take them as if they were C++ expressions, most likely causing one or several error messages when you compile it.
Preprocessor directives:
Preprocessor directives are lines included in the code of our programs that are not program statements but directives for the processor. These lines are always preceded by a pound sign (#). The processor executed before the actual compilation of code is generated by the statements.

These processor directives extend only across a single line of code. As soon as a new line character is found, the processor directive is considered to end. No semicolon (;) is expected at the end of a preprocessor directive. The only way a preprocessor directive can extend through more than one line is by preceding the new line character at the end of the line by a backslash (/)
Macro definitions (#define, #undef)
To define preprocessor macros we can use #define. Its formate is:
#define identifier replacement.
When the preprossor encounters this direccive , it replaces any occurrence of identifier in the rest of the code by replacement can be an expresson , a statement , a block or simply anything. The processor does not understand C++ , it simply replaces any occurrence of identifier by replacement.
#define TABLE_SIZE 100
int table 1 [TABLE_SIZE];
int table 2 [TABLE_SIZE];
After the preprocessor has replased TABLE_SIZE, the code becomes equivalent to:
Int table 1[100];
Int table 2 [100];
This use of #define as constant definer is already known by us from previous tutorials, but #define can work also with parameters to define function macros:
#define getmax(a,b) a>b?a:b
This would replace any occurrence of getmax followed by two arguments by the replacement expression, but also replasing each argument by its identifier, exactly as you would expect if it was a function:
//function macro
#include
Using namespace std;

#define getmax(a,b) ((a)>(b)?(a)b))

Int main()
{
Int x=5,y;
Y=getmax (x,2);
Cout << y << endl;
Cout<< getmax(7,x)<Return 0;
} 5
6

Defined macros are not affected by block structure. A macro lasts until it is undefined with the #undef preprosseor directive:
#define table_size 100
Int table 1 [TABLE_SIZE];
#undef TABLE_SIZE
#define TABLE_SIZE 200
Int table 2[TABLE_SIZE];

This would generate the same code as:

Int table 1 [100];
Int table 2 [200];

Function macro definitions accept two special operators (# and ##) in the replacement sequence:
If the operator # is used before a parameter is used in the replacement sequence, that parameter is replaced by a string literal (as if it were enclosed between double quotes)
#define str (x) #x
Cout << str(test);

This would be translated into:
Cout << “test”;

The operator ## concatenates two arguments leaving no blank spaces between them:

#define glue(a,b) a## b
Glue(c,out) << ”test’’;
This would also be translated into:
Cout <<”test”;
Because preprocessor replacements happen before any C++ syntax check, macro definitions can be a tricky feature, but be careful: code that relies heavily on complicated macros may result obscure to other programmers, since the syntax they expect in C++.
Conditional inclusions (#ifdef, #ifdef, #if, #endif, #else and #elif)
These directives allow to include or discard part of the code of a program if a certain condition is met.
#ifdef allowes a section of a program to be compiled only if the macro that is specified as the parameter has been defined, no matter which its value is. For example:
#ifdef TABLE_SIZE
int table [TABLE_SIZE];
#endif
In this case , the line of code int table [TABLE_SIZE]; is only compiled if TABLE_SIZE was previously defined with #define, independently of its value. If it was not defined, that line will not be included in the program compilation.
#ifndef serves for the exact opposite: the code between #indef and #endif directives is only compiled if the specified identifier has not been previously defined. For example:
#ifndef TABLE_SIZE
#define TABLE_SIZE 100
#endif
Int table [TABLE_SIZE];


/* My second program in C++
With more comments*/
#include
Using namespace std;
Int main ()
{
Cout <<”Hello World!”; // prints Hello World!
Count <<”I’m a C++program”; //prints I’m a C++ program
return 0;
} Hellow World! I’m a C++ program

Friday, May 28, 2010

DATA MINING AND DATA WARE HOUSE

DATA MINING AND DATA WARE HOUSE
Many software projects are accumulated by a great deal of data, so we really need information about the effective maintenance and reteving of data from the database. The newest, hottest technology to address these concerns is data mining and data warehousing.
Data Mining is the process of automated extraction of predictive information from large databases. It predicts future trends and finds behavior that the experts may miss as it lies beyond their expectations. Data Mining is part of a larger process called knowledge discovery, specifically, the step in which advanced statistical analysis and modeling techniques are applied to the data to find useful patterns and relationships.
Data warehousing takes a relatively simple idea and incorporates it into the technological underpinnings of a company. The idea is that a unified view of all data that a company collects will help improve operations. If hiring data can be combined with sales data, the idea is that it might be possible to discover and exploit patterns in the combined entity.
This paper will present an overview of the different process and advanced techniques involving in data mining and data warehousing.




1. Introduction to Data Mining:
Data mining can be defined as "a decision support process in which we search for patterns of information in data." This search may be done just by the user, i.e. just by performing queries, in which case it is quite hard and in most of the cases not comprehensive enough to reveal intricate patterns. Data mining uses sophisticated statistical analysis and modeling techniques to uncover such patterns and relationships hidden in organizational databases - patterns that ordinary methods might miss. Once found, the information needs to be presented in a suitable form, with graphs, reports, etc.
1.1 Data Mining Processes
From a process-oriented view, there are three classes of data mining activity: discovery, Predictive modeling and forensic analysis, as shown in figure below.
Discovery is the process of looking in a database to find hidden patterns without a
Predetermined idea or hypothesis about what the patterns may be. In other words, the
program takes the initiative in finding what the interesting patterns are, without the user
thinking of the relevant questions first.

Discovery is the process of looking in a database to find hidden patterns without a predetermined idea or hypothesis about what the patterns may be. In other words, the program takes the initiative in finding what the interesting patterns are, without the user thinking of the relevant questions first.
In predictive modeling patterns discovered from the database are used to predict the future. Predictive modeling thus allows the user to submit records with some unknown field values, and the system will guess the unknown values based on previous patterns discovered from the database. While discovery finds patterns in data, predictive modeling applies the patterns to guess values for new data items.
Forensic analysis:
This is the process of applying the extracted patterns to find anomalous or
unusual data elements. To discover the unusual, we first find what is the norm, and then
we detect those items that deviate from the usual within a given threshold. Discovery
helps us find "usual knowledge," but forensic analysis looks for unusual and specific
cases.
1.2 Data Mining Users and Activities
Data mining activities are usually performed by three different classes of users - executives, end users and analysts.
Executives need top-level insights and spend far less time with computers than the
other groups.
End users are sales people, market researchers, scientists, engineers, physicians,
etc.
Analysts may be financial analysts, statisticians, consultants, or database designers.
These users usually perform three types of data mining activity within a corporate environment: episodic, strategic and continuous data mining.
In episodic mining we look at data from one specific episode such as a specific direct marketing campaign. We may try to understand this data set, or use it for prediction on new marketing campaigns. Analysts usually perform episodic mining.
In strategic mining we look at larger sets of corporate data with the intention of gaining an overall understanding of specific measures such as profitability.
In continuous mining we try to understand how the world has changed within a given time period and try to gain an understanding of the factors that influence change.

1.3 Data Mining Applications:
Virtually any process can be studied, understood, and improved using data mining. The top three end uses of data mining are, not surprisingly, in the marketing area.
Data mining can find patterns in a customer database that can be applied to a prospect database so that customer acquisition can be appropriately targeted. For example, by identifying good candidates for mail offers or catalogs direct-mail marketers can reduce expenses and increase their sales. Targeting specific promotions to existing and potential customers offers similar benefits.
Market-basket analysis helps retailers understand which products are purchased together or by an individual over time. With data mining, retailers can determine which products to stock in which stores, and even how to place them within a store. Data mining can also help assess the effectiveness of promotions and coupons.
Another common use of data mining in many organizations is to help manage customer relationships. By determining characteristics of customers who are likely to leave for a competitor, a company can take action to retain that customer because doing so is usually far less expensive than acquiring a new customer.
Fraud detection is of great interest to telecommunications firms, credit-card companies, insurance companies, stock exchanges, and government agencies. The aggregate total for fraud losses is enormous. But with data mining, these companies can identify potentially fraudulent transactions and contain the damage.
Financial companies use data mining to determine market and industry characteristics as well as predict individual company and stock performance. Another interesting niche application is in the medical field: Data mining can help predict the effectiveness of surgical procedures, diagnostic tests, medications, service management, and process control.
1.4 Data Mining Techniques:
Data Mining has three major components Clustering or Classification, Association Rules and Sequence Analysis.


1.4.1 Classification:
The clustering techniques analyze a set of data and generate a set of grouping rules that can be used to classify future data. The mining tool automatically identifies the
clusters, by studying the pattern in the training data. Once the clusters are generated, classification can be used to identify, to which particular cluster, an input belongs. For example, one may classify diseases and provide the symptoms, which describe each class or subclass.
1.4.2 Association:
An association rule is a rule that implies certain association relationships among a set of objects in a database. In this process we discover a set of association rules at multiple levels of abstraction from the relevant set(s) of data in a database. For example, one may discover a set of symptoms often occurring together with certain kinds of diseases and further study the reasons behind them.
1.4.3 Sequential Analysis:
In sequential Analysis, we seek to discover patterns that occur in sequence. This
deals with data that appear in separate transactions (as opposed to data that appear in the same transaction in the case of association) e.g. if a shopper buys item A in the first week of the month, and then he buys item B in the second week etc.
1.4.4 Neural Nets and Decision Trees:
For any given problem, the nature of the data will affect the techniques you choose. Consequently, you'll need a variety of tools and technologies to find the best possible model. Classification models are among the most common, so the more popular ways for building them have been explained here. Classifications typically involve at least one of two workhorse statistical techniques - logistic regression (a generalization of linear regression) and discriminate analysis. However, as data mining becomes more common, neural nets and decision trees are also getting more consideration. Although complex in their own way, these methods require less statistical sophistication on the part of the user.
Neural nets use many parameters (the nodes in the hidden layer) to build a model that takes and combines a set of inputs to predict a continuous or categorical variable.

Source: "Introduction to Data Mining and Knowledge Discovery" by "Two Crows Corporation"

The value from each hidden node is a function of the weighted sum of the values from all the preceding nodes that feed into it. The process of building a model involves finding the connection weights that produce the most accurate results by "training" the neural net with data. The most common training method is back propagation, in which the output result is compared with known correct values. After each comparison, the weights are adjusted and a new result computed. After enough passes through the training data, the neural net typically becomes a very good predictor.
Decision trees represent a series of rules to lead to a class or value. For example, you may wish to classify loan applicants as good or bad credit risks. Figure below shows a simple decision tree that solves this problem. Armed with this tree and a loan application, a loan officer could determine whether an applicant is a good or bad credit risk. An individual with "Income > $40,000" and "High Debt" would be classified as a "Bad Risk," whereas an individual with "Income < $40,000" and "Job > 5 Years" would be classified as a "Good Risk."


Decision trees have become very popular because they are reasonably accurate and, unlike neural nets, easy to understand. Decision trees also take less time to build than neural nets. Neural nets and decision trees can also be used to perform regressions, and some types of neural nets can even perform clustering.
2.1 Introduction to Data warehousing:
In the current knowledge economy, it is now an indisputable fact that information is the key to organizations for gaining competitive advantage. Organizations very well know that the vital information for decision making is lying in its databases. Mountains of data are getting accumulated in various databases scattered around the enterprise. But the key to gaining competitive advantage lies in deriving insight and intelligence out of these data. Data warehousing helps in integrating categorizing, codifying and arranging the data from all parts of an enterprise.
According to Bill Inmon, known as the father of Data warehousing, The concept of data warehouse is depicted as figure




2.1.1 Subject oriented data:
All relevant data about a subject is gathered and stored as a single set in a useful format.
2.1.2 Integrated data:
Data is stored in a globally accepted fashion with consistent naming conventions, measurements, encoding structures and physical attributes, even when the underlying operational system store the data differently.


2.1.3 Non-volatile data:
The data warehouse is read-only, data is loaded in to the data warehouse and accesses there.
2.1.4 Time-variant data:
This long term data is from 5 to 10 years as opposed to the 30-60 days of operational data.
2.2 Structure of data warehouse:

The design of the data architecture is probably the most critical part of a data warehousing project. The key is to plan for growth and change, as opposed to trying to design the perfect system from the start. The design of the data architecture involves understanding all of the data and how different pieces are related. For example, payroll data might be related to sales data by the ID of the sales person, while the sales data might be related to customers by the customer ID. By connecting these two relationships, payroll data could be related to customers (e.g., which employees have ties to which customers).
Once the data architecture has been designed, you can then consider the kinds of reports that you are interested in. You might want to see a breakdown of employees by region, or a ranked list of customers by revenue. These kinds of reports are fairly simple. The power of a data warehouse becomes more obvious when you want to look at links between data associated with disparate parts of a organization (e.g., HR, accounts payable, and project management).
2.3 Benefits of Data warehousing:
Cost avoidance benefits.
Higher productivity.
Benefits through better analytical capability.
Manage business complexity.
Leverage on their existing investments.
End user spending.
Spending on e-business.
Accessibility and easy of use.
Real time information and analysis.
2.4 Techniques by different organization on efficient data warehouse:
That being said, most decisions to build data warehouses are driven by non-HR needs. Over the past decade, back office (supply chain) and front office (sales and marketing) organizations have spearheaded the creation of large corporate data warehouses. Improving the efficiency of the supply chain and competition for customers rely on the tactical uses that a data warehouse can provide. The key for other organizations, including HR, is to be involved in the creation of the warehouse so that their meets can be met by any resulting system.  This usually happens because both the data volume and question complexity have grown beyond what the current systems can handle. At that point the business becomes limited by the information that users can reasonably extract from the data system.
3.Conclusion:
Data mining offers great promise in helping organizations uncover hidden patterns in their data. However, data mining tools must be guided by users who understand the business, the data, and the general nature of the analytical methods involved. Realistic expectations can yield rewarding results across a wide range of applications, from improving revenues to reducing costs.
Building models is only one step in knowledge discovery. It's vital to collect and prepare the data properly and to check models against the real world. The "best" model is often found after building models of several different types and by trying out various technologies or algorithms.
The data mining area is still relatively young, and tools that support the whole of the data mining process in an easy to use fashion are rare. However, one of the most important issues facing researchers is the use of techniques against very large data sets. All the mining techniques are based on Artificial Intelligence, where they are generally executed against small sets of data, which can fit in memory. However, in data mining applications these techniques must be applied to data held in very large databases. These include use of parallelism and development of new database oriented techniques. However, much work is required before data mining can be successfully applied to large data sets. Only then will the true potential of data mining be able to be realized.
The data warehousing is the hottest concept for many software professionals to over come the sophisticated data to be managed efficiently. The data warehouse is repository (or archive) of information gathered from multiple sources, stored under a unified scheme, at a single site. Once gathered the data are stored for a long time permitting access to historical data. Thus, data ware houses provide the user a single consolidated interface to data, making decision support actions easier to implement. In the world of highly interconnected networks the data obtained or used by many companies would be very large and the maintenance becomes difficult and costly. So, the efficient data warehousing is to be implemented to obtain data from different branches (all over the world) and maintain it for providing information to all other branches (which does not have the concerned data).

Genetic Algorithms

Abstract
In this paper we explore the use of an adaptive search technique (genetic algorithms) to construct a system GABIL which continually learns and refines concept classification rules from its interaction with the environment. The performance of the system is measured on a set of concept learning problems and compared with the performance of two existing systems: ID5R and C4.5. Preliminary results support that, despite minimal system bias, GABIL is an effective concept learner and is quite competitive with ID5R and C4.5 as the target concept increases in complexity.
1 Introduction
An important requirement for both natural and artificial organisms is the ability to acquire concept classification rules from interactions with their environment. In this paper we explore the use of an adaptive search technique, namely Genetic Algorithms (GAs), as the central mechanism for building a system which continually learns and
refines concept classification rules from its interaction with the environment. We show how concept learning tasks can be represented and solved by GAs, and we provide empirical results, which illustrate the performance of GAs relative to more traditional methods. Finally, we discuss the advantages and disadvantages of this approach and describe future research activities.
2 Concept Learning Problems
Supervised concept learning involves inducing concept descriptions for the concepts to be learned from a set of positive and negative examples of the target concepts. Examples are represented as points in an n-dimensional feature space, which is defined a priori and for which all the legal values of the features are known. Concepts are therefore represented as subsets of points in the given dimensional space.
A concept learning program is presented with both a description of the feature space and a set of correctly classified examples of the concepts, and is expected to generate a reasonably accurate description of the (unknown) concepts. Since concepts can be arbitrarily complex subsets of a feature space, an important issue is the choice of the concept description language. The language must have sufficient expressive power to describe large subsets succinctly and yet be able to capture irregularities. The two language forms generally used are decision trees [Quinlan, 1986] and rules [Michalski, 1983]. Another important issue arises from the situation that there is a large (possibly infinite) set of concept descriptions, which are consistent with any particular finite set of examples. This is generally resolved by introducing either explicitly or implicitly a bias (preference) for certain kinds of descriptions (e.g., shorter or less complex descriptions may be preferred). Finally, there is the difficult issue of evaluating and comparing the performance of concept learning algorithms. The most widely used approach is a batch mode in which the set of examples is divided into training set and a test set. The concept learner is required to produce a concept description from the training examples. The validity of the description produced is then measured by the percentage of correct classifications made by the system on the second (test) set of examples during which no
further learning takes place. The alternative evaluation approach is an incremental mode in which the concept learner is required to produce a concept description from the examples seen so far and to use that description to classify the next incoming example. In this mode learning never stops, and evaluation is in terms of learning curves, which measure the predictive performance of the concept learner over time. This incremental and continuous model of concept learning matches more closely the kind of concept learning that an organism performs as it explores a complex and changing world. Consequently, we use predictive learning curves as our evaluation methodology.
3 Genetic Algorithms and Concept Learning
In order to apply GAs to a particular problem, we need to select an internal representation of the space to be searched and define an external evaluation function, which assigns utility to candidate solutions. Both components are critical to the successful application of the GAs to the problem of interest. 1
3.1 Representing the Search Space
The traditional internal representation used by Gas involves using fixed-length (generally binary) strings to represent points in the space to be searched. However, such representations do not appear well suited for representing the space of concept descriptions, which are generally symbolic in nature, which have both syntactic and semantic constraints, and which can be of widely varying length and complexity.There are two general approaches one might take to resolve this issue. The first involves changing the fundamental GA operators (crossover and mutation) to work effectively with complex non-string objects [Rendell, 1985]. This must be done carefully in order to preserve the properties, which make the GAs effective adaptive search procedures. Alternatively, one can attempt to construct a string representation, which minimizes any changes to the GAs. We are interested in pursuing both approaches. Our ideas on the first approach will be discussed briefly at the end of the paper. In the following sections we will describe our results using the second approach in which we try to apply classical GAs with minimal changes.
3.2 Defining Fixed-length Classifier Rules
Our approach to choosing a representation, which results in minimal changes to the standard GA operators, involves carefully selecting the concept description language. A natural way to express complex concepts is as a disjunctive set of (possibly overlapping) Classification rules. The left-hand side of each rule (distinct) consists of a conjunction of one or more tests involving feature values. The right-hand side of a rule indicates the concept (classification) to be assigned to the examples, which match its left-hand side. Collectively, a set of such rules can be thought of as representing the (unknown) concepts if the rules correctly classify the elements of the feature space. If we allow arbitrarily complex terms in the conjunctive left-hand side of such rules, we will have a very powerful description language, which will be difficult to represent as strings. However, by restricting the complexity of the elements of the conjunctions, we are able to use a string representation and standard GAs, with the only negative side effect that more rules may be required to express the concept. This is achieved by restricting each
element of a conjunction to be a test of the form: return true if the value of feature i of the example is in the given value set; return false otherwise.
For example, a rule might take the following symbolic form: "if (F2 = large) and (F5 = tall or thin) then it’s a widget". Since the left-hand sides are conjunctive forms
with internal disjunction, there is no loss of generality by requiring that there be at most one test for each feature (on the left hand side of a rule). With these restrictions we can now construct a fixed-length internal representation for classifier rules. Each fixed-length rule will have N feature tests, one for each feature. A fixed-length binary string, the length of which will depend of the type of feature (nominal, ordered, etc.) will represent each feature test. For simplicity, the examples used in this paper will involve features with nominal values. In this case we use k bits for the k values of a nominal feature. So, for example, if the legal values for F1 were the days of the week, then the pattern 0111110 would represent the test for F1 being a weekday. As an example, the left-hand side of a rule for a 5-feature problem would be represented internally as:
F1 F2 F3 F4 F5
0110010 1111 01 111100 11111
Notice that a feature test involving all 1’s matches any value of a feature and is equivalent to "dropping" that conjunctive term (i.e., the feature is irrelevant). So, in the
above example only the values of F1, F3, and F4 are relevant. For completeness, we allow patterns of all 0’s which match nothing. This means that any rule containing
such a pattern will not match (cover) any points in the feature space. While rules of this form are of no use in the final concept description, they are quite useful as storage areas for GAs when evolving and testing sets of rules. The right-hand side of a rule is simply the class (concept) to which the example belongs. This means that our "classifier system" is a "stimulus-response" system with no message passing.
3.3 Evolving Sets of Classifier Rules
Since a concept description will consist of one or more classifier rules, we still need to specify how Gas will be used to evolve sets of rules. There are currently two basic strategies: the Michigan approach exemplified by Holland’s classifier system [Holland, 1986], and the Pittsburgh approach exemplified by Smith’s LS-1 system [Smith, 1983]. Systems using the Michigan approach maintain a population of individual rules, which compete with each other for space and priority in the population. In contrast, systems using the Pittsburgh approach maintain a population of variable-length rule sets, which compete with each other with respect to performance on the domain task. Very little is currently known concerning the relative merits of the two approaches. In this paper we report on results obtained from using the Pittsburgh approach.2 that is, each individual in the population is a variable length string representing an unordered set of fixed-length rules (disjuncts). The number of rules in a particular individual is unrestricted and can range from 1 to a very large number depending on evolutionary pressures. Our goal was to achieve a representation that required minimal changes to the fundamental genetic
operators. We feel we have achieved this with our variable-length string representation involving fixedlength rules. Crossover can occur anywhere (i.e., both on rule boundaries and within rules). The only requirement is that the corresponding crossover points on the two parents "match up semantically". That is, if one parent is being cut on a rule boundary, then the other parent must be also cut on a rule boundary. Similarly, if one parent is being cut at point 5 bits to the right of a rule boundary, then the other parent must be cut in a similar spot (i.e., 5 bits to the right of some rule boundary). The mutation operator is unaffected and performs the usual bit-level mutations.
3.4 Choosing a Payoff Function
In addition to selecting a good representation, it is important to define a good payoff function, which rewards the right kinds of individuals. One of the nice features of using GAs for concept learning is that the payoff function is the natural place to centralize and make explicit any biases (preferences) for certain kinds of concept descriptions. It also makes it easy to study the effects of different biases by simply making changes to the payoff function. For the experiments reported in this paper, we wanted to minimize any a priori bias we might have. So we selected a payoff function involving only classification performance (ignoring, for example, length and complexity biases). The payoff (fitness) of each individual rule set is computed by testing the rule set on the current set of examples and letting:
payoff (individual I) = (percent correct) 2
This provides a non-linear bias toward correctly classifying all the examples while providing differential reward for imperfect rule sets.
3.5 The GA Concept Learner
Given the representation and payoff function described above, a standard GA can be used to evolve concept descriptions in several ways. The simplest approach involves using a batch mode in which a fixed set of examples is presented, and the GA must search the space of variable-length strings described above for a set of rules, which achieves a score of 100%. We will call this approach GABL (GA Batch concept Learner). The simplest way to produce an incremental GA concept learner is to use GABL incrementally in the following way. The concept learner initially accepts a single
example from a pool of examples. GABL is used to create a 100% correct rule set for this example. This rule set is used to predict the classification of the next example.
If the prediction is incorrect, GABL is invoked to evolve a new rule set using the two examples. If the prediction is correct, the example is simply stored with the previous example and the rule set remains unchanged. As each new additional instance is accepted, a prediction is made, and the GA is re-run in batch if the prediction is incorrect. We refer to this mode of operation as batch incremental and we refer to the GA batch-incremental concept learner as GABIL.
4 Initial Experiments
The experiments described in this section are designed to demonstrate the predictive performance of GABIL as a function of incremental increases in the size and complexity of the target concept. We invented a 4-feature world in which each feature has 4 possible distinct values (i.e., there are 256 instances in this world). This means that rules map into 16-bit strings and the length of individual rule sets is a multiple of 16. In addition to studying the behavior of GABIL as a function of increasing complexity, we were also interested in comparing its performance with an existing algorithm. ID5R [], which is a well-known incremental concept-learning algorithm, was chosen for comparison. ID5R uses decision trees as the description language and always produces a decision tree consistent with the instances seen.
We constructed a set of 12 concept learning problems, each consisting of a single target concept of increasing complexity. We varied the complexity by increasing
both the number of rules (disjuncts) and the number of relevant features per rule (conjuncts) required to correctly describing the concepts. The number of disjuncts ranged
from 1 to 4, while the number of conjuncts ranged from 1 to 3. Each target concept is labeled as nDmC, where n is the number of disjuncts and m is the number of conjuncts.
Each target concept is associated with one experiment. Within an experiment the number of disjuncts and conjuncts for the target concept remains fixed. The variation in target concept occurs between experiments. For each of the concepts, a set of 256 unique, noise free examples was generated from the feature space and labeled as positive or negative examples of the target concept. For he more complex concepts, this resulted in learning primarily from negative examples. For each concept, the 256 examples were randomly
Shuffled and then presented sequentially as described above. This procedure was repeated 10 times for each concept and for each learning algorithm. The performance
curves presented are the average behavior exhibited over 10 runs. In particular, they will remain at 100% indefinitely only when the algorithms have correctly learned the target concept. The graphs indicate that, on the simpler concepts, the predictive performance of ID5R improves more rapidly than that of GABIL. However, ID5R degrades in performance, as the target concept becomes more complex, with significant deterioration in predictive power.
It is not always possible for ID5R to make a prediction based on the decision tree. If it cannot use the tree to predict, we let ID5R make a random prediction. The performance of GABIL, on the other hand, is relatively insensitive to the increase in concept complexity, resulting in significantly better predictive capability than ID5R already on 4 distinct concepts. The analysis below suggests that this trend will continue with even larger numbers of disjuncts and conjuncts. We were surprised to see ID5R suffer the most on the 4D1C target concept, since syntactically the concept is only moderately complex. The target concept is of the form:
If (F1 = 0001) or (F2 = 0001) or (F3 = 0001) or
(F4 = 0001) then it’s positive
Although it is natural to expect that a simple target concept (from a syntactic viewpoint) would have a small decision tree representation, this is only a rough generalization.
This target concept is represented by ID5R as a decision tree of over 150 nodes. In fact, a unique leaf node in the decision tree represents each negative example. For this reason, ID5R cannot generalize over the negative examples, and has a good chance of predicting any negative example incorrectly. Furthermore, even the positive examples are not generalized well, resulting in prediction errors for positive examples. It is clear that the decision tree representation is poor for representing this particular concept. Target concept 4D1C represents a worst case, which explains why the difference between GABIL and ID5R is greatest for this concept. A similar situation occurs for target concepts 3D1C, 4D2C, and 4D3C, although to a lesser degree. ID5R relies upon Quinlan’s information theoretic entropy measure to build its decision trees.
5 Further Analysis and Comparisons
Having characterized the behavior of GABIL in this controlled concept world, we have begun to extend the analysis to more complex and challenging problems. One of our first steps was to look at the family of multiplexor problems introduced to the machine learning community by Wilson [Wilson, 1987]. Multiplexor problems fall into the general area of trying to induce a description of an input boolean function from input/output examples. Because no single individual input line is useful in distinguishing
class membership, information-theoretic approaches like Quinlan’s ID3 system have a particularly hard time inducing decision trees for multiplexor problems. Wilson’s work indicated that his GA-based classifier system Boole did not have such difficulties. Quinlan addressed some of these issues in the development of his C4 system. Quinlan subsequently reported that C4 outperforms Boole on the multiplexor Since we had access to C4.5 (a successor to C4 [Quinlan, 1989]), we felt that a direct comparison of
GABIL and C4.5 on multiplexor problems would be enlightening. Since C4.5 is a batch-mode system, we have to run it in a batch-incremental mode in the same manner as GABIL in order to provide meaningful comparisons. This can be achieved by running C4.5 in batch mode for every new instance seen, and using the resulting decision tree to predict the class of the next instance. The 6-input multiplexor problem has 6 features in
Which each feature has 2 possible distinct values (i.e., there are 64 instances in this world). This means that rules map into 12-bit strings and the length of individual rule
sets are a multiple of 12. For this concept we randomly generated a set of 200 examples from the feature space, each example labeled positive or negative. Since there are only 64 possible unique examples, the set does not contain unique examples, although they are noise free. This methodology allows for direct comparison with Quinlan’s reported results.
The set of 200 examples was randomly shuffled and then presented sequentially. This procedure was repeated 10 times for both learning algorithms. The performance curves presented in Figure 3 are the average behavior exhibited over 10 runs. GABIL clearly outperforms C4.5 on the 6-input multiplexor problem. As noted above, the weaker performance of C4.5 is not due to the choice of representation (decision tree). In fact, a compact decision tree can be created to describe the concept. The problem lies with the information theoretic bias itself, which makes it hard to find this compact tree. Preliminary results suggest similar performance differentials on larger multiplexor problems.
The concept description language and the search algorithm constitute strong biases for any concept learner. The above experiments indicate that ID-like systems can
suffer both from their decision tree language bias and from their information theoretic search bias. When the biases are appropriate, ID-like systems perform quite well. GABIL, however, due to its minimal system bias, performs uniformly well on target concepts of varying complexity. These initial results support the view that GABIL can be used as an effect concept learner although it may not outperform more strongly biased concept learning algorithms whose bias is appropriate for learning simpler target concepts.
6 Conclusions and Future Research
This paper presents a series of initial results regarding the use of GAs as the key element in the design of a system capable of continuously acquiring and refining concept classification rules from interactions with its environment. It is interesting to note that reasonable performance is achieved with minimal a priori bias. The initial results support the view that GAs can be used as an effective concept learner although they may not outperform algorithms specifically designed for concept learning when simple concepts are involved. This paper also sets the stage for the design of three additional GA-based concept learners. First, we wish to implement a variation of the current system that is truly incremental. Second, we are also very interested in understanding the difference between using the Pittsburgh approach and the Michigan approach in this problem domain. The current fixed-length rule representation can be used directly in Michigan-style classifier systems.
Third, we noted early in the paper that there were two basic strategies for selecting a representation for the concept description language. In this paper we developed a representation, which minimized the changes to standard GA implementations. We also plan to explore the alternative strategy of modifying the basic GA operators to deal effectively with non-string representations. We feel that the development and analysis of such systems is an important direction the research community should follow in order to develop additional results on these and other problems of interest.
Acknowledgements
We would like to thank Diana Gordon for her support and for many discussions on the biases in supervised concept learning systems. Diana was also instrumental in helping
us design our experimental methodology. We would also like to thank John Grefenstette and Alan Schultz for many useful comments about GABIL and crossover, J. R. Quinlan
for C4.5, and Paul Utgoff for ID5R.
.