Genetic programming is a simple and powerful technique which has been applied to a wide range of problems in combinatorial optimisation, automatic programming and model induction (see [Banzhaf et al., 1998] for a general introduction to genetic programming). The direct encoding of solutions as variable-length computer programs allows genetic programming to provide solutions that can be evaluated and also examined to understand their internal workings. In this way, a genetic programming solution can represent a single value after evaluation (as in optimisation), an iteratively built algorithm (as in automatic programming) or a data-driven model useful in data mining and knowledge discovery (as in model induction). While genetic programming is applied successfully in all three cases, each of which merits its own scientific discipline, the focus of this thesis is on the inner workings of the algorithm itself.
The issues of problem solving that motivate the use of genetic programming also represent some of the limits to its applicability. The increase in computer power, the wide range of applications and the ability to collect enormous amounts of scientific data all require a computational framework, such as genetic programming, capable of automatically generating solutions with a representation that is both powerful and understandable. However, the genetic programming algorithm has a high computational cost to run, has difficulty scaling to larger and harder problem instances, and uses a complex representation that can also limit its use.
At the heart of these issues is the use of a population of alternative solutions and the methods used to generate new alternatives. A candidate, or alternative, solution is evaluated by means of a fitness function that allows it to be compared with previously evaluated solutions. A stochastic selection method chooses better solutions from the population that then undergo stochastic variations to produce new alternatives. In this way, the population is responsible for guiding a parallel search through the solution space. However, the evaluation of each population member becomes increasingly computationally expensive and designing an effective process of variation on the direct and complex encoding of solutions is not intuitive. Also, the amount of data produced by a population-based search over an enormous space of possible programs makes the prediction and analysis of algorithm behaviour difficult.