1 Diversity Measures

The standard representation in genetic programming (syntax trees) lends itself to more fine grain structural measures that consider nodes, subtrees, and other graph theoretic properties (rather than just entire trees). Keijzer (1996) measured subtree variety as the ratio of unique subtrees over total subtrees, and program variety as the ratio of the number of unique individuals over the size of the population. Keijzer also used a distance measure between two individuals as the number of distinct subtrees the individuals share. The author noted that his distance measure of distinct subtrees between two individuals could be used to predict when subtree crossover will fail to provide improvements due to loss of distinct subtrees and diversity. Tackett (1994) also measured structural diversity using subtrees and schemata frequencies.

Problem specific measures can allow additional insight into population
diversity, especially on novel and non-traditional problems.
D'haeseleer and Bluming (1994)
defined *behaviour* and *frequency* signatures for each
individual based on fitness and *gene* (i.e. specific nodes) frequencies,
respectively.
The average correlation between the respective signatures
in the population represents the *phenotype* and *genotype*
diversity
of the population. In addition, D'haeseleer and Bluming tagged genetic
code and evaluated the behaviour of individuals with a ``stimulus-response
map'' to gain further knowledge into the structure and behaviour
of populations in their robot tank problem. Using these measures,
the authors witnessed emerging demes (distinct subpopulations) with neighborhood
selection and mating.

The degree of graph isomorphism could be applied to genetic programming trees as a measure of diversity. However, the properties of functions used in genetic programming (associativity, commutativity, etc) would require special, and possibly complex, implementations of isomorphism [Rosca, 1995a]. Also, determining graph isomorphism would be computationally expensive for an entire population. However, a measure of possible isomorphic trees could be found by noting simple properties (terminal, functions, depth, etc.) to determine the individuals which could be isomorphic without actually computing isomorphism.

McPhee and Hopper (1999) investigated diversity at the genetic level by assigning numerical tags to each node in the population. The tags track the survival of nodes from the initial population and the change of context for nodes during recombination. The authors also tracked the genetic lineages from the initial population by noting the individuals selected for recombination, which child they produced and which of the parents provided the root portion of the child's tree. They found that populations in the final generation often descended from one single initial individual, since genetic lineages were often reduced to one surviving lineage early on in the evolutionary process.

Measuring the difference between two individuals based on string edit distances has been used several times in genetic programming. O'Reilly (1997) used an edit distance based on string matching, which uses single node insertions, deletions and substitutions to transform two trees to be equal in structure and content. De Jong et al. (2001) also used a similar edit distance in a multiobjective method. This edit distance, a modification of string edit distance, or Levenshtein distance, overlaps two trees at the root node. Two different nodes, when overlapping, score a distance of 1. Equal nodes score a distance of 0. The edit distance is then the sum of all different nodes, which can be normalised by dividing it by the size of the smaller tree. The measure represents the number of node changes that need to be made to either tree to make them equal in structure and content. Note that this measure does not take into consideration the standard subtree crossover operator, but does resemble a single point mutation operator.

O'Reilly (1997) noted the importance of using structural distance measures on genetic programming populations to understand the underlying dynamics. An edit distance measure was used here to study the effects of crossover and the differences between individuals and better individuals. While no clear results were found, the ability to understand genetic programming populations with edit distance measures was suggested.

Ekárt and Németh (2000)
defined an edit distance specific to genetic programming syntax trees,
adapted from [Nienhuys-Cheng, 1997], which considered the cost
of substituting between different node types (functions vs. terminals
and within these classes).
The difference between two trees is
defined as:

where are trees with roots and possible children ( total) subtrees and . The overall distance between two trees, , is found by recursively finding the distance between their nodes, . The constant is set to but can be adjusted to weight the depth of tree differences differently. Two trees are brought to the same tree structure by adding ``null'' nodes to each tree. Note that the differences near the root have more weight. This is a convenient description for genetic programming as it has been noted that programs converge quickly to a fixed root portion [Igel and Chellapilla, 1999,McPhee and Hopper, 1999]. As before, this measure does not account for the subtree crossover operator, but it does represent its dynamics more closely by giving a higher cost to the upper portions of trees which are more difficult to change when using subtree crossover. Additionally, the cost of changing one node to another can be specified for each pair of nodes or for classes of nodes, as was the case in [Ekárt and Németh, 2000].

Measures based on behaviour compare differences among the populations' fitness values at a given time. Rosca (1995a) used the fitness values in a population to define an entropy and free energy measure. Entropy represents the amount of disorder of the population, where an increase in entropy represents an increase in diversity. Rosca found that populations appeared to be stuck in local optima when entropy did not change or decreased monotonically in successive generations.

Entropy is calculated for a population by first placing fitness values
into classes and noting the size of each class [Rosca, 1995a].
In a discrete fitness
space, a unique fitness value may define a fitness class. In
continuous spaces, discretisation may be necessary.
The level of convergence in the genotype space often significantly
reduces the number of unique fitness values, even in continuous
spaces. Given a fitness value , unique
fitness values in the current population, and fitness
classes
, the fitness
class is the proportion
of the population which has fitness equal to .
Entropy is then defined as,

In physical systems, entropy represents the amount of chaos in the system. In genetic programming, high entropy reveals the presence of many unique fitness values, where the population is evenly distributed over those values. Low entropy describes a population which contains fewer unique fitness values as many individuals have the same fitness, as shown in Figure 4.1. Entropy gives insights into the distribution of fitness values over the population, but does not describe the precise distribution of the fitness values.

The frequency of the terminals and functions in a population could also serve as a measure of population diversity. Similarly, diversity could also refer to the number and type of tree structures that the population represents. A visualisation of a genetic programming tree was described for a circular lattice in [Daida, 2002] and later in [Daida et al., 2003a], where a population visualisation using colors represented the most sampled type of shapes in the lattice. Using the lattice model, a population visualisation method can be developed using a 3-dimensional graph [Gustafson and Krasnogor, 2003], which gives rise to measures that describe the tree shapes sampled by the population. Figure 4.2 shows a full binary tree of depth 10 on the circular lattice in the upper left graph. A 3-dimensional visualisation can be made by finding the cumulative node count of the population for the absolute node positions represented in the lattice. This cumulative count then can denote the height of each node, as shown in the bottom two figures of Figure 4.2, where the upper right graph shows the same visualisation from above.