next up previous contents
Next: 1 Promoting Diversity Up: 4 Analysis of Diversity Previous: 4 Analysis of Diversity   Contents

1 Diversity Measures

Biological diversity refers to the differences between individuals in a population, which in nature imply structure and behaviour difference. Koza (1992) used the term variety to indicate the number of different genotypes in a population. In a standard genetic programming population, this would be the number of structurally unique individuals. While this measure is probably the least informative it is the most common due to its ease of use and understanding. Langdon (1998a) argued that genotypic diversity is a sufficient upper bound of population diversity. Due to the nature of most genetic programming systems and problems, two identical genotypes will produce the same fitness. Thus, a decrease in genotype diversity (unique individuals) will necessarily cause a decrease in phenotype diversity (unique fitness values). In the treatment of the stack problem, Langdon (1998a) investigated the effects of the crossover operator on variety. The author noted that genetic programming loses some ability to improve fitness after 20-30 generations, which is most probably due to selection and crossover causing a loss of variety. Langdon also noted that in the stack problem, in runs with better fitness, crossover appeared to produce a larger number of fitter (and non-duplicate) children than their parents. Also, the author found that the loss of diversity caused a decrease of unique terminals that, due to subtree crossover, led to further diversity loss. Langdon and Poli (2002) later noted that measuring variety with only unique genotypes fails to consider the ancestral history of individuals, the degree of difference between non-unique individuals and their behavioural similarities.

The standard representation in genetic programming (syntax trees) lends itself to more fine grain structural measures that consider nodes, subtrees, and other graph theoretic properties (rather than just entire trees). Keijzer (1996) measured subtree variety as the ratio of unique subtrees over total subtrees, and program variety as the ratio of the number of unique individuals over the size of the population. Keijzer also used a distance measure between two individuals as the number of distinct subtrees the individuals share. The author noted that his distance measure of distinct subtrees between two individuals could be used to predict when subtree crossover will fail to provide improvements due to loss of distinct subtrees and diversity. Tackett (1994) also measured structural diversity using subtrees and schemata frequencies.

Problem specific measures can allow additional insight into population diversity, especially on novel and non-traditional problems. D'haeseleer and Bluming (1994) defined behaviour and frequency signatures for each individual based on fitness and gene (i.e. specific nodes) frequencies, respectively. The average correlation between the respective signatures in the population represents the phenotype and genotype diversity of the population. In addition, D'haeseleer and Bluming tagged genetic code and evaluated the behaviour of individuals with a ``stimulus-response map'' to gain further knowledge into the structure and behaviour of populations in their robot tank problem. Using these measures, the authors witnessed emerging demes (distinct subpopulations) with neighborhood selection and mating.

The degree of graph isomorphism could be applied to genetic programming trees as a measure of diversity. However, the properties of functions used in genetic programming (associativity, commutativity, etc) would require special, and possibly complex, implementations of isomorphism [Rosca, 1995a]. Also, determining graph isomorphism would be computationally expensive for an entire population. However, a measure of possible isomorphic trees could be found by noting simple properties (terminal, functions, depth, etc.) to determine the individuals which could be isomorphic without actually computing isomorphism.

McPhee and Hopper (1999) investigated diversity at the genetic level by assigning numerical tags to each node in the population. The tags track the survival of nodes from the initial population and the change of context for nodes during recombination. The authors also tracked the genetic lineages from the initial population by noting the individuals selected for recombination, which child they produced and which of the parents provided the root portion of the child's tree. They found that populations in the final generation often descended from one single initial individual, since genetic lineages were often reduced to one surviving lineage early on in the evolutionary process.

Measuring the difference between two individuals based on string edit distances has been used several times in genetic programming. O'Reilly (1997) used an edit distance based on string matching, which uses single node insertions, deletions and substitutions to transform two trees to be equal in structure and content. De Jong et al. (2001) also used a similar edit distance in a multiobjective method. This edit distance, a modification of string edit distance, or Levenshtein distance, overlaps two trees at the root node. Two different nodes, when overlapping, score a distance of 1. Equal nodes score a distance of 0. The edit distance is then the sum of all different nodes, which can be normalised by dividing it by the size of the smaller tree. The measure represents the number of node changes that need to be made to either tree to make them equal in structure and content. Note that this measure does not take into consideration the standard subtree crossover operator, but does resemble a single point mutation operator.

O'Reilly (1997) noted the importance of using structural distance measures on genetic programming populations to understand the underlying dynamics. An edit distance measure was used here to study the effects of crossover and the differences between individuals and better individuals. While no clear results were found, the ability to understand genetic programming populations with edit distance measures was suggested.

Ekárt and Németh (2000) defined an edit distance specific to genetic programming syntax trees, adapted from [Nienhuys-Cheng, 1997], which considered the cost of substituting between different node types (functions vs. terminals and within these classes). The difference between two trees is defined as:

dist(T_1,T_2) = \left\{ \begin{array}{l l}
d(p,q) & \textrm{...
...{l=1}^m dist(s_l,t_l) & \textrm{otherwise}
\end{array} \right.

where $T_1,T_2$ are trees with roots $p,q$ and possible children ($m$ total) subtrees $s$ and $t$. The overall distance between two trees, $dist(T_1,T_2)$, is found by recursively finding the distance between their nodes, $d(p,q)$. The constant $\textrm{K}$ is set to $\frac{1}{2}$ but can be adjusted to weight the depth of tree differences differently. Two trees are brought to the same tree structure by adding ``null'' nodes to each tree. Note that the differences near the root have more weight. This is a convenient description for genetic programming as it has been noted that programs converge quickly to a fixed root portion [Igel and Chellapilla, 1999,McPhee and Hopper, 1999]. As before, this measure does not account for the subtree crossover operator, but it does represent its dynamics more closely by giving a higher cost to the upper portions of trees which are more difficult to change when using subtree crossover. Additionally, the cost of changing one node to another can be specified for each pair of nodes or for classes of nodes, as was the case in [Ekárt and Németh, 2000].

Measures based on behaviour compare differences among the populations' fitness values at a given time. Rosca (1995a) used the fitness values in a population to define an entropy and free energy measure. Entropy represents the amount of disorder of the population, where an increase in entropy represents an increase in diversity. Rosca found that populations appeared to be stuck in local optima when entropy did not change or decreased monotonically in successive generations.

Entropy is calculated for a population by first placing fitness values into classes and noting the size of each class [Rosca, 1995a]. In a discrete fitness space, a unique fitness value may define a fitness class. In continuous spaces, discretisation may be necessary. The level of convergence in the genotype space often significantly reduces the number of unique fitness values, even in continuous spaces. Given a fitness value $i$, $k$ unique fitness values in the current population, and fitness classes $p_1 \ldots p_k$, the fitness class $p_i$ is the proportion of the population which has fitness equal to $i$. Entropy is then defined as,

- \sum_k p_k \cdot \textrm{log} p_k .

In physical systems, entropy represents the amount of chaos in the system. In genetic programming, high entropy reveals the presence of many unique fitness values, where the population is evenly distributed over those values. Low entropy describes a population which contains fewer unique fitness values as many individuals have the same fitness, as shown in Figure 4.1. Entropy gives insights into the distribution of fitness values over the population, but does not describe the precise distribution of the fitness values.

Figure 4.1: Examples of entropy distributions from low entropy, where all members of the population have the same value, to high entropy, where each individual is represented by a different value (assuming there are at least as many possible fitness values as individuals).
Additionally, because entropy describes the number of unique fitness values and some information about their distribution, it also describes the same information used by selection. If all the individuals in the population have the same fitness value, selection based on fitness becomes random. The higher the entropy, the more evenly distributed the population becomes over the possible fitness values, the more fine-grain the population will appear to selection. Thus, entropy can describe, as does phenotype diversity in a broader sense, the dynamic change of the applied selection pressure.

The frequency of the terminals and functions in a population could also serve as a measure of population diversity. Similarly, diversity could also refer to the number and type of tree structures that the population represents. A visualisation of a genetic programming tree was described for a circular lattice in [Daida, 2002] and later in [Daida et al., 2003a], where a population visualisation using colors represented the most sampled type of shapes in the lattice. Using the lattice model, a population visualisation method can be developed using a 3-dimensional graph [Gustafson and Krasnogor, 2003], which gives rise to measures that describe the tree shapes sampled by the population. Figure 4.2 shows a full binary tree of depth 10 on the circular lattice in the upper left graph. A 3-dimensional visualisation can be made by finding the cumulative node count of the population for the absolute node positions represented in the lattice. This cumulative count then can denote the height of each node, as shown in the bottom two figures of Figure 4.2, where the upper right graph shows the same visualisation from above.

Figure 4.2: Example of population visualisations on a circular lattice in the upper left. A sample tree is shown in the upper right, which actually represents an overhead view of a population of trees where the node height is the number of times the population samples a node at this structural location. The 3-dimensional views of this population are in the bottom two graphs.

next up previous contents
Next: 1 Promoting Diversity Up: 4 Analysis of Diversity Previous: 4 Analysis of Diversity   Contents
S Gustafson 2004-05-20