Problem

Given a set of points on a graph, can a computer accurately find a practical mathematical function through the use of a genetic algorithm without any prior knowledge of what function that set of points may resemble?

Research

Function approximation programs come in many varieties as there are numerous approaches to approximating a function or fitting a curve. Each of these approaches has its own advantages and disadvantages.

In 2001, Aaron Golden programmed a function approximation program that used a genetic algorithm, but limited the number of inputs to exactly three points. Therefore, complex functions would not be accurately fitted with this approach, (but his goal was just to show genetic algorithms at work).

Jeremy Emch created a similar program to Golden, but did not have a limit to the number of points that can be inputted. From his results, he concluded that “genetic programming is indeed a powerful tool”.

These projects in the past only had abstracts and no code or executable.  Thus, a few questions arose; What is the minimum number of points necessary for good approximations? How powerful is genetic programming? Can a function approximation program output nice/clean looking functions?

Procedure

The entire procedure of the project including programming and testing is divided into five stages; two programming and three testing. First, using C++, the basic program is coded for and the main genetic algorithm is added specifically for function finding. This main genetic algorithm strayed from traditional ones as a part of the algorithm was approached in a different direction. Rather than using binary as building blocks of the strings, we took the word ‘genetic’ literally and used the biological mRNA codon chart as a basis instead. Then, the meta-program (a second genetic algorithm) is coded for and will be used to find the experimental optimal parameters of the main program.

The first test is for the purpose of tweaking the program. This is done by letting the meta-program run with a set of data from one of the five predetermined testing equations (these are chosen to represent the different types of functions able to be produced by the program) and output various combinations of genetic algorithm parameters and their fitness levels. This is repeated for each of the five sets of data. Once all the results are gathered, the combination with the highest fitness value from each of the five runs is then run against all five data sets to determine how well each will do against a variety of data. The combination of parameters with the highest fitness value will be used in the main program.

The parameters of the program are changed to the combination with the highest fitness value. Next, the graphical user interface shell is coded using java, and the function plotting and data exporting functions are coded into the main program.

The second test is to determine how the program will behave when given a different amount of data to work with. Again using the five predetermined equations, the data is altered to only include five, ten, or twenty points. They are then run in the program, each at a time, to measure their fitness at each generation until it finishes. This is repeated using data from each of the five functions.

The final test is to compare this function approximation program against others (though they do not necessarily use genetic algorithms) based on a set of criteria. It is important that the program should be able to approximate a mathematical expression for a set of points without extraneous information about the points, that the program should not require extensive training to use, that it should produce functions with reasonable numbers, and that the program should not take a tremendously long time to run. The other programs were compared on a pass or fail basis.

Variables


Test 1 (Program Parameters)

Below are the parameters that the genetic algorithm operates with. Each parameter controls a different part of the algorithm and can increase or decrease the overall performance (accuracy and speed) of the program. There are countless different combinations of these variables possible. Therefore, a test was devised to determine the combination that would maximize the program’s performance. More on the results of the test in the results section.

Parameter Name

Range

Description

Specialization

0-1

Whether the program will use pre-configured functions for some species

Selection Method

0-1

The method of selection (random or roulette)

% Selection Rate

5-50

The percentage of the individuals that are selected to move to the next generation of the genetic algorithm.

Crossover Method

0-2

The method of crossover (single crossover, double crossover, cut and splice crossover)

Crossover Rate

0.2-0.8

The chance that crossover will be performed between two individuals

Mutation Rate

0.001-0.75

The chance that a mutation will occur in a base of an individual

Max. Generations

10-500

The maximum number of generations the GA will run for (dependent upon the termination method, below)

Max. Mutations

1-50

The maximum number of mutations that may occur in an individual (so as to prevent too much mutation)

# of Individuals

10-300

The number of individuals in a species

# of Species

7-12

The maximum number of species in a generation of the GA

Termination Method

0-2

The method of termination (maximum generations, time limit, reasonable fitness)

Termination Seed

Random

A random seed used by the termination method (dependent on the method)

Reasonable Fitness

0.001-0.3

If the program has run for too long, it is okay to stop if it has met this fitness

Fitness Cut-off

0.001-0.2

In order to stop the program early, it needs to have a best fitness of at most this value.

Add Generations

0-149

The maximum number of generations to increment if the program is unable to meet the reasonable fitness benchmark

Fitness Plateau Len.

0-99

If the best fitness remains the same for this many generations, the fitness has reached a plateau, so quit.



Test 2 (Effect of the Number of Data Points Entered)

Below are the different amounts of points chosen to determine the program’s behaviour when dealing with varying amounts of points. These accurately represent small, average, and large amounts of data points that the program is expected to deal with effectively.

5 data points
10 data points
20 data points


Equations Used During Testing

For testing purposes in the first and second test, sample data sets were created from 5 selectively chosen equations. These equations (as seen below) were chosen as they are representative of the wide range of functions that this program is capable of producing. Therefore, by testing with data from these functions, the capabilities of the program are tested thoroughly and in a scientific manner.




Test 3 (Comparison with other Function Approximation Programs)

Following are the programs were chosen to determine how approximating functions using other methods compare against approximation using a genetic algorithm. Each program has its own method in finding a function that would match the set of points inputted by the user.

Zunzun.com
LAB Fit
FindGraph
TableCurve 2D
NEFPROX
the R-Project
Microsoft Excel

Program Comparison Criteria
When comparing against other programs, the following conditions are considered in the order of importance.
1.    The program should be able to approximate a mathematical expression for a set of points without extraneous information about the data.
2.    The program should be able to allow users to easily enter their data points and do a quick approximation. Therefore, the program’s learning curve should not be steep.
3.    The program should output relatively practical functions. That is, coefficients should be reasonable (i.e. not have a large number of significant figures).
4.    The program should not take a tremendously long time to run as the goal is for a relatively quick approximation of data.

Results


Test 1 Results

Each of the five sample data set used gave three sets of parameters with the highest overall fitness value (time and accuracy combined). Data from equation a, b, c, and d outputted parameter sets with very high fitness values; often close to 99%. However, when data from equation e was used, the sets of parameters that resulted had fitness values in the 70% range. Once tested against each other, the top five sets of parameters (one from each data set) are arranged based on how well each did with various sets of data. Parameter set A (the set that originated from tests done using equation A) has the highest fitness overall (speed and accuracy). Two others (sets B and D) have overall fitness values similar to, but lower than set A. Therefore, set A is the experimental optimum and will be used for the main program. It is possible that a set of parameters has the fitness value that represents the global maximum, but there also exist sets of parameters that have fitness values that are very close to the global maximum. Therefore the program can be optimized, or very close to optimized, with many different configurations of parameters. These results are important as they are used to optimize the program for accuracy and speed.


Test 2 Results

Through testing how the program will behave when given a different amount of data to work with, it was determined that the greater the number of data points, the lower the fitness value for that function will be in the first few generations. With only 5 points, the initial fitness value was normally higher than if 10 or 20 points were used. This is evident on all 5 sets of data used. Using 10 points, the initial fitness value was generally in between the initial values obtained with 5 and 20 data points. However, with 20 data points, the initial fitness value was on average the lowest of these three sets of data. Therefore, it can be concluded that independent of the number of data points inputted, the fitness value generally ended up very close to each other after 20 or more generations. Though it is still better to use more points if available as the chance that there would be any inaccuracies between points would be reduced, it is only most crucial at the beginning of the run. This indicates that the adaptability of the program is very good as it is able to operate in various environments were the data set can be a variety of sizes.


Test 3 Results

In the comparison between programs, all of them were able to accurately fit a curve to a set of data, but each program had its own downfall. Microsoft Excel and FindGraph required the user to already have an idea as to what the function should look like in order to find the best fitting function. Zunzun.com (a web based function approximation tool) takes a brute-force way in order to find a function, causing it to be slower than all the other programs, even when fitting simpler sets of data points. Aside from that, it was able to meet our other proposed criteria. LAB Fit also took a brute-force approach, comparing the data to its library of functions, and similar results were obtained to ZunZun.com, though a bit faster and a having more mathematical features (which our experiment does not deal with). Other programs tested (such as the R-Project, a free alternative to MATLAB) required one to have prior training (such as using scripts based off of the program’s own scripting language); therefore they cannot be used for quick approximations. However, our program only requires the user to input a set of data points. The program does the rest of the work in moderately adequate time and outputs up to 10 of the most accurate functions that it has found and plots their graphs.


Applications

Many times in biological, mathematical, statistical, economical, and other scientific situations, we are asked to plot data from experiments so we can study the relationships between the dependent and independent variables.

As the program does not require additional information about the data in question, it can be used for on-the-spot, quick function approximation.

The variability that is intrinsically characteristic of a genetic algorithm is an advantage in approximating functions as it can find obscure functions that other analytical techniques such as Taylor Expansions are unable to find (i.e. Taylor Expansions cannot find f(x) = e(tan(sin(x)))). This characteristic can also allow the program to find multiple functions that match the defined set of points. Therefore, an interesting mathematical implication of the program is that it can be used to study variations and/or similarities between various generated functions that match the defined set of points over a given domain.


Conclusion

Based on all the data that has been collected and analyzed, it can be concluded that overall, the program managed to fit data reasonably well to a function within a reasonable amount of time, although the results often had complex expressions. Still, the majority of the time, portions of the expressions could be easily simplified by hand. Nevertheless, the program was found to be a solid approach to function finding, particularly when the intent is to find a variety of similar functions that match a defined set of points. Therefore, it is safe to conclude that the program is for the most part accurate and the expressions given by the program are practical to use.

Future Work

Though the experiment is complete, there are a number of modifications that we can make to enhance the program. Since the majority of the output functions can be simplified, the program can be streamlined to effectively use an already existing Computer Algebra System (such as Maple) in order to simplify output functions.

Also, the program can be coded to accept, find the function for, and plot input data that contain more than two dimensions (such as a set of data from a function of two variables).

The program can also be modified to accept and evaluate margins of error associated with each or all of the points inputted by the user, which will increase the program speed.

Additionally, the program could be modified for encryption applications, where the information in files can be converted into numbers and have the program output a unique function that can be used as a key to decrypt the encrypted information.

A fourth test that we could perform on this topic is to determine the effect of different data structures (such as binary, tree, and codon string) on the overall performance of the program.

Many of these are items we currently are either working on or pondering how we can work on them.