Given a set of points on a graph, can a computer accurately find a practical mathematical function through the use of a genetic algorithm without any prior knowledge of what function that set of points may resemble?
ResearchFunction approximation programs come in many varieties as there are numerous approaches to approximating a function or fitting a curve. Each of these approaches has its own advantages and disadvantages.
In 2001, Aaron Golden programmed a function approximation program that used a genetic algorithm, but limited the number of inputs to exactly three points. Therefore, complex functions would not be accurately fitted with this approach, (but his goal was just to show genetic algorithms at work).
Jeremy Emch created a similar program to Golden, but did not have a limit to the number of points that can be inputted. From his results, he concluded that “genetic programming is indeed a powerful tool”.
These projects in the past only had abstracts and no code or executable. Thus, a few questions arose; What is the minimum number of points necessary for good approximations? How powerful is genetic programming? Can a function approximation program output nice/clean looking functions?
ProcedureThe entire procedure of the project including programming and testing is divided into five stages; two programming and three testing. First, using C++, the basic program is coded for and the main genetic algorithm is added specifically for function finding. This main genetic algorithm strayed from traditional ones as a part of the algorithm was approached in a different direction. Rather than using binary as building blocks of the strings, we took the word ‘genetic’ literally and used the biological mRNA codon chart as a basis instead. Then, the meta-program (a second genetic algorithm) is coded for and will be used to find the experimental optimal parameters of the main program.
The first test is for the purpose of tweaking the program. This is done by letting the meta-program run with a set of data from one of the five predetermined testing equations (these are chosen to represent the different types of functions able to be produced by the program) and output various combinations of genetic algorithm parameters and their fitness levels. This is repeated for each of the five sets of data. Once all the results are gathered, the combination with the highest fitness value from each of the five runs is then run against all five data sets to determine how well each will do against a variety of data. The combination of parameters with the highest fitness value will be used in the main program.
The parameters of the program are changed to the combination with the highest fitness value. Next, the graphical user interface shell is coded using java, and the function plotting and data exporting functions are coded into the main program.
The second test is to determine how the program will behave when given a different amount of data to work with. Again using the five predetermined equations, the data is altered to only include five, ten, or twenty points. They are then run in the program, each at a time, to measure their fitness at each generation until it finishes. This is repeated using data from each of the five functions.
The final test is to compare this function approximation program against others (though they do not necessarily use genetic algorithms) based on a set of criteria. It is important that the program should be able to approximate a mathematical expression for a set of points without extraneous information about the points, that the program should not require extensive training to use, that it should produce functions with reasonable numbers, and that the program should not take a tremendously long time to run. The other programs were compared on a pass or fail basis.
Variables|
Parameter Name |
Range |
Description |
|
Specialization |
0-1 |
Whether the program will use pre-configured functions for some species
|
|
Selection Method |
0-1 |
The method of selection (random or roulette) |
|
% Selection Rate |
5-50 |
The percentage of the individuals that are selected to move to the next
generation of the genetic algorithm. |
|
Crossover Method |
0-2 |
The method of crossover (single crossover, double crossover, cut and
splice crossover) |
|
Crossover Rate |
0.2-0.8 |
The chance that crossover will be performed between two individuals |
|
Mutation Rate |
0.001-0.75 |
The chance that a mutation will occur in a base of an individual |
|
Max. Generations |
10-500 |
The maximum number of generations the GA will run for (dependent upon
the termination method, below) |
|
Max. Mutations |
1-50 |
The maximum number of mutations that may occur in an individual (so as
to prevent too much mutation) |
|
# of Individuals |
10-300 |
The number of individuals in a species |
|
# of Species |
7-12 |
The maximum number of species in a generation of the GA |
|
Termination Method |
0-2 |
The method of termination (maximum generations, time limit, reasonable
fitness) |
|
Termination Seed |
Random |
A random seed used by the termination method (dependent on the method) |
|
Reasonable Fitness |
0.001-0.3 |
If the program has run for too long, it is okay to stop if it has met
this fitness |
|
Fitness Cut-off |
0.001-0.2 |
In order to stop the program early, it needs to have a best fitness of
at most this value. |
|
Add Generations |
0-149 |
The maximum number
of generations to increment if the program is unable to meet the
reasonable fitness benchmark |
|
Fitness Plateau
Len. |
0-99 |
If the best fitness
remains the same for this many generations, the fitness has reached a
plateau, so quit. |
Below are the different amounts of points chosen to determine the program’s behaviour when dealing with varying amounts of points. These accurately represent small, average, and large amounts of data points that the program is expected to deal with effectively.
5 data pointsFor testing purposes in the first and second test, sample data sets were created from 5 selectively chosen equations. These equations (as seen below) were chosen as they are representative of the wide range of functions that this program is capable of producing. Therefore, by testing with data from these functions, the capabilities of the program are tested thoroughly and in a scientific manner.
Following are the programs were chosen to determine how approximating functions using other methods compare against approximation using a genetic algorithm. Each program has its own method in finding a function that would match the set of points inputted by the user.
Zunzun.comEach of the five sample data set used gave three sets of parameters with the highest overall fitness value (time and accuracy combined). Data from equation a, b, c, and d outputted parameter sets with very high fitness values; often close to 99%. However, when data from equation e was used, the sets of parameters that resulted had fitness values in the 70% range. Once tested against each other, the top five sets of parameters (one from each data set) are arranged based on how well each did with various sets of data. Parameter set A (the set that originated from tests done using equation A) has the highest fitness overall (speed and accuracy). Two others (sets B and D) have overall fitness values similar to, but lower than set A. Therefore, set A is the experimental optimum and will be used for the main program. It is possible that a set of parameters has the fitness value that represents the global maximum, but there also exist sets of parameters that have fitness values that are very close to the global maximum. Therefore the program can be optimized, or very close to optimized, with many different configurations of parameters. These results are important as they are used to optimize the program for accuracy and speed.
Through testing how the program will behave when given a different amount of data to work with, it was determined that the greater the number of data points, the lower the fitness value for that function will be in the first few generations. With only 5 points, the initial fitness value was normally higher than if 10 or 20 points were used. This is evident on all 5 sets of data used. Using 10 points, the initial fitness value was generally in between the initial values obtained with 5 and 20 data points. However, with 20 data points, the initial fitness value was on average the lowest of these three sets of data. Therefore, it can be concluded that independent of the number of data points inputted, the fitness value generally ended up very close to each other after 20 or more generations. Though it is still better to use more points if available as the chance that there would be any inaccuracies between points would be reduced, it is only most crucial at the beginning of the run. This indicates that the adaptability of the program is very good as it is able to operate in various environments were the data set can be a variety of sizes.
In the comparison between programs, all of them were able to accurately fit a curve to a set of data, but each program had its own downfall. Microsoft Excel and FindGraph required the user to already have an idea as to what the function should look like in order to find the best fitting function. Zunzun.com (a web based function approximation tool) takes a brute-force way in order to find a function, causing it to be slower than all the other programs, even when fitting simpler sets of data points. Aside from that, it was able to meet our other proposed criteria. LAB Fit also took a brute-force approach, comparing the data to its library of functions, and similar results were obtained to ZunZun.com, though a bit faster and a having more mathematical features (which our experiment does not deal with). Other programs tested (such as the R-Project, a free alternative to MATLAB) required one to have prior training (such as using scripts based off of the program’s own scripting language); therefore they cannot be used for quick approximations. However, our program only requires the user to input a set of data points. The program does the rest of the work in moderately adequate time and outputs up to 10 of the most accurate functions that it has found and plots their graphs.
Many times in biological, mathematical, statistical, economical, and other scientific situations, we are asked to plot data from experiments so we can study the relationships between the dependent and independent variables.
As the program does not require additional information about the data in question, it can be used for on-the-spot, quick function approximation.
The variability that is intrinsically characteristic of a genetic algorithm is an advantage in approximating functions as it can find obscure functions that other analytical techniques such as Taylor Expansions are unable to find (i.e. Taylor Expansions cannot find f(x) = e(tan(sin(x)))). This characteristic can also allow the program to find multiple functions that match the defined set of points. Therefore, an interesting mathematical implication of the program is that it can be used to study variations and/or similarities between various generated functions that match the defined set of points over a given domain.