The latest “machine scientist” algorithms can take in data on dark matter, dividing cells, turbulence, and other situations too complicated for humans to understand and provide an equation capturing the essence of what’s going on.
Despite rediscovering Kepler’s third law and other textbook classics, BACON remained something of a curiosity in an era of limited computing power. Researchers still had to analyze most data sets by hand, or eventually with Excel-like software that found the best fit for a simple data set when given a specific class of equation. The notion that an algorithm could find the correct model for describing any data set lay dormant until 2009, when Lipson and Michael Schmidt, roboticists then at Cornell University, developed an algorithm called Eureqa.
Their main goal had been to build a machine that could boil down expansive data sets with column after column of variables to an equation involving the few variables that actually matter. “The equation might end up having four variables, but you don’t know in advance which ones,” Lipson said. “You throw at it everything and the kitchen sink. Maybe the weather is important. Maybe the number of dentists per square mile is important.”
One persistent hurdle to wrangling numerous variables has been finding an efficient way to guess new equations over and over. Researchers say you also need the flexibility to try out (and recover from) potential dead ends. When the algorithm can jump from a line to a parabola, or add a sinusoidal ripple, its ability to hit as many data points as possible might get worse before it gets better. To overcome this and other challenges, in 1992 the computer scientist John Koza proposed “genetic algorithms,” which introduce random “mutations” into equations and test the mutant equations against the data. Over many trials, initially useless features either evolve potent functionality or wither away.