May 21, 2024

excellentpix

Unlimited Technology

They Found a Way to Thematically Sort All of Wikipedia on a Laptop

A Faster Way to Sift Through Massive Datasets

From its debut in 2003, topic modeling had an enormous impact. But as datasets grew larger, the algorithm struggled. Blei and Bach were discussing the problem one night at a bar, when they realized that stochastic optimization, an idea introduced decades earlier by statistics professor Herbert Robbins could provide a work around.

In a groundbreaking paper published in 1951, just before coming to Columbia, Robbins explained how an optimization problem could be solved by estimating the gradient through randomized approximation, a methodology now called stochastic optimization. The technique was later used to efficiently approximate gradients in a sea of data points. Rather than calculate the gradient precisely, using all of the data, the optimizer repeatedly samples a subset of the data to get a rough estimate much faster. 

Stochastic optimization today underlies much of modern AI, enabling complex models like deep neural networks to quickly fit to massive datasets. Back in the early aughts, however, stochastic optimization was still finding its way into big models. At the time, Bach was using stochastic optimization to fill in pixels missing from images in computer vision models. 

When Bach, Blei, and Hoffman, who was Blei’s grad student at the time, combined stochastic optimization and topic modeling, they discovered their algorithm could fit a model with millions of data points — be they New York Times stories, emails, or variants in thousands of human genomes. Stochastic optimization has been so pivotal in AI, said Blei, that four of the last five Test of Time papers at NeurIPS have hinged on it. 

“The real test-of-time winner is Herb Robbins,” he said.

The ability to scale topic models soon led to another innovation: stochastic variational inference. In a highly cited followup paper, Hoffman and Blei, with co-authors Chong Wong and John Paisley, now a professor at Columbia Engineering, showed how stochastic variational inference could take the stochastic optimization algorithm they had applied to topic modeling and generalize it to a wide range of models. 

Stochastic optimization plus variational inference has been used to scale recommendation systems and the analysis of social networks, genetic data, and other huge datasets. It has also inspired entirely new types of machine learning models, including deep generative models. 

Online LDA was the “precursor not just to the stochastic variational inference paper, but to machine learning models that can learn analogs of ‘topics’ for data like images and robot motions,” said Jacob Andreas, a computer scientist at MIT who was not involved in the research. “What’s also always impressed me about this whole line of work on topic modeling is how many researchers in the humanities and social sciences are using it.”

Source News