Learn to find the right data

Getting good insights into AI models isn’t easy.

It is perhaps an indication of the extent to which the pendulum has swung towards machine learning within the artificial intelligence community that the concept of “data-centric AI” seems practically a tautology.

Back to the dark and distant past, otherwise known as the end of 20and century, much of the ongoing AI work then focused much more on building systems from the ground up that could reason about the world on their own. Then came deep learning, and while there are still people working on the AI ​​style of reasoning, most of the attention has gone to the approach of showing computers pictures or descriptions of things and expect them to learn to identify things.

In 2012, Geoffrey Hinton, professor of computer science at the University of Toronto, and his colleagues demonstrated a rapid advance in the ability of computers, thanks to the increased computing power of GPUs, to correctly label a thousand extracted objects. of the ImageNet dataset. Their deep neural network (DNN) easily outperformed the types of AI algorithms that dominated ImageNet’s previous two annual challenges. Soon after, a group working at the IDSIA research institute in Switzerland used a DNN to outperform humans at the same task: recognizing road signs. How? The computerized system was able to use subtle cues of shape and size to find answers to signs that most of the original image had been bleached out by the Sun.

Since then, claims of machine outperformance have appeared at regular intervals interspersed with evidence of how DNNs cheat and are, in turn, easily deceived: often for similar reasons. As with the model trained on traffic signs, neural networks frequently focus on details missed by human observers; not least because the human brain does not perceive images at the same level of detail as a PNG-powered computer. Subtle textures can be as useful to the machine as anything, not least because a number of studies have pointed out that DNNs don’t yet do a great job of extracting important features from an image and associating them to a particular object. Most of the time, they come across things that humans would largely ignore when asked “is there a computer in the picture?” or ‘Is the activity shown cooking?’.

Several years ago, a group working at the University of Virginia noticed that DNNs gave, perhaps unsurprisingly, much more weight to things that showed up more often in scenes than others. And those things often correlated with stereotypes, largely because the images used to train the models came from publicly available image databases, often with the help of search engines. So the datasets could have twice as many women cooking as men and use those correlations to find the answer of what they saw when they showed another image.

The result? The machine would make mistakes about who was cooking or inadvertently find the wrong answer based on the apparent gender of the person in the photo. This kind of “directional bias” is one of the sources of the problems that DNNs have when presented with real-world data, and also helps identify a big problem with the current generation of machine learning systems: data they use to train is not good enough. .

Quite often, to get millions of images or other pieces of content into the system and tag them, researchers have turned to crowdsourcing services such as Mechanical Turk and Upwork. But using relatively cheap labor comes with hidden costs, not least with name-calling and insults that sometimes show up in labels attached by less-than-happy or ill-trained crowdsourcers.

Then you have gaps in the data itself. In a talk for Princeton University’s Center for Information Technology Policy last year, Olga Russakovsky, an assistant professor of computer science at the same university, described how the Western orientation of many public datasets leads to mistakes in recognizing things as simple as soap. Bar soaps are relatively rare in the United States compared to liquid soap, so models, Russakovsky said, may not recognize them as soap. “A lot of these issues can be attributed to the fact that we collect all of this data primarily from the web, as this is the cheapest and most readily available large-scale data source,” she added. .

At a data-centric AI conference hosted by Stanford University in November last year, Cody Coleman, a PhD student at the university, said: “The unprecedented amount of data available has been essential to many recent deep learning successes. However, big data brings its own problems. It is computationally demanding, resource intensive, and often redundant. But when we think of real-world datasets, they are often limited to a small number of common or popular classes. »

The data-centric AI movement aims to solve this problem by paying much more attention to the data that is used to train the model and trying not only to avoid unnecessary effort, but to skew the results by presenting too many sources representing more or less the same thing. thing. One approach is to make machine learning much more iterative: where the data and model are repeatedly tuned to try to reduce errors. The question is to what extent this can be automated. An example of a way forward is DCBench, which looks for signs of deficiencies or bias in the trained model and the data used to feed it, and uses that to identify ways to fix the problem.

At the NeurIPS conference in late 2021, a team from Salesforce Research took a semi-automated “human-in-the-loop” approach to eliminate glitches in the training data and come up with additional rules the model could use. They found that the more conventional deep learning approach, such as using conflicting data to try to get the model to learn the right patterns on its own, proved to be more expensive than just direct integration. rules in the model.

A decade after the DNNs seemed to give rule-based AI the kiss of death, it’s making a partially hidden return. The term “data-centric AI” may turn out to be a bit of a misnomer in the end, as model designers make more changes to their engines to deal with problems caused by over-reliance on the data itself.

Sign up for the E&T News email to get great stories like this delivered to your inbox every day.