Researchers Announce Advance in Image-Recognition Software

MOUNTAIN VIEW, Calif. — Two groups of scientists, working independently, have created artificial intelligence software capable of recognizing and describing the content of photographs and videos with far greater accuracy than ever before, sometimes even mimicking human levels of understanding.

Until now, so-called computer vision has largely been limited to recognizing individual objects. The new software, described on Monday by researchers at Google and at Stanford University, teaches itself to identify entire scenes: a group of young men playing Frisbee, for example, or a herd of elephants marching on a grassy plain.

The software then writes a caption in English describing the picture. Compared with human observations, the researchers found, the computer- written descriptions are surprisingly accurate.

The advances may make it possible to better catalog and search for the billions of images and hours of video available online, which are often poorly described and archived. At the moment, search engines like Google rely largely on written language accompanying an image or video to ascertain what it contains.

“I consider the pixel data in images and video to be the dark matter of the Internet,” said Fei-Fei Li, director of the Stanford Artificial Intelligence Laboratory, who led the research with Andrej Karpathy, a graduate student. “We are now starting to illuminate it.”

Dr. Li and Mr. Karpathy published their research as a Stanford University

technical report. The Google team published their paper on arXiv.org, an open source site hosted by Cornell University.

In the longer term, the new research may lead to technology that helps the blind and robots navigate natural environments. But it also raises chilling possibilities for surveillance.

During the past 15 years, video cameras have been placed in a vast number of public and private spaces. In the future, the software operating the cameras will not only be able to identify particular humans via facial recognition, experts say, but also identify certain types of behavior, perhaps even automatically alerting authorities.

Two years ago Google researchers created image-recognition software and presented it with 10 million images taken from YouTube videos. Without human guidance, the program trained itself to recognize cats — a testament to the number of cat videos on YouTube.

Current artificial intelligence programs in new cars already can identify pedestrians and bicyclists from cameras positioned atop the windshield and can stop the car automatically if the driver does not take action to avoid a collision.

But “just single object recognition is not very beneficial,” said Ali Farhadi, a computer scientist at the University of Washington who has published research on software that generates sentences from digital pictures. “We’ve focused on objects, and we’ve ignored verbs,” he said, adding that these programs do not grasp what is going on in an image.

Both the Google and Stanford groups tackled the problem by refining software programs known as neural networks, inspired by our understanding of how the brain works. Neural networks can “train” themselves to discover similarities and patterns in data, even when their human creators do not know the patterns exist.

In living organisms, webs of neurons in the brain vastly outperform even the best computer-based networks in perception and pattern recognition. But by adopting some of the same architecture, computers are catching up, learning to identify patterns in speech and imagery with increasing accuracy.

The advances are apparent to consumers who use Apple’s Siri personal assistant, for example, or Google’s image search.

Both groups of researchers employed similar approaches, weaving together two types of neural networks, one focused on recognizing images and the other on human language. In both cases the researchers trained the software with relatively small sets of digital images that had been annotated with descriptive sentences by humans.

After the software programs “learned” to see patterns in the pictures and description, the researchers turned them on previously unseen images. The programs were able to identify objects and actions with roughly double the accuracy of earlier efforts, although still nowhere near human perception capabilities.

“I was amazed that even with the small amount of training data that we were able to do so well,” said Oriol Vinyals, a Google computer scientist who wrote the paper with Alexander Toshev, Samy Bengio and Dumitru Erhan, members of the Google Brain project. “The field is just starting, and we will see a lot of increases.”

Computer vision specialists said that despite the improvements, these software systems had made only limited progress toward the goal of digitally duplicating human vision and, even more elusive, understanding.

“I don’t know that I would say this is ‘understanding’ in the sense we want,” said John R. Smith, a senior manager at I.B.M.'s T.J. Watson Research Center in Yorktown Heights, N.Y. “I think even the ability to generate language here is very limited.”

But the Google and Stanford teams said that they expect to see significant increases in accuracy as they improve their software and train these programs with larger sets of annotated images. A research group led by Tamara L. Berg, a computer scientist at the University of North Carolina at Chapel Hill, is training a neural network with one million images annotated by humans.

“You’re trying to tell the story behind the image,” she said. “A natural scene will be very complex, and you want to pick out the most important objects in the image.” 

College hoops coaches rely on video, analytics to find edge...

The usual suspects often are cited when searching for the cause of college basketball's scoring decline: The game has become too physical, the pace too slow, defenses too packed in, and stars depart too quickly.

Division I men's teams are averaging 67.6 points through Thursday, just off the 2012-13 average of 67.5 points per game, the lowest scoring output since 1952.

But perhaps the salient reason scoring is down has nothing to do with size or talent or scheme. Perhaps it has everything to do with a proliferation of information.

NO SECRETS

Jason Richards enters a corridor of Petersen Events Center with a T-shirt and mesh shorts draping his broad shoulders and 6-foot-2 frame, the build of a former athlete. Richards once was Stephen Curry's running mate at Davidson. Few would guess he now runs Pitt basketball's nerd cave as its director of video and analytics. Still, he is sure this role is as good as any starting point toward becoming a head coach in the information age.

After college, he signed as an undrafted free agent with the Miami Heat in 2008 and briefly played in the NBA D-League. A torn ACL ended his career. It was with the Heat that he became fascinated with coach Erik Spoelstra's attention to detail and data and Spoelstra's start: working as a video coordinator.

At Richards' fingertips in the Pitt basketball offices is the Synergy Sports database, which integrates video and analytics on every NBA and Division I men's basketball player and team. He believes this scouting technology is playing a role in the college game's scoring decline.

“The analytics movement that started with baseball has filtered into basketball,” Richards said. “Game planning has gone to a whole new level. It's incredible what the software has done. … Teams are so well prepared for opponents now.”

Teams are so well prepared because of Synergy.

Synergy CEO Garrick Barr, a former video coordinator for the Phoenix Suns, founded the scouting service in 1998, creating reports off VHS-taped games. By 2008, he had streaming video and a small army of analysts breaking down games.

It was then that Synergy offered its sophisticated video-analytics package on a free trial basis to UCLA and Kansas. Kansas won the NCAA Tournament, and UCLA advanced to the Final Four. This season, more than 300 Division I men's programs subscribe to Synergy.

How detailed is Synergy's scouting database? Very. Want to know how often Duke forward Jahlil Okafor operates from his left shoulder on the left block? Synergy's Scott Mossman notes it only takes seconds to find such data.

“If you look at all of Okafor's left-block, left-shoulder hook shots, let's say he did that 73 times over the course of a season,” Mossman said, “we have all the analytics and video for each time he did it.”

Data often are cited for helping defenses over offenses. It is easier to isolate offensive performance and build a strategy to defend it. And because Synergy has proliferated to almost every Division I program, everyone has the same information. Said Virginia coach Tony Bennett earlier this season: “You cannot trick people now, not with all the great coaches in the league, with the Synergy, all the video.”

Said Richards: “There are no secrets.”

VIDEO ON DEMAND

It's not just the information; it's the speed of delivery.

Pitt coach Jamie Dixon is not a numbers guy. He is a video guy. He prefers to watch games from start to finish to see flow and when certain plays are run. He never has had access to so much video so quickly.

Last season, ACC programs agreed to use Synergy's video-exchange service. Teams are required to upload video of their most recent game two hours after the final buzzer.

Consider Pitt's game Feb. 21 at Syracuse. Pitt's next opponent, Boston College, also played Feb. 21, against Notre Dame. Dixon was able to watch that BC-Notre Dame game the same night.

Immediately after a game, Richards loads the next opponent's seven most recent games onto Dixon's iPad along with the just-completed Pitt game.

“Between waiting on the bus, the bus ride and getting back to Pittsburgh, I was already well into scouting (Boston College),” Dixon said. “You get more done in a shorter amount of time, and you have more games available. … In the past, if you were (scouting) a team and they played a game three days ago, you might not get that game. Now you get it immediately.”

So if everyone has the same science, where is the edge? Richards mimics how Dixon constantly slides his thumbs and index fingers on the iPad touch screen to stop and rewind video. There remains a human element. “He's rewinding, rewinding, stopping, taking notes,” Richards said. “His knowledge of the game is incredible.”

There also is an art in deciding in how much to give players so they are not overloaded.

There's only going to be more information as a handful of programs such as Duke and Louisville have installed the player-tracking system SportVU, a staple at every NBA arena.

It means the edge defenses enjoy today might only deepen tomorrow.

Travis Sawchik is a staff writer for Trib Total Media. Reach him at tsawchik@tribweb.com or via Twitter@Sawchik_Trib.