Mizzou Engineering team develops video retrieval system based on captioning

June 29, 2023

Graphic showing scenes from videos

It’s not hard to search for a cute cat video on the internet. But if you want to find a video of a cat chasing a dog down a street on a sunny day, it gets trickier.

Now, a Mizzou Engineering team has developed a novel system that relies on image captioning to find video clips of specific objects and scenes. Associate Professor Praveen Rao and his former Ph.D. student, Arun Zachariah, outlined the method at an Association for Computing Machinery (ACM) conference earlier this month.

“People watch a lot of videos,” Rao said. “With such a large database of videos,  you need a system that can efficiently retrieve videos of interest to the user. We built a retrieval system where, given a query video, it will find you the top most relevant video clips in the database.”

Portrait of Praveen Rao

Current video retrieval methods are based on deep learning, a type of artificial intelligence, to extract features in images within the video. That deep learning captures relationships between scenes and objects.

Because AI can already identify objects and scenes and provide captions for them, Rao and his team opted to base their system on that text rather than images.

The team developed a similar system, known as QIK, previously for image retrieval. That system uses AI to automatically generate captions of photos and then use these captions for image retrieval.

“We took that previous system and built out a video retrieval system using the same principles,” he said.

Here’s how it works. A given video clip is segmented into a smaller number of frames that are representative of the entire video. For instance, if you want a video of a man walking to a car, getting in and driving away, the system would break it into three frames that represent the context of the entire video — a man walking, getting in a car and driving.

When a user poses a query that asks for such a video, videos with all three components would rank highest, while videos of a man simply walking would have a lower ranking, and so on.

“It’s a different approach compared to what the community has been doing,” Rao said. “That’s why this work is really exciting.”

While more research is needed to scale the work up for online searches, Rao said the program could be adapted for specific fields such as defense, health care or ecommerce.

“What we’ve built is a local system that could be used by a company with a collection of videos they want to search,” Rao said. “To create a large-scale system would require more research and development, but the concept can be applied to any number of videos.”

Learn more about computer science at Mizzou Engineering.