June 18, 2020
The secret to surviving COVID-19 could be locked in our DNA. Researchers are analyzing genome sequences to find clues about why some people are more susceptible to the virus.
Right now, doing that work comes with a hefty price tag.
But Praveen Rao is developing a way for more scientists to unlock that information for free.
Rao is an associate professor with joint appointment in Health Management & Informatics and Electrical Engineering & Computer Science. He received RAPID funding from the National Science Foundation to democratize genome sequence analysis by providing tools to analyze and compare genomes.
A human genome is a person’s complete set of DNA. It contains more than 3 billion DNA base pairs. In order to compare genome sequences, researchers use digital replicas. And that requires a massive amount of computer storage and memory.
“We see more and more that human genes may hold the answer to finding a cure or treatment for COVID-19,” Rao said. “But genome sequences are very large, and commercial cloud platforms cost a lot of money to analyze these sequences. This NSF project will provide a way to work with genome sequences at a large scale without having to go through an expensive commercial cloud platform.”
Leveling the Playing Field
Rao will use CloudLab to provide that system, including the software, necessary algorithms and storage space. That means anyone with a CloudLab account will be able to upload, analyze and compare genome sequences.
“What we wish to do is give them a browser interface where they can load their data and get their analysis results,” Rao said. “You don’t have to think about cost or any kind of processing that would otherwise have to be done by someone with a software background. You just load the sequences, and we’ll give you the results.”
Researchers who have had the ability to analyze genomes have found links between genetic markers and symptoms of the virus. For instance, geneticists recently discovered a connection between Type A blood and respiratory problems.
Rao’s work will level the playing field, allowing more scientists to analyze genomes for coronavirus clues.
“We want to make this available to everyone, not just for a chosen few with resources,” he said. “Even those who aren’t scientists will be able to use it. This is a simple yet effective way to make progress toward finding a cure for COVID-19.”
His research team includes fellow Health Management and Informatics faculty, as well as faculty from the Department of Pathology and Anatomical Sciences (School of Medicine) and College of Agriculture, Food and Natural Resources.
Educational Components
The RAPID project has several educational offshoots.
First, it will allow NSF to test and expand CloudLab’s capabilities. CloudLab is an NSF-funded experimental testbed where researchers can explore new cloud technologies. This will be the first time it has been used to analyze genome sequences at scale.
“Part of the study is developing new techniques that will make this process more efficient and scalable,” Rao said. “The networking data will inform us about how we design software systems to support large-scale workloads on CloudLab.”
Rao plans to incorporate his findings in the classroom. He is starting a new course titled “Data Science in Healthcare” this fall. Next spring, he will begin offering the course titled “Principles of Big Data Management” to engineering students.
And he will offer a workshop next summer to introduce high school students to the concept of data analysis and cloud computing.
“We want to disseminate the knowledge we have in context of our understanding of both disease and computer science technology,” Rao said. “Perhaps a student will look at this work and consider pursuing a STEM degree and eventually graduate school, and maybe even become the next generation of researchers and educators. That’s the hope.”