Identical regions found across plant genomes by MU researchers
The idea of searching for ultraconserved regions across different plant genomes first occurred to Dmitry Korkin, an assistant professor of computer science, long before he came to the University of Missouri.
As a post-doctoral student at the University of California, San Francisco, he attended a 2006 talk by Gill Bejerano, the first author of a groundbreaking study published in Science that found identical regions across the human, rat and mouse genomes.
“I was sitting and listening to this talk and I though, ‘What about plants?’” Korkin said. “As I started looking into it, I realized it’s a much more difficult question in plants.”
Ultraconserved regions are those with identical base pair sequences. Several hundred such regions have been identified in animal genomes. Bejerano’s research found 481 segments of 200 or more base pairs that were conserved with no deletions, insertions or substitutions across the three genomes he looked at.
In animal genomes, these regions of extreme conservation are mainly syntenic, or located on the similar parts of a chromosome across species. With plant genomes, such similar parts are very short, so the regions of extreme conservation may not necessarily be located on the same chromosome.
“You have a lot of short blocks [of the same genetic code], shuffled along the genomes consisting of hundreds of millions of base pairs,” Korkin said. So instead of focusing on plant genomes, Korkin first decided to develop new fast algorithms to detect the extreme sequences and applied them to analyze protein strings, which are shorter and easier to deal with.
“The problem could be formulated for an arbitrary set of strings and not necessarily genome strings,” Korkin said. “So, we started advancing that approach, not for genomes but for proteins.”
That approach, while being among the most efficient currently existing algorithms for proteins, seemed to be a dead-end for the long genomic sequences. Chi-Ren Shyu, a professor of computer science and the director of the MU Informatics Institute, approached Korkin about an algorithm doctoral student Jeff Reneker had crafted.
“I realized that it could very well be applicable to this problem that I have,” Korkin said. “But, first, the algorithm had to be redone to be even more effective. We really wanted to push the limits of the algorithm.”
After some fine-tuning, the researchers tested it on the same genomes selected by Bejerano. “Not only did we find everything Bejerano found, we found more regions of conservation in these animal genomes,” Korkin said.
These newly discovered ultraconserved regions found by the researchers occurred in six animal genomes — dog, chicken, human, mouse, macaque and rat — and were located in different places. While Bejerano focused on conserved strings located on the same chromosome, the new algorithm could identify identical sequences located anywhere across any number of genomes.
“He could only consider similar regions,” Korkin said. “We could compare anything to anything so we could guarantee that we found any matches. So that makes the plant genome question more feasible.”
The research, published in the Proceedings of the National Academy of Sciences, compared six plant species — Arabidopsis, soybean, rice, cottonwood, sorghum and grape — and found extreme regions, now called LIMEs (long identical multispecies elements) located in different places on the plant genomes. The paper, titled “Long Identical Multispecies Elements in Plant and Animal Genomes,” was co-authored by Gavin Conant, assistant professor of animal sciences and Chris Pires, an associate professor at the Life Science Center. Researchers from the Universities of Missouri, California-Berkeley and Arizona collaborated on the paper.
A significant amount of data had to be analyzed for this research and the program ran for four weeks on 48 computer processors to complete the 32 billion searches necessary. This is the first time such big data capabilities have been turned toward plant genome analysis, but Korkin said it’s just the beginning.
“There are a vast number of genomes that are sequenced,” Korkin said. “Are we going to find these regions in all of them? Probably not between plants compared to animals but there could be other groups where the same mechanisms evolved independently.”
Based on the research, Korkin said these “frozen” regions in plants and animals likely evolved independently and were probably maintained and created through different mechanisms.
Korkin plans to further develop the algorithm based on the study’s findings. He said knowledge of biological processes and evidence from the data — like the fact that these conservation regions are clustered — can inform the algorithm.
“We would like to come back to this algorithm with biological knowledge and redesign it based on the data and biological phenomena,” Korkin said. “We know that our data are not arbitrary. We can use this knowledge to streamline the algorithm.”
To Korkin, his interest to continue working with the biological data represents the increasingly interdisciplinary nature of computer science.
“No longer do we have only pure computer science or pure biology – very often we use computer technology to answer biological questions,” Korkin said. “The methods are actually driven by the vast amount of real experimental data. We are in the age of data-driven computational science.”