Symposium sheds light on student’s Big Data discovery
Big Data is a big deal these days, and for his work in making the Big Data used in biology research easier to handle, Andi Dhroso earned a big-time honor.
University of Missouri computer science master’s student Dhroso, who also is seeking a doctoral degree from MU’s Informatics Institute (MUII), was one of five students selected as a recipient of the Keystone Symposia Future of Science Fund Scholarship to attend a Keystone Symposium on “Big Data in Biology” in San Francisco on March 23 and 24 to present his work. He works in the lab of Dmitry Korkin, associate professor of computer science, who also is on the MUII faculty.
“In simple terms, we tried to identify pieces of the DNA that are identical across multiple species. And when I say identical, I mean 100 percent identical,” Dhroso said. “Obviously the more diverse the species are, the smaller are the chances of having big pieces of the DNA matching.”
Dhroso’s work helps streamline the process of identifying long identical multispecies elements — or LIMEs — in plant and animal genomes, looking to further the work put forth by Korkin.
Korkin’s initial investigation stemmed from the 2004 discovery of 481 segments longer than 200 base pairs of DNA identical in the human, rat and mouse genomes.
The work done by Korkin in collaboration with Chi-Ren Shyu a professor of computer science and Director of MUII, and an interdisciplinary group of plant scientists and genomics researchers, published as “Long Identical Multispecies Elements in Plant and Animal Genomes” in the Proceedings of the National Academy of Sciences of the United States of America in 2012, found even more such areas by comparing the genomes of six animals — dog, chicken, human, mouse, macaque and rat — and six plant species — Arabidopsis, soybean, rice, cottonwood, sorghum and grape.
Discoveries were included in both syntenic and nonsyntenic areas — regions of genomes with sequences of genes arranged in the same order and those where genes are in an arbitrary order. One might expect such results in syntenic areas, whereas finding them in nonsyntenic areas was unexpected.
“If any expected extreme conservation would happen, you would expect it in those regions,” Korkin said. “However, we later devised a computational algorithm that allows us to search for all such conserved elements irrespective if they are in those syntenic regions or not. And we did find those ones that were not in those regions, and that was even more unexpected, and then we looked at other species. People thought that was only for mammals, but we found them in plants.”
The discovery was made possible by an algorithm that compared all the genetic sequences on 48 computer processors, which did 1 million searches per hour — a process that took four days.
But Korkin wanted to further discover when this phenomenon began in evolutionary history. That would involve comparing all known eukaryotic genomes – a task that would also include sequenced but not assembled genomes, making the task more difficult and unfeasible with the algorithm he was using.
So Dhroso developed a new algorithm to expedite the process and used it to compare three fully assembled genomes — human, mouse and elephant shark — and three nonassembled genomes – tetradon, a member of the pufferfish family; coelacanth, a rare fish that’s evolved little over the centuries; and lamprey, a jawless fish. Comparison of two fully assembled genomes previously took about 72 hours on 48 processors. With Dhroso’s algorithm, it took only two hours on a single processor: a process that used to take three days now would take less than three minutes on the same cluster.
It used to take between six and eight weeks on 48 processors to compare one assembled and one unassembled genome. Now, that process on a single processor takes about six hours.
“It doesn’t matter how much memory you have,” Dhroso said. “The way that the processor works is that the [type of] memory the processor has is much, much smaller than the actual overall computer memory.
“When the CPU processes your data,” explained Dhroso, “it retrieves first from disc to memory and then into cache. The time it takes the CPU to process your data from cache is very, very small. Now, when you’re processing a large amount of data, you have to organize your data in such a fashion that all the data that’s loaded into your cache — you want to process it and never touch it again.”
The researchers agree that being able to compare DNA across genomes could open doors for more practical applications in the future.
As Dhroso was finishing his research, Korkin came across the “Big Data in Biology” symposium planned by Keystone Symposia, and Dhroso turned in his abstract near the end of 2013 and was accepted, joining elite company alongside fellow recipients from Harvard, Yale, the University of Toronto and Rockefeller University.
“It’s huge,” the graduate student said. “It was just great to be recognized and to get accepted. Not that I thought the work or results were not great. There’s such a huge competition that you don’t know how well you fare against some of the top schools. So just to get accepted on such a prestigious level, it’s great.”
Korkin said he’s pleased to have a student’s hard work and achievement showcased in a proper fashion.
“There are currently a lot of efforts in the Informatics Institute and MU overall in Big Data,” Korkin said. “So being able to become visible through the student’s work is the best possible way to be recognized. These [students] are proposing innovative research and getting acknowledged by very prestigious events.”