Embracing Big Data’s big bang
An article that appeared on the Science Daily website, written by Ase Dragland of SITEF, a Scandinavian independent research organization, contends that 90 percent of all the world’s data has been generated in the last two years. This big bang of data — from sources such as scientific research, healthcare, the Internet, social media, cell phones, sensors, security cameras and personal purchase choices — has caused the term “Big Data” to enter our common vernacular and also has resulted in a burgeoning job market for data scientists.
The Computer Science Department in the University of Missouri’s College of Engineering is one of 1,000 universities around the globe to partner with IBM in an academic initiative to train the necessary data-driven workforce. The computer giant predicts that by 2015, Big Data and related analytics will result in 4.4 million jobs in industries across the board as they mine new and accumulated data for insights and use the findings to gain competitive advantage.
In the Fall 2013 semester, the MU Computer Science Department began offering a Big Data analytics course taught by Adjunct Assistant Professor Susan Zeng, who works as a subject matter expert at Columbia’s IBM site. The company has supplied the use of its software, InfoSphere BigInsights and InfoSphere Streams, and access to IBM academic cloud for the endeavor.
“We’re touching on different topics in depth so they have an idea where to start, where to go and how to find what they need in this fast-evolving field” said Zeng. “I am giving them a key to get in — to enter the door of Big Data.”
Analyzing this explosion of disparate facts, figures and image datasets is not merely a matter of building a more powerful computer system. There is no such electronic beast. Rather, data scientists are using parallel processing — many computers working simultaneously — with new intricate mathematical processes, or algorithms, to combine and partition data in order to arrive at meaningful conclusions.
Historically, data storage also has presented a challenge, but instantaneous access to a broad array of networks through “the cloud,” is a game changer in the field of Big Data.
“The speed of a single processor hit its limit five years ago,” said Yi Shang, a University of Missouri professor of computer science. “With cloud computing, there is more storage and scale of economy. Probably 70 to 80 percent of all computers are sitting idle at any given time,” he added. “Because they are idle, they can be shared, though in the parallel computing environment, programming is more difficult.”
Shang explained that in the past, computer languages largely have relied on sequential computational knowledge, which is like a recipe that is completed one step at a time.
“Now, the Hadoop software platform is the current de facto solution for Big Data processing on distributed computing systems,” Shang said. “It was developed by two researchers from Yahoo! in 2005. They wrote it in Java to let people use it as an open- source project.”
Apache Hadoop is used to process and query vast amounts of data on large clusters of commodity — affordable and accessible — hardware by partitioning the data.
Shang said that in 2004, Google presented a programming model called MapReduce that works in the parallel processing environment. When implemented on top of Hadoop, it will search through data to make new connections and discoveries, first mapping the input data to small units of key-value pairs — intermediate results that are then reduced to the output — all in a massively parallel method.
“Big Data analysis has resulted in concrete predictions to use in everyday situations. Because it is based on a greater amount of data, results are more accurate, more customized and more robust,” said Shang, who will lead a three-week workshop on Big Data analytics next summer. The workshop is open to MU students and all others interested.
As director of MU’s Informatics Institute (MUII), Chi-Ren Shyu has lived and breathed data most of his professional career. The Shumaker Endowed Professor of Computer Science said that data’s transition to Big Data can be explained by three “Vs ” — volume, variety and velocity — a model first put forward by Gartner IT research and advisory company research VP, Doug Laney in 2001.
The sheer volume of data being generated is staggering. Terabytes became petabytes and petabytes became exabytes, which became zettabytes — with each measurement exponentially multiplied 1,024 times over the previous designation.
Fully 2.8 zettabytes of data were created in 2012, and it is predicted the volume will double by 2015. To put the number in perspective, Cisco Systems, one of the world’s leading manufacturers of network equipment, asserts the amount of visual information conveyed from the eyes to the brain of the entire human race in a single year is equal to 66 zettabytes.
Variety refers to the different file types of data collected, already mentioned. Datasets heretofore not used for relational comparisons, are increasingly integrated with novel results.
“Data generated by sensors on bridges that measure vibration and displacement can be combined with satellite images and measurements to see if the bridges are still safe,” Shyu said by way of a simple example.
Shyu also said the $13.2 million LIGHT2 project of Jerry Parker, associate dean for research and a professor of physical medicine and rehabilitation at the MU School of Medicine, is a great example of Big Data’s ability to unite divergent data.
“He is looking at 10,000 patients in the MU Health System with follow-up for several years with the goal of cutting costs and increasing patient education for improved outcomes,” said Shyu, who is working with Parker on the project.
Velocity points to the fact that data is not static and if its sheer momentum can be harnessed, Big Data discoveries and decisions can be extremely vigorous, especially through the use of distributed networks. Shyu said that real-time surveillance (R-TS) is a great example of what he refers to as velocity. One example would be its use in the case of bioterrorism or disease outbreak. In real-time, first responders, hospitals, physicians and other emergency workers could share and process a great amount of information to give them a clear picture of the extent of the damage and allow them to predict and manage the situation.
Crowdsourcing, or distributed problem solving, also has been made possible with Big Data technologies. Crowdsourcing involves gathering a large amount of input on a specific topic by enlisting a great number of “investigators” — often the general public — for rapid and robust results.
Another very obvious Big Data application is bioinformatics, and research in this area has greatly benefited from the exploding analytic possibilities as witnessed last year in the novel discovery made in the labs of Dmitry Korkin, an associate professor of computer science who also is on the MUII faculty, and a colleague in plant sciences, Professor Melissa Mitchum. Through the use of data mining, the researchers were able to discover molecular mechanisms that made a specific soybean resistant to nematodes.
Shang points to collaboration with Tim Trull, an MU professor of psychological sciences, as a research project that has gained dynamism with new Big Data capabilities. The research is examining alcohol craving and addiction.
“In the past, research subjects were hooked up to sensors in a lab to collect data,” Shang said. “Now they wear sensors that continuously measure responses to stimuli that can be collected with a smart phone app. We can understand much more about patients from this field data.”
Shang said that in many instances, the capacity to combine new data with archived data has led to new insights.
The ability to collect large amounts of data from sensor networks has resulted in broad new applications. They range from the construction of so-called “smart” buildings that can regulate their own environments based on sensor observations, to work such as that being done by Professor Marge Skubic, of the college’s Electrical and Computer Engineering Department, using a variety of sensors to produce early illness alerts in senior housing.
By far, the Big Data applications that have generated the most interest are those that discover relationships for commercial purposes, or profit.
“IBM did a survey and found that two out of three businesses said they make decisions based on things they don’t know,” said Zeng. “By using [customer/consumer] data, they can make information-based decisions.”
Netflix is good example of a data-driven business. Based on users’ documented preferences, the company makes suggestions for additional recordings that might satisfy data profiles.
“Every hour Wal-Mart has one-million customers, generating 2.5 petabytes of data,” said Dong Xu, computer science department chair and James C. Dowell Professor. “The use of that data influences the way things are arranged in the store and which things are grouped together to encourage people to buy more.
“Because there is a lot of data in the business area, there are many good opportunities in marketing and forecasting with a lot of job opportunities in this area in the future,” he added.
In addition to the Big Data analytics course being taught by Zeng, several additional courses through computer science and electrical and computer engineering will make MU students completing them marketable as data scientists. Xu is working to create a dedicated course sequence in the MU Computer Science Department in high performance computing and Big Data analytics.
“When we train our students, we want their education to be dynamic. The reason our students are getting higher salaries is because we have kept this in mind as we have built our curriculum,” said Xu. “And by showing this cutting edge with classes like our Big Data analytics offering, we will attract more students to the program.”