MU researchers improving big data approaches for scientists
In the era of big data, knowing where to collect all your data and go with all of your data for processing can be a tricky proposition for scientists working in fields where high-performance computing is not their common forte.
Some cloud data centers offer great performance in terms of speedy processing of data but don’t have the scientist-desired security policies. Others have great scientist-desired security policies to protect their data but deliver less-than-stellar performance. And how to align the policies of the scientist’s institution that is a part of the housing and processing of a scientist’s massive amount of data in conjunction with a cloud data center?
Needless to say, it can be hard for a scientist to find the right match in both performance and security policies for his or her big data handling needs. A group of MU Engineering researchers and collaborators from across campus are about to make that process of using local institution and cloud data center resources much easier.
MU Electrical Engineering and Computer Science Professor Prasad Calyam is the principal investigator on the two-year, $500,000 NSF award titled, “CC* Integration: End-to-End Performance and Security Driven Federated Data-Intensive Workflow Management.” The funded project also includes co-PIs Trupti Joshi from the School of Medicine, the College of Education’s Isa Jahnke, Timothy Middelkoop and Dong Xu from Engineering, Biochemistry’s Tommi White and Hunter College’s Saptarshi Debroy. This project builds upon promising.
Scientists collect data from a wide array of what are now internet-connected devices, including microscopes, medical devices, sensors and more. Most non-computing labs don’t have the capability of storing and processing data at a very high scale. So, scientists often not only use their campus supercomputer resources but also rely on cloud computing resources hosted offsite to house their data and derive insights to make scientific discoveries.
Housing data offsite brings up various concerns for scientists, including data integrity, confidentiality, accessibility, performance and more. Basically, scientists want to know that where they’re housing their data is secure and will allow them to perform their work seamlessly as if they had the ability to store and process the data on site. In some fields such as health sciences, data handling needs to comply with standards such as HIPAA (Health Insurance Portability and Accountability Act), and many enterprises also seek to use security standards and best practices from NIST (National Institute of Standards and Technology).
Meanwhile, the institutions housing the data have certain policies they expect users to adhere to in order to maintain security and usability for all members of their community. The needs of some scientists may not always mesh with a given entity’s policy.
The goal of the project is to advanced the state-of-the-art in usable cyber security through a streamlined performance/security requirements gathering process and an alignment of the local institution and cloud data center policies, and ultimately to minimize the risk of potential data loss for scientists due to cyber attack incidents.
“In my interactions spanning several years, I found that scientists want to use resources anywhere that’s available, and they want to have some level of control of their data security, also,” Calyam said. “If my data goes somewhere else outside my local control or institution boundary, I need to know that a cloud data center administrator won’t delete it. Nobody should be changing my data without my knowledge, and when I need it, a data center policy shouldn’t block me from using my data or even worse a data center policy shouldn’t lead to inadvertent sharing of my data with anyone I would not like to share.
“The innovative aspect of our project is that we will be cleverly brokering multiple cloud providers’ data services. We want to be able to broker a set of data services anywhere, and we want to tell the user the best place to go for their big data storage and fast processing needs.”
The eventual benchmark for the project will be to create software that will allow scientists to enter their needs and find a computing/storage location that meets their usability and security requirements while also being able to adhere to data service provider policies. The project team focus in the brokering design is to make it quick, reliable and user friendly so scientists with various levels of tech proficiency can all get the cloud services they require.
Projects such as these are exactly what Calyam hopes to facilitate in his new role as the Robert H. Buescher Faculty Fellow. He currently is working to establish a multi-university, multi-disciplinary “Cyber Research and Education Initiative” at MU in the area of Big Data Cyberinfrastructure to foster Information Technology innovations.
“This is the first grant in my new role that actually helps us advance our college’s research computing priorities and can potentially deliver what our campus’ domain scientists in bioinformatics, neuroscience, geospatial informatics and health informatics are looking for,” Calyam said. “It’s multidisciplinary and multidepartmental. Funding for this collaborative project is going from NSF to Engineering, Education, the School of Medicine and the Campus Research Computing groups.
“This is what true interdisciplinary collaboration looks like in my mind, and MU culture is naturally set up to foster such collaborations across diverse disciplines.”