The University of South Carolina’s Research Computing (RC) team serves as a central resource for all research computing on campus. Paul Sagona, the team’s Interim Executive Director, describes their work as “research facilitation:” they show faculty and students how to use existing tools or develop innovative solutions, enabling them to focus on their research. Sometimes faculty ask for help; other times the team analyzes patterns of technology usage to identify areas of inefficiency and improve the process to save money, time, or resources.
"Bioinformatics has lots of new tools coming out but it’s a huge gap for scientists to learn them. As we move toward cloud infrastructure there will be a smaller gap in training as scientists will have less need to code. It will become easier to move from on prem to cloud."Behzad Torkian, Senior Applications Scientist, University of South Carolina
The challenge: how to redesign their workflow to speed data analysis
In March 2018 the team at the Molecular Microbial Biology Lab, led by Dr. Sean Norman, reached out to them with a problem: their researchers were collecting much more metagenomics data than they could easily process and use, which had created a large backlog. The field of metagenomics studies genetic material from environmental sites to create a better picture of the world’s complex and diverse microbial ecology; USC’s team was gathering environmental samples at a coastal pond in the Bahamas to help understand how climate change impacts ecosystems. After collecting each sample, they had to amplify and sequence the genetic material, then run the analysis. With tens of terabytes of data per sample and hundreds of terabytes to compute, the research was computationally intensive and expensive.
The solution: moving the workflow onto Google Cloud
To solve this challenge, the RC team turned to Google Cloud. For Behzad Torkian, Senior Applications Scientist, the biggest factors were the global reach and lightning-fast speed of its network and the control it gives researchers over their own work: “you need to move the data between the nodes and you want the freedom to use the tools the way you want, when you want,” he said. Another important factor was the opportunity to work with Google’s support team, whom they met during a campus IT visit. “Our collaborative process with Google was fantastic,” Sagona says. Bob Doran, Application Scientist at USC, adds, “they were super at communicating and excited about what we were doing.”
To start, the team came up with a plan to mimic their existing high performance computing (HPC) cluster on Google Cloud Platform (GCP), with flexible storage options and dynamically installed software that would scale with the data. They set up a read-only attached persistent solid-state disk for the reference database and metagenomic samples on GCP and started with single tests and small runs through a compute stack on Google Compute Engine that was distributed over 32 core instances. The results were outputted into GCP cloud buckets. Integrating Slurm in GCP enabled them to schedule, scale, and resubmit the jobs automatically. After troubleshooting along the way to verify results, they finally moved the whole job to GCP. “We ran the job on 124,352 cores concurrently. We ran on 3,886 nodes. And we did that in 16.5 hours,” Sagona reports. “The transition to GCP was enormously successful and greatly enhanced this research. It demonstrated that we could process these massive data sets in just a fraction of the time.” The difference was dramatic: a month’s worth of new samples that would have taken seven years to process on a personal computer, or three months on a local cluster, took sixteen hours on GCP. This means that a year’s worth of data backlog that would take fifty years on a personal computer or two years on a local cluster will take just three days on GCP. “That’s a big deal,” Torkian states, in part because it allows researchers to demonstrate more progress within the short time frames of typical research grants.
"We ran the job on 124,352 cores concurrently. We ran on 3,886 nodes. And we did that in 16.5 hours. The transition to GCP was enormously successful and greatly enhanced our research."Paul Sagona, Interim Executive Director, Research Computing, University of South Carolina
The benefits beyond the cloud: reproducible results
Moving to the cloud proved to have other benefits as well. Saving time saves money, and running jobs in parallel was more cost-effective. The next run will utilize containers to be able to take advantage of GCP’s modular organization so the workflow would be easily transferable and reproducible for other researchers. Torkian adds that “bioinformatics has lots of new tools coming out, but it’s a huge gap for scientists to learn them. As we move toward cloud infrastructure there will be a smaller gap in training as scientists will have less need to code. It will become easier to move from on prem to cloud.” According to Doran, new methods could also set new algorithmic standards for validating studies, which could change how science is conducted. The USC team also has high hopes for STRIDES, Google’s new partnership with the National Institute of Health to share public biomedical datasets: better collaboration and better access to data will help their researchers to make even more progress. Sagona speaks for the team at USC when he says, “we’re really looking forward to the future.”