Storage and processing of genetic data will exceed the computing needs of Twitter and YouTube in the next decade, experts have warned.
The computing resources necessary to handle genome data will soon exceed those of Twitter and YouTube, according to a team of biologists and computer scientists who are concerned about the ability to store the explosion of data.
The amount of information packed into just a few molecules of DNA is enough to fill a whole computer hard drive. Given the pace at which genetics is progressing, with sequencing costs dropping and ever more genomes being analysed, the amount of available genomic data will reach the exabyte scale – billions of gigabytes – by 2025, scientists predicted. This is mostly due to the amount of data that must be stored for a single genome, which is 30 times larger than the size of the genome itself.
In the report published in the journal PLoS Biology, the team said that this outstrips YouTube’s projected annual storage needs of one exabyte (two exabytes of video by 2025) and Twitter’s maximum projection of 17 petabytes per year. It even exceeds the one projected exabyte per year for what will be the world’s largest astronomy project, the Square Kilometre Array in South Africa and Australia.
Professor Gene Robinson, director of the Carl R Woese Institute for Genomic Biology at the University of Illinois, said: “As genome-sequencing technologies improve and costs drop, we are expecting an explosion of genome sequencing that will cause a huge flood of data.
“The only way to handle this data deluge will be to improve the computing infrastructure for genomics.”
He added: “Genomics will soon pose some of the most severe computational challenges that we have ever experienced. If genomics is to realise the promise of having a transformative positive impact on medicine, agriculture, energy production and our understanding of life itself, there must be dramatic innovations in computing. Now is the time to start.”
By 2025, the team expects as many as one billion people to have had their full genomes sequenced. Currently, genomics data is doubling roughly every seven months and was described as a 'four-headed beast' by the computing experts. This refers to the separate problems of data acquisition, storage, distribution and analysis.
However, not all experts agree. Narayan Desai, a computer scientist at communications giant Ericsson in San Jose, California, is not impressed by the way the study compares the demands of other disciplines. “This isn’t a particularly credible analysis,” he told Nature. Desai pointed out that the paper gives short shrift to the way in which other disciplines handle the data they collect. For instance, the paper underestimates the processing and analysis aspects of the video and text data collected and distributed by Twitter and YouTube, such as advertisement targeting and serving videos to diverse formats.