It’s all in the genes: how DNA can solve our data storage problem

Data is the DNA of modern economies. Quite literally, if work by one group of scientists takes hold.

The information age has transformed every aspect of our lives.

But each digital innovation brings its problems. A particularly thorny one is how to store the vast amounts of data our devices generate.

By 2017 there was demand for 14,800 billion gigabytes (14,800 exabytes) of data storage from 400 billion in 2009, according to IDC. And that figure is expected to grow by a third each year.

With existing technology, that will become unsustainable. Although the unit cost of data storage has collapsed over recent years, so much information is being generated it’s getting harder and more expensive to keep using conventional electronic archiving systems, which take up large amounts of space and deteriorate over time.

This is why scientists like Dr Nick Goldman at the European Bioinformatics Institute (EMBL-EBI) in Cambridge, UK, are looking at unconventional solutions. Like DNA.

DNA, Goldman explains, is a potentially brilliant data storage device given its “capacity for high density information encoding, its longevity and [its] proven track record as an information bearer”.

It’s certainly true that evolutionary pressures over hundreds of millions of years have made DNA an extremely efficient means of storing the large amounts of information it takes to generate the proteins and other compounds that make up complex organisms.

All this information is encoded in long, helical sequences of the four molecular units that make up DNA, much in the way the whole of the Internet and everything else digital, from compact disks to photographs taken by smartphones, is represented by 1s and 0s.

The idea of DNA doubling up as a storage device didn’t come to Goldman right away. It was while doing his day job at the EMBL-EBI – cataloguing all the DNA data being generated by genome scientists across the world – that he realised he had more genetic information than he could possibly store.

“I work in an institute where one of our jobs is to store on computers all the information about the DNA in all the genomes that scientists all over the world are creating. And that causes us quite a problem. It’s an enormous amount of information coming in every day,” he says.

data storage
Storage-Supply-Demand

After a great deal of hard deliberation, Goldman and his colleagues realised that the solution to their data storage problems was right under their noses: the DNA itself.

They realised that DNA isn’t just useful to store biological coding but could be a repository of any information.

Goldman and his team duly built up data packets of 180 base pairs – DNA comes as a unit and its connected counterpart and the chain builds like a twisting ladder, hence the double helix. Of those base pairs, around 100 actually encode the information that’s being stored, 20 are needed to index it and the rest are needed to help run the process.

“The hard part is making that first copy,” Goldman explains. “Once you have one, it’s easy to make subsequent copies.”

cool dna
Cool-DNA

“Initially, the technology would be used to make backup copies of existing data — things that you want to know are safe,” says Goldman. That’s important, because existing technologies tend to deteriorate over time and quickly become obsolete.

Magnetic storage, for instance, tends to degrade while technological advances mean that sometimes it’s hard to retrieve data written onto old systems. Who has a computer capable of reading floppy disks these days? Or, indeed, a music system able to play eight-track tapes?

DNA, on the other hand, has been around for as long as there’s been life on earth. It is also durable.

“You have to keep it dry — water breaks the strands — and the best way to do that is to keep it at temperatures below zero,” says Goldman. “Shading it from light also helps reduce mutation caused by radiation.

But subject to those constraints, it will last for centuries.

Some DNA samples from under the Greenland ice sheet have been estimated to be half a million years old — even if dinosaur DNA hasn’t yet been discovered.

What’s more, DNA is so efficient that a single gram can store 215 million gigabytes, which means that every bit of data ever recorded by man could fit into a suburban living room.

Indeed, DNA is able to compress more information in a given sequence than conventional binary computing because it has four, rather than just two, different units (for details see the illustration at bottom).

One fascinating possibility, Goldman says, would be to write data into living creatures’ DNA.

It’s already been done in small ways. For instance, the artist Eduardo Kac has translated a sentence from the Book of Genesis into DNA and then implanted it into bacteria. Others have written their initials into genetically altered organisms as a form a trademark.

But the prospect of having a hard drive in a bug (as opposed to one full of bugs) is unlikely. Living creatures probably won’t ever be vehicles to store vast amounts of data. While most organisms have long strings of “junk” DNA — DNA whose purpose scientists haven’t yet figured out or which is redundant — “there’s a limit to how much junk you could introduce to living organisms,” Goldman says.

There are costs to the organism of having too much DNA. Every time a cell divides, all the added DNA is replicated alongside the biologically necessary code, which consumes energy. At the same time, mutations can creep in with each division, which can distort the encoded DNA. After a number of generations, it can end up like a game of Chinese whispers.

So far, DNA data storage is still largely the preserve of the laboratory.

“We have had interest from companies looking to apply the technique, though they’ve been put off by cost,” Goldman explains.

Although the cost of reading and writing onto DNA has collapsed in recent years, relative to conventional computer storage, it’s still fiercely expensive.

For example, the bill for sequencing the first complete human genome ran into the billions of dollars at the turn of the millennium. Now it’s around USD1,000, and the most efficient form of encoding data onto DNA runs to around USD3,500 per megabyte. Each time you want to read it costs another tenth of that. By contrast a US cent buys around a thousand megabytes of conventional data storage.

storage size

There are other hurdles too. For instance, researchers from the University of Washington recently encoded chains of malicious software into DNA that was being sequenced by a commercial machine. The software reprogrammed the gene sequencer and then took control of the computer.

But there are a growing number of commercial drivers making DNA storage cheaper and more widespread. Microsoft is understood to be working on launching an operational DNA-based data storage system by the end of the decade. Other companies are involved in DNA production, such as Twist Bioscience, which is collaborating with Microsoft, and DNAScript, Nuclera Nucleics, Evonetix, Helixworks and Genome Foundry among others.

The speed of innovation suggests DNA storage could well become widespread sooner than currently seems possible.

  1. Converting DNA

    converting dna