Viruses: Back to basics

It is very possible that in the foreseeable future information will be stored in DNA, and the term “virus” will get back to its literal meaning.

Do you remember where the term “virus” came from? Yes, I’m talking about biological viruses, after which IT security specialists named the computer programs that insert their own code into other objects to reproduce and propagate themselves.

It is very likely that soon this information technology term will regain its original meaning — researchers from Microsoft and the University of Washington have marked a new milestone in data storage by writing approximately 200MB of data in the form of a synthetic DNA.

You may ask: What’s the connection with biological viruses? The analogy is pretty direct — viruses insert their genetic code into the DNA of infected organisms, causing the DNA to reproduce the viruses instead of synthesizing the right proteins, which are vital.

The most aggressive viruses disrupt normal physiological processes to such an extreme extent that it leads to the death of the cells and, in the end — of the whole organism. Similarly, the most aggressive malware may render the infected information system absolutely useless, or “dead.”

Therefore, now that humankind has begun writing information in the form of the DNA, it may be worth starting to worry about protecting this data at the “hardware level.” But first, let me give you an overview of how this “hardware” works.

Inside DNA

DNA, which stands for deoxyribonucleic acid, is the largest molecule in our organism, and a carrier of genetic information. The closest IT analogue is the boot image, which enables the computer to start up and load the operating system. In most cases (with some exceptions I won’t touch in this post), after the operating system has been loaded into memory, the computer launches the executable modules required to support itself and perform the work it’s programmed to do. In the same manner, living cells in most cases use DNA to produce the “executable’s” — RNA (ribonucleic acids) sequences, which handle protein synthesis to sustain the organism and perform its functions.

All characteristics of the organism, from eye and hair color to any hereditary disorders, are stored in the DNA. They are encoded in a sequence of nucleotides — molecular blocks containing (for most known organisms) only four varieties of nitrogenous bases: adenine, guanine, thymine, and cytosine. These can be called “biological bits.” As you can see, mother nature has used a quaternary numeral system to encode genetic information, unlike human-made computers, which use binary code.

It’s worth mentioning that DNA has a built-in code correction function — most known DNA has two strands of nucleotides, wrapped one around another like a twisted-pair wire in a double helix.

These two strands are attached one to another with hydrogen bonds that form only between strictly defined pairs of nucleotides — when they complement each other. This ensures that information encoded into a given sequence of nucleotides in one strand corresponds to a similar sequence of complementary nucleotides in the second strand. That’s how this code-correction mechanism works — when decoded or copied, the first DNA strand is used as a source of data and the second is used as a control sequence. This indicates if a sequence of nucleotides, coding some genetic characteristic, has been damaged in one of the strands.

In addition, genetic characteristics are encoded into nucleotide sequences using redundant encoding algorithms. To explain how it works in the simplest case — imagine that every hereditary characteristic, written as a sequence of nucleotides, is accompanied by a checksum.

The sequences of nucleotides coding genetic characteristics, or genes, have been studied extensively in the 50 years since the discovery of DNA. Today you can have your DNA read in many labs or even online — via 23andme or similar services.

How scientists read DNA

Through the past few centuries, scientists have developed methods to determine the structure of minuscule objects, such as X-ray structure analysis, mass spectrometry, and a family of spectroscopy methods. They work quite well for molecules comprising two, three, or four atoms, but understanding the experimental results for larger molecules is much more complicated. The more atoms in the molecule, the harder it is to understand its structure.

Keep in mind that DNA is considered the largest molecule for a good reason: DNA from a haploid human cell contains about 3 billion pairs of bases. The molecular mass of a DNA is a few orders of magnitude higher than the molecular mass of the largest known protein.

In short, it’s a huge heap of atoms, so deciphering experimental data obtained with classical methods, even with today’s supercomputers, can easily take months or even years.

But scientists have come up with a sequencing method that rapidly accelerates the process. The main idea behind it: split the long sequence of bases into many shorter fragments that can be analyzed in parallel.

To do this, biologists use molecular machines: special proteins (enzymes) called polymerases. The core function of these proteins is to copy the DNA by running along the strand and building a replica from bases.

But we don’t need a complete copy of the DNA; instead, we want to split it into fragments, and we do that by adding the so-called primers and markers — compounds that tell the polymerase where to start and where to stop the cloning process, respectively.

Primers contain a given sequence of nucleotides that can attach itself to a DNA strand at a place where it finds a corresponding sequence of complementary bases. Polymerase finds the primer and starts cloning the sequence, taking the building blocks from the solution. Like every living process, all of this happens in a liquid form. Polymerase clones the sequence until it encounters a marker: a modified nucleotide that ends the process of building the strand further.

There is a problem, however. The polymerase, DNA strand, primers, markers, and our building blocks, all are dispersed in the solution. It’s therefore impossible to define the exact location where polymerase will start. We can define only the sequences from which and to which we’re going to copy.

Continuing to the IT analogy, we can illustrate it in the following manner. Imagine that our DNA is a combination of bits: 1101100001010111010010111. If we use 0000 as a primer and 11 as a marker, we will get the following set of fragments, placed in the order of decreasing probability:
0000101011,
00001010111,
0000101011101001011,
00001010111010010111.

Using different primers and markers, we will go through all of the possible shorter sequences, and then infer the longer sequence based on the knowledge of what it is composed of.

That may sound counterintuitive and complicated, but it works. In fact, because we have multiple processes in parallel, this method reaches quite a good speed. That is, a few hours compared with months or years — not very fast from IT perspective, though.

DNA and random access

After learning how to read DNA, scientists learned how to synthesize sequences of nucleotides. The Microsoft researchers were not the first to try writing information in the form of artificial DNA. A few years ago, researchers from EMBL-EBI were able to encode 739 kilobytes.

Two things make Microsoft’s work a breakthrough. First, the researchers have greatly increased the stored data volume, to 200MB. That’s not too far from the 750MB of data that is contained in every strand of human DNA.

However, what is really new here is that they have proposed a way of reading part of the DNA, approximately 100 bases (bio-bits) long, in each sequencing operation.

The researchers were able to achieve that by using pairs of primers and markers that enable them to read a certain set of nucleotides with a defined offset from the beginning of the strand. It’s not exactly the random access to a single bit, but the technology is close — block memory access.

Researchers believe that the main niche for such DNA memory could be high-density long-term memory modules. It definitely makes sense: the best known samples of flash memory provide a density of ~1016 bits per cubic centimeter, whereas the estimated density for DNA memory is three orders of magnitude higher: ~1019 bits per cubic centimeter.

At the same time, DNA is quite a stable molecule. Coupled with built-in redundant coding and error-correction schemes, data on it would remain readable years or even centuries after it being written.

Back to viruses

But what does it all mean from an information security standpoint? It means that the integrity of information stored in such a way may be threatened by organisms that have specialized in data corruption for billions of years — viruses.

We’re unlikely to see a boom of genetically modified viruses created to hunt coded synthetic DNA. It will simply be easier — for a long time — to modify data and insert malicious code when the data is digital, before it’s written to DNA.

But it’s an open question, how to protect such data from corruption by existing viruses. For example, polymerase will gladly replicate any DNA in the solution: for example, the DNA of the common flu virus.

So it may be worth noting if anyone was sneezing or coughing while you were writing an important file…

Tips