Life is a Four-letter Word: Hidden Webs of Information Encoded in DNA

If you were asked to describe what your body is, what would you say? Most probably, you would describe the “stuff” that is inside – the major organs, the tissues, ligaments, and bones. And if you were a detailed person, your description may run into pages.

But our body is not simply a collection of stuff and describing it as such misses a very important aspect of life – information. To put it succinctly, the body is an information machine, storing and processing a huge amount of information, information that miraculously does all the work that keeps us alive.

The information story begins in the cell. Scientists at the Max Planck Institute in Germany have come up with a recent estimate of the number of cells in the adult human body. According to their estimates, the average adult male has around 36 trillion cells in their body (that is 36 followed by nine zeros) while average adult females have 28 trillion. All living cells on Earth, without any known exception, store their hereditary information in the form of DNA—long unbranched paired polymer chains, formed always of the same four types of nucleotides —A, T, C, G, also known as DNA bases. These bases wrap in a twisted chain around each other—the familiar double helix—to form the molecule. The arrangement of these letters into sequences creates a code that tells an organism how to form.

The ‘Ghost in the Cellular Machine

Every cell in the human body has about 1 billion DNA bases, arranged in a particular sequence of the four-letter alphabets. This structure gives rise to an explosion of possible combinations of DNA sequences. The possible combinations is 4 raised to the power of 1 billion, which is one followed by about 600 million zeros. How large is that? For comparison, the number of atoms in the universe is “only” one followed by 80 zeros. Hence, there is indeed a humongous amount of information in one strand of DNA, hidden from view. This is the ‘ghost’ in our cellular machines.

Perspective from Information Theory

In 1948, an MIT mathematician and engineer by the name of Claude Shannon (1916-2001) wrote an earth-shaking paper with the title, “The Mathematical Theory of Communications.” It is not an exaggeration at all to say that this paper ushered in the foundations of the internet age, way ahead of its time.

Claude Shannon – pioneer of the digital age.

Shannon’s thesis was that if we define information in a way that can be quantified, then we can build a theory in which all kinds of information – text, images, sound, movies – can be encoded in any communications channel (like the internet) to minimize the noise from garbled messages. That is why his paper is about “a theory of communications.”

We now know that Shannon’s definition information also applies to the body, which as I mentioned earlier, can be viewed as a information storage and processing machine, or an immensely sophisticated computer.

So what is information according to Shannon? In brief, Shannon defined information as “surprise” or how much new information the ocurrence of the event provides. If you think about it, this is a very natural definition of information. For example, If someone tells you something but you learn nothing that is new from him or her (i.e., there is no surprise), then you have gained no information.

Consider an unbiased coin. Each toss has two possible outcomes, which collapses to one outcome (head or tail) once the coin lands. In the language of Shannon’s theory, we say that one bit of information has been gained. Incidentally 8 bits equal one byte (b) and is the b used in Gb (gigabyte or 1 billion bytes) in computer lingo. So, 1 bit of information is gained by tossing one unbiased coin. Note also that 1 = log 2 where the logarithm here is to base two.

What about tossing two coins? There are now 4 possible outcomes, or two bits of information gained when both coins land. Notice that 2 = log 4 where the log is again to base 2. With three coins, there are 8 possible outcomes, or 3 bits of information gained when all three coins land. Once more, 3 = log 8.

What happens if the coin is biased? Suppose we have a coin which lands on head twice as often as tails. According to Shannon, the information gained by tossing this coin is the weighted average of all outcomes, where the weights are their relative probabilities. So for our coin, the probability of heads is 2/3 and the probability of tails is 1/3 (probabilities must sum up ton 1). The weighted average of outcomes for this coin is -2/3 log (2/3) – 1/3 log (1/3) = 0.92 bits. Notice that there is now less information gained compared to the toss of an unbiased coin, which makes sense because you know that heads is twice as likely to come up as tails, compared to a 50-50 chance for the unbiased coin. With less uncertainty, there is also less surprise in learning about outcomes, thus less information is gained.

What does the above have to do with DNA? The short answer is with Shannon’s calculus, we have a handy way of quantifying the amount of information encoded in each strand of DNA. Since every cell has about 1 billion DNA bases arranged in a particular sequence of the 4-letter chemical alphabet, the number of possible combinations is 4 raised to the power of 1 billion. Log base 2 of this number is about 1 billion bits of information, which is more than the information contained in the Library of Congress! That’s not all – the information contained in DNA is only a fraction of the total information contained in every cell, and as mentioned above, there are 36 trillion cells in an adult male human body. Assuming conservatively that all the information in a cell is contained in DNA, an adult male human carries some 36 trillion billion bits of information, which goes to show how deeply life is invested in information.

Although impressive, this is only part of the story of life on Earth. Living organisms are not just bags of information; they are computers. Therefore, a full understanding of life will come only from unravelling its computational mechanisms. That is still work-in-progress.

Meanwhile, computational scientists are wasting no time in harnessing the power of DNA for Large-scale data storage. New techniques have been developed that could encode digital data in DNA to create the highest-density large-scale data storage scheme ever invented. Capable of storing 215 petabytes (215 million gigabytes) in a single gram of DNA, the system could, in principle, store every bit of data ever recorded by humans in a container about the size and weight of a couple of pickup trucks. Whether the technology takes off commercially will rest on its cost. As in most things in the world of technology, it’s only a matter of time.

Sublime

An Arts and Science Blog

Life is a Four-letter Word: Hidden Webs of Information Encoded in DNA

Like this:

Leave a ReplyCancel reply

Life is a Four-letter Word: Hidden Webs of Information Encoded in DNA

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Sublime