from the human genome to the human proteome
As of 2003, according to the National Human Genome Research Institute
“the Human Genome Project produced a genome sequence that accounted for over 90% of the human genome”. At the time, this was as close to complete as the existing technology used for DNA sequencing would allow. Final, gapless assembly of “the first truly complete human genome sequence” didn’t occur until March 31 of 2022. There’s even an info-graphic devoted to the question of what made this work so challenging: https://www.genome.gov/sites/default/files/media/files/2021-08/NHGRI_T2T_Infographic.pdf
First and foremost, there’s the sheer volume of data required to solve the problem. “The human genome consists of about 3 billion bases in a precise order, each of which can be represented by a letter (G, A, T or C).” On top of all that, there are these long stretches of repeated values that scientists had to wade through. New technology was developed to read longer stretches of DNA. Instead of only 500 letters, it became possible to load more than 100,000 letters at a time. In this way, scientists were able to assemble the full length of the most difficult repeats… but the last 10% of the sequence was the most challenging.
In a March 31, 2022 Scientific American article, author Eric D. Green (director of the National Human Genome Research Institute, NHGRI) says “figuring out how to sequence the missing parts of the human genome required a new generation of DNA-sequencing technologies and a new generation of computational approaches”. Green goes on to describe some of the staggering number of benefits this completed project has to offer, as it will “catalyze future advances in genomics, human biology and medicine". The National Human Genome Research Institute notes, “The project was critical for advancing policies and earning increased support for the open sharing of scientific data.” This new vast source of open data can serve as the basis for solving other challenges, like predicting the shape of proteins.
If that sounds exciting to you then I recommend this podcast from The Economist (https://www.economist.com/alphafold-pod), where scientists discuss their involvement with AlphaFold. This is the artificial-intelligence system that’s released “over 350,000 structures, including the human proteome – all of the ~20,000 known proteins expressed in the human body – along with the proteomes of 20 additional organisms”.
If our only problems were finding solutions for some of the biggest computational challenges we’ve ever had, our future would look bright indeed.