|
Initial sequencing and analysis of the human genome
Nature volume 409, pages860–921 (2001)Cite this article
A Corrigendum to this article was published on 01 August 2001
An Erratum to this article was published on 01 June 2001
Abstract
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
Similar content being viewed by others
인간 유전체는
인간 발달, 생리학, 의학 및 진화에 관한 놀라운 정보의 보고를 담고 있습니다.
본 연구에서는
인간 유전체의 초안 시퀀스를 생산하고 무료로 공개하기 위한 국제 협력의 결과를 보고합니다.
또한 데이터의 초기 분석 결과를 제시하며,
시퀀스에서 얻을 수 있는 일부 통찰력을 설명합니다.
A draft human pangenome reference
Article Open access10 May 2023
The Allen Ancient DNA Resource (AADR) a curated compendium of ancient human genomes
Article Open access10 February 2024
Towards a reference genome that captures global genetic diversity
Article Open access30 October 2020
Main
The rediscovery of Mendel's laws of heredity in the opening weeks of the 20th century1,2,3 sparked a scientific quest to understand the nature and content of genetic information that has propelled biology for the last hundred years. The scientific progress made falls naturally into four main phases, corresponding roughly to the four quarters of the century. The first established the cellular basis of heredity: the chromosomes. The second defined the molecular basis of heredity: the DNA double helix. The third unlocked the informational basis of heredity, with the discovery of the biological mechanism by which cells read the information contained in genes and with the invention of the recombinant DNA technologies of cloning and sequencing by which scientists can do the same.
The last quarter of a century has been marked by a relentless drive to decipher first genes and then entire genomes, spawning the field of genomics. The fruits of this work already include the genome sequences of 599 viruses and viroids, 205 naturally occurring plasmids, 185 organelles, 31 eubacteria, seven archaea, one fungus, two animals and one plant.
Here we report the results of a collaboration involving 20 groups from the United States, the United Kingdom, Japan, France, Germany and China to produce a draft sequence of the human genome. The draft genome sequence was generated from a physical map covering more than 96% of the euchromatic part of the human genome and, together with additional sequence in public databases, it covers about 94% of the human genome. The sequence was produced over a relatively short period, with coverage rising from about 10% to more than 90% over roughly fifteen months. The sequence data have been made available without restriction and updated daily throughout the project. The task ahead is to produce a finished sequence, by closing all gaps and resolving all ambiguities. Already about one billion bases are in final form and the task of bringing the vast majority of the sequence to this standard is now straightforward and should proceed rapidly.
The sequence of the human genome is of interest in several respects. It is the largest genome to be extensively sequenced so far, being 25 times as large as any previously sequenced genome and eight times as large as the sum of all such genomes. It is the first vertebrate genome to be extensively sequenced. And, uniquely, it is the genome of our own species.
Much work remains to be done to produce a complete finished sequence, but the vast trove of information that has become available through this collaborative effort allows a global perspective on the human genome. Although the details will change as the sequence is finished, many points are already clear.
주요
20세기 초반에 멘델의 유전 법칙이 재발견된 것1,2,3은 유전 정보의 본질과 내용을 이해하려는 과학적 탐구를 촉발시켰으며, 이는 지난 100년간 생물학을 주도해온 동력이 되었습니다. 이 과학적 진보는 자연스럽게 네 가지 주요 단계로 나뉘며, 이는 세기의 네 분기대와 대략적으로 일치합니다.
첫 번째 단계는 유전의 세포적 기반을 확립했습니다: 염색체.
두 번째 단계는 유전의 분자적 기반을 정의했습니다: DNA 이중 나선 구조입니다.
세 번째 단계는 유전의 정보적 기반을 해명했습니다. 이는 세포가 유전자에 포함된 정보를 읽는 생물학적 메커니즘의 발견과, 과학자들이 동일한 작업을 수행할 수 있게 한 재조합 DNA 기술(클로닝과 시퀀싱)의 발명으로 이루어졌습니다.
클로닝 - 특정 유전자 조각을 분리하고 특정유전자 대량확보
시퀀싱 - 염기서열을 읽어내는 기술
네번째 단계로, 지난 25년은 유전자와 전체 유전체 해독을 위한 끊임없는 노력으로 특징지어졌으며,
이로 인해 유전체학 분야가 탄생했습니다.
이 연구의 성과에는 599개의 바이러스와 비로이드, 205개의 자연 발생 플라스미드, 185개의 세포 소기관, 31개의 진핵세균, 7개의 고세균, 1개의 곰팡이, 2개의 동물, 1개의 식물의 유전체 서열이 포함됩니다.
여기서는 미국, 영국, 일본, 프랑스, 독일, 중국에서 참여한 20개 연구팀의 협력을 통해 인간 유전체 초안 시퀀스를 생산한 결과를 보고합니다.
이 초안 유전체 시퀀스는
인간 유전체의 유크로마틴 부분의 96% 이상을 덮는 물리적 지도를 기반으로 생성되었으며,
공개 데이터베이스에 있는 추가 시퀀스와 결합하여
인간 유전체의 약 94%를 덮고 있습니다.
이 염기서열은 상대적으로 짧은 기간 내에 생성되었으며,
약 15개월 동안 커버리지가 약 10%에서 90% 이상으로 증가했습니다.
염기서열 데이터는 프로젝트 기간 동안 제한 없이 공개되었으며 매일 업데이트되었습니다.
앞으로의 과제는
모든 간격을 메우고 모든 모호성을 해결하여
완성된 염기서열을 생산하는 것입니다.
이미 약 10억 개의 염기쌍이 최종 형태로 완성되었으며,
대부분의 염기서열을 이 표준으로 가져오는 작업은
이제 단순화되었으며 빠르게 진행될 것입니다.
인간 유전체 염기서열은
여러 측면에서 중요합니다.
이는 현재까지 광범위하게 염기서열이 분석된 가장 큰 유전체로,
이전에 분석된 어떤 유전체보다 25배 크며, 모든 유전체의 합보다 8배 큽니다.
이는 척추동물 유전체 중 처음으로 광범위하게 염기서열이 분석된 사례입니다.
또한, 우리 종의 유전체라는 점에서 유일무이합니다.
완전한 최종 시퀀스를 생산하기 위해 여전히 많은 작업이 남아 있지만, 이 협업 노력을 통해 확보된 방대한 정보는 인간 유전체에 대한 글로벌 관점을 제공합니다. 시퀀스가 완료됨에 따라 세부 사항은 변경될 수 있지만, 이미 많은 점이 명확해졌습니다.
• The genomic landscape shows marked variation in the distribution of a number of features, including genes, transposable elements, GC content, CpG islands and recombination rate. This gives us important clues about function. For example, the developmentally important HOX gene clusters are the most repeat-poor regions of the human genome, probably reflecting the very complex coordinate regulation of the genes in the clusters.
• There appear to be about 30,000–40,000 protein-coding genes in the human genome—only about twice as many as in worm or fly. However, the genes are more complex, with more alternative splicing generating a larger number of protein products.
• The full set of proteins (the ‘proteome’) encoded by the human genome is more complex than those of invertebrates. This is due in part to the presence of vertebrate-specific protein domains and motifs (an estimated 7% of the total), but more to the fact that vertebrates appear to have arranged pre-existing components into a richer collection of domain architectures.
• Hundreds of human genes appear likely to have resulted from horizontal transfer from bacteria at some point in the vertebrate lineage. Dozens of genes appear to have been derived from transposable elements.
• Although about half of the human genome derives from transposable elements, there has been a marked decline in the overall activity of such elements in the hominid lineage. DNA transposons appear to have become completely inactive and long-terminal repeat (LTR) retroposons may also have done so.
• The pericentromeric and subtelomeric regions of chromosomes are filled with large recent segmental duplications of sequence from elsewhere in the genome. Segmental duplication is much more frequent in humans than in yeast, fly or worm.
• Analysis of the organization of Alu elements explains the longstanding mystery of their surprising genomic distribution, and suggests that there may be strong selection in favour of preferential retention of Alu elements in GC-rich regions and that these ‘selfish’ elements may benefit their human hosts.
• The mutation rate is about twice as high in male as in female meiosis, showing that most mutation occurs in males.
• Cytogenetic analysis of the sequenced clones confirms suggestions that large GC-poor regions are strongly correlated with ‘dark G-bands’ in karyotypes.
• Recombination rates tend to be much higher in distal regions (around 20 megabases (Mb)) of chromosomes and on shorter chromosome arms in general, in a pattern that promotes the occurrence of at least one crossover per chromosome arm in each meiosis.
• More than 1.4 million single nucleotide polymorphisms (SNPs) in the human genome have been identified. This collection should allow the initiation of genome-wide linkage disequilibrium mapping of the genes in the human population.
In this paper, we start by presenting background information on the project and describing the generation, assembly and evaluation of the draft genome sequence. We then focus on an initial analysis of the sequence itself: the broad chromosomal landscape; the repeat elements and the rich palaeontological record of evolutionary and biological processes that they provide; the human genes and proteins and their differences and similarities with those of other organisms; and the history of genomic segments. (Comparisons are drawn throughout with the genomes of the budding yeast Saccharomyces cerevisiae, the nematode worm Caenorhabditis elegans, the fruitfly Drosophila melanogaster and the mustard weed Arabidopsis thaliana; we refer to these for convenience simply as yeast, worm, fly and mustard weed.) Finally, we discuss applications of the sequence to biology and medicine and describe next steps in the project. A full description of the methods is provided as Supplementary Information on Nature's web site (http://www.nature.com).
We recognize that it is impossible to provide a comprehensive analysis of this vast dataset, and thus our goal is to illustrate the range of insights that can be gleaned from the human genome and thereby to sketch a research agenda for the future.
• 게놈의 지형은 유전자, 이동 요소, GC 함량, CpG 섬, 재조합 비율 등 여러 특징의 분포에 뚜렷한 차이를 보입니다. 이는 기능에 대한 중요한 단서를 제공합니다. 예를 들어, 발달에 중요한 HOX 유전자 클러스터는 인간 게놈에서 반복이 가장 적은 영역으로, 클러스터 내 유전자의 매우 복잡한 협응적 조절을 반영한 것으로 보입니다.
• 인간 유전체에는 약 30,000~40,000개의 단백질 코딩 유전자가 존재하며, 이는 벌레나 파리보다 약 두 배 정도 많습니다. 그러나 유전자들은 더 복잡하며, 대체 스플라이싱이 더 많이 발생해 더 많은 단백질 제품을 생성합니다.
• 인간 게놈에 의해 암호화된 전체 단백질 세트('프로테옴')는 무척추동물의 것보다 더 복잡합니다. 이는 부분적으로는 척추동물 고유의 단백질 도메인과 모티프(전체 약 7%)의 존재 때문이지만, 더 큰 이유는 척추동물이 기존에 존재하던 구성 요소를 더 풍부한 도메인 구조로 배열한 것 때문인 것으로 보입니다.
• 인간 유전자 중 수백 개는 척추동물 계통의 어느 시점에서 박테리아로부터 수평적 유전자 전달을 통해 유래했을 가능성이 높습니다. 수십 개의 유전자는 이동성 요소에서 유래했을 것으로 추정됩니다.
• 인간 유전체의 약 절반이 이동성 요소에서 유래했지만, 호미닌 계통에서 이러한 요소의 전체 활동은 현저히 감소했습니다. DNA 전위 요소는 완전히 비활성화되었으며, 장말단 반복(LTR) 역전위 요소도 마찬가지일 수 있습니다.
• 염색체의 중심체 주변과 말체 주변 지역은 유전체 다른 부분에서 유래한 대규모 최근 세그먼트 중복으로 가득 차 있습니다. 세그먼트 중복은 인간에서 효모, 파리, 벌레보다 훨씬 더 자주 발생합니다.
• Alu 요소의 조직 분석은 그들의 놀라운 유전체 분포에 대한 오랜 미스터리를 설명하며, GC 풍부 지역에서 Alu 요소의 선호적 보존을 촉진하는 강한 선택 압력이 존재할 수 있음을 시사합니다. 이러한 '이기적' 요소는 인간 호스트에게 이점을 제공할 수 있습니다.
• 남성의 감수 분열에서 돌연변이율은 여성의 약 두 배로, 대부분의 돌연변이가 남성에서 발생함을 보여줍니다.
• 시퀀싱된 클론의 세포유전학적 분석은 GC 함량이 낮은 대규모 지역이 카리오타입의 '어두운 G-밴드'와 강하게 연관되어 있다는 제안을 확인했습니다.
• 재조합률은 염색체의 원위부(약 20 메가베이스 (Mb) 주변)와 일반적으로 짧은 염색체 팔에서 훨씬 높으며, 이는 각 감수분열에서 각 염색체 팔당 적어도 한 번의 교차점이 발생하도록 촉진하는 패턴을 보입니다.
• 인간 게놈에서 140만 개 이상의 단일 뉴클레오티드 다형성(SNP)이 식별되었습니다. 이 데이터 세트는 인간 인구에서 유전자들의 게놈 전체 연관 불균형 매핑을 시작하는 데 활용될 수 있습니다.
이 논문에서는 프로젝트의 배경 정보를 소개하고 초안 게놈 서열의 생성, 조립 및 평가 과정을 설명합니다. 이어서 서열 자체의 초기 분석에 초점을 맞춥니다: 광범위한 염색체 구조; 반복 요소와 그들이 제공하는 진화적 및 생물학적 과정의 풍부한 고생물학적 기록; 인간 유전자와 단백질 및 다른 생물체와의 차이점과 유사점; 그리고 게놈 단편의 역사. (비교는 전체적으로 효모 Saccharomyces cerevisiae, 선충 Caenorhabditis elegans, 과일파리 Drosophila melanogaster, 겨자풀 Arabidopsis thaliana의 게놈과 이루어집니다. 편의상 이들을 각각 효모, 선충, 파리, 겨자풀로 지칭합니다.) 마지막으로, 시퀀스의 생물학 및 의학 분야 적용을 논의하고 프로젝트의 다음 단계를 설명합니다. 방법론의 상세한 설명은 Nature 웹사이트의 보충 자료(http://www.nature.com)에 제공됩니다.
우리는 이 방대한 데이터셋에 대한 포괄적인 분석을 제공할 수 없다는 점을 인정하며,
따라서 우리의 목표는 인간 유전체에서 얻을 수 있는 통찰력의 범위를 보여주고
미래 연구 방향을 개괄하는 것입니다.
Background to the Human Genome Project
The Human Genome Project arose from two key insights that emerged in the early 1980s: that the ability to take global views of genomes could greatly accelerate biomedical research, by allowing researchers to attack problems in a comprehensive and unbiased fashion; and that the creation of such global views would require a communal effort in infrastructure building, unlike anything previously attempted in biomedical research. Several key projects helped to crystallize these insights, including:
(1) The sequencing of the bacterial viruses ΦX1744,5 and lambda6, the animal virus SV407 and the human mitochondrion8 between 1977 and 1982. These projects proved the feasibility of assembling small sequence fragments into complete genomes, and showed the value of complete catalogues of genes and other functional elements.
(2) The programme to create a human genetic map to make it possible to locate disease genes of unknown function based solely on their inheritance patterns, launched by Botstein and colleagues in 1980 (ref. 9).
(3) The programmes to create physical maps of clones covering the yeast10 and worm11 genomes to allow isolation of genes and regions based solely on their chromosomal position, launched by Olson and Sulston in the mid-1980s.
(4) The development of random shotgun sequencing of complementary DNA fragments for high-throughput gene discovery by Schimmel12 and Schimmel and Sutcliffe13, later dubbed expressed sequence tags (ESTs) and pursued with automated sequencing by Venter and others14,15,16,17,18,19,20.
The idea of sequencing the entire human genome was first proposed in discussions at scientific meetings organized by the US Department of Energy and others from 1984 to 1986 (refs 21, 22). A committee appointed by the US National Research Council endorsed the concept in its 1988 report23, but recommended a broader programme, to include: the creation of genetic, physical and sequence maps of the human genome; parallel efforts in key model organisms such as bacteria, yeast, worms, flies and mice; the development of technology in support of these objectives; and research into the ethical, legal and social issues raised by human genome research. The programme was launched in the US as a joint effort of the Department of Energy and the National Institutes of Health. In other countries, the UK Medical Research Council and the Wellcome Trust supported genomic research in Britain; the Centre d’Etude du Polymorphisme Humain and the French Muscular Dystrophy Association launched mapping efforts in France; government agencies, including the Science and Technology Agency and the Ministry of Education, Science, Sports and Culture supported genomic research efforts in Japan; and the European Community helped to launch several international efforts, notably the programme to sequence the yeast genome. By late 1990, the Human Genome Project had been launched, with the creation of genome centres in these countries. Additional participants subsequently joined the effort, notably in Germany and China. In addition, the Human Genome Organization (HUGO) was founded to provide a forum for international coordination of genomic research. Several books24,25,26 provide a more comprehensive discussion of the genesis of the Human Genome Project.
Through 1995, work progressed rapidly on two fronts (Fig. 1). The first was construction of genetic and physical maps of the human and mouse genomes27,28,29,30,31, providing key tools for identification of disease genes and anchoring points for genomic sequence. The second was sequencing of the yeast32 and worm33 genomes, as well as targeted regions of mammalian genomes34,35,36,37. These projects showed that large-scale sequencing was feasible and developed the two-phase paradigm for genome sequencing. In the first, ‘shotgun’, phase, the genome is divided into appropriately sized segments and each segment is covered to a high degree of redundancy (typically, eight- to tenfold) through the sequencing of randomly selected subfragments. The second is a ‘finishing’ phase, in which sequence gaps are closed and remaining ambiguities are resolved through directed analysis. The results also showed that complete genomic sequence provided information about genes, regulatory regions and chromosome structure that was not readily obtainable from cDNA studies alone.
Figure 1
Timeline of large-scale genomic analyses. Shown are selected components of work on several non-vertebrate model organisms (red), the mouse (blue) and the human (green) from 1990; earlier projects are described in the text. SNPs, single nucleotide polymorphisms; ESTs, expressed sequence tags.
In 1995, genome scientists considered a proposal38 that would have involved producing a draft genome sequence of the human genome in a first phase and then returning to finish the sequence in a second phase. After vigorous debate, it was decided that such a plan was premature for several reasons. These included the need first to prove that high-quality, long-range finished sequence could be produced from most parts of the complex, repeat-rich human genome; the sense that many aspects of the sequencing process were still rapidly evolving; and the desirability of further decreasing costs.
Instead, pilot projects were launched to demonstrate the feasibility of cost-effective, large-scale sequencing, with a target completion date of March 1999. The projects successfully produced finished sequence with 99.99% accuracy and no gaps39. They also introduced bacterial artificial chromosomes (BACs)40, a new large-insert cloning system that proved to be more stable than the cosmids and yeast artificial chromosomes (YACs)41 that had been used previously. The pilot projects drove the maturation and convergence of sequencing strategies, while producing 15% of the human genome sequence. With successful completion of this phase, the human genome sequencing effort moved into full-scale production in March 1999.
The idea of first producing a draft genome sequence was revived at this time, both because the ability to finish such a sequence was no longer in doubt and because there was great hunger in the scientific community for human sequence data. In addition, some scientists favoured prioritizing the production of a draft genome sequence over regional finished sequence because of concerns about commercial plans to generate proprietary databases of human sequence that might be subject to undesirable restrictions on use42,43,44.
The consortium focused on an initial goal of producing, in a first production phase lasting until June 2000, a draft genome sequence covering most of the genome. Such a draft genome sequence, although not completely finished, would rapidly allow investigators to begin to extract most of the information in the human sequence. Experiments showed that sequencing clones covering about 90% of the human genome to a redundancy of about four- to fivefold (‘half-shotgun’ coverage; see Box 1) would accomplish this45,46. The draft genome sequence goal has been achieved, as described below.
The second sequence production phase is now under way. Its aims are to achieve full-shotgun coverage of the existing clones during 2001, to obtain clones to fill the remaining gaps in the physical map, and to produce a finished sequence (apart from regions that cannot be cloned or sequenced with currently available techniques) no later than 2003.
Box 1 Genome glossary
Sequence
Raw sequence Individual unassembled sequence reads, produced by sequencing of clones containing DNA inserts.
Paired-end sequence Raw sequence obtained from both ends of a cloned insert in any vector, such as a plasmid or bacterial artificial chromosome.
Finished sequence Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps.
Coverage (or depth) The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a ‘high-quality base’ is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20).
Full shotgun coverage The coverage in random raw sequence needed from a large-insert clone to ensure that it is ready for finishing; this varies among centres but is typically 8–10-fold. Clones with full shotgun coverage can usually be assembled with only a handful of gaps per 100 kb.
Half shotgun coverage Half the amount of full shotgun coverage (typically, 4–5-fold random coverage).
Clones
BAC clone Bacterial artificial chromosome vector carrying a genomic DNA insert, typically 100–200 kb. Most of the large-insert clones sequenced in the project were BAC clones.
Finished clone A large-insert clone that is entirely represented by finished sequence.
Full shotgun clone A large-insert clone for which full shotgun sequence has been produced.
Draft clone A large-insert clone for which roughly half-shotgun sequence has been produced. Operationally, the collection of draft clones produced by each centre was required to have an average coverage of fourfold for the entire set and a minimum coverage of threefold for each clone.
Predraft clone A large-insert clone for which some shotgun sequence is available, but which does not meet the standards for inclusion in the collection of draft clones.
Contigs and scaffolds
Contig The result of joining an overlapping collection of sequences or clones.
Scaffold The result of connecting contigs by linking information from paired-end reads from plasmids, paired-end reads from BACs, known messenger RNAs or other sources. The contigs in a scaffold are ordered and oriented with respect to one another.
Fingerprint clone contigs Contigs produced by joining clones inferred to overlap on the basis of their restriction digest fingerprints.
Sequenced-clone layout Assignment of sequenced clones to the physical map of fingerprint clone contigs.
Initial sequence contigs Contigs produced by merging overlapping sequence reads obtained from a single clone, in a process called sequence assembly.
Merged sequence contigs Contigs produced by taking the initial sequence contigs contained in overlapping clones and merging those found to overlap. These are also referred to simply as ‘sequence contigs’ where no confusion will result.
Sequence-contig scaffolds Scaffolds produced by connecting sequence contigs on the basis of linking information.
Sequenced-clone contigs Contigs produced by merging overlapping sequenced clones.
Sequenced-clone-contig scaffolds Scaffolds produced by joining sequenced-clone contigs on the basis of linking information.
Draft genome sequence The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes.
N50 length A measure of the contig length (or scaffold length) containing a ‘typical’ nucleotide. Specifically, it is the maximum length L such that 50% of all nucleotides lie in contigs (or scaffolds) of size at least L.
Computer programs and databases
PHRED A widely used computer program that analyses raw sequence to produce a ‘base call’ with an associated ‘quality score’ for each position in the sequence. A PHRED quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a PHRED quality score of 30 corresponds to 99.9% accuracy for the base call in the raw read.
PHRAP A widely used computer program that assembles raw sequence into sequence contigs and assigns to each position in the sequence an associated ‘quality score’, on the basis of the PHRED scores of the raw sequence reads. A PHRAP quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in the assembled sequence.
GigAssembler A computer program developed during this project for merging the information from individual sequenced clones into a draft genome sequence.
Public sequence databases The three coordinated international sequence databases: GenBank, the EMBL data library and DDBJ.
Map features
STS Sequence tagged site, corresponding to a short (typically less than 500 bp) unique genomic locus for which a polymerase chain reaction assay has been developed.
EST Expressed sequence tag, obtained by performing a single raw sequence read from a random complementary DNA clone.
SSR Simple sequence repeat, a sequence consisting largely of a tandem repeat of a specific k-mer (such as (CA)15). Many SSRs are polymorphic and have been widely used in genetic mapping.
SNP Single nucleotide polymorphism, or a single nucleotide position in the genome sequence for which two or more alternative alleles are present at appreciable frequency (traditionally, at least 1%) in the human population.
Genetic map A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans (cM), denoting a 1% chance of recombination.
Radiation hybrid (RH) map A genome map in which STSs are positioned relative to one another on the basis of the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human–hamster hybrid cell lines, each produced by lethally irradiating human cells and fusing them with recipient hamster cells such that each carries a collection of human chromosomal fragments. The unit of distance is centirays (cR), denoting a 1% chance of a break occuring between two loci.
Show more
Strategic issues
Hierarchical shotgun sequencing
Soon after the invention of DNA sequencing methods47,48, the shotgun sequencing strategy was introduced49,50,51; it has remained the fundamental method for large-scale genome sequencing52,53,54 for the past 20 years. The approach has been refined and extended to make it more efficient. For example, improved protocols for fragmenting and cloning DNA allowed construction of shotgun libraries with more uniform representation. The practice of sequencing from both ends of double-stranded clones (‘double-barrelled’ shotgun sequencing) was introduced by Ansorge and others37 in 1990, allowing the use of ‘linking information’ between sequence fragments.
The application of shotgun sequencing was also extended by applying it to larger and larger DNA molecules—from plasmids (∼ 4 kilobases (kb)) to cosmid clones37 (40 kb), to artificial chromosomes cloned in bacteria and yeast55 (100–500 kb) and bacterial genomes56 (1–2 megabases (Mb)). In principle, a genome of arbitrary size may be directly sequenced by the shotgun method, provided that it contains no repeated sequence and can be uniformly sampled at random. The genome can then be assembled using the simple computer science technique of ‘hashing’ (in which one detects overlaps by consulting an alphabetized look-up table of all k-letter words in the data). Mathematical analysis of the expected number of gaps as a function of coverage is similarly straightforward57.
Practical difficulties arise because of repeated sequences and cloning bias. Small amounts of repeated sequence pose little problem for shotgun sequencing. For example, one can readily assemble typical bacterial genomes (about 1.5% repeat) or the euchromatic portion of the fly genome (about 3% repeat). By contrast, the human genome is filled (> 50%) with repeated sequences, including interspersed repeats derived from transposable elements, and long genomic regions that have been duplicated in tandem, palindromic or dispersed fashion (see below). These include large duplicated segments (50–500 kb) with high sequence identity (98–99.9%), at which mispairing during recombination creates deletions responsible for genetic syndromes. Such features complicate the assembly of a correct and finished genome sequence.
There are two approaches for sequencing large repeat-rich genomes. The first is a whole-genome shotgun sequencing approach, as has been used for the repeat-poor genomes of viruses, bacteria and flies, using linking information and computational analysis to attempt to avoid misassemblies. The second is the ‘hierarchical shotgun sequencing’ approach (Fig. 2), also referred to as ‘map-based’, ‘BAC-based’ or ‘clone-by-clone’. This approach involves generating and organizing a set of large-insert clones (typically 100–200 kb each) covering the genome and separately performing shotgun sequencing on appropriately chosen clones. Because the sequence information is local, the issue of long-range misassembly is eliminated and the risk of short-range misassembly is reduced. One caveat is that some large-insert clones may suffer rearrangement, although this risk can be reduced by appropriate quality-control measures involving clone fingerprints (see below).
Figure 2
Idealized representation of the hierarchical shotgun sequencing strategy. A library is constructed by fragmenting the target genome and cloning it into a large-fragment cloning vector; here, BAC vectors are shown. The genomic DNA fragments represented in the library are then organized into a physical map and individual BAC clones are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct the sequence of the genome.
The two methods are likely to entail similar costs for producing finished sequence of a mammalian genome. The hierarchical approach has a higher initial cost than the whole-genome approach, owing to the need to create a map of clones (about 1% of the total cost of sequencing) and to sequence overlaps between clones. On the other hand, the whole-genome approach is likely to require much greater work and expense in the final stage of producing a finished sequence, because of the challenge of resolving misassemblies. Both methods must also deal with cloning biases, resulting in under-representation of some regions in either large-insert or small-insert clone libraries.
There was lively scientific debate over whether the human genome sequencing effort should employ whole-genome or hierarchical shotgun sequencing. Weber and Myers58 stimulated these discussions with a specific proposal for a whole-genome shotgun approach, together with an analysis suggesting that the method could work and be more efficient. Green59 challenged these conclusions and argued that the potential benefits did not outweigh the likely risks.
In the end, we concluded that the human genome sequencing effort should employ the hierarchical approach for several reasons. First, it was prudent to use the approach for the first project to sequence a repeat-rich genome. With the hierarchical approach, the ultimate frequency of misassembly in the finished product would probably be lower than with the whole-genome approach, in which it would be more difficult to identify regions in which the assembly was incorrect.
Second, it was prudent to use the approach in dealing with an outbred organism, such as the human. In the whole-genome shotgun method, sequence would necessarily come from two different copies of the human genome. Accurate sequence assembly could be complicated by sequence variation between these two copies—both SNPs (which occur at a rate of 1 per 1,300 bases) and larger-scale structural heterozygosity (which has been documented in human chromosomes). In the hierarchical shotgun method, each large-insert clone is derived from a single haplotype.
Third, the hierarchical method would be better able to deal with inevitable cloning biases, because it would more readily allow targeting of additional sequencing to under-represented regions. And fourth, it was better suited to a project shared among members of a diverse international consortium, because it allowed work and responsibility to be easily distributed. As the ultimate goal has always been to create a high-quality, finished sequence to serve as a foundation for biomedical research, we reasoned that the advantages of this more conservative approach outweighed the additional cost, if any.
A biotechnology company, Celera Genomics, has chosen to incorporate the whole-genome shotgun approach into its own efforts to sequence the human genome. Their plan60,61 uses a mixed strategy, involving combining some coverage with whole-genome shotgun data generated by the company together with the publicly available hierarchical shotgun data generated by the International Human Genome Sequencing Consortium. If the raw sequence reads from the whole-genome shotgun component are made available, it may be possible to evaluate the extent to which the sequence of the human genome can be assembled without the need for clone-based information. Such analysis may help to refine sequencing strategies for other large genomes.
Technology for large-scale sequencing
Sequencing the human genome depended on many technological improvements in the production and analysis of sequence data. Key innovations were developed both within and outside the Human Genome Project. Laboratory innovations included four-colour fluorescence-based sequence detection62, improved fluorescent dyes63,64,65,66, dye-labelled terminators67, polymerases specifically designed for sequencing68,69,70, cycle sequencing71 and capillary gel electrophoresis72,73,74. These studies contributed to substantial improvements in the automation, quality and throughput of collecting raw DNA sequence75,76. There were also important advances in the development of software packages for the analysis of sequence data. The PHRED software package77,78 introduced the concept of assigning a ‘base-quality score’ to each base, on the basis of the probability of an erroneous call. These quality scores make it possible to monitor raw data quality and also assist in determining whether two similar sequences truly overlap. The PHRAP computer package (http://bozeman.mbt.washington.edu/phrap.docs/phrap.html) then systematically assembles the sequence data using the base-quality scores. The program assigns ‘assembly-quality scores’ to each base in the assembled sequence, providing an objective criterion to guide sequence finishing. The quality scores were based on and validated by extensive experimental data.
Another key innovation for scaling up sequencing was the development by several centres of automated methods for sample preparation. This typically involved creating new biochemical protocols suitable for automation, followed by construction of appropriate robotic systems.
Coordination and public data sharing
The Human Genome Project adopted two important principles with regard to human sequencing. The first was that the collaboration would be open to centres from any nation. Although potentially less efficient, in a narrow economic sense, than a centralized approach involving a few large factories, the inclusive approach was strongly favoured because we felt that the human genome sequence is the common heritage of all humanity and the work should transcend national boundaries, and we believed that scientific progress was best assured by a diversity of approaches. The collaboration was coordinated through periodic international meetings (referred to as ‘Bermuda meetings’ after the venue of the first three gatherings) and regular telephone conferences. Work was shared flexibly among the centres, with some groups focusing on particular chromosomes and others contributing in a genome-wide fashion.
The second principle was rapid and unrestricted data release. The centres adopted a policy that all genomic sequence data should be made publicly available without restriction within 24 hours of assembly79,80. Pre-publication data releases had been pioneered in mapping projects in the worm11 and mouse genomes30,81 and were prominently adopted in the sequencing of the worm, providing a direct model for the human sequencing efforts. We believed that scientific progress would be most rapidly advanced by immediate and free availability of the human genome sequence. The explosion of scientific work based on the publicly available sequence data in both academia and industry has confirmed this judgement.
Generating the draft genome sequence
Generating a draft sequence of the human genome involved three steps: selecting the BAC clones to be sequenced, sequencing them and assembling the individual sequenced clones into an overall draft genome sequence. A glossary of terms related to genome sequencing and assembly is provided in Box 1.
The draft genome sequence is a dynamic product, which is regularly updated as additional data accumulate en route to the ultimate goal of a completely finished sequence. The results below are based on the map and sequence data available on 7 October 2000, except as otherwise noted. At the end of this section, we provide a brief update of key data.
Clone selection
The hierarchical shotgun method involves the sequencing of overlapping large-insert clones spanning the genome. For the Human Genome Project, clones were largely chosen from eight large-insert libraries containing BAC or P1-derived artificial chromosome (PAC) clones (Table 1; refs 82,83,84,85,86,87,88). The libraries were made by partial digestion of genomic DNA with restriction enzymes. Together, they represent around 65-fold coverage (redundant sampling) of the genome. Libraries based on other vectors, such as cosmids, were also used in early stages of the project.
Table 1 Key large-insert genome-wide libraries
The libraries (Table 1) were prepared from DNA obtained from anonymous human donors in accordance with US Federal Regulations for the Protection of Human Subjects in Research (45CFR46) and following full review by an Institutional Review Board. Briefly, the opportunity to donate DNA for this purpose was broadly advertised near the two laboratories engaged in library construction. Volunteers of diverse backgrounds were accepted on a first-come, first-taken basis. Samples were obtained after discussion with a genetic counsellor and written informed consent. The samples were made anonymous as follows: the sampling laboratory stripped all identifiers from the samples, applied random numeric labels, and transferred them to the processing laboratory, which then removed all labels and relabelled the samples. All records of the labelling were destroyed. The processing laboratory chose samples at random from which to prepare DNA and immortalized cell lines. Around 5–10 samples were collected for every one that was eventually used. Because no link was retained between donor and DNA sample, the identity of the donors for the libraries is not known, even by the donors themselves. A more complete description can be found at http://www.genome.gov/10000921
During the pilot phase, centres showed that sequence-tagged sites (STSs) from previously constructed genetic and physical maps could be used to recover BACs from specific regions. As sequencing expanded, some centres continued this approach, augmented with additional probes from flow sorting of chromosomes to obtain long-range coverage of specific chromosomes or chromosomal regions89,90,91,92,93,94.
For the large-scale sequence production phase, a genome-wide physical map of overlapping clones was also constructed by systematic analysis of BAC clones representing 20-fold coverage of the human genome86. Most clones came from the first three sections of the RPCI-11 library, supplemented with clones from sections of the RPCI-13 and CalTech D libraries (Table 1). DNA from each BAC clone was digested with the restriction enzyme HindIII, and the sizes of the resulting fragments were measured by agarose gel electrophoresis. The pattern of restriction fragments provides a ‘fingerprint’ for each BAC, which allows different BACs to be distinguished and the degree of overlaps to be assessed. We used these restriction-fragment fingerprints to determine clone overlaps, and thereby assembled the BACs into fingerprint clone contigs.
The fingerprint clone contigs were positioned along the chromosomes by anchoring them with STS markers from existing genetic and physical maps. Fingerprint clone contigs were tied to specific STSs initially by probe hybridization and later by direct search of the sequenced clones. To localize fingerprint clone contigs that did not contain known markers, new STSs were generated and placed onto chromosomes95. Representative clones were also positioned by fluorescence in situ hybridization (FISH) (ref. 86 and C. McPherson, unpublished).
We selected clones from the fingerprint clone contigs for sequencing according to various criteria. Fingerprint data were reviewed86,90 to evaluate overlaps and to assess clone fidelity (to bias against rearranged clones83,96). STS content information and BAC end sequence information were also used91,92. Where possible, we tried to select a minimally overlapping set spanning a region. However, because the genome-wide physical map was constructed concurrently with the sequencing, continuity in many regions was low in early stages. These small fingerprint clone contigs were nonetheless useful in identifying validated, nonredundant clones that were used to ‘seed’ the sequencing of new regions. The small fingerprint clone contigs were extended or merged with others as the map matured.
The clones that make up the draft genome sequence therefore do not constitute a minimally overlapping set—there is overlap and redundancy in places. The cost of using suboptimal overlaps was justified by the benefit of earlier availability of the draft genome sequence data. Minimizing the overlap between adjacent clones would have required completing the physical map before undertaking large-scale sequencing. In addition, the overlaps between BAC clones provide a rich collection of SNPs. More than 1.4 million SNPs have already been identified from clone overlaps and other sequence comparisons97.
Because the sequencing project was shared among twenty centres in six countries, it was important to coordinate selection of clones across the centres. Most centres focused on particular chromosomes or, in some cases, larger regions of the genome. We also maintained a clone registry to track selected clones and their progress. In later phases, the global map provided an integrated view of the data from all centres, facilitating the distribution of effort to maximize coverage of the genome. Before performing extensive sequencing on a clone, several centres routinely examined an initial sample of 96 raw sequence reads from each subclone library to evaluate possible overlap with previously sequenced clones.
Sequencing
The selected clones were subjected to shotgun sequencing. Although the basic approach of shotgun sequencing is well established, the details of implementation varied among the centres. For example, there were differences in the average insert size of the shotgun libraries, in the use of single-stranded or double-stranded cloning vectors, and in sequencing from one end or both ends of each insert. Centres differed in the fluorescent labels employed and in the degree to which they used dye-primers or dye-terminators. The sequence detectors included both slab gel- and capillary-based devices. Detailed protocols are available on the web sites of many of the individual centres (URLs can be found at http://www.nhgri.nih.gov/genome_hub.html). The extent of automation also varied greatly among the centres, with the most aggressive automation efforts resulting in factory-style systems able to process more than 100,000 sequencing reactions in 12 hours (Fig. 3). In addition, centres differed in the amount of raw sequence data typically obtained for each clone (so-called half-shotgun, full shotgun and finished sequence). Sequence information from the different centres could be directly integrated despite this diversity, because the data were analysed by a common computational procedure. Raw sequence traces were processed and assembled with the PHRED and PHRAP software packages77,78 (P. Green, unpublished). All assembled contigs of more than 2 kb were deposited in public databases within 24 hours of assembly.
Figure 3
The automated production line for sample preparation at the Whitehead Institute, Center for Genome Research. The system consists of custom-designed factory-style conveyor belt robots that perform all functions from purifying DNA from bacterial cultures through setting up and purifying sequencing reactions.
The overall sequencing output rose sharply during production (Fig. 4). Following installation of new sequence detectors beginning in June 1999, sequencing capacity and output rose approximately eightfold in eight months to nearly 7 million samples processed per month, with little or no drop in success rate (ratio of useable reads to attempted reads). By June 2000, the centres were producing raw sequence at a rate equivalent to onefold coverage of the entire human genome in less than six weeks. This corresponded to a continuous throughput exceeding 1,000 nucleotides per second, 24 hours per day, seven days per week. This scale-up resulted in a concomitant increase in the sequence available in the public databases (Fig. 4).
Figure 4
Total amount of human sequence in the High Throughput Genome Sequence (HTGS) division of GenBank. The total is the sum of finished sequence (red) and unfinished (draft plus predraft) sequence (yellow).
A version of the draft genome sequence was prepared on the basis of the map and sequence data available on 7 October 2000. For this version, the mapping effort had assembled the fingerprinted BACs into 1,246 fingerprint clone contigs. The sequencing effort had sequenced and assembled 29,298 overlapping BACs and other large-insert clones (Table 2), comprising a total length of 4.26 gigabases (Gb). This resulted from around 23 Gb of underlying raw shotgun sequence data, or about 7.5-fold coverage averaged across the genome (including both draft and finished sequence). The various contributions to the total amount of sequence deposited in the HTGS division of GenBank are given in Table 3.
Table 2 Total genome sequence from the collection of sequenced clones, by sequence status
Table 3 Total human sequence deposited in the HTGS division of GenBank
By agreement among the centres, the collection of draft clones produced by each centre was required to have fourfold average sequence coverage, with no clone below threefold. (For this purpose, sequence coverage was defined as the average number of times that each base was independently read with a base-quality score corresponding to at least 99% accuracy.) We attained an overall average of 4.5-fold coverage across the genome for draft clones. A few of the sequenced clones fell below the minimum of threefold sequence coverage or have not been formally designated by centres as meeting draft standards; these are referred to as predraft (Table 2). Some of these are clones that span remaining gaps in the draft genome sequence and were in the process of being sequenced on 7 October 2000; a few are old submissions from centres that are no longer active.
|