| Literature DB >> 30455414 |
Rachel M Sherman1,2, Juliet Forman3,4, Valentin Antonescu3, Daniela Puiu3, Michelle Daya5, Nicholas Rafaels5, Meher Preethi Boorgula5, Sameer Chavan5, Candelaria Vergara6, Victor E Ortega7, Albert M Levin8, Celeste Eng9, Maria Yazdanbakhsh10, James G Wilson11, Javier Marrugo12, Leslie A Lange5, L Keoki Williams13, Harold Watson14, Lorraine B Ware15, Christopher O Olopade16, Olufunmilayo Olopade17, Ricardo R Oliveira18, Carole Ober19, Dan L Nicolae17, Deborah A Meyers20, Alvaro Mayorga21, Jennifer Knight-Madden22, Tina Hartert15, Nadia N Hansel6, Marilyn G Foreman23, Jean G Ford24, Mezbah U Faruque25, Georgia M Dunston26, Luis Caraballo12, Esteban G Burchard27, Eugene R Bleecker20, Maria I Araujo28, Edwin F Herrera-Paz29, Monica Campbell5, Cassandra Foster6, Margaret A Taub30, Terri H Beaty31, Ingo Ruczinski32, Rasika A Mathias6,31, Kathleen C Barnes5, Steven L Salzberg33,34,35,36.
Abstract
We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic.Entities:
Mesh:
Year: 2018 PMID: 30455414 PMCID: PMC6309586 DOI: 10.1038/s41588-018-0273-y
Source DB: PubMed Journal: Nat Genet ISSN: 1061-4036 Impact factor: 38.330
Novel sequences in the African pan-genome.
| Number of sequence contigs | Total length (bp) | Bases with no alignment to GRCh38 (< 80% identity) | Longest contig(bp) | |
|---|---|---|---|---|
| Two ends placed | 302 | 667,668 | 431,656 | 20,732 |
| One end placed | 1,246 | 3,687,028 | 1,866,699 | 79,938 |
| Unplaced | 124,167 | 292,130,588 | 202,629,979 | 152,806 |
| Total | 125,715 | 296,485,284 | 204,928,334 | 152,806 |
| Non-private only | 33,599 | 80,098,092 | 50,044,650 | 152,806 |
Number and length of novel sequences in the African pan-genome. Bases with no alignment to GRCh38 were calculated by subtracting the lengths of all subsequences that aligned with at least 80% identity. The remainder represents truly novel sequence. Non-private insertions were insertions shared by at least two CAAPA cohort individuals.
Figure 1.Overview of methods.
Raw reads are aligned to GRCh38 and unaligned reads assembled with MaSuRCA. Assembled contigs are then filtered for contaminants with Centrifuge and contigs shorter than 1 kb are removed (blue box). Assembled contigs are placed based on their mate’s alignment locations when possible, by checking if over 95% of mates align to the same location. If such a placement is found, the exact breakpoint is determined via a nucmer alignment to the region for each end of the contig (yellow box). Contig placement locations are then compared between all individuals, nearby placements are clustered, and a representative is chosen. All contigs are then aligned to the representatives to determine which samples contain a given placed insertion. Contigs in or aligning to placed clusters are removed from the unplaced set, and the remaining unplaced contigs are aligned to one another with nucmer to remove redundancy and result in a final nonredundant unplaced set of contigs (purple box).
African pan-genome contig presence/absence statistics.
| Number of contigs | Mean # insertions per individual | Mean # individuals per insertion | |
|---|---|---|---|
| Two ends placed | 302 | 120 (39.7%) | 363 (of 910) |
| One end placed | 1,246 | 212 (17.0%) | 155 (of 910) |
| Unplaced | 124,167 | 527 (0.4%) | 4 (of 910) |
| Total | 125,715 | 859 (0.7%) | 6 (of 910) |
| Non-private only | 33,599 | 758 (2.2%) | 21 (of 910) |
Statistics on the presence or absence of the African pan-genome contigs. Presence/absence was determined by aligning all raw contigs for each individual to the final set of APG contigs. Alignments of one or more contigs yielded a presence call if the alignments covered at least 80% of an APG contig at at least 90% identity. Additional presence calls were made for the placed contigs if the individual had a similar contig placed in the same location, even if the alignment thresholds were not met.
Figure 2.African pan-genome contig locations.
Map of the human genome showing the locations of all African pan-genome contigs, for those that could be placed accurately along one of the chromosomes. Yellow lines represent an intergenic location; blue lines represent insertion points with RNA but not exonic annotations, and red lines indicate intersections within exons. All exon-intersecting insertions are labeled with the gene name. mRNA and lncRNA gene names are reported in Supplementary Table 4. In some cases insertions are too close together for lines to be resolved; when this occurs within exons, gene names are listed in order by chromosome position. Line width is not to scale.
Comparison of African pan-genome contigs to the Chinese and Korean genomes.
| Best GRCh38 alignment is 80–90% identical with 50–80% coverage | Best GRCh38 alignment is < 80% identical or < 50% coverage | Total | ||||
|---|---|---|---|---|---|---|
| Contigs | Length (bp) | Contigs | Length | Contigs | Length | |
| Matches Chinese only | 1,625 | 2,898,106 | 7,607 | 25,475,277 | 9,232 | 28,373,383 |
| Matches Korean only | 2,242 | 3,989,277 | 15,635 | 48,642,664 | 17,877 | 52,631,941 |
| Matches both | 5,385 | 9,720,662 | 9,713 | 29,981,048 | 15,098 | 39,701,710 |
| Total | 9,252 | 16,608,045 | 32,955 | 104,098,989 | 42,207 | 120,707,034 |
Contigs with a better alignment to the Chinese or Korean assemblies than to GRCh38. Alignments to the Chinese and Korean assemblies were required to have ≥ 90% identity and ≥ 80% coverage to be considered. Lengths shown are the sums of the contig lengths, not the alignment lengths.
Figure 3.An example of an alignment which does not meet the 50% coverage, 80% identity threshold for a “reasonably good” alignment to GRCh38. The APG contig is shown at the top, with the best consistent alignments to GRCh38 in the middle. The three constituent alignments (blue, red, and yellow segments) cover 801 bases, just under 25% of the contig, with a cumulative weighted identity of 87.9%. CAAPA_113686 has a single near perfect alignment to a Chinese HX1 contig (delineated by dotted lines) covering over 80% of CAAPA_113686 at over 90% identity. The APG contig also aligns very well to the Korean assembly (not shown).