![]() When confronted with a read that aligns equally well to two or more camouflaged regions (commonly known as multi-mapping reads ), standard next-generation sequence aligners, such as the Burrows-Wheeler Aligner (BWA), randomly map the read to one of the regions and assign a low mapping quality. Regardless of whether the duplication is active, however, any genomic region that has been nearly identically duplicated and is large enough to prevent sequencing reads from aligning unambiguously will be “dark”, because the aligner cannot determine which genomic region the read originated from. In fact, many genes in the human genome were duplicated over evolutionary time and are still transcriptionally and translationally active (e.g., heat-shock proteins), while others have been duplicated, but are considered inactive (i.e., pseudogenes). These camouflaged regions are generally either large contiguous tandem repeats (e.g., centromeres, telomeres, and other short tandem repeats), or a larger specific DNA region that has been duplicated (e.g., a gene duplication) either in tandem or in a more distal genome region. Specifically, many dark regions arise from duplicated genomic regions, where confidently aligning short reads to a unique location is not possible we term these regions as “camouflaged”. Other dark regions arise, not because the sequencing is inherently problematic, but because of bioinformatic challenges. Regions that are dark by depth may arise because the region is inherently difficult to sequence at the chemistry level (e.g., high GC content ), essentially eliminating sequencing reads from that region altogether. Some dark regions are what we term “dark by depth” (few or no mappable reads), while others are what we term “dark by mapping quality” (reads aligned to the region, but with a low mapping quality). Researchers have known for years that large, complex genomes, including the human genome, contain “dark” regions-regions where standard high-throughput short-read sequencing technologies cannot be adequately assembled or aligned-thus preventing our ability to identify mutations within these regions that may be relevant to human health and disease. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies. While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer’s disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer’s disease gene, found in disease cases but not in controls. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer’s Disease Sequencing Project. ![]() Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. ![]() We identify dark regions that are present in protein-coding exons across 748 genes. Of these gene bodies, 8.7% are completely dark and 35.2% are ≥ 5% dark. Resultsīased on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. We assess how well long-read or linked-read technologies resolve these regions. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. The human genome contains “dark” gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |