| Literature DB >> 35124726 |
Helen Long1,2, Richard Reeves1, Michelle M Simon3.
Abstract
Mice have emerged as one of the most popular and valuable model organisms in the research of human biology. This is due to their genetic and physiological similarity to humans, short generation times, availability of genetically homologous inbred strains, and relatively easy laboratory maintenance. Therefore, following the release of the initial human reference genome, the generation of the mouse reference genome was prioritised and represented an important scientific resource for the mouse genetics community. In 2002, the Mouse Genome Sequencing Consortium published an initial draft of the mouse reference genome which contained ~ 96% of the euchromatic genome of female C57BL/6 J mice. Almost two decades on from the publication of the initial draft, sequencing efforts have continued to increase the completeness and accuracy of the C57BL/6 J reference genome alongside advances in genome annotation. Additionally new sequencing technologies have provided a wealth of data that has added to the repertoire of annotations associated with traditional genomic annotations. Including but not limited to advances in regulatory elements, the 3D genome and individual cellular states. In this review we focus on the reference genome C57BL/6 J and summarise the different aspects of genomic and cellular annotations, as well as their relevance to mouse genetic research. We denote a genomic annotation as a functional unit of the genome. Cellular annotations are annotations of cell type or state, defined by the transcriptomic expression profile of a cell. Due to the wide-ranging number and diversity of annotations describing the mouse genome, we focus on gene, repeat and regulatory element annotation as well as two relatively new technologies; 3D genome architecture and single-cell sequencing outlining their utility in genetic research and their current challenges.Entities:
Mesh:
Year: 2022 PMID: 35124726 PMCID: PMC8913471 DOI: 10.1007/s00335-021-09936-7
Source DB: PubMed Journal: Mamm Genome ISSN: 0938-8990 Impact factor: 2.957
RefSeq and GENCODE Established Annotations
| Feature | Function / definition |
|---|---|
| Small cytoplasmic RNA (scRNA) | Small RNAs located in the cytoplasm |
| Ribosomal RNA (rRNA) | Non-coding RNAs that aid translation of messenger RNA to protein |
| Misc RNA | RNAs that cannot be denoted by other RNA classes/biotypes |
| Small nuclear RNA (snRNA) | Small RNA molecules, on average 150 bases long, found in the nucleus |
| Small nucleolar RNA (snoRNA) | Non-coding RNAs located in the nucleolus that modify other RNAs—mainly ribosomal RNAs |
| MicroRNA (miRNA) | Single stranded non-coding RNA elements that regulate gene expression |
| Long non coding RNA (LncRNA) | RNAs longer than 200 nucleotides that are not translated into functional proteins |
| All Pseudogenes | Mutated or deactivated sequences that mirror genes but lack introns and other sequences |
| Protein-coding gene | A functional unit of heredity, which contributes to a function or a phenotype |
| Signal recognition particle RNA (srpRNA) | RNAs located in the cytoplasm that aid the signal recognition particle complex by targeting proteins |
| Transer RNA (tRNA) | Transfer RNAs are highly abundant RNAs ~ 70–100 bases in length that aid in translation |
| Small nuclear RNA (snRNA) | Small RNA molecules found in splicing speckes and cajal bodies within the nucleus. They are ~ 150 nucleotides in length and process pre-messenger RNA |
Fig. 1Number of annotations in: a GENCODE and RefSeq for mm39. Only annotations that could be obviously matched between resources have been included. b RepeatMasker for mm39
RepeatMasker Established Annotations
| Feature | Function / definition |
|---|---|
| Satelite | Largely repeating short elements of AT-rich non-coding DNA that form centromeres and heterochromatin |
| Low Complexity | Repetitve elements of low complexity |
| LINE | |
| Long Terminal Repeat (LTR) | Paired sequences of DNA hundreds of base pairs long that often occur after a section of protein coding sequences |
| Simple Repeat | Simple duplicated sets of DNA bases |
| SINE |
Regulatory annotations and resources
| Feature | Function | Features | FANTOM5 | Vista enhancer browser | Encode | EnhancerAtlas 2.0 | 3D Genome Browser | ChromHMM |
|---|---|---|---|---|---|---|---|---|
| Promoter | Recruits the pre-initiation complex, located at or near the transcription start site of a gene | H3K4me3, H3K27ac | X | X | X | |||
| Enhancer | Stimulates transcription of target gene/genes, often located near target gene, however can be some distance away | H3K4me1, H3K27ac | X* | X | X | X | X | |
| Boundary | Prevents spread of euchromatin or heterochromatin in the genome, prevents the formation of regulatory interactions between enhancers and promoters | CTCF and Cohesin binding | X | X# | ||||
| Open chromatin | Active transcription, loosely packed nucleosomes | Acetylation, ATAC-seq, DNase hypersensitivity, FAIRE-seq and MNase-seq | X | |||||
| Heterochromatin | Repressed transcription, densely packed nucleosomes. Constitutive heterochromatin found at repeats, transposons, the centromere and telomeres. Facultative heterochromatin silences genes and is developmentally regulated | H3K9me3, HP1 (Constitutive) H3K27me3, polycomb family proteins (facultative) | X | X | ||||
| Compartment | A and B compartments, chromatin clusters together with other chromatin from the same compartment. Commonly identified using the first two principal components of Hi-C data | ‘A compartment’ enriched for genes, active transcription and open chromatin, ‘B compartment’ enriched for heterochromatin | X | |||||
| Topologically associating domain (TAD) | Self-interacting chromatin domains, thought to colocalise enhancers and their target genes, block the spread of activation/repression in the genome, and insulate genes from aberrant regulatory interactions. Identified algorithmically from Hi-C maps | Boundary elements enriched at boundaries i.e. convergently orientated CTCF binding sites | X | X | ||||
| Chromatin loop | Loop structure created by physical interaction between two DNA elements commonly within the same TAD | Boundary elements enriched at boundaries i.e. convergently orientated CTCF binding sites | X | X | X |
* = transcribed enhancers. # = annotated as insulators. Resources which either provide any annotation of that category or any example of a raw data type required to infer them have been marked with “X”
Fig. 2Annotation of scRNA-Seq data. A Single-cell experimental data is taken as input. B Input data is analysed using either unsupervised or supervised analysis. C Unsupervised analysis is done via clustering, for which there are many algorithms and single-cell tools, such as Seurat, Signac, Monocle and ScanPy D Clustering is done with the guidance of supporting evidence from previous data to identify known clusters, and where necessary identify novel clusters, leading to a new single-cell cluster annotation. E The cluster annotations then form part of the reference datasets which feed into supporting evidence, F and also are the basis for supervised classification of single-cell data. G Supervised classification of single-cell data relies on reference annotations to label cells. Some tools such as Alona and scMCA enable automated annotation, but other tools such as Garnet and ScPred are self-trained. H Supervised classification then produces annotated cells based off of a reference dataset of choice
Resources to aid annotating a cell
| Resource | Tissue availability | Metadata | Data used to build resources |
|---|---|---|---|
| Single-cell Mouse Cell Atlas scMCA | Whole Mouse Adult, | Cell Type, Tissue, Developmental Stage | scRNA-Seq |
| Mouse Organogenesis Cell Atlas | Whole Mouse Adult | Cell Type, Developmental Stage | scRNA-Seq |
| UCSC Cell Browser | Adult Mouse, Embryonic Mouse, Mouse Nervous System | Cell Type, Tissue, Developmental Stage, Tissue, Experiment Specific Data | scRNA-Seq |
| EBI expression Atlas | Mouse, Brain, Heart, Gonadal | Species, Cell Type, Tissue, Technology | scRNA-Seq, RNA-Seq, snRNA-Seq, Microarray, scATAC-Seq |
| Monocle / Garnett | Mouse Brain and Spinal Cord, Lung | Cell Type, Species, Tissue | scRNA-Seq |
| Pangalo DB | Brain, Intestine, Skin, Thymus, Spleen, Heart, Lung, | Cell Type, Tissue, Library Protocol, Number of Cells, Strain and or Genotype, Number of Expressed Genes, Accession number | scRNA-Seq |
| Alona | Brain, Bone Marrow, Skin, Epididymus and vas deferens | Cell Type, Accession Number, Species, Tissue | scRNA-Seq |
| Single Cell Portal Broad | Brain, Lung, Aging Mouse Brain | Species, Cell Type, Tissue, Technology, Disease, Sex, Library Protocol, Age | scRNA-Seq, scATAC-Seq |
| Mouse Brain Atlas | Whole Mouse Brain | Cell Type, Developmental Stage | scRNA-Seq |
| LifeMap | Developmental Mouse and Stem Cells | Cell Type, Anatomical Compartment, Developmental Path, Progenitor Status, Developmental Time, Number of associated Genes, Signals, High throughput, Matched Cultured Cells, Disease, | RNA-Seq, Microarray, In situ hybridisation |
| Allen Brain-Map | Whole Mouse Brain | Cell Type | scRNA-Seq |
Table of resources available for single-cell level data, detailing the tissue types available, the metadata stored there and the modality in which the single-cell data has been captured in