Literature DB >> 30592451

High-Throughput Sequencing in Respiratory, Critical Care, and Sleep Medicine Research. An Official American Thoracic Society Workshop Report.

Craig P Hersh, Ian M Adcock, Juan C Celedón, Michael H Cho, David C Christiani, Blanca E Himes, Naftali Kaminski, Rasika A Mathias, Deborah A Meyers, John Quackenbush, Susan Redline, Katrina A Steiling, Holly K Tabor, Martin D Tobin, Mark M Wurfel, Ivana V Yang, Gerard H Koppelman.

Abstract

High-throughput, "next-generation" sequencing methods are now being broadly applied across all fields of biomedical research, including respiratory disease, critical care, and sleep medicine. Although there are numerous review articles and best practice guidelines related to sequencing methods and data analysis, there are fewer resources summarizing issues related to study design and interpretation, especially as applied to common, complex, nonmalignant diseases. To address these gaps, a single-day workshop was held at the American Thoracic Society meeting in May 2017, led by the American Thoracic Society Section on Genetics and Genomics. The aim of this workshop was to review the design, analysis, interpretation, and functional follow-up of high-throughput sequencing studies in respiratory, critical care, and sleep medicine research. This workshop brought together experts in multiple fields, including genetic epidemiology, biobanking, bioinformatics, and research ethics, along with physician-scientists with expertise in a range of relevant diseases. The workshop focused on application of DNA and RNA sequencing research in common chronic diseases and did not cover sequencing studies in lung cancer, monogenic diseases (e.g., cystic fibrosis), or microbiome sequencing. Participants reviewed and discussed study design, data analysis and presentation, interpretation, functional follow-up, and reporting of results. This report summarizes the main conclusions of the workshop, specifically addressing the application of these methods in respiratory, critical care, and sleep medicine research. This workshop report may serve as a resource for our research community as well as for journal editors and reviewers of sequencing-based manuscript submissions in our research field.

Entities: Chemical Disease Gene Species

Keywords: RNA sequencing; bioinformatics; functional genomics; genetic epidemiology; whole-genome sequencing

Mesh：

Year: 2019 PMID： 30592451 PMCID： PMC6812157 DOI： 10.1513/AnnalsATS.201810-716WS

Source DB: PubMed Journal: Ann Am Thorac Soc ISSN： 2325-6621

Overview Workshop Agenda Principles of Study Design Study Designs and Phenotyping for Genetic Epidemiology Biobanks Health Equity Research Ethics Next-Generation DNA Sequencing Study Design Identifying Causal Variants Next-Generation RNA Sequencing Study Design Sequencing Methods Reporting for RNA-Seq Experiments Data Analysis Bioinformatics Quality Control Statistical Analysis Data Sharing Cell Type Heterogeneity Single-Cell Sequencing Multiomics Integration Functional Validation Conclusions

Overview

High-throughput, next-generation sequencing (NGS) technologies (see Box 1 for a glossary of terms) are becoming increasingly used in studies of common, complex diseases, including respiratory diseases such as asthma, chronic obstructive pulmonary disease, and idiopathic pulmonary fibrosis; critical illnesses; and sleep disorders (Table 1, Figure 1) (1–15). RNA sequencing is now more cost effective than microarrays, and exome and whole-genome DNA sequencing are rapidly replacing genotyping arrays. Given the widespread application of these techniques in respiratory, critical care, and sleep medicine research, a workshop was organized at the ATS International Conference in Washington, DC in May 2017. The aim of this workshop was to review the design, analysis, interpretation, and functional follow-up of high-throughput sequencing studies in respiratory, critical care, and sleep medicine research. Although reviews and best-practices guidelines for DNA and RNA sequencing have been published (16, 17), this workshop focused on the application of DNA and RNA sequencing to common, complex diseases in human populations but not on epigenome or microbiome studies or cancer genetics.

Table 1.

Examples of human next-generation sequencing studies in respiratory, critical care, and sleep medicine

Technology	Disease/Trait	Study Design/Subjects	Main Findings	Validation	Reference
Whole-exome sequencing	Narcolepsy	18 Families	8 Missense variants in P2Y11	Resequencing in 250 cases, 150 control subjects; in vitro P2Y11 signaling assays	10
Whole-exome sequencing	Bronchopulmonary dysplasia	50 Twin pairs, including 51 BPD cases	258 Genes with rare nonsynonymous mutations	Lung gene expression in published human data and rat BPD model, mouse phenotype database	11
Whole-exome sequencing	Airflow obstruction	100 Heavy smokers with normal lung function	Nonsynonymous SNP in CCDC38	Association testing in two additional studies. Immunohistochemistry in bronchial epithelial cells.	6
Whole-exome sequencing	Idiopathic pulmonary fibrosis	79 Probands with familial pulmonary fibrosis, 2,816 control subjects	Mutations in PARN found in cases, not control subjects. Mutations in RTEL1 more common in cases vs. control subjects	Mutations segregated in families. Shorter leukocyte telomeres in mutation carriers.	8
Whole-genome sequencing	Pulmonary vascular disease	864 PAH, 16 PVOD/ PCH, 7,134 control subjects	EIF2AK4 mutations in 19 patients with PAH	Phenotype association with younger age, reduced KCO, shorter survival	12
Whole-genome sequencing	Asthma	WGS in 8,453 Icelanders, imputation in >150 K	Rare variant in IL33 associated with lower eosinophil count, reduced asthma risk	Genotyping in 6,465 cases, >300 K control subjects; interleukin-33 gene expression; in vitro assay of receptor binding	3
RNA sequencing	Smoking	Blood samples from 229 current, 286 former smokers	171 DE genes, including 7 lncRNAs, 8 genes with differential exon use	Published microarray study	13
RNA sequencing	COPD	Lung tissue from 98 cases, 91 control subjects	2,312 DE genes	qPCR for seven genes	2
Single-cell RNA sequencing	IPF	FACS-sorted lung epithelial cells from 6 IPF, 3 control subjects	4 Cell clusters: AT2, basal, goblet, and indeterminate	Immunofluorescence confocal microscopy for epithelial cell markers	15
miRNA sequencing	Sepsis	Plasma from 29 sepsis, 44 noninfective SIRS, 16 control subjects	6 miRNAs distinguish sepsis from SIRS	qPCR, correlation with inflammatory cytokines	9
miRNA sequencing	Exercise physiology	Plasma before/after treadmill exercise test, n = 26	miR-181b increased with exercise	qPCR in separate cohort (n = 59), Skeletal muscle expression in mouse exercise model	14

Definition of abbreviations: BPD = bronchopulmonary dysplasia; COPD = chronic obstructive pulmonary disease; DE = differentially expressed; FACS = fluorescence-activated cell sorter; IPF = idiopathic pulmonary fibrosis; KCO = carbon monoxide transfer coefficient; lncRNA = long noncoding RNA; PAH = pulmonary arterial hypertension; PCH = pulmonary capillary hemangiomatosis; PMVEC = pulmonary microvascular endothelial cells, PVOD = pulmonary veno-occlusive disease; qPCR = quantitative polymerase chain reaction; SIRS = Systemic Inflammatory Response Syndrome; SNP = single-nucleotide polymorphism; WGS = whole-genome sequencing.

Figure 1.

Next-generation sequencing methodology (Illumina). Genomic DNA is fragmented and sequencing adaptors are attached. The genomic library is then hybridized to complementary oligonucleotide probes in the flow cell chamber. Because there are adaptors on both ends, hybridization results in a bridge. Amplification leads to clusters of fragments with the same sequence. Clusters are denatured; then, sequencing-by-synthesis involves the addition of fluorescently labeled nucleotides, with serial imaging after the incorporation of each nucleotide. Reprinted by permission from Reference 116.

Box 1. Definitions of Commonly Used Terms in Sequencing Studies Batch effect: In a large study, library construction and sequencing is done in batches (e.g., 96-well plate), which is a source of technical variation that should be addressed in the data analysis. Complex trait: A disease or phenotype that does not follow Mendelian inheritance. Complex traits are likely influenced by multiple genes and environmental factors. Most common human diseases would be considered complex diseases. Deconvolution: RNA sequencing is frequently performed in tissues such as blood or lung, which are composed of multiple cell types. Deconvolution methods aim to estimate the cell type proportions and/or identify the cell type(s) responsible for the expression of specific genes. Exome: The portion of the human genome (approximately 1%) that encodes for proteins. Whole-exome sequencing (WES) specifically targets these sequences. Expression quantitative trait locus (eQTL): A genetic variant, usually an SNP (see below), that affects the expression of a gene. eQTLs can be located near the gene of interest (cis-eQTL) or distant (>1 Mb away) (trans-eQTL). Genome-wide association study (GWAS): A study that assays hundreds of thousands to millions of SNPs across the genome and tests each variant for association with a disease or trait of interest. Mendelian disorder: A disease determined by variation in a single gene (e.g., cystic fibrosis or sickle cell disease). Next-generation sequencing (NGS): Highly automated parallel sequencing technique of small fragments of DNA or RNA. Millions or even billions of nucleotides, up to a whole genome, can be determined in 1 day. RNA integrity number (RIN): A proprietary algorithm that quantifies RNA degradation on the basis of an electropherogram. Single-nucleotide polymorphism (SNP): A single base pair change in DNA sequence (e.g., C to T) which is prevalent (>1%) in the general populations. SNPs are the most common type of genetic variation. Sequencing coverage/depth: For DNA sequencing, the number of reads that include a specific nucleotide in the sequencing experiment. This can be averaged across the genome (e.g., 30×). For RNA sequencing, sequencing depth is usually presented as the total number of sequencing reads. Variant calling: The process of identifying genetic variants (usually SNPs) in an individual exome or genome sequence. Examples of human next-generation sequencing studies in respiratory, critical care, and sleep medicine Definition of abbreviations: BPD = bronchopulmonary dysplasia; COPD = chronic obstructive pulmonary disease; DE = differentially expressed; FACS = fluorescence-activated cell sorter; IPF = idiopathic pulmonary fibrosis; KCO = carbon monoxide transfer coefficient; lncRNA = long noncoding RNA; PAH = pulmonary arterial hypertension; PCH = pulmonary capillary hemangiomatosis; PMVEC = pulmonary microvascular endothelial cells, PVOD = pulmonary veno-occlusive disease; qPCR = quantitative polymerase chain reaction; SIRS = Systemic Inflammatory Response Syndrome; SNP = single-nucleotide polymorphism; WGS = whole-genome sequencing. Next-generation sequencing methodology (Illumina). Genomic DNA is fragmented and sequencing adaptors are attached. The genomic library is then hybridized to complementary oligonucleotide probes in the flow cell chamber. Because there are adaptors on both ends, hybridization results in a bridge. Amplification leads to clusters of fragments with the same sequence. Clusters are denatured; then, sequencing-by-synthesis involves the addition of fluorescently labeled nucleotides, with serial imaging after the incorporation of each nucleotide. Reprinted by permission from Reference 116.

Workshop Agenda

The workshop participants focused on five topics, each of which concluded with a panel discussion. Areas of emphasis included study design, ethical considerations and health inequalities, applications of DNA and RNA sequencing, cell type heterogeneity, and functional studies. Biomedical literature searches were conducted by the speakers and co-chairs. The co-chairs collected summaries from speakers, and a writing group prepared the document for review by the workshop participants. Recommendations were formulated by discussion and consensus (Box 2, Figure 2).

Figure 2.

Workflow for a next-generation sequencing study in human disease.

Box 2: Recommendations for Design and Analysis of Next-Generation Sequencing Studies General principles of genetic epidemiology study design are especially important in next-generation sequencing studies. Disease-specific studies may have more detailed phenotyping, whereas population studies may allow for larger sample sizes. Investigators must clearly define the phenotypes of both cases and control subjects, including consideration of relevant exposures, such as smoking. Relying only on datasets largely composed of individuals of European ancestry will limit discoveries, especially in diseases that may be more prevalent in other racial or ethnic groups. Therefore, we suggest expanding sequencing efforts in subjects of different ancestries. Studies should consider broad consent for secondary data use. Researchers should use standardized methods of experimental design and data analysis for exome and whole-genome sequencing studies. Methods for design and analysis in RNA sequencing studies are more variable. Researchers should clearly document their methods, including software versions and input parameters, and consider validating key results with a different analysis method. Multidisciplinary teams should include bioinformaticians, statisticians, and computational biologists to assist in the management and analysis of large datasets. Quality control is the responsibility of the investigators. It should not be assumed that the core sequencing facility has performed all the necessary quality control steps. Data sharing is required by many funders and should be the default. Because of the extreme cellular heterogeneity of the lung, future studies should address cellular heterogeneity in the design and analysis by any of the proposed methods. We anticipate an important role of single-cell sequencing in the future. For wider acceptance, single-cell sequencing has to address specific hypotheses and should be held to the same rigorous standards as other study designs, even while the laboratory and statistical methods are under development. Given the pervasive influence of circadian biology on multiple cellular processes, including gene expression, studies should time stamp sample collection. Integrating different omics datasets, both at the single-cell and at the tissue level, will likely increase our understanding of the complex diseases in respiratory, critical care, and sleep medicine. Laboratory validation is an important next step toward the eventual translation of results. Workflow for a next-generation sequencing study in human disease.

Principles of Study Design

Study Designs and Phenotyping for Genetic Epidemiology

The general principles of epidemiology study design remain true for genetic epidemiology studies, including subject ascertainment, phenotype definition, and sample size considerations, as summarized in the Strengthening the Reporting of Genetic Association Studies (STREGA) Guidelines (18). There are several possible designs for genetics studies of respiratory disease. In the past, most studies enrolled subjects ascertained for a specific condition. These studies usually use careful phenotyping to define the disease of interest using endotypes, such as methacholine challenge testing or polysomnography (19–21). General population (cohort) studies may offer the advantage of large sample sizes and the ability to study multiple outcomes, although the phenotyping may not be as precise. Questionnaires may be the primary source of respiratory disease diagnosis, although several large cohorts have included spirometry (22, 23). For common diseases, the large numbers may offset the potential for phenotypic heterogeneity (e.g., childhood vs. adult-onset asthma) or even misclassification (e.g., chronic obstructive pulmonary disease [COPD] misdiagnosed as asthma, especially in women) (24). Recent studies have linked genetics to the electronic medical record (25). General population cohorts have limited utility in studies of critical illness, where subjects are enrolled in the hospital (26). Although most recent studies have been case–control or cohort studies, family-based studies still play a role, especially in the analysis of rare variants, where transmission can be followed through a pedigree (27). As genomic data sharing has become the norm, secondary analysis for respiratory diseases is now routinely performed in general population studies, such as the Framingham Heart Study (22). When secondary data will be used or when multiple studies will be combined in a meta-analysis, investigators must carefully review the phenotyping methods, including the specific questionnaire items, to be sure that similar traits are being compared. In case–control studies, control subjects should have comparable exposures, such as smokers with normal lung function in COPD studies or patients with multitrauma, pneumonia, or sepsis who did not develop acute respiratory distress syndrome (28–30).

Biobanks

Sequencing studies undertaken at the scale required for well-powered association testing require organized efforts, with coordinated biobanking. Rare diseases and phenotypes may be due to genetic variants with high penetrance, which may be detectable in relatively modest numbers of samples. Even in this situation, and in the absence of many different mutations causing similar phenotypes, it is helpful to draw on very large numbers of sequenced individuals. Genomics England (the “100,000 Genomes Project”) (31), the U.S. National Heart, Lung, and Blood Institute Trans-Omics in Precision Medicine (TOPMed) (32), and the Genome Aggregation Database (33) are initiatives that will enhance such comparisons, which are critical to confirm whether variants are causal for the disease in question (Table 2). The Genomics England project is recruiting patients with cancers, infectious diseases such as tuberculosis, and rare diseases, including primary ciliary dyskinesia, spontaneous pneumothorax, familial pulmonary fibrosis, and familial multiple pulmonary arteriovenous malformations.

Table 2.

Biobanks and commonly used databases for next-generation sequencing research

	URL
Biobanks and other large sequencing studies
Centers for Common Disease Genomics	www.genome.gov/27563570
China Kadoorie Biobank	www.ckbiobank.org
Genomics England (“100,000 Genomes Project”)	www.genomicsengland.co.uk
Trans-Omics in Precision Medicine (TOPMed)	www.nhlbiwgs.org
U.K. Biobank	www.ukbiobank.ac.uk
Databases
Database of Genotypes and Phenotypes (dbGaP)	www.ncbi.nlm.nih.gov/gap
Ensembl genome browser	www.ensembl.org
Gene Expression Omnibus (GEO)	www.ncbi.nlm.nih.gov/geo
Genome Aggregation Database (gnomAD)	http://gnomad.broadinstitute.org/
Genotype-Tissue Expression project (GTEx)	www.gtexportal.org
Human Cell Atlas	www.humancellatlas.org
Lung Map	www.lungmap.net
Reference Sequence Database (RefSeq)	www.ncbi.nlm.nih.gov/refseq
Sequence Read Archive (SRA)	www.ncbi.nlm.nih.gov/sra
University of California Santa Cruz (UCSC) Genome Browser	www.genome.ucsc.edu

Biobanks and commonly used databases for next-generation sequencing research Complementing efforts that specifically recruit individuals with particular diseases are population biobanks that are agnostic to health status, many of which have extensive longitudinal follow-up. The UK Biobank recruited 500,000 participants aged 40 to 69 years (34). In addition to a baseline assessment, subsequent health status is evaluated via linked electronic healthcare records. Beginning with respiratory studies in 50,000 participants (35), genome-wide genotyping has now been extended to all participants. Whole-exome sequencing (WES) is underway, and whole-genome sequencing (WGS) has recently been announced in collaboration with industry partners. The sequence data will be made available to the research community. Although 95% of the UK Biobank is of European ancestry, similar efforts are in progress in China in the Kadoorie Biobank (36).

Health Equity

Many respiratory, critical care, and sleep disorders have substantial differences in disease susceptibility, prevalence, and burden according to race and ethnicity (37). Genetic and genomic factors, along with their interplay with the environment, contribute to these differences. For example, African ancestry is a strong predictor of lung function (38). Some disease susceptibility or pharmacogenetic variants that have been identified in diseases such as asthma, emphysema, and sleep apnea (39–41) are population specific and may be absent or low frequency in other populations (42, 43). Although most of these studies have been performed using common variants, there is even more ethnic diversity for rare variants (44). Individuals of African ancestry harbor a larger number of rare variants than white individuals, which may have important clinical implications. For example, rare variants identified as causing cardiomyopathy in white individuals were so common in African Americans as to indicate that they were unlikely to be pathogenic (45). To date, there has been a large gap in research studies involving non-white individuals, for reasons including convenience, access, and genetic heterogeneity (46). In addition, there have also been disparities in funding minority investigators or diseases that predominately affect minorities, such as sickle cell disease (47). By 2060, only 44% of the United States will be non-Hispanic white (48). There is a strong scientific and moral justification for expanding sequencing studies into other ancestries. Efforts such as TOPMed are performing WGS in a large number of non-European samples. These and other efforts can help ensure that sequencing efforts improve health equity for the benefit of all.

Research Ethics

Several ethical challenges have emerged related to NGS studies. Because of the ability to assay numerous sites and the requirements for data sharing by the U.S. National Institutes of Health (NIH) and other funders (49), genome-wide association studies and sequencing data are often used for secondary studies, which may be unrelated to the initial trait or disease proposed. To allow for these studies, investigators must request and subjects must provide broad consent for secondary data analysis, as opposed to narrow consent for a specific disease. There is no consensus about what is required for broad consent for databases such as the National Center for Biotechnology Information database of Genotypes and Phenotypes (dbGaP) (50); these decisions are frequently left to local institutional review boards. In addition to the genes of interest, WES and WGS studies will identify other genetic variants that may be clinically significant for the subject or their family members. The American College of Medical Genetics has provided recommendations for the reporting of secondary results from clinical sequencing (51), providing a list of 59 actionable genes, which, interestingly, does not include the genes for cystic fibrosis or alpha-1 antitrypsin deficiency. However, it is not clear how these recommendations would apply to research studies, where the sequencing is not performed in a Clinical Laboratory Improvement Amendments (CLIA)-certified laboratory. There is usually no mechanism or resources for confirmatory clinical sequencing or genetic counseling. Sequencing studies of DNA of patients with a critical illness require consent from patient proxies as well as the designation of an individual to receive study results if the patient remains incapacitated or dies (52).

Next-Generation DNA Sequencing

Study Design

Humans carry an extraordinary amount of genetic diversity. Although most variants in a given individual are common, most genetic variants in a population are rare (i.e., present in <1% of the population). Genotyping microarrays, together with methods of genotype imputation, efficiently allow testing of more prevalent genetic variants. However, comprehensive assessment of DNA variation, including discovery of rare or novel variants, requires a test that assays each base pair. The exponential decrease in sequencing costs has led to the ability to perform WGS studies to identify the contributions of rare variants to disease. The design of a DNA sequencing study should consider several factors. Sequencing depth determines the accuracy of variant calling in an individual. For example, 30× coverage leads to high accuracy over most of the genome. However, one can sequence more samples for the same cost using lower coverage (e.g., 3–6×). Although less accurate, lower coverage still provides suitable variant information for genetic association testing and may be superior to genotyping (53). Another consideration is targeted (e.g., exome) versus WGS. The exome harbors most rare, highly deleterious mutations; sequencing only these regions reduces costs, although these savings are offset somewhat by the additional costs and inefficiencies of library preparation. Some coding regions may still be poorly covered by WES (54).

Identifying Causal Variants

NGS has led to breakthroughs in the discovery of genes for Mendelian diseases and other rare variants of strong effect (55, 56). Several groups have provided guidelines for identifying causal variants for Mendelian disease (57) and for identifying genetic association in WES (58) and WGS (59). However, the interpretation of WES and WGS data has specific challenges. Very small error rates over billions of base pairs have the potential to generate many false positives (60), although advances in technology, approaches, and bioinformatics methods have vastly improved data quality. Similarly, the large number of variants carried by any individual (27) can also lead to false positives, and caution must be used to avoid inflated estimates of pathogenicity (61). In addition, most studies of rare variants are likely underpowered (5, 58, 62). The growing availability of population-specific high-quality reference genomes will aid comparison of diseased study populations to these reference datasets. Coordinated efforts are providing population-based reference data important for filtering causal variants (33) and performing WGS in large numbers of subjects, such as the NIH Centers for Common Disease Genomics and TOPMed. These efforts will lead to new rare variant discoveries as well as improved reference panels for genotyping studies and fine mapping. Table 3 details recommendations for reporting the results of a WES or WGS study.

Table 3.

Minimal elements required in the reporting of high-throughput sequencing studies.

Analytic Step	Required Elements
Whole-exome and genome sequencing
Preprocessing and preanalysis quality control	Randomization of samples
	Target design, when applicable (e.g., whole-exome sequencing)
	Methods for quality assessment of:
	Raw reads
	Aligned reads and coverage
	Global data quality
	Ancestry of samples (comparison with study and to reference genomes)
Core analytics	Method of read alignment
	Method of variant calling
	Method of association analyses
Advanced analytics	Methods for integration with other data types
RNA sequencing
Preprocessing and preanalysis quality control	Spike-in use
	Randomization of samples
	Number of raw reads
	Methods for quality assessment of:
	Raw reads
	Aligned reads
	Quantification of reads
	Reproducibility of replicates
	Global data quality
Core analytics	Method of transcript/gene identification
	Method of transcript/gene quantification
	Method of normalization
	Method of batch correction
	Method of detection of differential expression
Advanced analytics	Method of transcript/isoform discovery
	Method of indel detection
	Method of gene fusion detection
	Method of variant detection
	Method for single-cell analyses
	Methods for integration with other data types

Required elements should also include the package or software name, version number, and settings used for the analysis.

Minimal elements required in the reporting of high-throughput sequencing studies. Required elements should also include the package or software name, version number, and settings used for the analysis.

Next-Generation RNA Sequencing

In contrast to genome sequence, gene expression is dependent on cell, tissue, and disease state. Although this context dependence makes sample collection and assays more challenging, gene expression may be more closely reflective of disease pathophysiology. Gene expression microarrays have been largely replaced by RNA-seq, which can assay a broader range of RNA types with increased sensitivity and lower costs. RNA-seq studies can be grouped into two broad categories on the basis of their study designs (63). Annotation studies aim to define the transcriptome of a specific cell type or organism, including novel transcripts. In comparison, quantification or differential expression studies compare transcript levels across experimental conditions or diseases. An introduction to RNA-seq for bench science has been recently published (64). Several factors are important in designing an RNA-seq study in human populations. Because sequencing costs depend on the number of reads, there is an inherent trade-off between sample size and sequencing depth, similar to DNA sequencing. As low as 1 to 10 million reads per sample may be adequate for differential expression (65, 66), whereas up to 200 million reads may be required to define all isoforms (67, 68). For most human disease studies, the number of samples may be more important than the number of reads (69).

Sequencing Methods

Sequencing technology has been reviewed elsewhere (70, 71). Library construction follows one of three general protocols (72). Poly-A capture is best for selecting coding transcripts but requires the highest-quality input RNA. Total RNA libraries include more RNA species; ribosomal RNA depletion is used to improve yield. Methods to capture specific target sequences can be used for lower-quality or fragmented RNA, including formalin-fixed paraffin-embedded tissue. Globin depletion is a common step for whole blood samples. Small RNA sequencing (i.e., microRNA) requires a specific library construction protocol. Investigators must determine the minimum quality for input RNA, on the basis of RNA integrity number. Other variables in sequencing include stranded versus unstranded protocols, single versus paired end reads, and insert length. Single-end short reads (≤75 bp) are acceptable for standard differential expression (73). Paired-end longer reads are preferable for annotating splicing events. In larger studies, samples should be randomized across batches to minimize confounding; analysis should account for batch effects. Gene expression in human tissues can be quite variable, depending on sampling, technical factors, and cellular heterogeneity (74). The latter can be addressed by deconvolution methods, which require additional information, either cell counts or cell-specific reference transcriptomes. A review of the specific steps and analytic tools for RNA-seq data analysis has been recently published (16). However, there remains controversy regarding the optimal methods for normalization of RNA-seq data and for detecting differential expression. A few studies have examined these questions systematically (75–79). Furthermore, selection of an analytic approach is also dependent on the experimental design; for example, not all available tools support inclusion of covariates. Caution is advised regarding the interpretation of results where the number of biologic replicates in an experiment is small or where transcripts are expressed at very low levels.

Reporting for RNA-Seq Experiments

We propose minimum elements to be reported for RNA-seq experiments (Table 3), which extend the Minimal Information about Microarray Experiments (MIAME) guidelines (80). These minimal reporting requirements are important because of the rapidly evolving landscape of methods and tools for the analysis of RNA-seq data, the need to ensure RNA-seq data are both interpretable and reproducible, and the need to facilitate access to and integration of RNA-seq experiments across a spectrum of biologic and experimental conditions. Given the lack of consensus on the optimal methods for mapping versus assembling RNA-seq reads, normalizing RNA-seq data, and assessing for differential expression, we encourage investigators to repeat key analyses using more than one approach.

Data Analysis

Bioinformatics

Sequencing technologies have improved to the point where the greatest barrier to obtaining scientific insights is more related to data storage, analysis, and interpretation than its generation (81). The first critical component is an interdisciplinary team with expertise encompassing both the design and the use of specialized methods on sophisticated computational resources (82, 83). Institutional infrastructure or external service providers that offer high-performance computing environments, including cloud computing and core facilities, are important to facilitate the generation and analysis of high-throughput sequence data. One significant advantage to these solutions is that they distribute costs over many users. In addition, these resources can help ensure high-quality data and results, as they are generated by devoted personnel who are more familiar with NGS approaches than occasional users. On the other hand, some analytic steps are best performed with feedback from those familiar with experimental design, rather than by pipelines that may overlook important issues. Ideally, researchers with 1) backgrounds in chemistry and molecular biology involved in generating the data, 2) in-depth familiarity with analysis of sequencing data, and 3) backgrounds in design of a particular experiment will communicate intermediate results and tailor analytical steps as needed.

Quality Control

Quality control (QC) in a rigorous and standardized matter is critical. After raw reads are demultiplexed after their generation by the sequencing instrument, QC steps to ensure sequencing worked appropriately provide output, including number of reads per sample, quality score of reads at each base, and overrepresented sequences (e.g., primers). Subject-level QC includes checking for sex consistencies, duplicates, and related subjects. After reads are aligned to a reference genome, additional parameters assess mapping quality. For example, software tools (Tables 4 and 5) can be used to summarize the number of mapped reads, including junction-spanning reads for RNA-seq, and can compute the number of bases assigned to various classes of DNA/RNA according to a reference file. For RNA-seq, use of Ambion External RNA Controls Consortium spike-ins offers an additional QC measure. Scripts that provide standardized and reproducible reports on the basis of output from these various programs, such as those that are part of the Genome Analysis Toolkit (GATK) best-practices pipeline for DNA-Seq (84) or taffeta scripts for RNA-seq (85), facilitate the assessment of QC and decisions about which samples should be excluded or what kinds of bias might be present. Mapped read files that are in bam format can be converted to bigwig or other compressed format for display in a browser, such as the University of California Santa Cruz (UCSC) genome browser, to verify that mapping of particular genes looks as expected (e.g., that full lengths of genes are covered vs. highly irregular portions).

Table 4.

Software for DNA sequencing studies

Task	Tools	URL
Alignment	BWA-MEM (117)	http://bio-bwa.sourceforge.net
Alignment	Bowtie2 (118)	http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Quality control	Raw reads
	FastQC	http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
	FASTX-Toolkit	http://hannonlab.cshl.edu/fastx_toolkit/
	Mapping
	BAMtools (119)	https://github.com/pezmaster31/bamtools/
	Picard Tools	https://broadinstitute.github.io/picard/
	Variants
	GATK (120)	https://software.broadinstitute.org/gatk/
Variant calling	SAMtools (121)	http://www.htslib.org
Variant calling	GATK unified genotyper, haplotype caller, variant quality score recalibration (122)	https://software.broadinstitute.org/gatk/
Visualization	Integrative Genomics Viewer (IGV) (123)	http://software.broadinstitute.org/software/igv/
Visualization	UCSC Genome Browser (124)	http://www.genome.ucsc.edu/
Association analysis	PLINK 2 (common variants) (125)	https://www.cog-genomics.org/plink2/
	SKAT-O (rare variants) (88)	https://www.hsph.harvard.edu/skat/
	GENESIS (rare variants) (126)	https://bioconductor.org/packages/release/bioc/html/GENESIS.html
	BOLT-LMM (127)	https://data.broadinstitute.org/alkesgroup/BOLT-LMM/

This table provides an overview of commonly used software tools for performing analysis of next-generation sequencing data. Because the field continues to evolve rapidly, additional tools not listed in this table may also be useful to researchers.

Table 5.

Software for RNA sequencing studies

Task	Tools	URL
Alignment	Bowtie (128)	http://bowtie-bio.sourceforge.net/index.shtml
	STAR (129)	https://github.com/alexdobin/STAR/
	TopHat (130)	https://ccb.jhu.edu/software/tophat/index.shtml
Transcript quantification	Cufflinks (130)	http://cole-trapnell-lab.github.io/cufflinks/
	eXpress (131)	https://pachterlab.github.io/eXpress/
	HTSeq-count (132)	http://htseq.readthedocs.io/en/master/count.html
	Kallisto (133)	https://pachterlab.github.io/kallisto/
	RSEM (134)	https://github.com/deweylab/RSEM
Quality control	Raw reads
	FastQC	http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
	FASTX-Toolkit	http://hannonlab.cshl.edu/fastx_toolkit/
	Mapping
	BAMtools (119)	https://github.com/pezmaster31/bamtools/
	Picard Tools	https://broadinstitute.github.io/picard/
	RSeQC (135)	http://rseqc.sourceforge.net
	Quantification
	NOISeq (136)	https://bioconductor.org/packages/release/bioc/html/NOISeq.html
Differential expression	DEGseq (137)	https://bioconductor.org/packages/release/bioc/html/DEGseq.html
	DESeq2 (138)	https://bioconductor.org/packages/release/bioc/html/DESeq2.html
	edgeR (139)	http://bioconductor.org/packages/release/bioc/html/edgeR.html
	limma/voom (140)	https://bioconductor.org/packages/release/bioc/html/limma.html
	PoissonSeq (141)	https://cran.r-project.org/web/packages/PoissonSeq/index.html
	NOISeq (136)	https://bioconductor.org/packages/release/bioc/html/NOISeq.html
	Sleuth (142)	https://pachterlab.github.io/sleuth/
Alternative splicing	CuffDiff2 (143)	http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/
	DEX-Seq (144)	http://bioconductor.org/packages/release/bioc/html/DEXSeq.html
	DSG-Seq (145)	http://bioinfo.au.tsinghua.edu.cn/software/DSGseq/
	MISO (146)	http://genes.mit.edu/burgelab/miso/
	rSeqDiff (147)	http://www-personal.umich.edu/∼jianghui/rseqdiff/
	Leafcutter (148)	https://github.com/davidaknowles/leafcutter
Visualization	CummeRbund (130)	http://compbio.mit.edu/cummeRbund/
	Integrative Genomics Viewer (IGV) (123)	http://software.broadinstitute.org/software/igv/
	RNASeqViewer (149)	http://bioinfo.au.tsinghua.edu.cn/software/RNAseqViewer/
	SplicePlot (150)	http://montgomerylab.stanford.edu/spliceplot/index.html
	SpliceSeq (151)	https://bioinformatics.mdanderson.org/main/SpliceSeq:Overview
	SplicingViewer (152)	http://bioinformatics.zj.cn/splicingviewer/
	UCSC Genome Browser (124)	http://www.genome.ucsc.edu

This table provides an overview of commonly used software tools for performing analysis of RNA sequencing data. Because the field continues to evolve rapidly, additional tools not listed in this table may also be useful to researchers.

Software for DNA sequencing studies This table provides an overview of commonly used software tools for performing analysis of next-generation sequencing data. Because the field continues to evolve rapidly, additional tools not listed in this table may also be useful to researchers. Software for RNA sequencing studies This table provides an overview of commonly used software tools for performing analysis of RNA sequencing data. Because the field continues to evolve rapidly, additional tools not listed in this table may also be useful to researchers.

Statistical Analysis

After alignment of reads, DNA-Seq data are analyzed to search for variation relative to a reference genome, and Variant Call Format (vcf/bcf) files are obtained. There are a variety of statistical tests for association within the WGS framework. The simplest consideration is based on frequency of the variant and/or a prior probability of being associated with disease (e.g., known pathogenic, deleterious, functionally relevant). Common variants, defined as those with minor allele frequency greater than 0.01 to 0.05 or minor allele count greater than 10 to 20, depending on the sample size and factors such as case–control imbalance, are usually analyzed with a genome-wide association study approach of single-nucleotide polymorphism (SNP) association tests (86). Rare and/or deleterious variants are generally analyzed using a window-based approach, where windows consist of SNPs in or near a gene or are based on sliding genome regions of 5 to 50 kb. Statistics for each window are obtained using different approaches: burden tests collapse many variants into a single risk score but assume all variants are similar in effect (i.e., risk alleles); adaptive burden tests collapse many variants but allow for risk, protective and neutral; and variance component tests apply random effects modeling (87). There is also a class of window-based rare variant tests that combine the variance component and burden test framework to take advantage of strengths of both; sequence kernel association test-optimal (SKAT-O) is the most commonly used (88). It is routine to apply multiple rare-variant tests and use alternative options of windows, resulting in increased multiple testing complexity. In addition to Bonferroni correction, permutation and Bayesian approaches are used to adjust for multiple comparisons. Controversy about the best way to analyze RNA-seq data still exists, and methods development is ongoing (89). For samples passing initial QC, the next step involves quantification of levels of genes or transcripts. In most cases, reference gene or transcript files are obtained from Ensembl (90) or RefSeq (91). The output of the quantification process is then used with an appropriate software package to measure differential expression and assess related QC. Regardless of program used, it is important to report false-discovery rate, adjusted P values and fold changes.

Data Sharing

To ensure that high-throughput sequence results are reproducible and that costly data can benefit all stakeholders, data-sharing resources have grown significantly. Both raw data and results generated from projects sponsored by major funders are required to be deposited into publicly available databases. DNA-Seq data, including that of TOPMed, are deposited in dbGaP, along with individual-level phenotype and association results (50). RNA-seq and other sequencing data are available in the Sequence Read Archive and can be discovered via the Gene Expression Omnibus (92).

Cell Type Heterogeneity

An important issue for consideration in many omics studies of the lung is cell heterogeneity. The lung is a complex organ comprising approximately forty resident cell types (93), a growing number of cell subpopulations that are present either transiently during development or in adult lung (94), as well as many types of inflammatory cells that infiltrate the airways and alveoli during periods of injury or disease. Thus, a signal measured by omics technologies in the whole lung can reflect a change in the pattern of expression of the molecules measured within a certain cell type, a change in the cellular composition of the lung, or both. There are three main approaches to deal with cellular heterogeneity. One approach is to perform statistical deconvolution of omic profiles by relying on cell-specific features from reference datasets. This approach has been used widely in peripheral blood profiling studies (95) and more recently on complex tissues (96), but it is highly dependent on known markers and difficult to implement for lung cell populations because of the limitations of appropriate reference datasets. The second approach is to isolate cell types on the basis of cell surface markers using flow cytometry or specific areas of the lung by laser capture microdissection (LCM). Although cell sorting is often used in immunological studies and has facilitated major contributions to the field (97), it is limited by the need for known cell markers and antibodies for cell populations of interest, as well as concerns that stress from cell sorting may affect gene expression patterns. LCM can be technically challenging on human lung tissue but has had some success in conjunction with sequencing technologies (98). However, in most benign tissues, the resolution of LCM allows for enriching for a regional microenvironment but not for dissecting between different cell types.

Single-Cell Sequencing

The recently developed single-cell technologies provide the best solution to identification of all relevant cell populations, although technical limitations remain (99–101). The reproducibility and success of such studies depend greatly on availability of high-quality human tissue, on cellular susceptibility to stress, and on the platforms used (102). Despite these limitations, single-cell RNA-seq studies have shed light on lung cell population heterogeneity during lung development (103) and in lung diseases such as idiopathic pulmonary fibrosis (15). The recent development of single-nucleus RNA-seq may address some of the issues with tissue quality (104). To address cellular heterogeneity in human lung disease, the field needs a “lung (disease) atlas,” such as the one proposed by the human cell atlas (105), a large collaborative set of studies that will systematically profile all cells in diseased and healthy human lungs.

Multiomics Integration

DNA-seq offers the opportunity to detect different sources of DNA variation, including common and rare single-nucleotide variant and small and large deletions and insertions, whereas RNA-seq provides full assessment of the cell or tissue transcriptome at a given point in time, with a high dynamic range of these transcripts, different mRNA transcript isoforms, as well as different classes of mRNA and noncoding RNAs (ncRNAs). Integrative genomics methods evaluate the functional significance of DNA variation on gene expression. First, these address the relation of genetic variation with transcript abundance (expression quantitative trait locus, eQTL). Because SNP variants can be called in RNA-seq, this offers a direct assessment of an eQTL effect if an SNP is present in its heterozygous form; comparison of the allelic expression demonstrates the relation of an SNP with transcript abundance (allelic imbalance). Because many disease SNPs exert their function through eQTLs, well-powered lung-relevant eQTL datasets are needed. Until now, many studies have been performed in mixed cell populations of white blood cells (106) or whole lung (107). Because eQTLs may be cell specific, lung cell eQTL maps are needed to increase our understanding of lung-specific disease SNPs. Second, RNA-seq enables assessment of the relation of genetic variation with splicing events leading to alternative isoform expression (splice QTL). Finally, the association of genetic variation with transcripts induced by disease or specific stimuli (context-dependent eQTLs) can be investigated either in paired observational human studies or ex vivo in laboratory settings (Table 6). Because most respiratory diseases develop as a result from environmental exposures and genetic background, the study of inducible eQTLs may offer a good model to understand disease development by integrating genetic variation with induced gene expression.

Table 6.

Selected examples of omics integration and using omics for functional validation studies

Technique	Example
Context-dependent eQTLs	Li and colleagues showed that cytokine production by peripheral blood mononuclear cells on stimulation depends on six specific SNPs (153). One inducible cytokine QTL at the NAA35-GOLM1 locus markedly modulated interleukin-6 production in response to multiple pathogens and also showed association with susceptibility to candidemia.
Imputed gene expression (PrediXcan)	Ferreira and colleagues tested for associations between asthma and 17,190 genes found to have cis- and/or trans-eQTLs across 12 cell types relevant to asthma (154). They confirmed 37 genes where the association was driven by eQTLs located in established risk loci for allergic disease and discovered 11 novel genetic associations.
Gene knockdown	Dixit and colleagues investigated the effect of gene knockdown by CRISPR/Cas9 on RNA-seq expression in human LPS stimulated bone marrow dendritic cells, a method they called Perturb-seq (155). By analyzing the transcriptional consequences of perturbations of transcription factors in these cells, they were able to interpret the functional consequences of these transcription factors, as well as their interaction, uncovering their molecular mechanisms.

Definition of abbreviations: eQTL = expression quantitative trait locus; LPS = lipopolysaccharide; QTL = quantitative trait locus; SNP = single-nucleotide polymorphism.

Selected examples of omics integration and using omics for functional validation studies Definition of abbreviations: eQTL = expression quantitative trait locus; LPS = lipopolysaccharide; QTL = quantitative trait locus; SNP = single-nucleotide polymorphism. Using study-specific subsets of RNA-seq data or external reference datasets such as the GTEx (Genotype-Tissue Expression project) consortium (108) allows for the ability to impute gene expression in large numbers of individuals for whom only genetic variant data are available (109). These integrative approaches can expand the value of smaller sample sizes with transcript data to the larger datasets to identify gene expression correlated with phenotype (Table 6). We anticipate that further integration of other omics data in bulk tissue or at the single-cell level, through efforts such as the Lung Map (110), will markedly increase our understanding of respiratory disease. The efficiency and speed of these types of analysis may be improved by the implementation of composite measures (111) in future research programs in respiratory medicine, particularly as the approach can be used for all types of omics analysis, including transcriptomics, proteomics, metabolomics, and epigenetics. Omics integration analysis has advanced considerably, and many machine learning methods are now being used, including Bayesian and network-based approaches (112), and, more recently, deep learning and neural networks (113).

Functional Validation

In functional studies, gene expression may also be used as outcome, either in vivo when human subjects or patients are exposed to an environmental stimulus or drugs or in human samples or animal models. Comparative analysis of humans and mouse models through RNA-seq may enable swift validation of downstream targets and provide insight in the validity of the animal model (114). Additional sequencing methods such as DNA-Seq, Assay for Transposase-Accessible Chromatin sequencing (ATAC-Seq), and Chromatin Immunoprecipitation sequencing (ChIP-Seq) can identify functional regions affected by genetic variants. Gene editing techniques including Crispr-Cas9 that enable knockdown of genes or SNP-specific editing may be followed up by a readout on the effects of gene regulatory networks (115) (Table 6).

Conclusions

With large efforts such as TOPMed and the U.K. Biobank, in addition to specific disease studies, there is an ever-increasing amount of sequencing data available for studies of respiratory disease, critical care, and sleep medicine. Because of the complex nature of these studies, it is critical to include researchers with multiple backgrounds at the outset of study design, including clinician-scientists and epidemiologists who can enroll and phenotype subjects; laboratory personnel with skills in biobanking, sample management, and high-throughput sequencing; bioinformaticians, statisticians, and computational biologists who can manage and analyze data; and molecular biologists who can conduct functional validation studies. All of these experts must collaborate to design studies, interpret data, and present results. This should not discourage new investigators from participating in omics studies, as each person can provide complementary expertise. Specific recommendations regarding study design, analysis, and follow-up (Box 1) should serve as guides for starting a new sequencing study or for the critical appraisal of a completed study. Genomics is a rapidly evolving field, and researchers must keep abreast of best practices. However, general principles of study design and data reporting are likely to remain valid in the future.

151 in total

1. RSeQC: quality control of RNA-seq experiments.

Authors: Liguo Wang; Shengqin Wang; Wei Li
Journal: Bioinformatics Date: 2012-06-27 Impact factor: 6.937

2. Genomics for the world.

Authors: Carlos D Bustamante; Esteban González Burchard; Francisco M De la Vega
Journal: Nature Date: 2011-07-13 Impact factor: 49.962

Review 3. Computational methods for transcriptome annotation and quantification using RNA-seq.

Authors: Manuel Garber; Manfred G Grabherr; Mitchell Guttman; Cole Trapnell
Journal: Nat Methods Date: 2011-05-27 Impact factor: 28.547

4. Differential expression in RNA-seq: a matter of depth.

Authors: Sonia Tarazona; Fernando García-Alcalde; Joaquín Dopazo; Alberto Ferrer; Ana Conesa
Journal: Genome Res Date: 2011-09-08 Impact factor: 9.043

5. Big Data: Astronomical or Genomical?

Authors: Zachary D Stephens; Skylar Y Lee; Faraz Faghri; Roy H Campbell; Chengxiang Zhai; Miles J Efron; Ravishankar Iyer; Michael C Schatz; Saurabh Sinha; Gene E Robinson
Journal: PLoS Biol Date: 2015-07-07 Impact factor: 8.029

6. Novel insights into the genetics of smoking behaviour, lung function, and chronic obstructive pulmonary disease (UK BiLEVE): a genetic association study in UK Biobank.

Authors: Louise V Wain; Nick Shrine; Suzanne Miller; Victoria E Jackson; Ioanna Ntalla; María Soler Artigas; Charlotte K Billington; Abdul Kader Kheirallah; Richard Allen; James P Cook; Kelly Probert; Ma'en Obeidat; Yohan Bossé; Ke Hao; Dirkje S Postma; Peter D Paré; Adaikalavan Ramasamy; Reedik Mägi; Evelin Mihailov; Eva Reinmaa; Erik Melén; Jared O'Connell; Eleni Frangou; Olivier Delaneau; Colin Freeman; Desislava Petkova; Mark McCarthy; Ian Sayers; Panos Deloukas; Richard Hubbard; Ian Pavord; Anna L Hansell; Neil C Thomson; Eleftheria Zeggini; Andrew P Morris; Jonathan Marchini; David P Strachan; Martin D Tobin; Ian P Hall
Journal: Lancet Respir Med Date: 2015-09-27 Impact factor: 30.700

7. HTSeq--a Python framework to work with high-throughput sequencing data.

Authors: Simon Anders; Paul Theodor Pyl; Wolfgang Huber
Journal: Bioinformatics Date: 2014-09-25 Impact factor: 6.937

8. A benchmark for RNA-seq quantification pipelines.

Authors: Mingxiang Teng; Michael I Love; Carrie A Davis; Sarah Djebali; Alexander Dobin; Brenton R Graveley; Sheng Li; Christopher E Mason; Sara Olson; Dmitri Pervouchine; Cricket A Sloan; Xintao Wei; Lijun Zhan; Rafael A Irizarry
Journal: Genome Biol Date: 2016-04-23 Impact factor: 13.583

Review 9. LungMAP: The Molecular Atlas of Lung Development Program.

Authors: Maryanne E Ardini-Poleske; Robert F Clark; Charles Ansong; James P Carson; Richard A Corley; Gail H Deutsch; James S Hagood; Naftali Kaminski; Thomas J Mariani; Steven S Potter; Gloria S Pryhuber; David Warburton; Jeffrey A Whitsett; Scott M Palmer; Namasivayam Ambalavanan
Journal: Am J Physiol Lung Cell Mol Physiol Date: 2017-08-10 Impact factor: 5.464

10. Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing.

Authors: José A Robles; Sumaira E Qureshi; Stuart J Stephen; Susan R Wilson; Conrad J Burden; Jennifer M Taylor
Journal: BMC Genomics Date: 2012-09-17 Impact factor: 3.969

6 in total

1. A Practical Guide to the Measurement and Analysis of DNA Methylation.

Authors: Benjamin D Singer
Journal: Am J Respir Cell Mol Biol Date: 2019-10 Impact factor: 6.914

Review 2. Genetic Determinants in Airways Obstructive Diseases: The Case of Asthma Chronic Obstructive Pulmonary Disease Overlap.

Authors: Aabida Saferali; Craig P Hersh
Journal: Immunol Allergy Clin North Am Date: 2022-06-30 Impact factor: 3.152

3. Multiomics analysis identifies BIRC3 as a novel glucocorticoid response-associated gene.

Authors: Mengyuan Kan; Avantika R Diwadkar; Haoyue Shuai; Jaehyun Joo; Alberta L Wang; Mei-Sing Ong; Joanne E Sordillo; Carlos Iribarren; Meng X Lu; Natalia Hernandez-Pacheco; Javier Perez-Garcia; Mario Gorenjak; Uroš Potočnik; Esteban G Burchard; Maria Pino-Yanes; Ann Chen Wu; Blanca E Himes
Journal: J Allergy Clin Immunol Date: 2021-12-28 Impact factor: 14.290

Contents