| Literature DB >> 36114500 |
Yuanhang Liu1, Aditya Bhagwate1, Stacey J Winham1, Melissa T Stephens2, Brent W Harker2, Samantha J McDonough3, Melody L Stallings-Mann4, Ethan P Heinzen1, Robert A Vierkant1, Tanya L Hoskin1, Marlene H Frost5, Jodi M Carter3, Michael E Pfrender6, Laurie Littlepage7, Derek C Radisky8, Julie M Cunningham3, Amy C Degnim9, Chen Wang10.
Abstract
BACKGROUND: Formalin-fixed, paraffin-embedded (FFPE) tissues have many advantages for identification of risk biomarkers, including wide availability and potential for extended follow-up endpoints. However, RNA derived from archival FFPE samples has limited quality. Here we identified parameters that determine which FFPE samples have the potential for successful RNA extraction, library preparation, and generation of usable RNAseq data.Entities:
Keywords: Breast tissue; DV200; DV50; Decision tree; FFPE; Library concentration; Quality control; RNA concentration; RNA-seq
Mesh:
Substances:
Year: 2022 PMID: 36114500 PMCID: PMC9479231 DOI: 10.1186/s12920-022-01355-0
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.622
Fig. 1Flow-chart of library optimization and bioinformatics evaluation. a A pilot study consisting of FFPE and fresh frozen pairs for 7 BBD patients were submitted for sequencing to evaluate two protocols of library preparation for RNA-seq, Ribo-depletion and RNA exome capture. Several bioinformatics metrics were evaluated for the two protocols. Whole exome sequencing (WES) data was used to estimate SNP confirmation rate, and the RNA exome capture showed superior performance in all categories and was selected as the library preparation protocol to process all samples. b 130 study samples (ER+ estrogen receptor positive, ER− estrogen receptor negative, Cont control) along with 17 technical replicates and 11 study replicates were submitted for library preparation using the RNA exome capture protocol. 40 samples failed library preparation step with insufficient RNA. All remaining samples were submitted for sequencing in 10 batches. Rigorous bioinformatics evaluation was performed to identify qc failed samples based on defined bioinformatics metrics. The final dataset comprised 62 study samples
Fig. 2Bioinformatics QC to identify pass versus fail samples. a Heatmap of sample pairwise correlation of gene expression. Row color annotation bar indicate sequencing batch (seqb) 1–7 and 8–10. Right lower panel shows a histogram of the distribution of sample wise median correlation based on gene expression data. b Relationship between sample wise median correlation of gene expression with false positive rate using 11 study replicate samples. Samples with a sample-wise median correlation below 0.75 were classified as QC failed samples. Loess is used curve fitting and 95% confidence interval is plotted in grey bands. c Relationship between sample wise median correlation of gene expression with number of detectable genes with transcript per million (TPM) > 4. A cutoff of 11,400# of genes was selected to identify QC failed samples. d Relationship between number of gene mapped reads and total number of detected genes with transcript per million (TPM) > 4. A cutoff was selected at 80% of saturation point (20 million gene mapped reads, 10,400 # of detected genes with TPM > 4)
Fig. 3Correlation of pre-capture library concentration with bioinformatics metrics for all samples excluding FFzn controls. Scatter plot of RNA library concentration (ng/ul) with three bioinformatics metrics. Samples colored in red/green indicates qc failed/pass samples according to sample-wise median correlation, number of detected genes with TPM higher than 4, number of gene mapped reads. Spearman correlation and p value are shown on the upper right of each panel. A smooth line was fit to the data using loess and 95% confidence interval was also indicated as grey shaded area. a Scatter plot of library concentration with sample wise median correlation. Dashed line indicates a correlation of 0.75. b Scatter plot of library concentration with number of genes with transcript per million (TPM) higher than 4. Dashed line indicates 11,400 genes. c Scatter plot of library concentration with total number of gene mapped reads. Dashed line indicates 25 million gene mapped reads. d Scatter plot of library concentration with estimated local/regional failure rate calculated based on bioinformatics metrics within each window of library concentration
Fig. 4A decision tree model to predict QC pass/fail based on pre-sequencing lab metrics. QC pass and fail refer to sample status defined by bioinformatics metrics; QC failed samples were those excluded from the final dataset. a Parameter tuning based on repeated cross validation using grid search with 10 choices of complexity parameter. Complexity parameter with the highest cross-validation accuracy was used to build the final model. b Decision tree diagram with branches indicating specific cutoffs based on pre-sequencing metrics that were predictive of the qc pass/fail status. Samples with RNA qubit higher than 25 ng/ul and pre-capture library qubit higher than 1.7 ng/ul shows the best RNA-seq data quality. There are three values in each box/node. The upper value (PASS/FAIL) in each box indicates the predicted qc status based on pre-sequencing lab metrics at each branch of decision tree. The middle number in each box indicates the ratio of qc-pass samples as defined by bioinformatics metrics. The bottom number in each box indicates the percentage of total number of samples within each box. The lower panel indicates a heatmap of the three metrics (number of gene mapped reads, number of detected genes with TPM higher than 4, sample-wise median correlation) that were used to define QC status. The upper annotation bar of the heatmap indicates the three leaf nodes predicted by the decision tree. c Relative contribution/influence of the pre-sequencing lab metrics in building the final model