Literature DB >> 30510241

Discovery of common and rare genetic risk variants for colorectal cancer.

Jeroen R Huyghe¹, Stephanie A Bien¹, Tabitha A Harrison¹, Hyun Min Kang², Sai Chen², Stephanie L Schmit³, David V Conti⁴, Conghui Qu¹, Jihyoun Jeon⁵, Christopher K Edlund⁴, Peyton Greenside⁶, Michael Wainberg⁷, Fredrick R Schumacher⁸, Joshua D Smith⁹, David M Levine¹⁰, Sarah C Nelson¹⁰, Nasa A Sinnott-Armstrong¹¹, Demetrius Albanes¹², M Henar Alonso^13,14,15, Kristin Anderson¹⁶, Coral Arnau-Collell¹⁷, Volker Arndt¹⁸, Christina Bamia^19,20, Barbara L Banbury¹, John A Baron²¹, Sonja I Berndt¹², Stéphane Bézieau²², D Timothy Bishop²³, Juergen Boehm²⁴, Heiner Boeing²⁵, Hermann Brenner^18,26,27, Stefanie Brezina²⁸, Stephan Buch²⁹, Daniel D Buchanan^30,31,32, Andrea Burnett-Hartman³³, Katja Butterbach¹⁸, Bette J Caan³⁴, Peter T Campbell³⁵, Christopher S Carlson^1,36, Sergi Castellví-Bel¹⁷, Andrew T Chan^{37,38,39,40,41,42}, Jenny Chang-Claude^43,44, Stephen J Chanock¹², Maria-Dolores Chirlaque^14,45, Sang Hee Cho⁴⁶, Charles M Connolly¹, Amanda J Cross^47,48, Katarina Cuk¹⁸, Keith R Curtis¹, Albert de la Chapelle⁴⁹, Kimberly F Doheny⁵⁰, David Duggan⁵¹, Douglas F Easton^52,53, Sjoerd G Elias⁵⁴, Faye Elliott²³, Dallas R English^55,56, Edith J M Feskens⁵⁷, Jane C Figueiredo^58,59, Rocky Fischer⁶⁰, Liesel M FitzGerald^56,61, David Forman⁶², Manish Gala^37,39, Steven Gallinger⁶³, W James Gauderman⁴, Graham G Giles^55,56, Elizabeth Gillanders⁶⁴, Jian Gong¹, Phyllis J Goodman⁶⁵, William M Grady⁶⁶, John S Grove⁶⁷, Andrea Gsur²⁸, Marc J Gunter⁶⁸, Robert W Haile⁶⁹, Jochen Hampe²⁹, Heather Hampel⁷⁰, Sophia Harlid⁷¹, Richard B Hayes⁷², Philipp Hofer²⁸, Michael Hoffmeister¹⁸, John L Hopper^55,73, Wan-Ling Hsu¹⁰, Wen-Yi Huang¹², Thomas J Hudson⁷⁴, David J Hunter^41,75, Gemma Ibañez-Sanz^13,76,77, Gregory E Idos⁴, Roxann Ingersoll⁵⁰, Rebecca D Jackson⁷⁸, Eric J Jacobs³⁵, Mark A Jenkins⁵⁵, Amit D Joshi^39,41, Corinne E Joshu⁷⁹, Temitope O Keku⁸⁰, Timothy J Key⁸¹, Hyeong Rok Kim⁸², Emiko Kobayashi¹, Laurence N Kolonel⁸³, Charles Kooperberg¹, Tilman Kühn⁴³, Sébastien Küry²², Sun-Seog Kweon^84,85, Susanna C Larsson⁸⁶, Cecelia A Laurie¹⁰, Loic Le Marchand⁶⁷, Suzanne M Leal⁸⁷, Soo Chin Lee^88,89, Flavio Lejbkowicz^90,91,92, Mathieu Lemire⁷⁴, Christopher I Li¹, Li Li⁹³, Wolfgang Lieb⁹⁴, Yi Lin¹, Annika Lindblom^95,96, Noralane M Lindor⁹⁷, Hua Ling⁵⁰, Tin L Louie¹⁰, Satu Männistö⁹⁸, Sanford D Markowitz⁹⁹, Vicente Martín^14,100, Giovanna Masala¹⁰¹, Caroline E McNeil¹⁰², Marilena Melas⁴, Roger L Milne^55,56, Lorena Moreno¹⁷, Neil Murphy⁶⁸, Robin Myte⁷¹, Alessio Naccarati^103,104, Polly A Newcomb^1,36, Kenneth Offit^105,106, Shuji Ogino^{40,41,107,108}, N Charlotte Onland-Moret⁵⁴, Barbara Pardini^104,109, Patrick S Parfrey¹¹⁰, Rachel Pearlman⁷⁰, Vittorio Perduca^111,112, Paul D P Pharoah⁵², Mila Pinchev⁹¹, Elizabeth A Platz⁷⁹, Ross L Prentice¹, Elizabeth Pugh⁵⁰, Leon Raskin¹¹³, Gad Rennert^91,92,114, Hedy S Rennert^91,92,114, Elio Riboli¹¹⁵, Miguel Rodríguez-Barranco^14,116, Jane Romm⁵⁰, Lori C Sakoda^1,117, Clemens Schafmayer¹¹⁸, Robert E Schoen¹¹⁹, Daniela Seminara⁶⁴, Mitul Shah⁵³, Tameka Shelford⁵⁰, Min-Ho Shin⁸⁴, Katerina Shulman¹²⁰, Sabina Sieri¹²¹, Martha L Slattery¹²², Melissa C Southey¹²³, Zsofia K Stadler¹²⁴, Christa Stegmaier¹²⁵, Yu-Ru Su¹, Catherine M Tangen⁶⁵, Stephen N Thibodeau¹²⁶, Duncan C Thomas⁴, Sushma S Thomas¹, Amanda E Toland¹²⁷, Antonia Trichopoulou^19,20, Cornelia M Ulrich²⁴, David J Van Den Berg⁴, Franzel J B van Duijnhoven⁵⁷, Bethany Van Guelpen⁷¹, Henk van Kranen¹²⁸, Joseph Vijai¹²⁴, Kala Visvanathan⁷⁹, Pavel Vodicka^103,129,130, Ludmila Vodickova^103,129,130, Veronika Vymetalkova^103,129,130, Korbinian Weigl^18,27,131, Stephanie J Weinstein¹², Emily White¹, Aung Ko Win^32,55, C Roland Wolf¹³², Alicja Wolk^86,133, Michael O Woods¹³⁴, Anna H Wu⁴, Syed H Zaidi⁷⁴, Brent W Zanke¹³⁵, Qing Zhang¹³⁶, Wei Zheng¹³⁷, Peter C Scacheri¹³⁸, John D Potter¹, Michael C Bassik¹¹, Anshul Kundaje^7,11, Graham Casey¹³⁹, Victor Moreno^13,14,15,77, Goncalo R Abecasis², Deborah A Nickerson⁹, Stephen B Gruber⁴, Li Hsu^1,10, Ulrike Peters^140,141.

Abstract

To further dissect the genetic architecture of colorectal cancer (CRC), we performed whole-genome sequencing of 1,439 cases and 720 controls, imputed discovered sequence variants and Haplotype Reference Consortium panel variants into genome-wide association study data, and tested for association in 34,869 cases and 29,051 controls. Findings were followed up in an additional 23,262 cases and 38,296 controls. We discovered a strongly protective 0.3% frequency variant signal at CHD1. In a combined meta-analysis of 125,478 individuals, we identified 40 new independent signals at P < 5 × 10-8, bringing the number of known independent signals for CRC to ~100. New signals implicate lower-frequency variants, Krüppel-like factors, Hedgehog signaling, Hippo-YAP signaling, long noncoding RNAs and somatic drivers, and support a role for immune function. Heritability analyses suggest that CRC risk is highly polygenic, and larger, more comprehensive studies enabling rare variant analysis will improve understanding of biology underlying this risk and influence personalized screening strategies and drug development.

Entities: Chemical

Mesh：

Substances：
RNA, Long Noncoding

Year: 2018 PMID： 30510241 PMCID： PMC6358437 DOI： 10.1038/s41588-018-0286-6

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 41.307

Colorectal cancer (CRC) is the fourth leading cancer-related cause of death worldwide[1] and presents a major public health burden. Up to 35% of inter-individual variability in CRC risk has been attributed to genetic factors[2,3]. Family-based studies have identified rare high-penetrance mutations in at least a dozen genes but, collectively, these account for only a small fraction of familial risk[4]. Over the past decade, genome-wide association studies (GWAS) for sporadic CRC, which constitutes the majority of cases, have identified approximately 60 association signals at over 50 loci[5-22]. Yet, most of the genetic factors contributing to CRC risk remain undefined. This severely hampers our understanding of biological processes underlying CRC. It also limits CRC precision prevention, including individualized preventive screening recommendations and development of cancer prevention drugs. The contribution of rare variation to sporadic CRC is particularly poorly understood. To expand the catalog of CRC risk loci and improve our understanding of rare variants, genes, and pathways influencing sporadic CRC risk, and risk prediction, we performed the largest and most comprehensive whole-genome sequencing (WGS) study and GWAS meta-analysis for CRC to date, combining data from three consortia: the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO), the Colorectal Cancer Transdisciplinary Study (CORECT), and the Colon Cancer Family Registry (CCFR). Our study almost doubles the number of individuals analyzed, incorporating GWAS results from >125,000 individuals, and substantially expands and strengthens our understanding of biological processes underlying CRC risk.

RESULTS

Study Overview

We performed WGS of 1,439 CRC cases and 720 controls of European ancestry at low coverage (3.8–8.6×). We detected, called, and estimated haplotype phase for 31.8 million genetic variants, including 1.7 million short insertion-deletion variants (indels) (Online Methods). These data include many rare variants not studied by GWAS. Based on other large-scale WGS studies employing a similar design, we expected to have near-complete ascertainment of single nucleotide variants (SNVs) with minor allele count (MAC) greater than five (minor allele frequency (MAF) >0.1%), and high accuracy at heterozygous genotypes[23,24]. We tested 14.4 million variants with MAC ≥5 for CRC association using logistic regression (Online Methods) but did not find any significant associations. To increase power to detect associations with rare and low-frequency variants of modest effect, we imputed variants from the sequencing experiment into 34,869 cases and 29,051 controls of predominantly European (91.7%) and East Asian ancestry (8.3%) from 30 existing GWAS studies (Online Methods and Supplementary Table 1). By design, two thirds of sequenced individuals were CRC cases, thereby enriching the panel for rare or low-frequency alleles that increase CRC risk. We contributed our sequencing data to the Haplotype Reference Consortium (HRC)[25] and imputed the 30 existing GWAS studies to the HRC panel, which comprises haplotypes for 32,488 individuals. Results of these GWAS meta-analyses (referred to as Stage 1 meta-analysis; Online Methods) informed the design of a custom Illumina array comprising the OncoArray, a custom array to identify cancer risk loci[26], and 15,802 additional variants selected based on Stage 1 meta-analysis results. We genotyped 12,007 cases and 12,000 controls of European ancestry with this custom array, and combined them with an additional 11,255 cases and 26,296 controls with GWAS data, resulting in a Stage 2 meta-analysis of 23,262 CRC cases and 38,296 controls (Online Methods, Supplementary Fig. 1, and Supplementary Table 1). Next, we performed a combined (Stage 1 + Stage 2) meta-analysis of up to 58,131 cases and 67,347 controls. This meta-analysis was based on the HRC-panel-imputed data because, given its large size, this panel results in superior imputation quality and enables accurate imputation of variants with MAFs as low as 0.1%[25]. Here, we report new association signals discovered through our custom genotyping experiment and replicating in Stage 2 at the Bonferroni significance threshold of P < 7.8×10−6 (Online Methods), as well as distinct association signals passing the genome-wide significance (GWS) threshold of P < 5×10−8 in the combined meta-analysis of up to 125,478 individuals.

CRC risk loci

In the combined meta-analysis, we identified 30 new CRC risk loci reaching GWS and >500kb away from previously reported CRC risk variants (Table 1; Supplementary Fig. 2 and 3). Twenty-two of these were represented on our custom genotyping panel, either by the lead variant (15 loci) or by a variant in linkage disequilibrium (LD) (7 loci; r2>0.7). Of these 22 variants, eight attained the Bonferroni significance threshold in the Stage 2 meta-analysis (Table 1).

Table 1

New CRC risk loci reaching genome-wide significance (P < 5×10−8) in the combined (Stage 1 and Stage 2) meta-analysis.

							Stage 1 meta-analysis: up to 34,869 cases and 29,051 controls			Stage 2 meta-analysis: up to 23,262 cases and 38,296 controls			Combined meta-analysis: up to 58,131 cases and 67,347 controls

Locus	Nearby gene(s)	rsID lead variant	Chr.	Position (Build 37)	Alleles (risk/other)	RAF (%)	OR	95% CI	P	OR	95% CI	P	OR	95% CI	P
Rare variants

5q21.1	RGMB; CHD1	rs145364999*	5	98,206,082	T/A	99.69	1.57	1.20–2.05	9.0×10⁻⁴	1.93	1.48–2.52	1.0×10⁻⁶	1.74	1.45–2.10	6.3×10⁻⁹

Low-frequency variants

3q13.2	BOC	rs72942485	3	112,999,560	G/A	98.02	1.16	1.07–1.26	2.5×10⁻⁴	1.23	1.12–1.35	1.5×10⁻⁵	1.19	1.12–1.26	2.1×10⁻⁸

Common variants

1p34.3	FHL3	rs4360494[§]	1	38,455,891	G/C	45.39	1.05	1.03–1.08	2.9×10⁻⁵	1.06	1.03–1.08	3.3×10⁻⁵	1.05	1.04–1.07	3.8×10⁻⁹
1p32.3	TTC22; PCSK9	rs12144319*	1	55,246,035	C/T	25.48	1.07	1.04–1.10	1.4×10⁻⁶	1.07	1.04–1.10	5.5×10⁻⁶	1.07	1.05–1.09	3.3×10⁻¹¹
2q24.2	MARCH7; TANC1	rs448513[§]	2	159,964,552	C/T	32.60	1.06	1.03–1.08	1.9×10⁻⁵	1.05	1.02–1.08	5.8×10⁻⁴	1.05	1.03–1.07	4.4×10⁻⁸
2q33.1	SATB2	rs983402*	2	199,781,586	T/C	33.12	1.05	1.03–1.08	7.2×10⁻⁵	1.08	1.05–1.11	1.0×10⁻⁸	1.07	1.05–1.09	7.7×10⁻¹²
3q22.2	SLCO2A1	rs10049390[§]	3	133,701,119	A/G	73.53	1.06	1.03–1.09	4.9×10⁻⁵	1.07	1.04–1.10	1.8×10⁻⁵	1.06	1.04–1.08	3.8×10⁻⁹
4q24	TET2	rs1391441	4	106,128,760	A/G	67.20	1.05	1.02–1.07	1.5×10⁻⁴	1.06	1.03–1.09	2.3×10⁻⁵	1.05	1.03–1.07	1.6×10⁻⁸
4q31.21	HHIP	rs11727676	4	145,659,064	C/T	9.80	1.08	1.03–1.13	4.5×10⁻⁴	1.10	1.05–1.14	1.5×10⁻⁵	1.09	1.06–1.12	2.9×10⁻⁸
6p21.32	HLA-DRB1; HLA-DQA1	rs9271695*	6	32,593,080	G/A	79.54	1.09	1.06–1.13	1.3×10⁻⁷	1.09	1.05–1.12	1.7×10⁻⁷	1.09	1.07–1.12	1.1×10⁻¹³
7p13	MYO1G; SNHG15; CCM2; TBRG4	rs12672022[§]	7	45,136,423	T/C	83.45	1.07	1.04–1.11	1.6×10⁻⁵	1.06	1.03–1.10	4.4×10⁻⁴	1.07	1.04–1.09	2.8×10⁻⁸
9p21.3	ANRIL; CDKN2A; CDKN2B	rs1537372[§]	9	22,103,183	G/T	56.92	1.05	1.02–1.07	1.4×10⁻⁴	1.06	1.03–1.08	2.4×10⁻⁵	1.05	1.03–1.07	1.4×10⁻⁸
9q22.33	GALNT12; TGFBR1	rs34405347[§]	9	101,679,752	T/G	90.34	1.08	1.04–1.13	5.5×10⁻⁵	1.09	1.04–1.13	1.5×10⁻⁴	1.09	1.05–1.12	3.1×10⁻⁸
9q31.3	LPAR1	rs10980628	9	113,671,403	C/T	21.06	1.05	1.02–1.09	3.1×10⁻⁴	1.08	1.05–1.11	1.3×10⁻⁶	1.07	1.04–1.09	2.8×10⁻⁹
11q22.1	YAP1	rs2186607	11	101,656,397	T/A	51.78	1.05	1.03–1.08	1.1×10⁻⁵	1.05	1.03–1.08	3.3×10⁻⁵	1.05	1.04–1.07	1.5×10⁻⁹
12q12	PRICKLE1; YAF2	rs11610543[§]	12	43,134,191	G/A	50.13	1.05	1.03–1.08	1.1×10⁻⁵	1.06	1.03–1.08	2.8×10⁻⁵	1.05	1.04–1.07	1.3×10⁻⁹
12q13.3	STAT6; LRP1; NAB2	rs4759277	12	57,533,690	A/C	35.46	1.07	1.04–1.09	8.4×10⁻⁷	1.04	1.02–1.07	1.6×10⁻³	1.05	1.04–1.07	9.4×10⁻⁹
13q13.3	SMAD9	rs7333607*	13	37,462,010	G/A	23.50	1.09	1.06–1.12	2.5×10⁻⁸	1.07	1.04–1.10	4.4×10⁻⁶	1.08	1.06–1.10	6.3×10⁻¹³
13q22.1	KLF5	rs78341008[§]	13	73,791,554	C/T	7.19	1.13	1.07–1.18	1.4×10⁻⁶	1.11	1.05–1.16	4.8×10⁻⁵	1.12	1.08–1.16	3.2×10⁻¹⁰
13q34	COL4A2; COL4A1; RAB20	rs8000189	13	111,075,881	T/C	64.01	1.05	1.02–1.07	2.1×10⁻⁴	1.07	1.04–1.10	1.3×10⁻⁶	1.06	1.04–1.08	1.8×10⁻⁹
14q23.1	DACT1	rs17094983[§]	14	59,189,361	G/A	87.73	1.10	1.07–1.15	8.4×10⁻⁸	1.08	1.04–1.12	9.0×10⁻⁵	1.09	1.06–1.12	4.6×10⁻¹¹
15q22.33	SMAD3	rs56324967*	15	67,402,824	C/T	67.57	1.07	1.04–1.10	2.2×10⁻⁷	1.08	1.05–1.11	9.8×10⁻⁸	1.07	1.05–1.09	1.1×10⁻¹³
16q23.2	MAF	rs9930005[§]	16	80,043,258	C/A	43.03	1.05	1.03–1.08	1.3×10⁻⁵	1.05	1.02–1.07	4.0×10⁻⁴	1.05	1.03–1.07	2.1×10⁻⁸
17p12	LINC00675	rs1078643*	17	10,707,241	A/G	76.36	1.07	1.04–1.10	9.2×10⁻⁶	1.09	1.05–1.12	1.1×10⁻⁷	1.08	1.05–1.10	6.6×10⁻¹²
17q24.3	LINC00673	rs983318[§]	17	70,413,253	A/G	25.26	1.07	1.04–1.10	1.2×10⁻⁶	1.05	1.02–1.08	8.0×10⁻⁴	1.06	1.04–1.08	5.6×10⁻⁹
17q25.3	RAB40B; METRLN	rs75954926*	17	81,061,048	G/A	65.68	1.10	1.07–1.13	9.4×10⁻¹¹	1.09	1.06–1.12	4.8×10⁻⁹	1.09	1.07–1.11	3.0×10⁻¹⁸
19p13.11	KLF2	rs34797592[§]	19	16,417,198	T/C	11.82	1.09	1.05–1.13	8.2×10⁻⁶	1.09	1.05–1.13	1.2×10⁻⁵	1.09	1.06–1.12	4.2×10⁻¹⁰
19q13.43	TRIM28	rs73068325	19	59,079,096	T/C	18.26	1.06	1.03–1.09	2.1×10⁻⁴	1.07	1.04–1.11	5.0×10⁻⁵	1.07	1.04–1.09	4.2×10⁻⁸
20q13.12	TOX2; HNF4A	rs6031311§	20	42,666,475	T/C	75.91	1.07	1.04–1.10	1.7×10⁻⁶	1.05	1.02–1.08	7.6×10⁻⁴	1.06	1.04–1.08	6.8×10⁻⁹
20q13.33	TNFRSF6B; RTEL1	rs2738783[§,¶]	20	62,308,612	T/G	20.29	1.07	1.04–1.10	2.6×10⁻⁶	1.05	1.02–1.08	3.3×10⁻³	1.06	1.04–1.08	5.3×10⁻⁸

Lead variant is the most associated variant at the locus. rsIDs based on NCBI dbSNP Build 150. Alleles are on the + strand. Chr.: Chromosome. RAF: Risk allele frequency, based on stage 2 data. OR, odds ratio estimate for the risk allele. All P-values reported in this table are based on fixed-effects inverse variance-weighted meta-analysis.

Indicates that variant or LD proxy (r2>0.7) was selected for our custom genotyping panel and formally replicates in the Stage 2 meta-analysis at a Bonferroni significance threshold of P < 7.8×10−6.

Indicates that variant or LD proxy (r2>0.7) was selected for our custom genotyping panel but did not attain Bonferroni significance in the Stage 2 meta-analysis.

This SNP reached genome-wide significance in the combined (Stage 1 + Stage 2) sample-size weighted meta-analysis based on likelihood ratio test results (P = 4.9×10−8).

Among these eight loci is the first rare variant signal identified for sporadic CRC, involving five 0.3% frequency variants at 5q21.1, near genes CHD1 and RGMB. SNP rs145364999, intronic to CHD1, had high quality genotyping (Supplementary Fig. 4). The variant was well imputed in the remaining sample sets (imputation quality r2 ranged from 0.66 to 0.87; Supplementary Table 2) and there was no evidence of heterogeneity of effects (heterogeneity P=0.63; Supplementary Table 2). The rare allele confers a strong protective effect (allelic odds ratio (OR)=0.52 in Stage 2; 95% confidence interval (CI)=0.40–0.68). Chromatin remodeling factor CHD1 provides an especially plausible candidate and has been shown to be a synthetically-essential gene[27] that is occasionally deleted in some cancers, but always retained in PTEN-deficient cancers[28]. The resulting mutually exclusive deletion pattern of CHD1 and PTEN has been observed in prostate, breast, and CRC TCGA data[28]. We hypothesize that the rare allele confers a protective effect through lowering CHD1 expression, which is required for nuclear factor-κβ (NF-κβ) pathway activation and growth in cancer cells driven by loss of the tumor suppressor PTEN[28]. However, we cannot rule out involvement of nearby candidate gene RGMB that encodes a co-receptor for bone morphogenetic proteins BMP2 and BMP4, both of which are linked to CRC risk through GWAS[9,11]. Additionally, RGMB has been shown to bind to PD-L2[29], a known ligand of PD-1, an immune checkpoint blockade inhibitor targeted by cancer immunotherapy[30]. The vast majority of new association signals involve common variants. We found associations near strong candidate genes for CRC risk in pathways or gene families not previously implicated by GWAS. Locus 13q22.1, represented by lead SNP rs78341008 (MAF 7.2%; P=3.2×10−10), is near KLF5, a known CRC oncogene that can be activated by somatic hotspot mutations or super-enhancer duplications[31,32]. KLF5 encodes transcription factor Krüppel-like factor 5 (KLF5), which promotes cell proliferation and is highly expressed in intestinal crypt stem cells. We also found an association at 19p13.11, near KLF2. KLF2 expression in endothelial cells is critical for normal blood vessel function[33,34]. Down-regulated KLF2 expression in colon tumor tissues contributes to structurally and functionally abnormal tumor blood vessels, resulting in impaired blood flow and hypoxia in tumors[35]. Another locus at 9q31.1 is near LPAR1, which encodes a receptor for lysophosphatidic acid (LPA). LPA-induced expression of hypoxia-inducible factor 1 (HIF-1α), a key regulator of cellular adaptation to hypoxia and tumorigenesis, depends on KLF5[36]. Additionally, LPA activates multiple signaling pathways and stimulates proliferation of colon cancer cells by activation of KLF5[37]. Another locus (7p13) is near SNHG15, encoding a long non-coding RNA (lncRNA) that epigenetically represses KLF2 to promote pancreatic cancer proliferation[38]. We found two loci near members of the Hedgehog (Hh) signaling pathway. Aberrant activation of this pathway, caused by somatic mutations or changes in expression, can drive tumorigenesis in many tumors[39]. Notably, downregulated stromal cell Hh signaling reportedly accelerates colonic tumorigenesis in mice[40]. Locus 3q13.2, represented by low-frequency lead SNP rs72942485 (MAF 2.2%; P=2.1×10−8), overlaps BOC, encoding a Hh coreceptor molecule. In medulloblastoma, upregulated BOC promotes Hh-driven tumor progression through Cyclin D1-induced DNA damage[41]. In pancreatic cancer, a complex role for stromal BOC expression in tumorigenesis and angiogenesis has been reported[42]. Locus 4q31.21 is near HHIP, encoding an inhibitor of Hh signaling. Of note, the Hh signaling pathway was also significantly enriched in our pathway analysis (described below). Locus 11q22.1 is near YAP1, which encodes a critical downstream regulatory target in the Hippo signaling pathway that is gaining recognition as a pivotal player in organ size control and tumorigenesis[43]. YAP1 is highly expressed in intestinal crypt stem cells, and in transgenic mice, overexpression resulted in severe intestinal dysplasia and loss of differentiated cell types[44], reminiscent of phenotypes observed in mice and humans with deleterious germline APC mutations. Further, Hypoxia-inducible factor 2α (HIF-2α) promotes colon cancer growth by up-regulating YAP1 activity[45]. We provide further evidence for a link between immune function and CRC pathogenesis, and implicate the major histocompatibility complex (MHC) in CRC risk. We identified a locus near genes HLA-DRB1/HLA-DQA1, which is associated with immune-mediated diseases[46]. We identified two new loci near known tumor suppressor genes. Locus 4q24 is near TET2, a chromatin-remodeling gene frequently somatically mutated in multiple cancers, including colon cancer[47], and overlapping GWAS signals for multiple other cancers[48-50]. The CDKN2B-CDKN2A-ANRIL locus at 9p21.3 is a well-established hot spot of pleiotropic GWAS associations for many complex diseases including coronary artery disease[51], type 2 diabetes[52], and cancers[50,53,54-56]. Interestingly, lead variant rs1537372 is in high LD (r2=0.82) with variants associated with coronary artery disease[51] and endometriosis[57], but not with the other cancer-associated variants. CDKN2A/B encode cyclin-dependent kinase inhibitors that regulate the cell cycle. CDKN2A is one of the most commonly inactivated genes in cancer, and is a high penetrance gene for melanoma[58,59]. CDKN2B activation is tightly controlled by the cytokine TGF-β, further linking this signaling pathway with CRC tumorigenesis[60]. Our findings implicate genes in pathways with established roles in CRC pathogenesis. We identified loci at SMAD3 and SMAD9, members of the TGF-β signaling pathway that includes genes linked to familial CRC syndromes (e.g., SMAD4 and BMPR1A) and several GWAS-implicated genes (e.g., SMAD7, BMP2, BMP4)[61]. We identified another locus near TGF-β Receptor 1 (TGFBR1). Nearby gene GALNT12 reportedly harbors inactivating germline and somatic mutations in human colon cancers[62] and, therefore, could also be the regulated effector gene. We identified a locus at 14q23.1 near DACT1, a member of the Wnt-β-catenin pathway with genes previously linked to familial CRC syndromes (APC[63]), and several GWAS-implicated genes (e.g., CTNNB118 and TCF7L217 ). Genes related to telomere biology were linked by other GWAS: TERC[10] and TERT[22], encoding the RNA and protein subunit of telomerase respectively, and FEN117, involved in telomere stability[64]. A new locus at 20q13.33 harbors another gene related to telomere biology, RTEL1. This gene is involved in DNA double-strand break repair, and overlaps GWAS signals for cancers[55,65] and inflammation-related phenotypes, including inflammatory bowel disease[66] and atopic dermatitis[67]. Of 61 signals at 56 loci previously associated with CRC at GWS, 42 showed association evidence at P < 5×10−8 in the combined meta-analysis, and 55 at P < 0.05 in the independent Stage 2 meta-analysis (Supplementary Table 3). Of note, the association of rs755229494 at locus 5q22.2 (P=2.1×10−12) was driven by studies with predominantly Ashkenazi Jewish ancestry and this SNP is in perfect LD with known missense SNP rs1801155 in the APC gene (I1307K), the minor allele of which is enriched in this population (MAF 6%), but rare in other populations[68,69].

Delineating distinct association signals at CRC risk loci

To identify additional independent association signals at known or new CRC risk loci, we conducted conditional analysis using individual-level data of 125,478 participants (Online Methods). At nine loci we observed 10 new independent association signals that attained PJ <5×10−8 in a joint multiple-variant analysis (Table 2; Supplementary Table 4; Supplementary Fig. 5). Because this analysis focused on <5% of the genome, we also report signals at PJ <1×10−5 in Supplementary Table 5. At 22 loci, we observed 25 new suggestive associations with PJ <1×10-5.

Table 2

Additional new conditionally independent association signals at known and newly identified CRC risk loci that reach genome-wide significance (P < 5×10−8) in the combined meta-analysis of up to 125,478 individuals.

										Joint multiple-variant analysis
Locus	Nearby gene(s)	rsID lead variant	Chr.	Position (Build 37)	Alleles (risk/other)	RAF (%)	OR_{unconditional}	95% CI	P_{unconditional}	Conditioning variant(s)	OR_conditional	95% CI	P_conditional
Low-frequency variants

11q13.4	POLD3	rs61389091	11	74,427,921	C/T	96.06	1.23	1.18–1.29	1.2×10⁻¹⁸	rs7121958*, rs7946853	1.21	1.16–1.27	3.7×10⁻¹⁶

Common variants

2q33.1	SATB2	rs11884596	2	199,612,407	C/T	38.23	1.06	1.04–1.08	1.1×10⁻⁹	rs983402	1.06	1.04–1.07	3.6×10⁻⁹
5p15.33	TERT; CLPTM1L	rs78368589	5	1,240,204	T/C	5.97	1.14	1.10–1.18	9.4×10⁻¹²	rs2735940*	1.12	1.08–1.16	4.1×10⁻⁹
5p13.1	LINC00603; PTGER4	rs7708610	5	40,102,443	A/G	35.64	1.04	1.02–1.06	1.5×10⁻⁵	rs12514517*	1.06	1.04–1.08	3.8×10⁻⁹
6p21.32	HLA-B; MICA; MICB; NFKBIL1; TNF	rs2516420	6	31,449,620	C/T	92.63	1.10	1.06–1.13	1.3×10⁻⁷	rs9271695, rs116685461, rs116353863	1.12	1.08–1.16	2.0×10⁻¹⁰
8q24.21	MYC	rs4313119	8	128,571,855	G/T	74.86	1.06	1.04–1.08	1.0×10⁻⁹	rs6983267*, rs7013278	1.06	1.04–1.08	2.1×10⁻⁹
12p13.32	CCND2	rs3217874	12	4,400,808	T/C	42.82	1.08	1.06–1.10	1.2×10⁻¹⁷	rs3217810, rs35808169	1.06	1.04–1.08	2.4×10⁻⁹
15q13.3	GREM1	rs17816465	15	33,156,386	A/G	20.55	1.07	1.04–1.09	6.8×10⁻⁹	rs2293581, rs12708491	1.07	1.05–1.10	1.4×10⁻¹⁰
20p12.3	BMP2	rs28488	20	6,762,221	T/C	63.88	1.06	1.04–1.08	2.6×10⁻¹¹	rs189583, rs4813802, rs994308	1.07	1.05–1.09	2.6×10⁻¹⁴
20p12.3	BMP2	rs994308	20	6,603,622	C/T	59.39	1.08	1.06–1.10	4.8×10⁻¹⁸	rs189583, rs4813802, rs28488	1.06	1.05–1.08	8.6×10⁻¹²

Lead variant is the most associated variant at the locus in the conditional analysis. rsIDs based on NCBI dbSNP Build 150. Alleles are on the + strand. Chr.: Chromosome. RAF: Risk allele frequency, based on stage 2 data. OR, odds ratio estimates are for the risk allele. Conditioning variants are the lead variant of other conditionally independent association signals with P < 1×10−5 within 1-Mb of the new association signal. Because of extensive LD we used a 2-Mb distance for the MHC region (6p21.32). All lead variants for the new association signals are in linkage equilibrium with any previously reported CRC risk variants at the locus (r2 <0.10).

Indicates that the conditioning variant is either the index variant, or a variant in LD with the index variant reported in previous GWAS. Details and full results are provided in Supplementary Table 5.

At 11q13.4, near POLD3 and CHRDL2, we identified a new low-frequency variant (lead SNP rs61389091, MAF 3.94%) separated by a recombination hotspot from the known common variant signal[12] (LD r2 between lead SNPs <0.01). At 5p15.33, we identified another lower-frequency variant association (lead SNP rs78368589, MAF 5.97%), which was independent from the previously reported common variant signal 56kb away near TERT and CLPTM1L (LD r2 with lead SNP rs2735940 <0.01)[22]. Variants in this region were linked to many cancer types, including lung, prostate, breast, and ovarian cancer[70]. The remaining eight new signals involved common variants. At new locus 2q33.1, near genes PLCL1 and SATB2, two statistically independent associations (LD r2 between two lead SNPs <0.01) are separated by a recombination hotspot (Supplementary Fig. 5). In the MHC region, we identified a conditionally independent signal near genes involved in NF-κβ signaling, including the gene encoding tumor necrosis factor-α, genes for the stress-signaling proteins MICA/MICB, and HLA-B. Locus 20p12.3, near BMP2, harbored four distinct association signals (Figure 1), two of which were reported previously[10,11] (Supplementary Table 5). All four SNPs selected in the model were in pairwise linkage equilibrium (maximum LD r2 = 0.039, between rs189583 and rs994308). Our conditional analysis further confirmed that the signal ~1-Mb centromeric of BMP2, near gene HAO1, is independent. At 8q24.21 near MYC, the locus showing the second strongest statistical evidence of association in the combined meta-analysis (lead SNP rs6983267; P = 3.4×10−64), we identified a second independent signal (lead SNP rs4313119, PJ = 2.1×10−9; LD r2 with rs6983267 <0.001). At the recently reported locus 5p13.122, near the non-coding RNA gene LINC00603, we identified an additional signal (lead SNP rs7708610) that was partly masked by the reported signal in the single-variant analysis due to the negative correlation between rs7708610 and rs12514517 (r = −0.18; r2 = 0.03). This caused significance for both SNPs to increase markedly when fitted jointly (rs7708610, unconditional P = 1.5×10−5 and PJ = 3.8×10−9). At 12p13.32 near CCND2, we identified a new signal (lead SNP rs3217874, PJ = 2.4×10−9) and confirmed two previously associated signals[13-15] (Supplementary Text). At the GREM1 locus on 15q13.3, two independent signals were previously described[11]. Our analyses suggest that this locus harbors three signals. A new signal represented by SNP rs17816465 is conditionally independent from the other two signals (PJ = 1.4×10−10, conditioned on rs2293581 and rs12708491; LD with conditioning SNPs r2<0.01; Supplementary Text).

Figure 1

Conditionally independent association signals at the BMP2 locus.

Regional association plot showing the unconditional −log10(P-value) for the association with CRC risk in the combined meta-analysis of up to 125,478 individuals, as a function of genomic position (Build 37) for each variant in the region. The lead variants are indicated by a diamond symbol and its positions are indicated by dashed vertical lines. The color-labeling and shape of all other variants indicate the lead variant with which they are in strongest LD. The two new genome-wide significant signals are indicated by an asterisk.

Additionally, signals with PJ values approaching GWS were observed at new locus 3q13.2 near BOC (rs13086367, unconditional P = 6.7×10−8, PJ = 6.9×10−8, MAF=47.4%), 96kb from the low-frequency signal represented by rs72942485 (unconditional P = 2.1×10−8, PJ = 1.3×10−8, MAF=2.2%); at known locus 10q22.3 near ZMIZ1 (rs1250567, unconditional P = 3.1×10−8, PJ = 7.2×10−8, MAF=45.1%); and at new locus 13q22.1 near KLF5 (rs45597035, unconditional P = 2.7×10−9, PJ = 8.1×10−8, MAF=34.4%) (Supplementary Table 5). Furthermore, we clarify previously reported independent association signals (Supplementary Text).

Associations of CRC risk variants with other traits

Nineteen of the GWS association signals for CRC were in high LD (r2>0.7) with at least one SNP in the NHGRI-EBI GWAS Catalog[46] that has significant association in GWAS of other traits. Notable overlap included SNPs associated with other cancers, immune-related traits (e.g., tonsillectomy, inflammatory bowel disease, and circulating white blood cell traits), obesity traits, blood pressure, and other cardiometabolic traits (Supplementary Table 6).

Mechanisms underlying CRC association signals

To further localize variants driving the 40 newly identified signals, we used association evidence to define credible sets of variants that are 99% likely to contain the causal variant (Online Methods). The 99% credible set size for new loci ranged from one (17p12) to 93 (2q33.1). For 11 distinct association signals, the set included ten or fewer variants (Supplementary Table 7). At locus 17p12, we narrowed the candidate variant to rs1078643, located in exon 1 of the lncRNA LINC00675 that is primarily expressed in gastrointestinal tissues. Small credible sets were observed for locus 4q31.21 (two variants, indexed by synonymous SNP rs11727676 in HHIP), and signals at known loci near GREM1 (one variant) and CCND2 (two variants). We performed functional annotation of credible set variants to nominate putative causal variants. Eight sets contained coding variants but only the synonymous SNP in HHIP had a high posterior probability of driving the association (Supplementary Table 8). Next, we examined overlap of credible sets with regulatory genomic annotations from 51 existing CRC-relevant datasets to examine non-coding functions (Online Methods). Also, to better refine regulatory elements in active enhancers, we performed ATAC-seq to measure chromatin accessibility in four colonic crypts and used resulting data to annotate GWAS signals. Of the 40 sets, 36 overlapped with active enhancers identified by histone mark H3K27ac measured in normal colonic crypt epithelium, CRC cell lines, or CRC tissue (Supplementary Table 8; Supplementary Fig. 6). Twenty of these 36 overlapped with super-enhancers. Notably, when compared with epigenomics data from normal colonic crypt epithelium, all 36 sets overlapped enhancers with gained or lost activity in one or more CRC specimens. Eleven of these sets overlapped enhancers recurrently gained or lost in ≥20 CRC cell lines. The locus at GWAS hot spot 9p21 overlaps a super-enhancer, and the credible set is entirely intronic to ANRIL, alias CDKN2B-AS1. The Genotype-Tissue Expression (GTEx) data show that the antisense lncRNA ANRIL is exclusively expressed in transverse colon and small intestine. Interestingly, ANRIL recruits SUZ12 and EHZ2 to epigenetically silence tumor suppressor genes CDKN2A/B[71]. Noncoding somatic driver mutations or focal amplifications have been reported in regions regulating expression of MYC[72], TERT[73], and KLF531, now implicated by GWAS for CRC. We checked whether GWAS-identified association signals co-localize with these regions and found that the KLF5 signal overlaps the somatically amplified super-enhancer flanked by KLF5 and KLF12 (Figure 2). Also, the previously reported signal in the TERT promotor region[22] overlaps with the recurrent somatically mutated region in multiple cancers[73].

Figure 2

Functional genomic annotation of new CRC risk locus overlapping KLF5 super-enhancer.

Top: Regional association plot showing the unconditional −log10(P-value) for the association with CRC risk in the combined meta-analysis of up to 125,478 individuals, as a function of genomic position (Build 37) for each variant in the region. The lead variants are indicated by a diamond symbol and its positions are indicated by dashed vertical lines. The color-labeling and shape of all other variants indicate the lead variant with which they are in strongest LD. Bottom: UCSC genome browser annotations for region overlapping the super-enhancer flanked by KLF5 and KLF12, and spanning variants in LD with rs78341008, and with two conditionally independent association signals indexed by rs45597035 and rs1924816. The region is annotated with the following tracks (from top to bottom): UCSC gene annotations; epigenomic profiles showing MACS2 peak calls as transparent overlays for different samples taken from non-diseased colonic crypt cells or colon tissue (purple) and from different primary CRC cell lines or tumor samples (teal); position of the lead variants and variants in LD with the lead; variants in the 99% credible set; the union of super-enhancers called using the ROSE package; gray bars highlight the targeted enhancers (e1,e3, and e4) previously shown by Zhang et al.[31] to have combinatorial effects on KLF5 expression. ATAC-seq data newly generated for this study show high resolution annotation of putative binding regions within the active super-enhancer further fine-mapping putative causal variants at each of the three signals.

To test whether CRC associations are non-randomly distributed across genomic features, we used GARFIELD[74]. Focusing on DNase I hypersensitive site (DHS) peaks that identify open chromatin, we observed significant enrichment across many cell types, particularly fetal tissues, with strongest enrichment observed in fetal gastrointestinal tissues, CD20+ primary cells (B cells), and embryonic stem cells (Supplementary Fig. 7; Supplementary Table 9). We used MAGENTA[75] to identify pathways or gene sets enriched for associations with CRC, assessing two gene P-value cutoffs: 95th and 75th percentiles. At the 75th percentile, we observed enrichment of multiple KEGG cancer pathways at a false discovery rate (FDR) of 0.05. This was not observed for the 95th percentile cutoff and suggests that many more loci that are shared with other cancer types remain to be identified in larger studies. Using the 75th (95th) percentile cutoff, at FDR 0.05 and 0.20, we found enrichment of 7 (5) and 53 (24) gene sets, respectively. Established pathways related to TGF-β/SMAD and BMP signaling were among the top enriched pathways. Other notable enriched pathways included Hedgehog signaling, basal cell carcinoma, melanogenesis, cell cycle, S phase, and telomere maintenance (Supplementary Table 10).

Polygenicity of colorectal cancer and contribution of rare variants

To estimate the contribution of rare variants (MAF ≤1%) to CRC heritability, we used the LD- and MAF-stratified component GREML (GREML-LDMS) method implemented in GCTA[76] (Online Methods). Assuming a lifetime risk of 4.3%, we estimated that all imputed autosomal variants explain 21.6% (95% CI=17.5–25.7%) of the variation in liability for CRC, with almost half of this contributed by rare variants (= 9.7%, 95% CI=6.2–13.3%; likelihood ratio test P=0.003); the estimated liability-scale heritability for variants with MAF >1% is 11.8% (95% CI=8.9–14.7%). Our overall estimate falls within the range of heritability reported by large twin studies[2]. Because heritability estimates for rare variants are sensitive to potential biases due to technical effects or population stratification[77] and the contribution of rare variants is probably underestimated due to limitations of genotype imputation, results should be interpreted with caution. Overall, findings suggest that missing heritability is not large, but that many rare and common variants have yet to be identified.

Familial relative risk explained by GWAS-identified variants

Adjusting for winner’s curse[78], the familial relative risk (RR) to first-degree relatives (λ0) attributable to GWAS-identified variants rose from 1.072 for the 55 previously described autosomal risk variants that showed evidence for replication at P <0.05, to 1.092 after inclusion of 40 new signals, and increased further to 1.098 when we included 25 suggestive association signals reported in Supplementary Table 5 (Online Methods). Assuming a λ0 of 2.2, the 55 established signals account for 8.8% of familial RR explained (95% CI: 8.1–9.4). Established signals combined with 40 newly discovered signals account for 11.2% (95% CI: 10.5–12.0), and adding 25 suggestive signals increases this to 11.9% (95% CI: 11.1–12.7).

Implications for stratified screening prevention

We demonstrate how using a polygenic risk score (PRS) derived from 95 independent association signals could impact clinical guidelines for preventive screening. The difference in recommended starting age for screening for those in the highest 1% (and 10%) percentiles of risk compared with lowest percentiles is 18 years (and 10 years) for men, and 24 years (and 12 years) for women (Figure 3; Online Methods). Supplementary Table 11 gives risk allele frequency (RAF) estimates in different populations for variants included in the PRS. As expected, RAFs vary across populations. Furthermore, differences in LD between tagging and true causal variants across populations can result in less prediction accuracy and subsequent lower predictive power of the PRS in non-European populations. Accordingly, it will be important to develop ancestry-specific PRSs that incorporate detailed fine-mapping results for each GWAS signal.

Figure 3

Recommended age to start CRC screening based on a polygenic risk score (PRS).

The PRS was constructed using the 95 known and newly discovered variants. The horizontal lines represent the recommended age for the first endoscopy for an average-risk person in the current screening guideline for CRC. The risk threshold to determine the age for the first screening was set as the average of 10-year CRC risks for a 50-year-old man (1.25%) and woman (0.68%), i.e. (1.25%+0.68%)/2 = 0.97%, who have not previously received an endoscopy. Details are given in the Online Methods.

DISCUSSION

To further define the genetic architecture of sporadic CRC, we performed low-coverage WGS and imputation into a large set of GWAS data. We discovered 40 new CRC signals and replicated 55 previously reported signals. We found the first rare variant signal for sporadic CRC, which represents the strongest protective rare allelic effect identified to date. Our analyses highlight new genes and pathways contributing to underlying CRC risk and suggest roles for Krüppel-like factors, Hedgehog signaling, Hippo-YAP signaling, and immune function. Multiple loci provide new evidence for an important role of lncRNAs in CRC tumorigenesis[79]. Functional genomic annotations support that most sporadic CRC genetic risk lies in non-coding genomic regions. We further show how newly discovered variants can lead to improved risk prediction. This study underscores the critical importance of large-scale GWAS collaboration. While discovery of the rare variant signal was only possible through increased coverage and improved imputation accuracy enabled by imputation panels, sample size was pivotal for discovery of new CRC loci. Results suggest that CRC exhibits a highly polygenic architecture, much of which remains undefined. This also suggests that continued GWAS efforts, together with increasingly comprehensive imputation panels that allow for improved low-frequency and rare genetic variant imputation, will uncover more CRC risk variants. In addition, to investigate sites that are not imputable, large-scale deep sequencing will be needed. Importantly, the prevailing European bias in CRC GWAS limits the generalizability of findings and the application of PRSs in non-European (especially African) populations[80]. Therefore, a broader representation of ancestries in CRC GWAS is necessary. Studies of somatic genomic alterations in cancer have mostly focused on the coding genome and identification of noncoding drivers has proven to be challenging[73]. Yet, noncoding somatic driver mutations or focal amplications in regulatory regions impacting expression have been reported for MYC[72], TERT[73], and KLF531. The observed overlap between GWAS-identified CRC risk loci and somatic driver regions strongly suggests that expanding the search of somatic driver mutations to noncoding regulatory elements will yield additional discoveries and that searches for somatic drivers can be guided by GWAS findings. Additionally, we found loci near proposed drug targets, including CHD1, implicated by the rare variant signal, and KLF5. To date, cancer drug target discovery research has almost exclusively focused on properties of cancer cells, yielding drugs that target proteins either highly expressed or expressed in a mutant form due to frequent recurrent somatic missense mutations (e.g., BRAFV600E) or gene fusion events. In stark contrast with other common complex diseases, cancer GWAS results are not being used extensively to inform drug target selection. It has been estimated that selecting targets supported by GWAS could double the success rate in clinical development[81]. Our discoveries corroborate that not using GWAS results to inform drug discovery is a missed opportunity, not only for treating cancers, but also for chemoprevention in high-risk individuals. In summary, in the largest genome-wide scan for sporadic CRC risk thus far, we identified the first rare variant signal for sporadic CRC, and almost doubled the number of known association signals. Our findings provide a substantial number of new leads that may spur downstream investigation into the biology of CRC risk, and that will impact drug development and clinical guidelines, such as personalized screening decisions.

ONLINE METHODS

Study samples.

After quality control (QC), this study included whole-genome sequencing (WGS) data for 1,439 colorectal cancer (CRC) cases and 720 controls from 5 studies, and GWAS array data for 58,131 CRC or advanced adenoma cases (3,674; 6.3% of cases) and 67,347 controls from 45 studies from GECCO, CORECT, and CCFR. The Stage 1 meta-analysis comprised existing genotyping data from 30 studies that were included in previously published CRC GWAS[13,18,22]. After QC, the Stage 1 meta-analysis included 34,869 cases and 29,051 controls. Study participants were predominantly of European ancestry (31,843 cases and 26,783 controls; 91.7% of participants). Because it was shown previously that the vast majority of known CRC risk variants are shared between Europeans and East Asians[17], we included 3,026 cases and 2,268 controls of East Asian ancestry to increase power for discovery. The Stage 2 meta-analysis comprised newly generated genotype data involving 4 genotyping projects and 22 studies. After QC, the Stage 2 meta-analysis included 23,262 cases and 38,296 controls, all of European ancestry. Studies, sample selection, and matching are described in the Supplementary Text. Supplementary Table 1 provides details on sample numbers, and demographic characteristics of study participants. All participants provided written informed consent, and each study was approved by the relevant research ethics committee or institutional review board. Four normal colon mucosa biopsies for ATAC-seq were obtained from patients with a normal colon at colonoscopy at the Institut d’Investigació Biomèdica de Bellvitge (IDIBELL), Spain. Patients signed informed consent, and the protocol was approved by the Bellvitge Hospital Ethics Committee (Colscreen protocol PR084/16).

Whole-genome sequencing.

We performed low-pass WGS of 2,192 samples from 5 studies at the University of Washington Northwest Genomics Center (Seattle, WA, USA). Cases and controls were processed and sequenced together. Libraries were prepared with ThruPLEX DNA-seq kits (Rubicon Genomics) and paired-end sequencing performed using Illumina HiSeq 2500 sequencers. Reads were mapped to human reference genome (GRCh37 assembly) using Burrows-Wheeler aligner BWA v0.6.2[82]. Fold genomic coverage averaged 5.3× (range: 3.8–8.6×). We used the GotCloud population-based multi-sample variant calling pipeline[83] for post-processing of BAM files with initial alignments, and to detect and call single nucleotide variants (SNVs) and short insertions and deletions (indels). After removing duplicated reads and recalibrating base quality scores, QC checks included sample contamination detection. Variants were jointly called across all samples. To identify high-quality sites, the GotCloud pipeline performs a two-step filtering process. First, lower quality variants are identified by applying individual variant quality statistic filters. Next, variants failing multiple filters are used as negative examples to train a support vector machine (SVM) classifier. Finally, we performed a haplotype-aware genotype refinement step via Beagle[84] and ThunderVCF[85] on the SVM-filtered VCF files. After further sample QC, we excluded samples with estimated DNA contamination >3% (16), duplicated samples (5) or related individuals (1), sex discrepancies (0), and samples with low concordance with GWAS array data (11). We checked for ancestry outliers by performing principal components analysis (PCA) after merging in data for shared, linkage disequilibrium (LD)-pruned SNVs for 1,092 individuals from the 1000 Genomes Project[86]. After QC, sequences were available for 1,439 CRC cases and 720 controls of European ancestry.

GWAS genotype data and quality control.

Details of genotyping and QC for studies included in the Stage 1 meta-analysis are described elsewhere[13,18,22]. Supplementary Table 1 provides details of genotyping platforms used. Before association analysis, we pooled individual-level genotype data of all Stage 1 studies for a subset of SNPs to enable identification of unexpected duplicates and close relatives. We calculated identity by descent (IBD) for each pair of samples using KING-robust[87] and excluded duplicates and individuals that are second-degree or more closely related. As part of Stage 2, 28,805 individuals from 19 studies were newly genotyped on a custom Illumina array based on the Infinium OncoArray-500K[26] and a panel of 15,802 successfully manufactured custom variants (described in Supplementary Text). An additional 8,725 individuals from 5 studies were genotyped on the Illumina HumanOmniExpressExome-8v1–2 array. Genotyping and calling for both projects were performed at the Center for Inherited Disease Research (CIDR) at Johns Hopkins University. Genotypic data that passed initial QC at CIDR subsequently underwent QC at the University of Washington Genetic Analysis Center (UW GAC) using standardized methods detailed in Laurie et al.[88]. The median call rate for the custom Infinium OncoArray-500K data was 99.97%, and error rate estimated from 301 sample duplicate pairs was 9.99e-7. A relatively low number of samples (246) had a missing call rate >2%, with the highest being 3.48%, and were included in analysis. For the HumanOmniExpressExome-8v1–2 data, median call rate was 99.96%, and the error rate estimated from 179 sample duplicate pairs was 2.65e-6. Thirty samples had a missing call rate >2%, with the highest being 3.79%, and were included in analysis. We excluded samples with discrepancies between reported and genotypic sex based on X chromosome heterozygosity and the means of sex chromosome probe intensities, unintentional duplicates, and close relatives defined as individuals that are second-degree or more closely related. After further excluding individuals of non-European ancestry as determined by PCA (see below), the custom OncoArray data included in analysis comprised 11,852 CRC cases and 11,895 controls, and the HumanOmniExpressExome-8v1–2 array data included in analysis comprised 4,439 CRC cases and 4,115 controls. Only variants passing QC were used for imputation. We excluded variants failing CIDR technical filters or UW GAC quality filters, which included missing call rate >2%, discordant calls in sample duplicates, and departures from Hardy-Weinberg equilibrium (HWE) (P <1e-4) based on European-ancestry controls. The Stage 2 analysis also included genotype data from the CORSA study (Supplementary Text). In total, 2,354 individuals were genotyped using the Affymetrix Axiom Genome-Wide Human CEU 1 Array. We called genotypes using the AxiomGT1 algorithm. All samples had missing call rate <3%. We excluded samples with discrepancies between reported and genotypic sex (20), close relatives defined as individuals that are second-degree or more closely related (94), as inferred using KING-robust[87], and individuals of non-European ancestry (6) as inferred from PCA. After QC, data included in analysis comprised 1,460 cases and 774 controls. Prior to phasing and imputation, we filtered out SNPs with missing call rate >2%, or HWE P <1e-4. Imputed genotype data were obtained from UK Biobank and QC and imputation are described elsewhere[89]. A nested case-control dataset was constructed as described in the Supplementary Text. We excluded individuals of non-European ancestry as inferred from PCA, and randomly dropped one individual from each pair that were more closely related than third-degree relatives as inferred using KING-robust. This resulted in excluding 137 samples. In total, 5,356 CRC (5,004) or advanced adenoma (352) cases and 21,407 matched controls were included in the replication analysis.

Principal components analysis.

After excluding close relatives, we performed PCA using PLINK1.9[90] on LD-pruned sets of autosomal SNPs obtained by removing regions with extensive long-range LD[91,92], SNPs with minor allele frequency (MAF) <5%, or HWE P <1e-4, or any missingness, and carrying out LD pruning using the PLINK option ‘-indep-pairwise 50 5 0.2’. To identify population outliers we merged in 1,092 individuals from 1000 Genomes Project Phase III and performed PCA using the intersection of variants[93].

Genotype imputation.

The 2,159 whole-genome sequences described above were used to create a phased imputation reference panel. After estimating haplotypes for all GWAS array data sets using SHAPEIT2[94], we used minimac395 to impute from this reference panel (19.6 million variants with minor allele count (MAC) >1) into the GWAS datasets described above. We also imputed to the Haplotype Reference Consortium (HRC) panel[25] (39.2 million variants) using the University of Michigan Imputation Server[95]. To improve imputation accuracy for Stage 1 data sets, phasing and imputation were performed after pooling studies/genotype projects that used the same, or very similar, genotyping platforms (Supplementary Table 1). For Stage 2, we performed phasing and imputation separately for each genotyping project data set and imputed to the HCR panel.

Statistical analyses.

Association testing of sequence data.

We tested variants with MAC ≥5 for CRC association using Firth’s bias-reduced logistic regression as implemented in EPACTS (genome.sph.umich.edu/wiki/EPACTS) and adjusted for sex, age, study, and 3 principal components (PCs) calculated from an LD-pruned set of genotypes. We performed rare variant aggregate tests at the gene and enhancer level using the Mixed effects Score Test (MiST)[96]. This unified test is a linear combination between unidirectional burden and bidirectional variance component tests that performs best in terms of statistical power across a range of architectures[97].

Association and meta-analysis.

Stage 1 comprised two large mega-analyses of pooled individual-level genotype data sets (Supplementary Table 12). The four Stage 2 genotyping project data sets were analyzed separately. Within each data set, variants with an imputation accuracy r2 ≥0.3 and MAC ≥50 were tested for CRC association using the imputed genotype dosage in a logistic regression model adjusted for age, sex, and study/genotyping project-specific covariates, including PCs to adjust for population structure (Supplementary Table 12). To account for residual confounding within CORSA, we tested association with each variant using a linear mixed model and kinship matrix calculated from the data, as implemented in EMMAX[98]. To enable meta-analysis, we then calculated approximate allelic log odds ratios (OR) and corresponding standard errors as described in Cook et al.[99]. Next, we combined association summary statistics across analyses via fixed-effects inverse variance-weighted meta-analysis. Because Wald tests can be notably anti-conservative for rare variant associations, we also performed likelihood ratio-based tests, followed by sample-size weighted meta-analysis, as implemented in METAL[100]. In total, 16,900,397 variants were analyzed. To examine residual population stratification, we inspected quantile-quantile plots of test statistics (Supplementary Figure 8), and calculated genomic control inflation statistics (λGC). λGC for the combined meta-analysis was 1.105, and for Stage 1 and 2 meta-analyses was 1.071 and 1.075, respectively. Because λGC increases with sample size for polygenic phenotypes, even in the absence of confounding biases[101], we investigated the effect of confounding due to residual population stratification using LD score regression[102]. Because of limitations of LD score regression, this analysis is restricted to common variants (MAF≥1%) for which λGC was 1.188 in the combined meta-analysis. The LD score regression intercept was 1.067, which is substantially less than λGC, indicating at most a small contribution of bias and that inflation in χ2 statistics results mostly from polygenicity. We also calculated λ1,000 which is the equivalent inflation statistic for a study with 1,000 cases and 1,000 controls[103]. For the combined meta-analysis, λ1000 was 1.004 and for both Stage 1 and 2 meta-analyses this was 1.003.

Significance threshold for the replication genotyping experiment.

To protect against probe design failure, we built redundancy into the custom genotyping panel by including LD proxies of independently associated variants selected for follow-up. To determine the number of independent tests, we performed LD clumping of the 9,198 analyzed variants that were selected for replication genotyping based on the Stage 1 meta-analysis, and that survived filters described above. Using an r2 threshold of 0.1 this translated to 6,438 independent tests and a Bonferroni significance threshold of 0.05/6,438=7.8×10−6.

Conditional and joint multiple-variant analysis.

To identify additional distinct association signals at CRC loci, we performed a series of conditional meta-analyses. At each locus attaining P <5×10−8, we included the genotype dosage for the variant showing the strongest statistical evidence for association in the region in the combined meta-analysis, as an additional covariate in the respective logistic regression models. Association summary statistics for each variant in the region were then combined across studies by a fixed-effects meta-analysis. If at least one association signal attained a significance level of P <1×10−5 in this meta-analysis, we performed a second round of conditional meta-analysis, adding the variant showing the strongest statistical evidence for association in the region in the first round of conditional meta-analysis as a covariate to the logistic regression models used in the first round. We repeated this procedure and kept adding variants to the model until no additional variants at the locus attained P <1×10−5. Finally, we performed a joint multiple-variant analysis in which we jointly estimated the effects of variants selected in each step and tested for each variant whether the P-value from the joint multiple-variant analysis (PJ) was <1×10−5. Analyses were performed on 2-Mb windows centered on the most associated variant in the unconditional analysis. If windows overlapped, we performed the analysis on the collapsed genomic region. Because of extensive LD, we used a 4-Mb window for the MHC region.

Definition of known loci.

We compiled a list of 62 previously reported genome-wide significant CRC association signals from the literature (Supplementary Table 3). Because of improved power and coverage of our study, we identified the most associated variant at each signal, and used these lead variants for further analyses, rather than the previously reported index variant.

Refinement of association signals.

To refine new association signals, we constructed credible sets that were 99% likely, based on posterior probability, to contain the causal disease-associated SNP[104]. In brief, for each distinct signal, we retained a candidate set of variants by identifying all analyzed variants with r[2] ≥0.1 with the most associated variant within a 2-Mb window centered on the most associated variant. We calculated approximate Bayes’ factors (ABF)[105] for each variant as: where r = 0.04/(s.e.2+0.04), z = β/s.e., and β and s.e. are the log OR estimate and its standard error from the combined meta-analysis. For loci with multiple distinct signals, results are based on conditional meta-analysis, adjusting for all other index variants in the region. We then calculated the posterior probability of being causal as ABF/T where T is the sum of ABF values over all candidate variants. Next, variants were ranked in decreasing order by posterior probabilities and the 99% credible set was obtained by including variants with the highest posterior probabilities until the cumulative posterior probability ≥99%.

Functional genomic annotation.

To nominate variants for future laboratory follow-up, we performed bioinformatic analysis at each new signal using our functional annotation database, and a custom UCSC analysis data hub. Using ANNOVAR[106], we annotated lead variants and variants in LD (r2 ≥0.4) with the lead variant, relative to features pertaining to i) gene-centric function (PolyPhen2[107]), ii) genome-wide functional prediction scores (CADD[108], DANN[109], EigenPC[110]), iii) disease relatedness (GWAS catalog[46]), and iv) CRC-relevant regulatory functions (enhancer, repressor, DNA accessible, and transcription factor binding site (TFBS)[111,112]; Supplementary Table 13). Supplementary Table 8 summarizes variant annotations relative to the CCDS Project[113], and reference genome GRCh37. Variants were maintained in Supplementary Table 8 if they met any of the following conditions: DANN score ≥0.9, CADD phred score ≥20, Eigen-PC phred score ≥17, PolyPhen2 “probably damaging”, “stop loss”, “stop gain”, “splicing”, or were positioned in a predicted regulatory element. We visually inspected loci overlapping with CRC-relevant functional genomic annotations. Variants positioned in enhancers with aberrant CRC activity were identified by comparing epigenomes of non-diseased colorectal tissues/colonic crypt cells to epigenomes of primary CRC cell lines (data accessible at NCBI GEO database, accession GSE77737). We prioritized target genes for loci with predicted regulatory function. Evidence suggests that Topological Association Domains (TADs) can be used to map physical boundaries on gene promoter interactions with distal regulatory elements[114-116]. As such, we used GMI12878 Hi-C Chromosome Conformation Capture data to identify gene promoters that were in the same TADs as risk loci using the WashU Epigenome Browser (https://epigenomegateway.wustl.edu/). Genes in this list were further prioritized based on biological relevancy and expression quantitative trait loci (eQTL) data from Genotype-Tissue Expression (GTEx)[117] using HaploReg v4.1[118].

ATAC-seq assay.

We generated high resolution maps of DNA accessible regions in normal colon mucosa samples using the Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq). Using the updated omni-ATAC protocol for archival samples, we performed ATAC-seq in four colon mucosa biopsies from the ICO-biobank taken from participants undergoing screening at IDIBELL, Spain. Biopsies were cryopreserved by slow freezing using a solution of 10% DMSO, 90% media, and Mr. Frosty Cryo 1°C Freezing Containers (Thermo Scientific). ATAC-seq was implemented as prescribed with two exceptions. Instead of dounce homogenizer we used a tissue lyser and stainless bead system, pulverizing at 40Hz for 2 mins and pulsing at 50Hz for 10–20 seconds. Secondly, Illumina library quantification was performed using picogreen quantitation and TapeStation instead of KAPA quantitative qPCR. Libraries were sequenced to an average of 25M paired end reads using Illumina HiSeq 2500. The ENCODE data processing pipeline was implemented (https://github.com/kundajelab/atac_dnase_pipelines) aligning to hg19[119]. QC results are summarized in Supplementary Table 14.

Regulatory and functional information enrichment analysis.

We used GARFIELD[74] to identify cell types, tissues, and functional genomic features relevant to CRC risk. This method tests for enrichment of association in features primarily extracted from ENCODE and Roadmap Epigenomics Project data, while accounting for sources of confounding, including LD. We applied default settings and used the author-supplied data which is suitable for analysis of GWAS results based on European-ancestry individuals.

Pathway and gene set enrichment analysis.

We used MAGENTA to test predefined gene sets (e.g., KEGG pathways) for enrichment for CRC risk associations[75]. We used combined meta-analysis results as input and applied default settings which included removing genes that fall in the MHC region from analysis. Enrichment was tested at two gene P-value cutoffs: 95th and 75th percentiles of all gene P-values in the genome.

Estimation of contribution of rare variants to heritability.

We used the LD- and MAF-stratified component GREML (GREML-LDMS) method as implemented in GCTA[76] to estimate the proportion of variation in liability for CRC explained by all imputed autosomal variants (i.e., estimate of narrow-sense heritability ), and the proportion contributed by rare variants (MAF ≤1%). Because of computational limitations we analyzed a subset of 11,895 cases and 14,659 controls imputed to our WGS panel. We analyzed individual-level data for 17,649,167 imputed variants with MAC >3 and HWE test P ≥10−6. Following Yang et al.[76], we did not filter on imputation quality. In brief, we stratified variants into groups based on MAF (boundaries at 0.001, 0.01, 0.1, 0.2, 0.3, 0.4) and mean LD score (boundaries at quartiles) calculated as described in Yang et al.[76]. We then calculated genetic relationship matrices (GRMs) for each of these 28 variant partitions and jointly estimated variance components for these partitions, adjusting for age, sex, study, genotyping batch, and three genotype PCs. From the variance component estimates and their variance-covariance matrix we estimated the contribution of rare variants (MAF ≤1%) and common variants (MAF >1%), and calculated standard errors using the delta method. We tested significance of the contribution of rare variants using a likelihood ratio test. To calculate heritability on the underlying liability scale we interpreted K as lifetime risk[120] and used an estimate of 4.3% (Surveillance, Epidemiology, and End Results Program (SEER) Cancer Statistics, 2011–2013).

Familial relative risk explained by genetic variants.

We assumed a multiplicative model within and between variants and calculated the proportion of familial relative risk (RR) explained by a given set of genetic variants as , where is the overall familial RR to first-degree relatives of cases. is the familial RR due to variant calculated as , where is the risk allele frequency for variant , and is the estimated per allele OR[9,121]. We adjusted the OR estimates of new association signals for winner’s curse following Zhong and Prentice[78]. We represented previously identified association signals by the variant showing the strongest statistical evidence of association in the combined meta-analysis, and assumed that winner’s curse was negligible. We assumed to be 2.2[122]. Using the delta method, we computed the variance for the proportion of familial RR as follows:

Absolute risk of CRC incidence and starting age of first screening.

We constructed a polygenic risk score (PRS) as a weighted sum of expected risk allele frequency for common genetic variants, using the per allele OR for each variant as weights. OR estimates for newly discovered variants were adjusted for winner’s curse to avoid potential inflation[78]. Assuming all genetic variants are independent, let denote a PRS constructed based on K variants: where are the estimated OR and the number of risk alleles for variant i. We assumed follows a normal distribution , where the estimates of mean and variance are computed as following: where is the risk allele frequency for variant . Then the baseline hazard at each age , , is computed as following: and are the incidence rates for non-Hispanic whites who have not taken an endoscopy before, derived from population incidence rates during 1992–2005 from the SEER Registry. Using these baseline hazard rates, we estimated the 10-year absolute risk of developing CRC given age and a PRS as previously described[123]. By setting a risk threshold as the average of the 10-year CRC risk for a 50-year old man (1.25%) and woman (0.68%), i.e., (1.25%+0.68%)/2=0.97%, who have not previously received an endoscopy[124], we estimated the recommended starting age of first screening given the PRS. Variants and OR estimates used in these analyses are given in Supplementary Table 15.

109 in total

1. A genome-wide association study shows that common alleles of SMAD7 influence colorectal cancer risk.

Authors: Peter Broderick; Luis Carvajal-Carmona; Alan M Pittman; Emily Webb; Kimberley Howarth; Andrew Rowan; Steven Lubbe; Sarah Spain; Kate Sullivan; Sarah Fielding; Emma Jaeger; Jayaram Vijayakrishnan; Zoe Kemp; Maggie Gorman; Ian Chandler; Elli Papaemmanuil; Steven Penegar; Wendy Wood; Gabrielle Sellick; Mobshra Qureshi; Ana Teixeira; Enric Domingo; Ella Barclay; Lynn Martin; Oliver Sieber; David Kerr; Richard Gray; Julian Peto; Jean-Baptiste Cazier; Ian Tomlinson; Richard S Houlston
Journal: Nat Genet Date: 2007-10-14 Impact factor: 38.330

2. A genome-wide association study identifies colorectal cancer susceptibility loci on chromosomes 10p14 and 8q23.3.

Authors: Ian P M Tomlinson; Emily Webb; Luis Carvajal-Carmona; Peter Broderick; Kimberley Howarth; Alan M Pittman; Sarah Spain; Steven Lubbe; Axel Walther; Kate Sullivan; Emma Jaeger; Sarah Fielding; Andrew Rowan; Jayaram Vijayakrishnan; Enric Domingo; Ian Chandler; Zoe Kemp; Mobshra Qureshi; Susan M Farrington; Albert Tenesa; James G D Prendergast; Rebecca A Barnetson; Steven Penegar; Ella Barclay; Wendy Wood; Lynn Martin; Maggie Gorman; Huw Thomas; Julian Peto; D Timothy Bishop; Richard Gray; Eamonn R Maher; Anneke Lucassen; David Kerr; D Gareth R Evans; Clemens Schafmayer; Stephan Buch; Henry Völzke; Jochen Hampe; Stefan Schreiber; Ulrich John; Thibaud Koessler; Paul Pharoah; Tom van Wezel; Hans Morreau; Juul T Wijnen; John L Hopper; Melissa C Southey; Graham G Giles; Gianluca Severi; Sergi Castellví-Bel; Clara Ruiz-Ponte; Angel Carracedo; Antoni Castells; Asta Försti; Kari Hemminki; Pavel Vodicka; Alessio Naccarati; Lara Lipton; Judy W C Ho; K K Cheng; Pak C Sham; J Luk; Jose A G Agúndez; Jose M Ladero; Miguel de la Hoya; Trinidad Caldés; Iina Niittymäki; Sari Tuupanen; Auli Karhu; Lauri Aaltonen; Jean-Baptiste Cazier; Harry Campbell; Malcolm G Dunlop; Richard S Houlston
Journal: Nat Genet Date: 2008-03-30 Impact factor: 38.330

3. Genome-wide association scan identifies a colorectal cancer susceptibility locus on 11q23 and replicates risk loci at 8q24 and 18q21.

Authors: Albert Tenesa; Susan M Farrington; James G D Prendergast; Mary E Porteous; Marion Walker; Naila Haq; Rebecca A Barnetson; Evropi Theodoratou; Roseanne Cetnarskyj; Nicola Cartwright; Colin Semple; Andrew J Clark; Fiona J L Reid; Lorna A Smith; Kostas Kavoussanakis; Thibaud Koessler; Paul D P Pharoah; Stephan Buch; Clemens Schafmayer; Jürgen Tepel; Stefan Schreiber; Henry Völzke; Carsten O Schmidt; Jochen Hampe; Jenny Chang-Claude; Michael Hoffmeister; Hermann Brenner; Stefan Wilkening; Federico Canzian; Gabriel Capella; Victor Moreno; Ian J Deary; John M Starr; Ian P M Tomlinson; Zoe Kemp; Kimberley Howarth; Luis Carvajal-Carmona; Emily Webb; Peter Broderick; Jayaram Vijayakrishnan; Richard S Houlston; Gad Rennert; Dennis Ballinger; Laura Rozek; Stephen B Gruber; Koichi Matsuda; Tomohide Kidokoro; Yusuke Nakamura; Brent W Zanke; Celia M T Greenwood; Jagadish Rangrej; Rafal Kustra; Alexandre Montpetit; Thomas J Hudson; Steven Gallinger; Harry Campbell; Malcolm G Dunlop
Journal: Nat Genet Date: 2008-03-30 Impact factor: 38.330

4. Environmental and heritable factors in the causation of cancer--analyses of cohorts of twins from Sweden, Denmark, and Finland.

Authors: P Lichtenstein; N V Holm; P K Verkasalo; A Iliadou; J Kaprio; M Koskenvuo; E Pukkala; A Skytthe; K Hemminki
Journal: N Engl J Med Date: 2000-07-13 Impact factor: 91.245

5. Environmental and heritable causes of cancer among 9.6 million individuals in the Swedish Family-Cancer Database.

Authors: Kamila Czene; Paul Lichtenstein; Kari Hemminki
Journal: Int J Cancer Date: 2002-05-10 Impact factor: 7.396

Review 6. Genome-wide association studies of cancer: current insights and future perspectives.

Authors: Amit Sud; Ben Kinnersley; Richard S Houlston
Journal: Nat Rev Cancer Date: 2017-10-13 Impact factor: 60.716

7. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012.

Authors: Jacques Ferlay; Isabelle Soerjomataram; Rajesh Dikshit; Sultan Eser; Colin Mathers; Marise Rebelo; Donald Maxwell Parkin; David Forman; Freddie Bray
Journal: Int J Cancer Date: 2014-10-09 Impact factor: 7.396

8. A genome-wide association scan of tag SNPs identifies a susceptibility variant for colorectal cancer at 8q24.21.

Authors: Ian Tomlinson; Emily Webb; Luis Carvajal-Carmona; Peter Broderick; Zoe Kemp; Sarah Spain; Steven Penegar; Ian Chandler; Maggie Gorman; Wendy Wood; Ella Barclay; Steven Lubbe; Lynn Martin; Gabrielle Sellick; Emma Jaeger; Richard Hubner; Ruth Wild; Andrew Rowan; Sarah Fielding; Kimberley Howarth; Andrew Silver; Wendy Atkin; Kenneth Muir; Richard Logan; David Kerr; Elaine Johnstone; Oliver Sieber; Richard Gray; Huw Thomas; Julian Peto; Jean-Baptiste Cazier; Richard Houlston
Journal: Nat Genet Date: 2007-07-08 Impact factor: 38.330

9. Multiple common susceptibility variants near BMP pathway loci GREM1, BMP4, and BMP2 explain part of the missing heritability of colorectal cancer.

Authors: Ian P M Tomlinson; Luis G Carvajal-Carmona; Sara E Dobbins; Albert Tenesa; Angela M Jones; Kimberley Howarth; Claire Palles; Peter Broderick; Emma E M Jaeger; Susan Farrington; Annabelle Lewis; James G D Prendergast; Alan M Pittman; Evropi Theodoratou; Bianca Olver; Marion Walker; Steven Penegar; Ella Barclay; Nicola Whiffin; Lynn Martin; Stephane Ballereau; Amy Lloyd; Maggie Gorman; Steven Lubbe; Bryan Howie; Jonathan Marchini; Clara Ruiz-Ponte; Ceres Fernandez-Rozadilla; Antoni Castells; Angel Carracedo; Sergi Castellvi-Bel; David Duggan; David Conti; Jean-Baptiste Cazier; Harry Campbell; Oliver Sieber; Lara Lipton; Peter Gibbs; Nicholas G Martin; Grant W Montgomery; Joanne Young; Paul N Baird; Steven Gallinger; Polly Newcomb; John Hopper; Mark A Jenkins; Lauri A Aaltonen; David J Kerr; Jeremy Cheadle; Paul Pharoah; Graham Casey; Richard S Houlston; Malcolm G Dunlop
Journal: PLoS Genet Date: 2011-06-02 Impact factor: 5.917

10. Meta-analysis of three genome-wide association studies identifies susceptibility loci for colorectal cancer at 1q41, 3q26.2, 12q13.13 and 20q13.33.

Authors: Richard S Houlston; Jeremy Cheadle; Sara E Dobbins; Albert Tenesa; Angela M Jones; Kimberley Howarth; Sarah L Spain; Peter Broderick; Enric Domingo; Susan Farrington; James G D Prendergast; Alan M Pittman; Evi Theodoratou; Christopher G Smith; Bianca Olver; Axel Walther; Rebecca A Barnetson; Michael Churchman; Emma E M Jaeger; Steven Penegar; Ella Barclay; Lynn Martin; Maggie Gorman; Rachel Mager; Elaine Johnstone; Rachel Midgley; Iina Niittymäki; Sari Tuupanen; James Colley; Shelley Idziaszczyk; Huw J W Thomas; Anneke M Lucassen; D Gareth R Evans; Eamonn R Maher; Timothy Maughan; Antigone Dimas; Emmanouil Dermitzakis; Jean-Baptiste Cazier; Lauri A Aaltonen; Paul Pharoah; David J Kerr; Luis G Carvajal-Carmona; Harry Campbell; Malcolm G Dunlop; Ian P M Tomlinson
Journal: Nat Genet Date: 2010-10-24 Impact factor: 38.330

146 in total

1. External Validation of Risk Prediction Models Incorporating Common Genetic Variants for Incident Colorectal Cancer Using UK Biobank.

Authors: Catherine L Saunders; Britt Kilian; Deborah J Thompson; Luke J McGeoch; Simon J Griffin; Antonis C Antoniou; Jon D Emery; Fiona M Walter; Joe Dennis; Xin Yang; Juliet A Usher-Smith
Journal: Cancer Prev Res (Phila) Date: 2020-02-18

2. Risk Prediction Models for Colorectal Cancer Incorporating Common Genetic Variants: A Systematic Review.

Authors: Luke McGeoch; Catherine L Saunders; Simon J Griffin; Jon D Emery; Fiona M Walter; Deborah J Thompson; Antonis C Antoniou; Juliet A Usher-Smith
Journal: Cancer Epidemiol Biomarkers Prev Date: 2019-07-10 Impact factor: 4.254

3. Systematic Functional Interrogation of Genes in GWAS Loci Identified ATF1 as a Key Driver in Colorectal Cancer Modulated by a Promoter-Enhancer Interaction.

Authors: Jianbo Tian; Jiang Chang; Jing Gong; Jiao Lou; Mingpeng Fu; Jiaoyuan Li; Juntao Ke; Ying Zhu; Yajie Gong; Yang Yang; Danyi Zou; Xiating Peng; Nan Yang; Shufang Mei; Xiaoyang Wang; Rong Zhong; Kailin Cai; Xiaoping Miao
Journal: Am J Hum Genet Date: 2019-06-13 Impact factor: 11.025

4. Cost-Effectiveness of Personalized Screening for Colorectal Cancer Based on Polygenic Risk and Family History.

Authors: Dayna R Cenin; Steffie K Naber; Anne C de Weerdt; Mark A Jenkins; David B Preen; Hooi C Ee; Peter C O'Leary; Iris Lansdorp-Vogelaar
Journal: Cancer Epidemiol Biomarkers Prev Date: 2019-11-20 Impact factor: 4.254

5. The promise of single-cell mechanophenotyping for clinical applications.

Authors: Molly Kozminsky; Lydia L Sohn
Journal: Biomicrofluidics Date: 2020-06-09 Impact factor: 2.800

6. Identifying Putative Susceptibility Genes and Evaluating Their Associations with Somatic Mutations in Human Cancers.

Authors: Zhishan Chen; Wanqing Wen; Alicia Beeghly-Fadiel; Xiao-Ou Shu; Virginia Díez-Obrero; Jirong Long; Jiandong Bao; Jing Wang; Qi Liu; Qiuyin Cai; Victor Moreno; Wei Zheng; Xingyi Guo
Journal: Am J Hum Genet Date: 2019-08-08 Impact factor: 11.025

Review 7. Epidemiology and Mechanisms of the Increasing Incidence of Colon and Rectal Cancers in Young Adults.

Authors: Elena M Stoffel; Caitlin C Murphy
Journal: Gastroenterology Date: 2019-08-05 Impact factor: 22.682

8. Estimation of Absolute Risk of Colorectal Cancer Based on Healthy Lifestyle, Genetic Risk, and Colonoscopy Status in a Population-Based Study.

Authors: Prudence R Carr; Korbinian Weigl; Dominic Edelmann; Lina Jansen; Jenny Chang-Claude; Hermann Brenner; Michael Hoffmeister
Journal: Gastroenterology Date: 2020-03-14 Impact factor: 22.682

Review 9. The role of genomics in global cancer prevention.

Authors: Ophira Ginsburg; Paul Brennan; Patricia Ashton-Prolla; Anna Cantor; Daniela Mariosa
Journal: Nat Rev Clin Oncol Date: 2020-09-24 Impact factor: 66.675

Review 10. Genomics and metagenomics of colorectal cancer.

Authors: Charmaine Ng; Haojun Li; William K K Wu; Sunny H Wong; Jun Yu
Journal: J Gastrointest Oncol Date: 2019-12