Literature DB >> 34562304

K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes.

Bruno Contreras-Moreira1, Carla V Filippi1,2,3, Guy Naamati1, Carlos García Girón1, James E Allen1, Paul Flicek1.   

Abstract

The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.
© 2021 The Authors. The Plant Genome published by Wiley Periodicals LLC on behalf of Crop Science Society of America.

Entities:  

Mesh:

Year:  2021        PMID: 34562304     DOI: 10.1002/tpg2.20143

Source DB:  PubMed          Journal:  Plant Genome        ISSN: 1940-3372            Impact factor:   4.089


  1 in total

1.  Ensembl Genomes 2022: an expanding genome resource for non-vertebrates.

Authors:  Andrew D Yates; James Allen; Ridwan M Amode; Andrey G Azov; Matthieu Barba; Andrés Becerra; Jyothish Bhai; Lahcen I Campbell; Manuel Carbajo Martinez; Marc Chakiachvili; Kapeel Chougule; Mikkel Christensen; Bruno Contreras-Moreira; Alayne Cuzick; Luca Da Rin Fioretto; Paul Davis; Nishadi H De Silva; Stavros Diamantakis; Sarah Dyer; Justin Elser; Carla V Filippi; Astrid Gall; Dionysios Grigoriadis; Cristina Guijarro-Clarke; Parul Gupta; Kim E Hammond-Kosack; Kevin L Howe; Pankaj Jaiswal; Vinay Kaikala; Vivek Kumar; Sunita Kumari; Nick Langridge; Tuan Le; Manuel Luypaert; Gareth L Maslen; Thomas Maurel; Benjamin Moore; Matthieu Muffato; Aleena Mushtaq; Guy Naamati; Sushma Naithani; Andrew Olson; Anne Parker; Michael Paulini; Helder Pedro; Emily Perry; Justin Preece; Mark Quinton-Tulloch; Faye Rodgers; Marc Rosello; Magali Ruffier; James Seager; Vasily Sitnik; Michal Szpak; John Tate; Marcela K Tello-Ruiz; Stephen J Trevanion; Martin Urban; Doreen Ware; Sharon Wei; Gary Williams; Andrea Winterbottom; Magdalena Zarowiecki; Robert D Finn; Paul Flicek
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.