Literature DB >> 27318200

Genome puzzle master (GPM): an integrated pipeline for building and editing pseudomolecules from fragmented sequences.

Jianwei Zhang¹, Dave Kudrna², Ting Mu³, Weiming Li³, Dario Copetti⁴, Yeisoo Yu², Jose Luis Goicoechea², Yang Lei³, Rod A Wing⁴.

Abstract

MOTIVATION: Next generation sequencing technologies have revolutionized our ability to rapidly and affordably generate vast quantities of sequence data. Once generated, raw sequences are assembled into contigs or scaffolds. However, these assemblies are mostly fragmented and inaccurate at the whole genome scale, largely due to the inability to integrate additional informative datasets (e.g. physical, optical and genetic maps). To address this problem, we developed a semi-automated software tool-Genome Puzzle Master (GPM)-that enables the integration of additional genomic signposts to edit and build 'new-gen-assemblies' that result in high-quality 'annotation-ready' pseudomolecules.
RESULTS: With GPM, loaded datasets can be connected to each other via their logical relationships which accomplishes tasks to 'group,' 'merge,' 'order and orient' sequences in a draft assembly. Manual editing can also be performed with a user-friendly graphical interface. Final pseudomolecules reflect a user's total data package and are available for long-term project management. GPM is a web-based pipeline and an important part of a Laboratory Information Management System (LIMS) which can be easily deployed on local servers for any genome research laboratory.
AVAILABILITY AND IMPLEMENTATION: The GPM (with LIMS) package is available at https://github.com/Jianwei-Zhang/LIMS CONTACTS: jzhang@mail.hzau.edu.cn or rwing@mail.arizona.eduSupplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Species

Mesh：

Year: 2016 PMID： 27318200 PMCID： PMC5048067 DOI： 10.1093/bioinformatics/btw370

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Illumina and PacBio SMRT sequencing technologies are the two most widely accepted sequencing platforms currently used for large scale genomics-driven data generation. Illumina, representing the most widely used second-generation sequencing technology, produces short reads (35–150 base read lengths), that are highly accurate, with base call outputs that can yield hundreds of millions of bases from a single lane over several days (depending upon complexity) (Schatz ). PacBio produces hundreds of thousands of long-read error-corrected sequences (up to 20 kb average read lengths) that can be produced in 3–6 h per SMRT cell. Data generated by either platform can be used independently, or in combination, to successfully assemble genomes de novo (Alkan ; Chin ; Kajitani ). Many assemblers have been developed to assemble raw sequence reads into sequence contigs (i.e. minimum sequence units for an assembly), such as SOAPdenovo (Luo ), Allpaths (Butler ; MacCallum ), HGAP (Chin ), or Falcon (https://github.com/PacificBiosciences/FALCON-integrate). A few programs or packages (e.g. Bambus, ABACAS, Mauve Aligner, ALLMAPS, etc.) are also available for scaffolding contigs (Assefa ; Hunt ; Pop ; Rissman ; Tang ). However, software tools that can be used to inspect/edit NGS sequence assemblies, as well as integrate other evidence types (i.e. physical and genetics maps) to produce an assembly that more accurately and completely reflects the native structure of a given genome, are currently lacking. To analyze, manage and incorporate genome datasets for diverse sequencing projects, such as pseudomolecule construction for the maize and several wild Oryza genome sequencing projects (Schnable ; J. Stein et al., submitted for publication; Wei ), we developed a software tool called ‘Genome Puzzle Master’ (GPM). GPM does not require sophisticated bioinformatics skills or support, and the final products are ready to use in the form of annotation-ready pseudomolecules. GPM can also facilitate the incorporation of additional datasets as new refinements are generated. Here we make GPM available to researchers who have NGS genome assemblies and other unlinked genomic datasets, and are struggling to generate ‘annotation-ready’ or ‘submission-ready’ pseudomolecules.

2 Methods

GPM is a key part of a web-based Laboratory Information Management System (LIMS) that we developed to manage and analyze genomic data at different levels from both wet and dry lab experiments. The LIMS is set up in a LAMP environment (Linux operating system, Apache HTTP Server, MySQL database software and Perl programing language) and requires additional libraries and software listed in Supplementary Table S1. We used jQuery (plus UI), a fast and concise JavaScript library, to build GPM’s highly interactive web applications. A database schema was designed which contains one main table called ‘matrix’ to record most types of data from ‘wet’ or ‘dry’ lab experiments (Supplementary Table S2), and several extra tables (e.g. ‘link’, ‘alignment’) to store relationship data. Combined with these tables, we can extensively handle all types of datasets and connect them logically. In our database, the expansibility is flexible, as we can either add a new type of ‘container’ in the ‘matrix’ table or create another table for new data if needed, depending on the data type. For example, huge amounts of individual clone information can potentially cause process servers to slow down while performing complicated search operations on the ‘matrix’ table; hence, we specifically created a ‘clone’ table to store this type of data and separately can run queries on it to avoid concurrent queries on the ‘matrix’ table. To include additional information for certain existing data types, we extended the ability to store more informative data by adopting the JavaScript Object Notation (JSON), a lightweight data-interchange format, in the ‘note’ field of the ‘matrix’ table. GPM can currently utilize data categorized into different types, such as sequence contigs/scaffolds, Bacterial Artificial Chromosome (BAC) clone end-sequences (BESs), and reference genome sequences (RefSeq) that are the basic input elements for GPM assemblies. Different data types can be connected to one another via their sequence relationships. For example, AGP (a golden path, https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/) information can describe the assembly of a larger sequence object (e.g. a contig, a scaffold, or a chromosome) from smaller objects and list their relationships. Most importantly, these kinds of relationships are used to link multiple datasets and to logically integrate genomic data with accuracy. To demonstrate a GPM assembly result, here we specifically defined two terms, ‘assemblyCtg’ for a GPM assembly contig and ‘assemblySeq’ for a sequence component that belongs to an assemblyCtg. Both data types are stored in the ‘matrix’ table as individual containers. In the database, a record of assemblyCtg describes its constituent members (assemblySeqs) and attributes (e.g. length, chromosome number and position); and an assemblySeq records the original sequence and the actual part and status of the component sequence (e.g. coordinates and orientation) that contributes to its corresponding assemblyCtg. A complete GPM assembly contains a set of assemblyCtgs, which are formed by various numbers of assemblySeqs.

3 Results

3.1 Information-guided assembly

To build pseudomolecules from sequence contigs with reference information, a GPM ‘assemblyRun’ execution (Fig. 1) can be divided into 11 optional operations:

Fig. 1.

GPM assemblyRun operations

GPM assemblyRun operations Assembly initialization. First, to initialize a new assembly, GPM converts all de novo assembled sequences into assemblySeqs and each assemblySeq is assigned to a single assemblyCtg. (Note: This operation can be skipped for an existing assembly to avoid losing any manual edits.) Seq-to-Seq alignment. Depending on the potential relationship of all sequence elements, GPM can prepare overlapping alignments among all sequences in an assembly by running a ‘Seq-to-Seq’ pre-calculation with ‘blastn’ (NCBI BLAST 2.2.29+, E-value =1e-200) (Camacho ) as the default alignment engine. Pre-calculation parameters can be customized as needed. Overlap information between sequence contigs can be used in later processing steps, e.g. to determine the orientation of two neighboring sequences. Physical reference (PR)-guided assembly. Based on an existing physical reference, such as a physical map (PM) or an AGP file, sequences will be assigned to assemblyCtgs. If an AGP file is available, object-component information is used to build links between sequences. If a FingerPrinted Contig (FPC) file (Nelson ) is available, we use PM contig-clone information to connect potentially neighboring sequences and merge two assemblySeqs from neighboring BACs on PMs into an assemblyCtg. AssemblyCtgs will only be merged when overlapping evidence is detected. For a non-PR-based whole genome shotgun (WGS) project, the original sequences would be loaded into GPM, and this operation would be skipped. To build assemblyCtgs, the default parameters for merging two sequences are ‘minOverlapSeqToSeq = 1000 bp’ and ‘identitySeqToSeq = 99%,’ plus the overlap should be at both ends of each sequence. Seq-to-Genome alignment. This operation allows the user to run the alignment engine (‘blastn’ as default, E-value =1e-200) to search against reference genome sequences by using the original sequences of all assemblySeqs. The alignment information can be used to ‘group,’ ‘order,’ and ‘orient’ assemblyCtgs at a chromosomal scale. Depending on the similarity between the RefSeq and the to-be-assembled sequence, alignment parameters can be varied and tested by the user. GPM currently supports blastn, megablast (Camacho ) and BLAT (Kent ) as alignment engines for sequence comparison. Users can choose a proper alignment engine to accelerate GPM’s performance. For example, megablast or BLAT can be chosen for aligning sequences with high similarities since they run faster than blastn (in most situations). Importantly, to minimize the misleading impacts of repetitive sequences in both to-be-assembled and reference datasets during this step, users can (i) activate the ‘Mark Repeat Region’ option to filter out any non-unique alignments, which will not be considered as evidence to guide a GPM assembly; or (ii) select the alternative ‘Soft Masking’ option for the blastn/megablast alignment processing if repetitive regions in reference sequences have already been soft-masked for an assembly. Chromosome number assignment. If a high-quality reference genome is available, we prefer to use it as a guide to automatically assign chromosome numbers to assemblyCtgs. Alternatively, chromosome assignments can also be performed manually in a batch mode. Redundant assemblySeq removal. GPM will mask and remove a buried or redundant sequence according to the alignment information processed in operation 2. Here a redundant sequence is defined as a sequence that can be fully aligned by two neighboring sequences and the two neighboring sequences have end-overlapped alignment (default parameters ‘minOverlapSeqToSeq = 1000 bp’ and ‘identitySeqToSeq = 99%’). Such assemblySeqs would be marked as redundant and are hidden in the final assembly. Masking redundancies can reduce the computational complexity of orienting assemblySeqs and assemblyCtgs. Non-redundant assemblySeq orientation. In a multi-assemblySeq contig, GPM can orient sequences based on overlap data. This operation is performed on the premise that an overlapping region of two neighboring sequences should be at the end of each sequence, specifically at the right end of a pre-assemblySeq, and at the left end of the next-assemblySeq. GPM adjusts the orientation of these assemblySeqs to fit the premise accordingly with the proper extension of both non-overlapping parts that will elongate an expanding contig. GPM determines the orientation of each initial seed assemblySeq and then the subsequent assemblySeq one-by-one. AssemblyCtg orientation. Using a reference genome, GPM can orient assemblyCtgs according to the linear information suggested by the ‘Seq-to-Genome’ alignment data. End-to-End assemblyCtg merge. In some cases, physically overlapping sequences can be separated into two different assemblyCtgs due to the lack of strong evidence derived from the guide data, for example, when there is weak evidence to merge two contigs during a PM construction step. These kinds of situations usually arise from BACs located at the ends of PM contigs, and as a consequence, assemblySeqs of these BACs are also found at assemblyCtg ends. Detection of end-to-end overlaps between neighboring assemblyCtgs facilitates merging into larger assemblyCtgs. Redundancies in overlapping regions removed. Overlapping sequences between two neighboring assemblySeqs are considered redundant, and one redundant portion of either assemblySeq is removed from the final pseudomolecule. Here GPM makes no preferences on which overlapping sequence is retained. However, if evidence is provided that one overlapping sequence is of a higher quality than the other, then the highest quality sequence will be retained. AssemblyCtgs ordered and/or renumbered. GPM can sort assemblyCtgs based on their chromosomal number and position and provide an option to renumber assemblyCtgs in their proper order. After running an automated GPM assembly, the user is also able to manually check and edit the assembly (Fig. 2). With the visualization function, GPM can provide a convenient way to manually check and edit an assembly, thereby eliminating the need to rely solely on automated assembly results. Relationships among datasets (including BAC sequences, BESs, physical maps, reference sequences, etc.) can logically guide assembly operations, including but not limited to grouping, merging, ordering and orienting. All manual editing steps are saved automatically, so the entire editing process can be reproduced and quality checked. Upon completion of an editing step, GPM can be used to export contig sequences, chromosome-based pseudomolecules and AGP files in real time.

Fig. 2.

Visualization of typical available data in GPM. (A) GPM assemblyCtg view of a 500-KB region. AssemblySeqs, top and bottom, are shown as overlapping (yellow) and fully redundant assemblySeqs are gray. The retained (green) and removed (gray) portions of assemblySeqs are indicated. (B) Chromosome-scale view of a 500-KB region that compares two genome assemblies to a Reference sequence. The Reference Sequence is shown in the middle (bright green) with alignments (yellow) to each assemblyCtg (violet) at the top and bottom. The assemblyCtg order can be changed by drag-and-drop (Color version of this figure is available at Bioinformatics online.)

3.2 Application of the GPM pipeline to assemble two high-quality reference genome sequences for indica rice: a case study

Recently, our consortia published two high-quality reference genome sequences for the two main varietal groups of indica rice—Zhenshan 97 (ZS97) and Minghui 63 (MH63) (Zhang ). These genomes were primarily sequenced using PacBio long-read sequencing of minimum tiling path BAC pools, combined with Illumina WGS assembled contigs to fill gaps. Once individual BAC sequences were assembled they were loaded into GPM for assembly editing and pseudomolecule construction. Following is a summary of how these data were used to assemble two of the highest quality indica rice genome assemblies produced to date: To assemble the ZS97 and MH63 genomes, the following datasets were loaded in to GPM: (i) Whole Genome Profiling (WGP, van Oeveren ) sequence-based PMs for each genome; and (ii) 5363 assembled BAC sequences (including duplicates for the same BAC clones sequenced in multiple jobs or pools) from 188 HGAP jobs for ZS97, and 6,801 from 313 jobs for MH63. Here we take ZS97 as an example to demonstrate the assembly procedure once these data were loaded (Supplementary Fig. S1). To start a new assembly, we set the ‘FPC: ZS97 v.1’ PM as the physical reference and the ‘IRGSP-MSU’ (i.e. O.sativa subsp. japonica cv. Nipponbare genome sequence, Kawahara ) as the reference genome. We checked ‘Assign chromosome number for contigs’ and ‘Orient contigs based-on reference genome’ since both ZS97 and Nipponbare belong to the same genus and species (i.e. O.sativa), and minor differences between the to-be-assembled and reference genomes would not mislead the results because they are so closely related. ‘Seq-to-Seq Alignment’ was used to pre-build all possible overlapping relationships among all BAC sequences since it was expected that those should be connected. ‘Seq-to-Genome Alignment’ was also used to map all BAC sequences to the reference genome. We also utilized the ‘End-to-End Merge,’ ‘Auto-Orient Sequences’ and ‘Filter Redundant Sequences and Overlaps’ options. (Note: A similar process was used to assemble the MH63 genome, except we used the ‘FPC: MH63 v.1’ PM as the physical reference.) After manual checking, editing and removing redundancies, the final assembly products yielded 318 (ZS97, composed of 3862 assemblySeqs) and 216 (MH63, composed of 3256 assemblySeqs) assemblyCtgs that were ordered, oriented, and assigned to their appropriate chromosomes. The ‘assemblyCtg’ sequences were used as the primary frameworks to build pseudomolecules after gap-filling sequences were integrated during the second round of GPM assembly (Zhang ).

4 Discussion

GPM is an integrated pipeline for generating and editing pseudomolecules from existing next gen sequence assembles using evidence-based guides such as reference sequences, physical maps, genetic maps and paired BESs. We demonstrated how GPM can be used to generate high-quality submission-ready pseudomolecules for two indica rice accessions, ZS97 and MH63, by the integration of BAC-based physical maps and reference sequence guides. To perform an assembly with GPM, not all the resources like those used for ZS97 and MH63 are required, but GPM does require at least one guide type (e.g. genetic, physical or genome reference). Since GPM does not reassemble sequence data from the original short/long reads, the kind and quality of the evidence-based guides will greatly affect the amount of time required for editing and will impact the final quality of an assembly. Using a reference genome in a GPM assembly is quite valuable, however, the utilization of more evidence guides can greatly enhance the ability to build high-quality pseudomolecules and to avoid the overuse of the reference sequences in conforming a GPM assembly to mimic a RefSeq. If a reference genome is the only evidence to guide an assembly, then reference sequence overuse mistakes can’t be avoided. However, using other information that is available could detect overuse errors which would be corrected during a manual check step (which is not labor-intensive and strongly recommended) with GPM. Hence, evidence data are key factors used to build high-quality GPM assemblies. In the current version of our LIMS, we can track each processing step from BAC library construction to the final genome assembly seamlessly during the entire phase of a sequencing project. BAC library resources and wet lab sequencing runs are recorded in our system. As a part of the LIMS, GPM supports FASTA-formatted sequences. However, the ability to deal with raw reads has been limited due to their huge data size. Some additional features, such as paired end/mate and MTP information, are partially adapted for assembly manipulation in the current version of GPM. Pseudomolecules in FASTA format with an AGP file can be generated as the final output for further downstream analysis and public repository sequence submissions (e.g. GenBank). The LIMS can be expanded to integrate new functions for a specific project as needed. For example, we built a pipeline called ‘postHGAP’ to perform circularization and identification of BAC sequences for the two rice genome sequencing projects described above (Fig. 3). As reported in our data descriptor (Zhang ), we used a map-based BAC-pool sequencing strategy to produce sequence data that were assembled into a set of sequences (termed unitigs by HGAP) for each pool. We were able to process the output of each corresponding HGAP job to circularize and identify BAC sequences properly using related data information in the LIMS as follows: (i) postHGAP first filtered unitigs with lengths shorter than 10 kb, then (ii) trimmed vectors and circularized plasmid or BAC sequences according to pairs of sequences at specific breakpoints and (iii) assigned BAC IDs to each sequence according to WGP tags or BES information. During the postHGAP processing step, two parameters (default: minOverlap = 500 bp, overlapIdentity = 95%) could be set for sequence circularization and four optional parameters (default: minCloneTagNumber = 5, tagMatchIdentity = 100%, tagMatchPercent = 80%, besMatchIdentity = 98% if no WGP tags available) for BAC ID assignment. The program ‘blastn’ (NCBI BLAST 2.2.29+) was used to perform sequence comparison and alignment analysis in postHGAP. As a result, a total of 501 HGAP jobs (multiple runs for the same pool count multiple times) for 375 pools were run through postHGAP in this study and produced 12 164 BAC-ID-ready sequences (including duplicates for the same BAC processed in multiple jobs or pools) for both ZS97 and MH63. All BAC-ID-ready sequences were seamlessly converted as input datasets for both GPM assemblies.

Fig. 3.

Flowchart for processing unitigs with postHGAP

Flowchart for processing unitigs with postHGAP Technically, the use of a one-main-table database schema design may simplify query processing. However, one possible disadvantage for this schema is that the response time to query might be longer with increasing amounts of data. Based upon our LIMS structure, we used the new ‘container’ option for less complicated data and created new tables for large datasets (e.g. clones). To date, we are able to operate GPM smoothly on a 4× 12-core cluster with more than 1.7 million data records. These kinds of strategies are recommended for other users. We can provide free host services to academic projects of medium-size genomes (up to 1 Gb) with full technical support. Several WGS scaffolding packages (e.g. Bambus, Mauve Aligner, ABACAS, ALLMAPS, etc.) treat assembled contig sequences as non-overlapping and single-copy sequences. Unfortunately, these packages are not focused toward the handling of inherent sequence redundancies and are not programmed to produce incremental assemblies. Further, these scaffolding tools are all file-based and do not provide a flexible process for manual checking and editing, especially if the input data has been modified or updated. In contrast, GPM is a relationship-based pipeline, which has the flexibility to edit and visualize assembled data, not only by showing the order and orientation of contig sequences and sequence redundancies but also by displaying the necessary guide information for easy user confirmation. One thing GPM cannot do is to assemble a genome de novo from raw reads. This deficit can be compensated for by its ability to import results (sequences and AGPs) from other assembly programs to guide new and improved assemblies. The final quality of a GPM assembly relies heavily on the quality and richness of the guide information used. For example, if a draft genome assembly is used as the guide rather than a map-based BAC-by-BAC assembly, the quality of the GPM assembly will suffer. DNA sequencing technologies and assembly programs change rapidly, and the GPM pipeline presented here is no exception. We will continue to improve the interactive functionalities of GPM, as well as integrate additional tools to support more data types. With GPM, data types are seamlessly linked and logically integrated into an encompassing LIMS for all genomic data. GPM is an open source software dynamically developed for the genomics research community and can be extensively adapted/improved by different research groups for their own applications. Click here for additional data file.

21 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

2. Assembly of large genomes using second-generation sequencing.

Authors: Michael C Schatz; Arthur L Delcher; Steven L Salzberg
Journal: Genome Res Date: 2010-05-27 Impact factor: 9.043

3. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.

Authors: Chen-Shan Chin; David H Alexander; Patrick Marks; Aaron A Klammer; James Drake; Cheryl Heiner; Alicia Clum; Alex Copeland; John Huddleston; Evan E Eichler; Stephen W Turner; Jonas Korlach
Journal: Nat Methods Date: 2013-05-05 Impact factor: 28.547

4. Sequence-based physical mapping of complex genomes by whole genome profiling.

Authors: Jan van Oeveren; Marjo de Ruiter; Taco Jesse; Hein van der Poel; Jifeng Tang; Feyruz Yalcin; Antoine Janssen; Hanne Volpin; Keith E Stormo; Robert Bogden; Michiel J T van Eijk; Marcel Prins
Journal: Genome Res Date: 2011-02-01 Impact factor: 9.043

5. ALLPATHS: de novo assembly of whole-genome shotgun microreads.

Authors: Jonathan Butler; Iain MacCallum; Michael Kleber; Ilya A Shlyakhter; Matthew K Belmonte; Eric S Lander; Chad Nusbaum; David B Jaffe
Journal: Genome Res Date: 2008-03-13 Impact factor: 9.043

6. Limitations of next-generation genome sequence assembly.

Authors: Can Alkan; Saba Sajjadian; Evan E Eichler
Journal: Nat Methods Date: 2010-11-21 Impact factor: 28.547

7. ALLMAPS: robust scaffold ordering based on multiple maps.

Authors: Haibao Tang; Xingtan Zhang; Chenyong Miao; Jisen Zhang; Ray Ming; James C Schnable; Patrick S Schnable; Eric Lyons; Jianguo Lu
Journal: Genome Biol Date: 2015-01-13 Impact factor: 13.583

8. The physical and genetic framework of the maize B73 genome.

Authors: Fusheng Wei; Jianwei Zhang; Shiguo Zhou; Ruifeng He; Mary Schaeffer; Kristi Collura; David Kudrna; Ben P Faga; Marina Wissotski; Wolfgang Golser; Susan M Rock; Tina A Graves; Robert S Fulton; Ed Coe; Patrick S Schnable; David C Schwartz; Doreen Ware; Sandra W Clifton; Richard K Wilson; Rod A Wing
Journal: PLoS Genet Date: 2009-11-20 Impact factor: 5.917

9. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors: Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal: Gigascience Date: 2012-12-27 Impact factor: 6.524

10. A comprehensive evaluation of assembly scaffolding tools.

Authors: Martin Hunt; Chris Newbold; Matthew Berriman; Thomas D Otto
Journal: Genome Biol Date: 2014-03-03 Impact factor: 13.583

9 in total

1. Extensive sequence divergence between the reference genomes of two elite indica rice varieties Zhenshan 97 and Minghui 63.

Authors: Jianwei Zhang; Ling-Ling Chen; Feng Xing; David A Kudrna; Wen Yao; Dario Copetti; Ting Mu; Weiming Li; Jia-Ming Song; Weibo Xie; Seunghee Lee; Jayson Talag; Lin Shao; Yue An; Chun-Liu Zhang; Yidan Ouyang; Shuai Sun; Wen-Biao Jiao; Fang Lv; Bogu Du; Meizhong Luo; Carlos Ernesto Maldonado; Jose Luis Goicoechea; Lizhong Xiong; Changyin Wu; Yongzhong Xing; Dao-Xiu Zhou; Sibin Yu; Yu Zhao; Gongwei Wang; Yeisoo Yu; Yijie Luo; Zhi-Wei Zhou; Beatriz Elena Padilla Hurtado; Ann Danowitz; Rod A Wing; Qifa Zhang
Journal: Proc Natl Acad Sci U S A Date: 2016-08-17 Impact factor: 11.205

2. A draft genome assembly of the Chinese sillago (Sillago sinica), the first reference genome for Sillaginidae fishes.

Authors: Shengyong Xu; Shijun Xiao; Shilin Zhu; Xiaofei Zeng; Jing Luo; Jiaqi Liu; Tianxiang Gao; Nansheng Chen
Journal: Gigascience Date: 2018-09-01 Impact factor: 6.524

3. Sequencing of Australian wild rice genomes reveals ancestral relationships with domesticated rice.

Authors: Marta Brozynska; Dario Copetti; Agnelo Furtado; Rod A Wing; Darren Crayn; Glen Fox; Ryuji Ishikawa; Robert J Henry
Journal: Plant Biotechnol J Date: 2017-01-23 Impact factor: 9.803

4. Genome assembly of the JD17 soybean provides a new reference genome for comparative genomics.

Authors: Xinxin Yi; Jing Liu; Shengcai Chen; Hao Wu; Min Liu; Qing Xu; Lingshan Lei; Seunghee Lee; Bao Zhang; Dave Kudrna; Wei Fan; Rod A Wing; Xuelu Wang; Mengchen Zhang; Jianwei Zhang; Chunyan Yang; Nansheng Chen
Journal: G3 (Bethesda) Date: 2022-04-04 Impact factor: 3.154

5. Eight soybean reference genome resources from varying latitudes and agronomic traits.

Authors: Jeffrey Shih-Chieh Chu; Bo Peng; Kuanqiang Tang; Xingxing Yi; Huangkai Zhou; Huan Wang; Guang Li; Jiantian Leng; Nansheng Chen; Xianzhong Feng
Journal: Sci Data Date: 2021-07-01 Impact factor: 6.444

6. Building two indica rice reference genomes with PacBio long-read and Illumina paired-end sequencing data.

Authors: Jianwei Zhang; Ling-Ling Chen; Shuai Sun; Dave Kudrna; Dario Copetti; Weiming Li; Ting Mu; Wen-Biao Jiao; Feng Xing; Seunghee Lee; Jayson Talag; Jia-Ming Song; Bogu Du; Weibo Xie; Meizhong Luo; Carlos Ernesto Maldonado; Jose Luis Goicoechea; Lizhong Xiong; Changyin Wu; Yongzhong Xing; Dao-Xiu Zhou; Sibin Yu; Yu Zhao; Gongwei Wang; Yeisoo Yu; Yijie Luo; Beatriz Elena Padilla Hurtado; Ann Danowitz; Rod A Wing; Qifa Zhang
Journal: Sci Data Date: 2016-09-13 Impact factor: 6.444

7. Chromosomal-level assembly of yellow catfish genome using third-generation DNA sequencing and Hi-C analysis.

Authors: Gaorui Gong; Cheng Dan; Shijun Xiao; Wenjie Guo; Peipei Huang; Yang Xiong; Junjie Wu; Yan He; Jicheng Zhang; Xiaohui Li; Nansheng Chen; Jian-Fang Gui; Jie Mei
Journal: Gigascience Date: 2018-11-01 Impact factor: 6.524

8. Genomic variation in 3,010 diverse accessions of Asian cultivated rice.

Authors: Wensheng Wang; Ramil Mauleon; Zhiqiang Hu; Dmytro Chebotarov; Shuaishuai Tai; Zhichao Wu; Min Li; Tianqing Zheng; Roven Rommel Fuentes; Fan Zhang; Locedie Mansueto; Dario Copetti; Millicent Sanciangco; Kevin Christian Palis; Jianlong Xu; Chen Sun; Binying Fu; Hongliang Zhang; Yongming Gao; Xiuqin Zhao; Fei Shen; Xiao Cui; Hong Yu; Zichao Li; Miaolin Chen; Jeffrey Detras; Yongli Zhou; Xinyuan Zhang; Yue Zhao; Dave Kudrna; Chunchao Wang; Rui Li; Ben Jia; Jinyuan Lu; Xianchang He; Zhaotong Dong; Jiabao Xu; Yanhong Li; Miao Wang; Jianxin Shi; Jing Li; Dabing Zhang; Seunghee Lee; Wushu Hu; Alexander Poliakov; Inna Dubchak; Victor Jun Ulat; Frances Nikki Borja; John Robert Mendoza; Jauhar Ali; Jing Li; Qiang Gao; Yongchao Niu; Zhen Yue; Ma Elizabeth B Naredo; Jayson Talag; Xueqiang Wang; Jinjie Li; Xiaodong Fang; Ye Yin; Jean-Christophe Glaszmann; Jianwei Zhang; Jiayang Li; Ruaraidh Sackville Hamilton; Rod A Wing; Jue Ruan; Gengyun Zhang; Chaochun Wei; Nickolai Alexandrov; Kenneth L McNally; Zhikang Li; Hei Leung
Journal: Nature Date: 2018-04-25 Impact factor: 49.962

9. Whole-genome de novo assemblies reveal extensive structural variations and dynamic organelle-to-nucleus DNA transfers in African and Asian rice.

Authors: Xin Ma; Jinjian Fan; Yongzhen Wu; Shuangshuang Zhao; Xu Zheng; Chuanqing Sun; Lubin Tan
Journal: Plant J Date: 2020-08-27 Impact factor: 6.417

9 in total