Literature DB >> 19874608

pISTil: a pipeline for yeast two-hybrid Interaction Sequence Tags identification and analysis.

Johann Pellet¹, Laurène Meyniel, Pierre-Olivier Vidalain, Benoît de Chassey, Lionel Tafforeau, Vincent Lotteau, Chantal Rabourdin-Combe, Vincent Navratil.

Abstract

BACKGROUND: High-throughput screening of protein-protein interactions opens new systems biology perspectives for the comprehensive understanding of cell physiology in normal and pathological conditions. In this context, yeast two-hybrid system appears as a promising approach to efficiently reconstruct protein interaction networks at the proteome-wide scale. This protein interaction screening method generates a large amount of raw sequence data, i.e. the ISTs (Interaction Sequence Tags), which urgently need appropriate tools for their systematic and standardised analysis.
FINDINGS: We develop pISTil, a bioinformatics pipeline combined with a user-friendly web-interface: (i) to establish a standardised system to analyse and to annotate ISTs generated by two-hybrid technologies with high performance and flexibility and (ii) to provide high-quality protein-protein interaction datasets for systems-level approach. This pipeline has been validated on a large dataset comprising more than 11.000 ISTs. As a case study, a detailed analysis of ISTs obtained from yeast two-hybrid screens of Hepatitis C Virus proteins against human cDNA libraries is also provided.
CONCLUSION: We have developed pISTil, an open source pipeline made of a collection of several applications governed by a Perl script. The pISTil pipeline is intended to laboratories, with IT-expertise in system administration, scripting and database management, willing to automatically process large amount of ISTs data for accurate reconstruction of protein interaction networks in a systems biology perspective. pISTil is publicly available for download at http://sourceforge.net/projects/pistil.

Entities: Chemical Disease Gene Species

Year: 2009 PMID： 19874608 PMCID： PMC2776022 DOI： 10.1186/1756-0500-2-220

Source DB: PubMed Journal: BMC Res Notes ISSN： 1756-0500

Findings

Systems biology focuses, in part, on exhaustive and accurate reconstruction of molecular interaction networks, which support cellular machinery, i.e interactomes, under physiological or pathological conditions. Molecular interactions data related to human and model organisms are currently being integrated in generalist databases, such as INTACT [1], MINT [2] or STRING [3]. Some other databases are more specialised, as for instance VirHostNet, a knowledgebase devoted to virus-host interactions that allows analysis and visualisation of infection at the systems level [4]. One of the main sources of protein-protein interactions deposited in these public databases is generated by yeast two-hybrid (Y2H) technology. Indeed, Y2H allows high-throughput screening of direct physical protein-protein interactions at a proteome scale, but requires the sequencing of hundreds to thousands of cellular preys per experiment. These prey sequences extracted from yeast positive colonies are referred to ISTs, i.e. Interaction Sequence Tags [5]. Dedicated tools were developed to deal with high-throughput sequencing of ESTs (Expressed Sequence Tags) in transcriptome-based studies of cell lines, tissues or whole organ libraries in different physiological contexts and mainly rely on Phred functionalities [6]. However, these tools are not fully adapted for ISTs analysis. For instance, information related to cDNA libraries vectors has to be used to unambiguously define the IST reading "frame" and to eliminate cDNA inserts that have been cloned into abnormal reading frame or correspond to untranslated mRNA regions (UTRs). The reconstruction of a high-quality protein interactome dramatically depends on this unambiguous annotation of ISTs. In this paper, we present pISTil, a fully automated pipeline combined to a web-based interface, which are specifically devoted to ISTs identification and analysis. The pISTil system is highly flexible and allows: (i) systematic and fast assignment of ISTs to a unique protein accession number; (ii) annotation of "in frame" ISTs (i.e., ISTs with prey cDNA inserts are in frame with Gal4 transactivation domain) or "not in frame" ISTs (i.e., ISTs with prey cDNA inserts are not in frame with Gal4 transactivation domain); (iii) sequence quality filtering, manual checking and visualisation of annotated ISTs through a user-friendly web interface and (iv) export of protein-protein interactions in multiple formats, such as MIMIx standard format [7]. The pISTil annotation procedure has been tested and validated with more than 11.000 ISTs generated by Y2H screening of human cDNA libraries. This comprehensive analysis led us to define optimal thresholds that reduced the noise to signal ratio associated to ISTs. As a case study, the pISTil pipeline and its web interface utility were illustrated through the analysis of Y2H screens that have been successfully used to reconstruct a relevant HCV-human protein infection network [8].

Implementation

The pISTil pipeline is implemented using a collection of open source program and bioinformatics tools such as Perl, Bioperl, Staden, PHP, Java, NCBI-Blast Toolkit and the PostgreSQL database system (Figure 1). Information on installing and running pISTil is given in the documentation distributed with pISTil [see Additional file 1].

Figure 1

pISTil pipeline workflow. pISTil is organised around three major components. The pISTil software analyses chromatogram files (traces) that are organised by project. In this figure, three different projects are shown with two users. The pISTil web application provides a web interface for the visualization of projects results. The pISTil database integrates all IST analysis meta-information. First of all, IST chromatogram files - in ABI (Applied Biosystems INC) or SCF (Standard Chromatogram Format) formats - are filtered by Phred-pregap4 software [9,10] in order to extract nucleic sequences and their associated quality value. The resulting nucleic sequence of each IST is then translated into three frames and aligned against a protein sequence database (as defined in the configuration file by the users) by using BLASTX alignment software [11,12]. Only alignment information for the best hit is subsequently retained. In addition, identification of Gal4 transactivation domain (Gal4-AD) on ISTs allows the true delineation of "in frame" and "not in frame" ISTs [5] that may lead to false positive protein-protein interaction annotation. Even though translational frame-shift is possible in yeast, "not in frame" ISTs may be more prone to errors related to the irrelevant nature of associated proteins. All information generated by the IST pipeline are stored into the pISTil database, such as sequence quality of ISTs, identity percentage of ISTs, E-value, alignment position, the reading frame and protein sequence database source (Ensembl, RefSeq, etc.). Other meta-data supplied by users, such as origin (host organism, tissue origin, cell type), bait protein accession number/name that was used for the Y2H screen (GenBank accession number) and description of cDNA libraries constructions that have been used for the Y2H screens, are also integrated into the pISTil database.

pISTil pipeline validation

An experimental dataset of 11.658 ISTs obtained from more than 300 Y2H screens was tested in order to validate the pISTil pipeline (unpublished; data not shown). We statistically assessed the stringency of our filter parameters to define optimal thresholds that maximise the true positive rate associated to virus-host protein-protein interactions. One major drawback related to high-throughput sequencing of ISTs is the generation of sequences of poor quality [5]. Indeed, PCR-based procedures used to extract prey cDNA directly from yeast colonies are not optimal in term of yield and specificity, and often generate poor quality templates for sequencing reactions [13]. Because high quality sequences retrieved from Y2H screens are often short (i.e. <300 bp), filtering ISTs based only on sequence length appears in this context inadequate. In Figure 2, correlation between the length of ISTs filtered with a Phred score > 13 (probability of incorrect base call < 5/100) and the percentage of identity between ISTs and annotated proteins (RefSeq) shows that ISTs with long stretch of high-quality nucleotides are correctly discriminated at the 80 percent of identity threshold. Thus, only filtering on identity may be appropriate to recover those short quality sequences that are discarded by applying additional Phred quality cut-off.

Figure 2

pISTil pipeline validation. Correlation between the percentage of identity of whole ISTs with RefSeq protein sequences and the length of these ISTs filtered with a Phred score above 13 (n = 11.658 ISTs). Red-yellow colours gradient is correlated to the density of points (red and yellow correspond respectively to high and low density regions). In this study, ISTs were thus considered as highly significant if they follow these criteria: (i) an identity threshold superior or equal to 80% with an e-value threshold inferior or equal to 10e-10 and (ii) a protein product defined as "in frame".

HCV-human protein-protein interactions analysis, a case study

pISTil was previously applied to analyse ISTs generated from Y2H screens [8]. In this study, 27 constructs encoding full-length HCV mature proteins or discrete domains were used as baits to screen human cDNA prey libraries. As a case study, 1.158 chromatogram files related to two of these screens were processed using the "ist_analyse.pl" program. Analysis showed that 50% of the sequences passed through the first filters (identity ≥ 80%, e-value ≤ 1e-10). As described above, this low retention rate is commonly observed when extracting ISTs directly from yeast by PCR. This success rate dramatically increased when rescuing DNA templates extracted from yeast by bacteria transformation, but this later procedure was much more time consuming (data not shown). Even if 77% of ISTs passing through the first filter are "in frame", the remaining 23% ISTs are "not in frame" and might be considered as "true positives" because of translation mechanisms existing in yeast that allow stop-codon reading through and frame-shift correction for a significant fraction of the preys [14-16]. Altogether, our stringent ISTs identification pipeline leads to characterise unambiguously roughly 40% of the sequences (443/1158) and defined for 10 viral proteins used as baits 132 distinct protein interactions and 117 unique host protein partners. Based on these criteria, these protein-protein interactions were alternatively confirmed by GST pull-down validation with a success rate of 80% [8], underlying the efficiency of the pISTil pipeline. Additional thresholds might be used to reduce the false positive rate of ISTs assignation, for instance the number of independent ISTs observations for each non-redundant interaction. Indeed previous studies have shown that protein-protein interactions defined by more than three ISTs exhibit a high rate of confirmation with alternative interaction detection methods such as co-immunoprecipitation or pull-down [5].

pISTil Web interface and utility

In order to manually check all information associated to ISTs annotation, a web interface was designed. The pISTil web interface was fully implemented in PHP/PostgreSQL [see Additional file 1]. A demonstration of the web site capabilities is available at . Throughout this web interface, users can easily access yeast two-hybrid meta-data, such as project, bait, plate and ISTs information (see Figure 3 and documentation). An advanced search interface allows querying and ranking protein-protein interaction annotations using multiple-criteria, such as quality of ISTs, "in frame" ISTs annotation, percentage of identity of ISTs, e-value and the number of independent ISTs observations associated with non redundant protein-protein interactions. The results of interactions between proteins associated with baits and preys can be displayed as a HTML table. This table can be sorted according to the number of independent ISTs associated with non-redundant protein-protein interactions (Figure 3). Functionality related to the design and the visualisation of the minimal interacting domain is also provided for further experimental validation. This minimal interacting domain is obtained by extracting the minimal common protein sequence from multiple alignment of independent IST defining a non-redundant protein-protein interaction (Figure 3).

Figure 3

pISTil web interface screenshots. Complete details on how to use the interface is given in the pISTil documentation [see Additional file 1]. a - Home page: . b - Multicriteria search result page: . c - Search independent IST result page: . d - Interaction domain result page - . e - PPI information result page - . By using compliant PSI-MI (Proteomics Standards Initiative-Molecular Interaction) standard as file format for molecular interaction output and by carefully following MIMIx guidelines, efforts have also been made to facilitate unified exchange of protein-protein interaction data with the main public interaction data providers, such as those belonging to the IMEx consortium.

Conclusion

We have developed pISTil, a pipeline for large-scale identification and analysis of ISTs data generated by yeast two-hybrid approach. This application is dedicated to laboratories willing to automatically process, easily visualise and efficiently share yeast two-hybrid data. The use of such a standard approach will facilitate comparisons of datasets and will improve quality of protein-protein interaction network reconstruction in systems biology projects. Finally, next generation sequence tags project relying on cDNA libraries may also take advantage of this open source and efficient pipeline. pISTil is available under the GNU General Public License and may be downloaded from its project website.

Availability and requirements

• Project name: pISTil • Project home page: • Operating system(s): Running on Mac OS × 10.4× or higher, Linux (Linux 2.6.18-1.2798.fc6) and Unix Solaris systems (SunOS 5.10) • Programming language: Perl 5.0 or higher, PHP (php4 or php5), PostgreSQL 8. × or higher • Other requirements: Phred, Apache 2.0, Staden 1.6.0, NCBI BLAST Toolkit • License: GNU General Public License • Any restrictions to use by non-academics: License require

Abbreviations

bp: base pairs; EST: Expressed Sequence Tag; IST: Interaction Sequence Tag; nt: nucleotide; PSI-MI: Proteomics Standards Initiative-Molecular Interaction; Y2H: Yeast Two-Hybrid; ABI: Applied Biosystems INC; SCF: Standard Chromatogram Format; HTML: Hypertext Markup Language.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

VN designed the study and drafted the manuscript. JP developed the pISTil system, the pISTil web interface and wrote the pISTil documentation. LM contributed to the development of the pISTil web interface, the PSI-MI XML export; tested the pISTil software, corrected the pISTil documentation and manuscript. BdC, LT, POV contributed to the design of pISTil algorithm and corrected the manuscript. VL and CRC provided funding and corrected the manuscript. All authors read and approved the final manuscript.

Additional file 1

The pISTil documentation. This documentation gives detailed information on how to install, run and use the pipeline as well as its associated web interface. Click here for file

16 in total

1. The Staden package, 1998.

Authors: R Staden; K F Beal; J K Bonfield
Journal: Methods Mol Biol Date: 2000

Review 2. The minimum information required for reporting a molecular interaction experiment (MIMIx).

Authors: Sandra Orchard; Lukasz Salwinski; Samuel Kerrien; Luisa Montecchi-Palazzi; Matthias Oesterheld; Volker Stümpflen; Arnaud Ceol; Andrew Chatr-aryamontri; John Armstrong; Peter Woollard; John J Salama; Susan Moore; Jérôme Wojcik; Gary D Bader; Marc Vidal; Michael E Cusick; Mark Gerstein; Anne-Claude Gavin; Giulio Superti-Furga; Jack Greenblatt; Joel Bader; Peter Uetz; Mike Tyers; Pierre Legrain; Stan Fields; Nicola Mulder; Michael Gilson; Michael Niepmann; Lyle Burgoon; Javier De Las Rivas; Carlos Prieto; Victoria M Perreau; Chris Hogue; Hans-Werner Mewes; Rolf Apweiler; Ioannis Xenarios; David Eisenberg; Gianni Cesareni; Henning Hermjakob
Journal: Nat Biotechnol Date: 2007-08 Impact factor: 54.908

Review 3. The Staden sequence analysis package.

Authors: R Staden
Journal: Mol Biotechnol Date: 1996-06 Impact factor: 2.695

4. A map of the interactome network of the metazoan C. elegans.

Authors: Siming Li; Christopher M Armstrong; Nicolas Bertin; Hui Ge; Stuart Milstein; Mike Boxem; Pierre-Olivier Vidalain; Jing-Dong J Han; Alban Chesneau; Tong Hao; Debra S Goldberg; Ning Li; Monica Martinez; Jean-François Rual; Philippe Lamesch; Lai Xu; Muneesh Tewari; Sharyl L Wong; Lan V Zhang; Gabriel F Berriz; Laurent Jacotot; Philippe Vaglio; Jérôme Reboul; Tomoko Hirozane-Kishikawa; Qianru Li; Harrison W Gabel; Ahmed Elewa; Bridget Baumgartner; Debra J Rose; Haiyuan Yu; Stephanie Bosak; Reynaldo Sequerra; Andrew Fraser; Susan E Mango; William M Saxton; Susan Strome; Sander Van Den Heuvel; Fabio Piano; Jean Vandenhaute; Claude Sardet; Mark Gerstein; Lynn Doucette-Stamm; Kristin C Gunsalus; J Wade Harper; Michael E Cusick; Frederick P Roth; David E Hill; Marc Vidal
Journal: Science Date: 2004-01-02 Impact factor: 47.728

5. STRING 8--a global view on proteins and their functional interactions in 630 organisms.

Authors: Lars J Jensen; Michael Kuhn; Manuel Stark; Samuel Chaffron; Chris Creevey; Jean Muller; Tobias Doerks; Philippe Julien; Alexander Roth; Milan Simonovic; Peer Bork; Christian von Mering
Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971

6. IntAct--open source resource for molecular interaction data.

Authors: S Kerrien; Y Alam-Faruque; B Aranda; I Bancarz; A Bridge; C Derow; E Dimmer; M Feuermann; A Friedrichsen; R Huntley; C Kohler; J Khadake; C Leroy; A Liban; C Lieftink; L Montecchi-Palazzi; S Orchard; J Risse; K Robbe; B Roechert; D Thorneycroft; Y Zhang; R Apweiler; H Hermjakob
Journal: Nucleic Acids Res Date: 2006-12-01 Impact factor: 16.971

7. MINT: the Molecular INTeraction database.

Authors: Andrew Chatr-aryamontri; Arnaud Ceol; Luisa Montecchi Palazzi; Giuliano Nardelli; Maria Victoria Schneider; Luisa Castagnoli; Gianni Cesareni
Journal: Nucleic Acids Res Date: 2006-11-29 Impact factor: 16.971

8. VirHostNet: a knowledge base for the management and the analysis of proteome-wide virus-host interaction networks.

Authors: Vincent Navratil; Benoît de Chassey; Laurène Meyniel; Stéphane Delmotte; Christian Gautier; Patrice André; Vincent Lotteau; Chantal Rabourdin-Combe
Journal: Nucleic Acids Res Date: 2008-11-04 Impact factor: 16.971

9. Hepatitis C virus infection protein network.

Authors: B de Chassey; V Navratil; L Tafforeau; M S Hiet; A Aublin-Gex; S Agaugué; G Meiffren; F Pradezynski; B F Faria; T Chantier; M Le Breton; J Pellet; N Davoust; P E Mangeot; A Chaboud; F Penin; Y Jacob; P O Vidalain; M Vidal; P André; C Rabourdin-Combe; V Lotteau
Journal: Mol Syst Biol Date: 2008-11-04 Impact factor: 11.429

10. Ensembl 2007.

Authors: T J P Hubbard; B L Aken; K Beal; B Ballester; M Caccamo; Y Chen; L Clarke; G Coates; F Cunningham; T Cutts; T Down; S C Dyer; S Fitzgerald; J Fernandez-Banet; S Graf; S Haider; M Hammond; J Herrero; R Holland; K Howe; K Howe; N Johnson; A Kahari; D Keefe; F Kokocinski; E Kulesha; D Lawson; I Longden; C Melsopp; K Megy; P Meidl; B Ouverdin; A Parker; A Prlic; S Rice; D Rios; M Schuster; I Sealy; J Severin; G Slater; D Smedley; G Spudich; S Trevanion; A Vilella; J Vogel; S White; M Wood; T Cox; V Curwen; R Durbin; X M Fernandez-Suarez; P Flicek; A Kasprzyk; G Proctor; S Searle; J Smith; A Ureta-Vidal; E Birney
Journal: Nucleic Acids Res Date: 2006-12-05 Impact factor: 16.971

8 in total

1. Generation and comprehensive analysis of an influenza virus polymerase cellular interaction network.

Authors: Lionel Tafforeau; Thibault Chantier; Fabrine Pradezynski; Johann Pellet; Philippe E Mangeot; Pierre-Olivier Vidalain; Patrice Andre; Chantal Rabourdin-Combe; Vincent Lotteau
Journal: J Virol Date: 2011-10-12 Impact factor: 5.103

2. Mapping of Chikungunya virus interactions with host proteins identified nsP2 as a highly connected viral component.

Authors: Mehdi Bouraï; Marianne Lucas-Hourani; Hans Henrik Gad; Christian Drosten; Yves Jacob; Lionel Tafforeau; Patricia Cassonnet; Louis M Jones; Delphine Judith; Thérèse Couderc; Marc Lecuit; Patrice André; Beate Mareike Kümmerer; Vincent Lotteau; Philippe Desprès; Frédéric Tangy; Pierre-Olivier Vidalain
Journal: J Virol Date: 2012-01-18 Impact factor: 5.103

3. Epstein-Barr virus protein EB2 stimulates cytoplasmic mRNA accumulation by counteracting the deleterious effects of SRp20 on viral mRNAs.

Authors: Franceline Juillard; Quentin Bazot; Fabrice Mure; Lionel Tafforeau; Christophe Macri; Chantal Rabourdin-Combe; Vincent Lotteau; Evelyne Manet; Henri Gruffat
Journal: Nucleic Acids Res Date: 2012-04-13 Impact factor: 16.971

4. Flavivirus NS3 and NS5 proteins interaction network: a high-throughput yeast two-hybrid screen.

Authors: Marc Le Breton; Laurène Meyniel-Schicklin; Alexandre Deloire; Bruno Coutard; Bruno Canard; Xavier de Lamballerie; Patrice Andre; Chantal Rabourdin-Combe; Vincent Lotteau; Nathalie Davoust
Journal: BMC Microbiol Date: 2011-10-20 Impact factor: 3.605

5. The interactomes of influenza virus NS1 and NS2 proteins identify new host factors and provide insights for ADAR1 playing a supportive role in virus replication.

Authors: Benoît de Chassey; Anne Aublin-Gex; Alessia Ruggieri; Laurène Meyniel-Schicklin; Fabrine Pradezynski; Nathalie Davoust; Thibault Chantier; Lionel Tafforeau; Philippe-Emmanuel Mangeot; Claire Ciancia; Laure Perrin-Cocon; Ralf Bartenschlager; Patrice André; Vincent Lotteau
Journal: PLoS Pathog Date: 2013-07-04 Impact factor: 6.823

6. ViralORFeome: an integrated database to generate a versatile collection of viral ORFs.

Authors: J Pellet; L Tafforeau; M Lucas-Hourani; V Navratil; L Meyniel; G Achaz; A Guironnet-Paquet; A Aublin-Gex; G Caignard; P Cassonnet; A Chaboud; T Chantier; A Deloire; C Demeret; M Le Breton; G Neveu; L Jacotot; P Vaglio; S Delmotte; C Gautier; C Combet; G Deleage; M Favre; F Tangy; Y Jacob; P Andre; V Lotteau; C Rabourdin-Combe; P O Vidalain
Journal: Nucleic Acids Res Date: 2009-12-08 Impact factor: 16.971

7. IIS--Integrated Interactome System: a web-based platform for the annotation, analysis and visualization of protein-metabolite-gene-drug interactions by integrating a variety of data sources and tools.

Authors: Marcelo Falsarella Carazzolle; Lucas Miguel de Carvalho; Hugo Henrique Slepicka; Ramon Oliveira Vidal; Gonçalo Amarante Guimarães Pereira; Jörg Kobarg; Gabriela Vaz Meirelles
Journal: PLoS One Date: 2014-06-20 Impact factor: 3.240

8. Epstein-Barr virus nuclear antigen 3A protein regulates CDKN2B transcription via interaction with MIZ-1.

Authors: Quentin Bazot; Thibaut Deschamps; Lionel Tafforeau; Maha Siouda; Pascal Leblanc; Marie L Harth-Hertle; Chantal Rabourdin-Combe; Vincent Lotteau; Bettina Kempkes; Massimo Tommasino; Henri Gruffat; Evelyne Manet
Journal: Nucleic Acids Res Date: 2014-08-04 Impact factor: 16.971

8 in total