Literature DB >> 25581801

Targeted exploration and analysis of large cross-platform human transcriptomic compendia.

Qian Zhu¹, Aaron K Wong¹, Arjun Krishnan², Miriam R Aure³, Alicja Tadych², Ran Zhang⁴, David C Corney⁴, Casey S Greene⁵, Lars A Bongo⁶, Vessela N Kristensen⁷, Moses Charikar⁸, Kai Li⁸, Olga G Troyanskaya⁹.

Abstract

We present SEEK (search-based exploration of expression compendia; http://seek.princeton.edu/), a query-based search engine for very large transcriptomic data collections, including thousands of human data sets from many different microarray and high-throughput sequencing platforms. SEEK uses a query-level cross-validation-based algorithm to automatically prioritize data sets relevant to the query and a robust search approach to identify genes, pathways and processes co-regulated with the query. SEEK provides multigene query searching with iterative metadata-based search refinement and extensive visualization-based analysis options.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Hedgehog Proteins
RNA

Year: 2015 PMID： 25581801 PMCID： PMC4768301 DOI： 10.1038/nmeth.3249

Source DB: PubMed Journal: Nat Methods ISSN： 1548-7091 Impact factor: 28.547

The accumulation of human gene expression data in public repositories, such as The Cancer Genome Atlas[1] and Gene Expression Omnibus[2], offers unprecedented opportunities for data-driven characterization of biological pathways that underlie human diseases. Unsupervised, exploratory approaches are particularly suitable for data-driven discovery and in settings with insufficient or biased training data. However, traditional unsupervised methods, such as clustering and bi-clustering[3,4], do not readily extend to compendia containing thousands of data sets from different expression technologies and platforms. Query-based search can enable biomedical researchers to effectively explore and analyze the large collection of expression data sets to identify co-expressed genes in order to explore functional relationships, and make inferences about pathway function with regard to query genes of interest. However, existing search approaches are limited to smaller compendia in model organisms[5,6] or, in human, to identifying similar arrays[7] or performing gene-level search on a single microarray platform[8]. We present SEEK (Search-based Exploration of Expression Kompendia), a robust, cross-platform search system capable of handling very large compendia of human expression data across multiple expression platforms, including microarray and next-generation sequencing (NGS) technologies, and automatically prioritizing data sets relevant to the user’s single or multi-gene query to identify genes co-regulated with the query in informative data sets. SEEK provides biomedical researchers with a systems-level, unbiased exploration of diverse human pathways, tissues, and diseases represented in the entire heterogeneous human compendium. The system integrates thousands of data sets on-the-fly using a novel cross-validation-based data set-weighting algorithm, which robustly identifies relevant data sets and leverages them to retrieve genes co-regulated with the query. It supports sophisticated biological search contexts defined by multi-gene queries and enables cross-platform analysis, with the current compendium including 155,025 experiments spanning 5,210 data sets from 41 different microarray and RNASeq platforms (Fig. 1a and Supplementary Data 1). It has been implemented in a user-friendly, interactive web-interface (http://seek.princeton.edu), which includes expression visualization and interpretation modules (Fig. 1a). This interface facilitates hypothesis generation by providing 1) intuitive expression visualizations of the retrieved co-expressed genes, 2) explorations of individual data sets to establish associations between co-expressed genes and biological variables, and 3) further refinement of the search results such as limiting data sets to a specific tissue (e.g. brain or kidney) or disease (e.g. primary tumor or non-cancerous disease).

Figure 1

The SEEK system overview and systematic functional evaluation

(a) The system overview. Users begin by defining a query gene set of interest. SEEK can easily accommodate gene sets as small as 1–2 genes and as large as 100 genes (step 1). The SEEK search engine searches the entire compendium, and returns genes that are co-expressed with the query and the top relevant data sets (steps 2, 3). The web user-interface provides visualizations of gene co-expressions across data sets (step 4), and enables users to iteratively refine their search (Fig. 2) and further analyze the results through condition-specific view (step 5). The latter allows users to check possible associations with the measured outcomes in order to interpret the co-expressed genes (Supplementary Note 3). (b) Gene retrieval evaluations across 995 diverse GO biological process terms, for each of SEEK, MEM, Gene recommender, and meta-data set correlation algorithms (Supplementary Note 1). Queries of diverse sizes (2–20 genes) were selected randomly among each term’s genes to evaluate the precision of retrieving the remaining genes in each term. Individual term performances (Supplementary Data 2) and additional detailed comparative evaluations (Supplementary Figs. 1, 2) are provided.

The search algorithm (Methods) allows multi-gene queries and includes a gene hubbiness[9,10] correction procedure, a novel cross-validation data set weighting method, and finally a summarization procedure to calculate the final score for each gene. Prior to applying the search algorithm, the data compendium is pre-processed to make correlation distributions comparable across data sets, and then a hubbiness-correction procedure is applied to remove biases caused by generically well-coexpressed genes not specific to the user’s area of interest that is defined by the query. The data set weighting algorithm then prioritizes relevant data sets based on the query. The intuition of this approach is to up-weight data sets where a subset of the query genes can retrieve the remaining query genes well based on normalized, hubbiness-corrected co-expression in that data set (cross-validation-based weighting). This approach is effective even when not all query genes are co-expressed. Finally, the integrated gene scores are calculated based on the data set weights and genes’ co-expression patterns in each data set to provide a final gene ranking. SEEK is accurate and robust in a large-scale gene-retrieval assessment across a diverse array of biological contexts. Specifically, we constructed over 129,000 queries spanning 995 human GO biological process gene-sets (by choosing subsets of genes from each process) and evaluated the ability of the algorithm to retrieve the remaining genes in the process (Methods). This set-up was designed to simulate realistic situations where the query genes are biologically coherent, but are not necessarily well co-expressed, and users are interested in identifying genes functionally related to the query (in this case, members of the same biological process). SEEK’s performance is robust across a wide range of pathways (Supplementary Data 2), and it consistently outperforms previous search approaches, including the only query-based human search system MEM[8], Gene recommender[6] (not available for human as a resource), and meta-data set correlations (Fig. 1b and Supplementary Note 1). Furthermore, our evaluation demonstrates that SEEK’s support for multi-gene queries enhances the algorithm’s ability to effectively weight relevant data sets from the compendium (Supplementary Fig. 1a), and that the algorithm is robust to query noise (Supplementary Fig. 2). Importantly, our evaluation demonstrates the benefits of robust search of a compendium with thousands of expression data sets, as SEEK’s performance improves with the inclusion of more microarray and RNASeq data sets in the compendium, assessed by sub-sampling our large compendium to create smaller subsets (Supplementary Fig. 3 and Supplementary Data 3). Furthermore, being able to integrate the full scale of the existing human gene expression data allows the approach to support focused queries covering diverse areas of biology (Supplementary Fig. 4), providing strong performance across diverse processes including erythrocyte differentiation (44-fold increase of precision over random (FIOR) at 10% recall) and glutamate signaling (104-fold) (Supplementary Fig. 4). In contrast, using the most relevant single data set for the same query yields weak performance of just 3 and 6 FIOR for the two processes respectively, demonstrating the value of using the entire compendium. We illustrate the power of SEEK and multi-gene queries by using SEEK to identify genes dysregulated in the Hedgehog (Hh) pathway and the corresponding tissues and disease states where the Hh pathway is hyper-activated. We use Hh genes GLI1, GLI2, and PTCH1 as the query, where transcription factors GLI1 and GLI2 have been suggested as pathway markers of Hh signaling[11]. By examining this query in the context 5 of a large-compendium of expression data sets (Fig. 2a and Supplementary Fig. 5), we observe a wide prevalence of aberrant Hh-signaling across many diseased tissues (Supplementary Fig. 5). The top ranked data sets have substantially higher weights, indicating the presence of a strong query-related signal in these data (Supplementary Fig. 5) and appear to be more specific to the Hh query than to random queries (Supplementary Fig. 6a). These highly weighted data sets include studies of tumors with previously documented connections to aberrant Hh signaling, such as medulloblastoma, where over-activation of Hh has been documented[12,13], human germ cell tumors, where Hh pathway mutations have been linked to aberrant Hh activation in human germ cells[14], and malignant rhabdoid tumors[15,16], where mutations have been found to lead to Hh signaling activation[16]. Thus, SEEK correctly identifies data sets relevant to the Hh signaling and helps explore the important role of the Hh pathway in a wide array of cancer types. The data set weighting leads to accurate retrieval of other genes in the Hh pathway, including Hh pathway signaling receptors and their associated genes SMO, PTCH2, HHIP, BOC[17], and the Cos2 homolog KIF7[18] (Fig. 2a and Supplementary Fig. 6b) as well as additional genes associated with Hh dysregulations in cancer (Supplementary Note 2).

Figure 2

Search results for the Hedgehog (Hh) query (GLI1, GLI2, PTCH1) and search refinement

(a) Data sets prioritized and genes retrieved for the query in the main result page, expression view. The result is retrieved from the Hh query after a global compendium search. The top ranked data sets (1) and the co-expressed gene list (2) are indicated. Conditions in each data set are hierarchically clustered in real-time according to the expression values of the top genes retrieved from the search (3). The expression heat-map of the genes in one of the data sets is shown in (4). (b) Illustration of the search refinement function. Refine Search enables users to narrow the scope of their search based on a powerful and broad set of selection criteria including tissue, cell-type, or disease categories, platforms, or rank of data sets from initial search (Supplementary Note 3). (c) The final results after limiting the search scope to brain data sets. Brain-specific co-expressions are noted in this case with higher co-expression scores to the query and better groupings of conditions than the initial search. SEEK also has alternative view modes such as co-expression view, and condition-specific view (Supplementary Note 3).

The SEEK interface can visualize the abovementioned results including the top-ranked data sets, genes, and their co-expression profiles, using flexible and interactive visualizations (an example of the Hh query is provided in Fig. 2a). The main search result page provides users with the ability to perform extensive follow-up analyses, including functional analysis of results with a co-expression view that summarizes the query and retrieved genes’ co-expression across 50 data sets at a time (Supplementary Note 3). The users can also examine the behavior of any gene in a given data set in detail through a condition-specific view (Fig. 1a), where they can examine associations between co-expressed genes and treatments or outcomes based on data set metadata. An additional post-search analysis, the search refinement function, allows users to iteratively refine their search by limiting the scope of the query search to data sets of a specific disease or tissue of interest, e.g. brain or non-cancer data, if the user wishes to further reveal disease-state or tissue-specific co-expression patterns (Fig. 2b, c). Refine Search currently provides customized search over not only the 2,685 cancer data sets of various tissue origins, but also almost 2,000 non-cancer data sets, including nearly 280 stem cell, over 100 neurodegenerative disease, and 1,400 various immune and other cell type related data sets. Thus, the proposed search approach is effective in enabling robust and accurate search over very large and diverse human expression compendia by defining specific biological questions based on multi-gene queries. Such compendium-wide search is powerful as it identifies and combines relevant information across many data sets, each representing a mixture of signals from diverse pathways affected by disease, environmental factors, and clinical or experimental treatments. SEEK is based on measuring co-expressions, which minimizes biases toward prior knowledge, and accurately extracts functional information without need to explicitly model outcome variables such as treatment and control experiments (Fig. 1b and prior works[5,6,8,19]). The use of co-expression thus enables the robust integration of a large number of data sets from diverse tissues, cell lines, and disease origins, generated from diverse platforms, and it can be extended to make functional comparisons across organisms. A key challenge here is that the search results can be polluted by batch effects[20], poor quality data sets, or even good quality data sets irrelevant to the user’s query context. Yet the detailed, targeted correction of these issues in each data set or modeling of each outcome variable is impossible in the context of a very large, multi-platform compendium. SEEK’s data set weighting algorithm addresses this challenge, by enabling multi-gene query support for constructing expressive search contexts, and using a discriminative algorithm for identifying which data sets are relevant and accurate in representing query-related biological processes. This algorithm thus automatically down-weights low-quality data sets (e.g. with severe batch effects) (Supplementary Fig. 7 and Supplementary Note 4) and provides accurate retrieval of functionally related genes and data sets (Fig. 1b and Supplementary Figs. 1, 2). In summary, SEEK enables biomedical researchers to tap the enormous potential of existing expression data in a flexible, integrative, and interactive way. SEEK allows users to prioritize existing data sets. To our knowledge, SEEK is the first search system that addresses the challenges of cross-platform and cross-data set integrated search in human, integrating the large diversity of microarray and RNASeq platforms, thus fully utilizing the expression compendium, and is the largest scale integration of human transcriptomic data to date. We demonstrated that the ability to effectively search such a large and diverse compendium is important, and that search performance benefits from leveraging more diverse data. As such, we plan to regularly update SEEK’s compendium as new microarray and RNASeq data sets become publicly available.

Online methods

Data preparation and correlation normalization

SEEK assembles its human gene expression compendium by obtaining data sets from NCBI’s Gene Expression Omnibus (GEO) database[2] and the Cancer Genome Atlas (TCGA)[1]. The compendium consists of data sets from 41 platforms including 32 platforms from Affymetrix, Agilent, and Illumina, and 9 RNA sequencing platforms (Supplementary Data 1). These platforms were chosen based on the number of available data sets and the availability of raw data to perform consistent processing for each platform. The data sets were processed consistently by applying platform-specific procedures on their raw measurements (Supplementary Note 5 and Supplementary Data 4) to remove the systematic differences among data sets[21]. The normalized data sets containing gene-level expression values can be accessed through the SEEK website. The next step of data processing is calculating the Pearson correlations r(x, y) between all pairs of genes x and y in each data set d. As correlation values arising from different genome-wide distributions are not directly comparable across data sets, a Fisher transform procedure[22] is applied to convert the distribution of correlations to a normal-like distribution: where f(x, y) is the Fisher-transformed score. Then, the data are translated to z-scores for standardization: where avg(f) is the average of f for all (x, y) pairs, and std(f) is the standard deviation of f. The normalization procedure has been used in previous studies[5,23], and has been found successful in transforming most correlation distributions that are generated from different platforms and technologies into a comparable normal distribution with mean 0 and variance 1 (Supplementary Fig. 8). Note that SEEK also works well with other correlation measures, such as Spearman and bicor[24] (Supplementary Fig. 9). We found that the normalized Pearson correlation provides performance better or comparable to that of Spearman and bicor in the search setting, likely because the normalization procedure and the SEEK algorithm itself reduce the effects of outliers in search performance (Supplementary Fig. 9).

Search algorithm

The search algorithm takes two inputs 1) a set of query genes Q = {q1, …, q}, and 2) the set of correlation z-scores containing the query: z(g, q), for each data set d in the data compendium D, for all genes q in Q and for all genes g in the genome G. The outputs of the algorithm are a prioritized list of data sets and co-expressed genes relevant to Q. The search algorithm consists of four steps. The first step is to load pre-computed z-scores of Pearson correlations (in the normalization step above) containing the query across D. The second step is to perform hubbiness correction on each data set d. The correction procedure is motivated by the observation that hubby[9,10] or well-connected genes in the co-expression network represent global, well-co-expressed processes[25], and can contaminate the search results regardless of query composition due to the effect of unbalanced gene connectivity in a scale-free co-expression network[9,10,26-28], and can lead to non-specific results in search or clustering approaches. To avoid the bias created by hubby genes that are not related to the user’s query or pathway of interest, our method corrects each gene g’s correlation to q in each data set d: where z̃ is the hubbiness-corrected z-score. By subtracting g’s average correlation from the correlation of (g, q), we expect the resulting score to emphasize g’s co-expression specifically with the query rather than its general connectivity. The control of co-expression hubbiness enables the detection of specific biological signals in the data that would otherwise be swamped by broad co-expression patterns of the most well-connected genes. The third step performs cross-validation-based data set weighting. The goal is to rank data sets based on each data set’s relevance to the query[5]. The result will be the first output of the search system and will also be used to compute the final gene-score vector for the last step. The main idea is to upweight data sets where a subset of the query genes can retrieve the remaining query genes well based on normalized, hubbiness-corrected co-expression in that data set. Thus, it is analogous in spirit to the cross-validation procedures commonly used in machine learning, where a subset of the standard (in this case query) “hides” from the system to assess how well the method can predict these hidden genes. To describe the weighting method, we first introduce some notations. The data set d is implicit in each formula below and omitted for brevity, thus z̃(g, q) is the corrected z-score for g to a query gene q in Q in data set d. Let R = (g(1), g(2), g(3), …, g(r)) be the sequence of genes at rank 1, 2, 3, …, r obtained from ordering genes by decreasing z̃(g, q). That is, R satisfies: z̃(g(1), q) ≥ z̃(g(2), q) ≥ z̃(g(3), q)… Let r(t, R) be the rank of gene t in the ranking R minus 1 (for example, r(g(1), R) = 0), and let p < 1be a rate parameter, which we set at 0.99 based on empirical analysis (Supplementary Fig. 10). Then the weight of the data set is The weighting formula performs cross-validations on q in the set Q. The goal is to detect which query genes q can best retrieve the remainder query Q – q; such q’s have a high contribution to w. We shorten r(t, R) in Eq. 2 as r(t). The exact form of this expression for weight (i.e. sum of p() is inspired by rank-biased precision[29], and is adapted to our setting to robustly measure the effectiveness of the data set in retrieving Q – q. Here, p < 1 is the rate parameter in rank-biased precision, and is the parameter of geometric distribution, since r(t) assumes discrete values. When it is employed, p( upweights ranks for genes t in the set Q – q that are high in the rank list (i.e., r(t) is small), which intuitively emphasizes those genes in the query that are highly co-expressed to each other. The measure has the desired property of upweighting pairs of query genes that are well correlated while not allowing the correlations between the uninformative, non-coherent part of the query to affect the weight of the data set because the query genes may only be partially co-expressed in a given data set. Compared to previous methods[5], our method gains robustness to heterogeneous query signals, because the reward on the highly coherent query genes far outweighs the damaging effect of a few non-coherent query genes, which are poorly ranked to other query genes, have high r(t), and have scores p( tending to zero. The last step of the algorithm calculates the final integrated gene scores to generate a master ranking of co-expressed genes that is the second output of the system (in addition to data set relevance weighting). We obtain the gene-to-query score matrix M, where the entry M, is the average corrected z-score of gene g to the query in data set d: With the data set weight vector from the previous step w = [w1, w2, …], a simple formulation of the final gene-score vector F is given by: Although previous research had some success with this formulation[5], our findings show that it works well only in the presence of complete gene information with no missing genes in M. When there are heterogeneous sources of data in the compendium (e.g. different microarray and RNAseq platforms), the confounding factor of missing genes and partial gene rankings must be accounted for. Our approach is to modify the procedure above by employing threshold parameters to exclude a data set from weighting if it does not contain enough query genes, and exclude a gene from the final ranking if it is not assayed by a sufficient number of data sets in the compendium (Supplementary Note 6). The pseudocode for the entire SEEK search algorithm can be found in Supplementary Note 6. The algorithm is robust to query composition (Supplementary Figs. 1, 2) and data set quality, including automatically down weighting data sets with substantial batch effects (Supplementary Note 4 and Supplementary Fig. 7). For single-gene queries, the search algorithm performs the same steps above except that in the data set weighting step the algorithm assigns equal weight to all data sets. Thus, for single-gene queries, the search system will treat each data set equally and retrieve genes that are generally correlated with the query in the hubbiness-corrected space. If users wish to perform their single-gene searches in a tissue-specific or disease-specific manner, they can manually define a category of data sets using the extensive “Refine Search” interface on the SEEK website, which will restrict D in the search system input.

Estimating the significance of gene scores

We estimate a P value for each retrieved gene by comparing the integrated score of each gene with scores from a pool of 10,000 randomly generated queries with diverse query sizes varying from 1 to 100 genes. The random pool allows SEEK to estimate the significance of gene-score as well as evaluate the specificity of that gene to the query genes (as opposed to random queries). For a given gene g and its final co-expression score S(g) generated from the user’s query Q, the P value of g is estimated as the number of random queries R in which S(g) > S(g) divided by the random pool size.

Algorithm and interface implementations

The SEEK algorithm is implemented in C++ and has been integrated into the open-source C++ Sleipnir library, enabling other computational users to use and expand SEEK without website tie-in[30]. The backend employs the efficient data structures from the Sleipnir library to facilitate the process of handling large query sets of over 100 genes without memory overflow. SEEK’s jobs are parallelized to make full use of the multiprocessor resources and their processing power. The SEEK web server is constructed with some of the latest web technologies including JQuery and Qtip2 libraries. Dynamic pages are generated with Java servlets running behind the Apache Tomcat server on a Red Hat CentOS Linux operating system. In addition, Ajax technology is deployed to send and retrieve data from the server asynchronously such that users can receive instant feedback on their gene enrichment analysis, expression zoom-in function, and data set selection module without having to leave or refresh the page.

Metadata processing

SEEK categorizes data sets into tissue and disease groups by mining the description, title, and sample level characteristic fields in data sets’ metadata. The text mining procedure utilizes the UMLS MetaThesaurus[31] and BRENDA[32] controlled vocabularies to extract predefined concept names that are present in the individual fields. To ensure that tissue groups are accurate, we manually reviewed annotations to the frequently appearing terms generated by text mining. Similarly, we formed additional ‘meta’ data set groups, such as cancer and non-cancer groups and the multi-tissue profiling group (Supplementary Data 5), to provide users with the ability to limit their search to such groups under the “Refine Search” feature of the website.

Large-scale functional evaluation setup

We conducted a comprehensive evaluation of SEEK in comparison with existing algorithms Gene recommender, MEM, and meta-data set correlation search (Supplementary Note 1). We tested each system’s ability to retrieve genes from the same biological process given some chosen genes from the process as queries. For the evaluation, we partitioned the genes in each of the 995 GO biological process terms (Supplementary Data 2) into a query building set and a testing set. The query building set consists of a random sample of 25 genes from each term if the term has more than 40 genes, or else it is made of half of the number of genes in the term. Queries were formed by repeatedly sampling genes from the set, so that each query size has 10 different queries of that size represented, and we iteratively generated queries for sizes 2, 3, 4, … up to Q genes, where Q=0.8|query building set|. The testing set consists of the remaining genes in the term (after subtracting the query building set), and is used for evaluating the queries’ retrieval results. A precision recall (PR) curve is computed on a per-query basis, averaged over all queries of a term, and finally averaged over all evaluated terms to derive an overall system performance plot for each method. Fold precision over random is calculated at an indicated recall (10%), and uses a random ranking of genes where genes’ rank positions are shuffled. By selecting genes randomly from each process in building the queries, we mimic the situation in which the query genes are functionally related but not well co-expressed. By keeping the two sets (query building and testing) separate in the evaluation, we can reduce the performance variation between the queries of the same size within a process. For building gold standard GO gene-set for evaluation, we used gene annotations with experimental evidence codes (IMP, IGI, IPI, IDA, IEP, EXP) as well as TAS (traceable author statements) and NAS (non-traceable author statement). To select the GO slim set (Supplementary Data 3) used for studying the effect of compendium size, we carefully examined the title and description of the GO terms in the context of the GO hierarchy and arrived at a non-redundant subset of GO terms that are both specific enough to be informative, and diverse enough to represent the hierarchy, similar to our approach in[33]. To evaluate SEEK’s performance as a function of the query size, we pooled together previously built biological process queries from 995 processes, then binned them by query size (2–20 genes). We examined 3 categories of biological processes based on process size: 20–40 genes, 40–100 genes, and 100–300 genes. Performance refers to the fold precision over random at 10% recall (fold PR10%) in using each query to retrieve remaining genes from its corresponding process. To evaluate the search system’s robustness to noisy query genes, we selected over 1,800 5-gene and 10-gene queries from 90 KEGG pathways with 50–100 genes per pathway. Each pathway had 10 queries selected of each query size. We established a “no noise” case, where each query was purely made of genes belonging to the same KEGG pathway, and a noisy case, where 1-, 2-, 4- random genes were added to each query. The fraction (fold PR@10% of each noisy query) / (fold PR@10% of the corresponding “no noise” query) was calculated, where fold PR@10% refers to the performance of retrieving KEGG pathway genes using the queries.

31 in total

Review 1. Network biology: understanding the cell's functional organization.

Authors: Albert-László Barabási; Zoltán N Oltvai
Journal: Nat Rev Genet Date: 2004-02 Impact factor: 53.242

2. Evidence for dynamically organized modularity in the yeast protein-protein interaction network.

Authors: Jing-Dong J Han; Nicolas Bertin; Tong Hao; Debra S Goldberg; Gabriel F Berriz; Lan V Zhang; Denis Dupuy; Albertha J M Walhout; Michael E Cusick; Frederick P Roth; Marc Vidal
Journal: Nature Date: 2004-06-09 Impact factor: 49.962

3. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data.

Authors: Amos Tanay; Roded Sharan; Martin Kupiec; Ron Shamir
Journal: Proc Natl Acad Sci U S A Date: 2004-02-18 Impact factor: 11.205

4. Exploring the functional landscape of gene expression: directed search of large microarray compendia.

Authors: Matthew A Hibbs; David C Hess; Chad L Myers; Curtis Huttenhower; Kai Li; Olga G Troyanskaya
Journal: Bioinformatics Date: 2007-08-27 Impact factor: 6.937

5. Gli1 is important for medulloblastoma formation in Ptc1+/- mice.

Authors: Hiromichi Kimura; Daniel Stephen; Alexandra Joyner; Tom Curran
Journal: Oncogene Date: 2005-06-09 Impact factor: 9.867

6. Cluster analysis and display of genome-wide expression patterns.

Authors: M B Eisen; P T Spellman; P O Brown; D Botstein
Journal: Proc Natl Acad Sci U S A Date: 1998-12-08 Impact factor: 11.205

7. Characterization of two patched receptors for the vertebrate hedgehog protein family.

Authors: D Carpenter; D M Stone; J Brush; A Ryan; M Armanini; G Frantz; A Rosenthal; F J de Sauvage
Journal: Proc Natl Acad Sci U S A Date: 1998-11-10 Impact factor: 11.205

8. Finding function: evaluation methods for functional genomic data.

Authors: Chad L Myers; Daniel R Barrett; Matthew A Hibbs; Curtis Huttenhower; Olga G Troyanskaya
Journal: BMC Genomics Date: 2006-07-25 Impact factor: 3.969

9. The Sleipnir library for computational functional genomics.

Authors: Curtis Huttenhower; Mark Schroeder; Maria D Chikina; Olga G Troyanskaya
Journal: Bioinformatics Date: 2008-05-21 Impact factor: 6.937

10. Geometric interpretation of gene coexpression network analysis.

Authors: Steve Horvath; Jun Dong
Journal: PLoS Comput Biol Date: 2008-08-15 Impact factor: 4.475

65 in total

1. Cancer: smoother journeys for molecular data.

Authors: Vivien Marx
Journal: Nat Methods Date: 2015-04 Impact factor: 28.547

2. IFNγ-Dependent Tissue-Immune Homeostasis Is Co-opted in the Tumor Microenvironment.

Authors: Christopher J Nirschl; Mayte Suárez-Fariñas; Benjamin Izar; Sanjay Prakadan; Ruth Dannenfelser; Itay Tirosh; Yong Liu; Qian Zhu; K Sanjana P Devi; Shaina L Carroll; David Chau; Melika Rezaee; Tae-Gyun Kim; Ruiqi Huang; Judilyn Fuentes-Duculan; George X Song-Zhao; Nicholas Gulati; Michelle A Lowes; Sandra L King; Francisco J Quintana; Young-Suk Lee; James G Krueger; Kavita Y Sarin; Charles H Yoon; Levi Garraway; Aviv Regev; Alex K Shalek; Olga Troyanskaya; Niroshana Anandasabapathy
Journal: Cell Date: 2017-06-29 Impact factor: 41.582

3. A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes.

Authors: Christopher A Mancuso; Jacob L Canfield; Deepak Singla; Arjun Krishnan
Journal: Nucleic Acids Res Date: 2020-12-02 Impact factor: 16.971

4. Human COQ10A and COQ10B are distinct lipid-binding START domain proteins required for coenzyme Q function.

Authors: Hui S Tsui; Nguyen V B Pham; Brendan R Amer; Michelle C Bradley; Jason E Gosschalk; Marcus Gallagher-Jones; Hope Ibarra; Robert T Clubb; Crysten E Blaby-Haas; Catherine F Clarke
Journal: J Lipid Res Date: 2019-05-02 Impact factor: 5.922

5. Hypoxia-induced tumor exosomes promote M2-like macrophage polarization of infiltrating myeloid cells and microRNA-mediated metabolic shift.

Authors: Jung Eun Park; Bamaprasad Dutta; Shun Wilford Tse; Nikhil Gupta; Chee Fan Tan; Jee Keem Low; Kheng Wei Yeoh; Oi Lian Kon; James P Tam; Siu Kwan Sze
Journal: Oncogene Date: 2019-03-12 Impact factor: 9.867

6. Shared Nearest Neighbors Approach and Interactive Browser for Network Analysis of a Comprehensive Non-Small-Cell Lung Cancer Data Set.

Authors: Stephanie T Schmidt; Neal Akhave; Ryan E Knightly; Alexandre Reuben; Natalie Vokes; Jianhua Zhang; Jun Li; Junya Fujimoto; Lauren A Byers; Beatriz Sanchez-Espiridion; Lixia Diao; Jing Wang; Lorenzo Federico; Marie-Andree Forget; Daniel J McGrail; Annikka Weissferdt; Shiaw-Yih Lin; Younghee Lee; Erika Suzuki; Jeffrey J Kovacs; Carmen Behrens; Ignacio I Wistuba; Andrew Futreal; Ara Vaporciyan; Boris Sepesi; John V Heymach; Chantale Bernatchez; Cara Haymaker; Tina Cascone; Jianjun Zhang; Christopher A Bristow; Timothy P Heffernan; Marcelo V Negrao; Don L Gibbons
Journal: JCO Clin Cancer Inform Date: 2022-07

7. PADPIN: protein-protein interaction networks of angiogenesis, arteriogenesis, and inflammation in peripheral arterial disease.

Authors: Liang-Hui Chu; Chaitanya G Vijay; Brian H Annex; Joel S Bader; Aleksander S Popel
Journal: Physiol Genomics Date: 2015-06-09 Impact factor: 3.107

8. Global landscape of mouse and human cytokine transcriptional regulation.

Authors: Sebastian Carrasco Pro; Alvaro Dafonte Imedio; Clarissa Stephanie Santoso; Kok Ann Gan; Jared Allan Sewell; Melissa Martinez; Rebecca Sereda; Shivani Mehta; Juan Ignacio Fuxman Bass
Journal: Nucleic Acids Res Date: 2018-10-12 Impact factor: 16.971

9. Immune cells lacking Y chromosome show dysregulation of autosomal gene expression.

Authors: Hanna Davies; Edyta Rychlicka-Buniowska; Jonas Mattisson; Behrooz Torabi Moghadam; Jan P Dumanski; Jonatan Halvardson; Noemi Nagy; Kazimierz Węglarczyk; Karolina Bukowska-Strakova; Marcus Danielsson; Paweł Olszewski; Arkadiusz Piotrowski; Erin Oerton; Aleksandra Ambicka; Marcin Przewoźnik; Łukasz Bełch; Tomasz Grodzicki; Piotr L Chłosta; Stefan Imreh; Vilmantas Giedraitis; Lena Kilander; Jessica Nordlund; Adam Ameur; Ulf Gyllensten; Åsa Johansson; Alicja Józkowicz; Maciej Siedlar; Alicja Klich-Rączka; Janusz Jaszczyński; Stefan Enroth; Jarosław Baran; Martin Ingelsson; John R B Perry; Janusz Ryś; Lars A Forsberg
Journal: Cell Mol Life Sci Date: 2021-04-10 Impact factor: 9.261

10. SigCom LINCS: data and metadata search engine for a million gene expression signatures.

Authors: John Erol Evangelista; Daniel J B Clarke; Zhuorui Xie; Alexander Lachmann; Minji Jeon; Kerwin Chen; Kathleen M Jagodnik; Sherry L Jenkins; Maxim V Kuleshov; Megan L Wojciechowicz; Stephan C Schürer; Mario Medvedovic; Avi Ma'ayan
Journal: Nucleic Acids Res Date: 2022-05-07 Impact factor: 19.160