Ying Chen1, Kunie Sakurai1, Sumihiro Maeda2, Tohru Masui3, Hideyuki Okano2, Johannes Dewender4, Stefanie Seltmann4, Andreas Kurtz5, Hiroshi Masuya6, Yukio Nakamura7, Michael Sheldon8, Juliane Schneider9, Glyn N Stacey10, Yulia Panina1, Wataru Fujibuchi11. 1. Center for iPS Cell Research and Application (CiRA), Kyoto University, 53 Kawahara-cho, Sho-goin, Sakyo-ku, Kyoto 606-8507, Japan. 2. Department of Physiology, Keio University School of Medicine, Tokyo 160-8582, Japan. 3. National Center for Medical Genetics, Keio University School of Medicine, Tokyo 160-8582, Japan. 4. Fraunhofer Institute for Biomedical Engineering, Biomedical Data and Bioethics, Anna-Louisa-Karsch-Strasse 2, 10178 Berlin, Germany. 5. Fraunhofer Institute for Biomedical Engineering, Biomedical Data and Bioethics, Anna-Louisa-Karsch-Strasse 2, 10178 Berlin, Germany; BIH Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany. 6. Integrated Bioresource Information Division, RIKEN BioResource Research Center, Tsukuba, Ibaraki 305-0074, Japan. 7. Cell Engineering Division, RIKEN BioResource Research Center, Tsukuba, Ibaraki 305-0074, Japan. 8. Department of Genetics and Human Genetics Institute of New Jersey, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA. 9. Harvard Catalyst|The Harvard Clinical and Translational Science Center, Boston, MA 02215, USA. 10. International Stem Cell Banking Initiative, 2 High Street, Barley, Hertfordshire SG88HZ, UK; National Stem Cell Resource Center, Institute of Zoology, Chinese Academy of Sciences, Beijing 100190, China; Innovation Academy for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China. 11. Center for iPS Cell Research and Application (CiRA), Kyoto University, 53 Kawahara-cho, Sho-goin, Sakyo-ku, Kyoto 606-8507, Japan. Electronic address: fujibuchi-g@cira.kyoto-u.ac.jp.
Abstract
The past decade has witnessed an extremely rapid increase in the number of newly established stem cell lines. However, due to the lack of a standardized format, data exchange among stem cell line resources has been challenging, and no system can search all stem cell lines across resources worldwide. To solve this problem, we have developed the Integrated Collection of Stem Cell Bank data (ICSCB) (http://icscb.stemcellinformatics.org/), the largest database search portal for stem cell line information, based on the standardized data items and terms of the MIACARM framework. Currently, ICSCB can retrieve >16,000 cell lines from four major data resources in Europe, Japan, and the United States. ICSCB is automatically updated to provide the latest cell line information, and its integrative search helps users collect cell line information for over 1,000 diseases, including many rare diseases worldwide, which has been a formidable task, thereby distinguishing itself from other database search portals.
The past decade has witnessed an extremely rapid increase in the number of newly established stem cell lines. However, due to the lack of a standardized format, data exchange among stem cell line resources has been challenging, and no system can search all stem cell lines across resources worldwide. To solve this problem, we have developed the Integrated Collection of Stem Cell Bank data (ICSCB) (http://icscb.stemcellinformatics.org/), the largest database search portal for stem cell line information, based on the standardized data items and terms of the MIACARM framework. Currently, ICSCB can retrieve >16,000 cell lines from four major data resources in Europe, Japan, and the United States. ICSCB is automatically updated to provide the latest cell line information, and its integrative search helps users collect cell line information for over 1,000 diseases, including many rare diseases worldwide, which has been a formidable task, thereby distinguishing itself from other database search portals.
Keywords:
ICSCB; MIACARM; cell line statistics; data integration; data portal; disease classification; diseased iPS cell line; international stem cell bank initiative; standardized search; stem cell line information
Since the first report of human induced pluripotent stem cells (iPSCs) (Takahashi et al., 2007), there has been a rapid increase in the number of iPSC lines and related information worldwide (Table 1). This remarkable growth has not only accelerated studies of regenerative medicine but also provided opportunities to understand such pragmatic issues as the quality of pluripotent stem cells (Nishizawa et al., 2016) and the disease mechanisms (Sasaki et al., 2016). Stem cell banks and registries are expected to provide necessary data on individual stem cell lines. However, the exchange of data among institutions is not a trivial matter, and the scientific reproducibility of the stem cells, particularly iPSCs generated by different methods, depending on available information is problematic for both basic studies and clinical applications (Isasi and Knoppers, 2011; Thirumala et al., 2009; Yaffe et al., 2016). Moreover, as technologies for the characterization of cell lines continue to advance, the addition of new quality standards as necessary data items has complicated and diversified data formats among stem cell banks and registries (Hug, 2009; Knoppers and Isasi, 2010). As an attempt to solve these problems, we previously reported MIACARM (Minimum Information About a Cellular Assay for Regenerative Medicine) guidelines in 2016 (Sakurai et al., 2016), which proposed the utilization of standardized data items and formats for all stem cell lines in regenerative medicine. At present, MIACARM contains 260 items covering such areas as stem cell production and materials (e.g., donor information, source cell information, and cell culture medium and substrate information), cell banking processes, cell characterization, sterility testing, and even ethical concerns. Later, a standardized nomenclature for pluripotent stem cells was introduced in 2018 with unification of cell line codification and minimization of information loss and confusion regarding cell lines as goals (Kurtz et al., 2018). Nevertheless, with the growing number of registered cell lines, existing data deposition formats have made it increasingly harder for not only data depositors but also users to seek and obtain cell lines collected under different projects, disease states, and privacy issues (Godard et al., 2003; Winickoff et al., 2009).
Table 1
Stem Cell Banks and Registries Worldwide (as of December 6, 2020)
B/Ra
Stem Cell Bank or Registry
Country
Website
Number of Cell Lines
Data Included in ICSCB's Registries (○, Included; △, Planned)
SKIP
eagle-i
hPSCreg
B
BLCB
Spain
https://p-cmrc.cat/
176
R
hPSCreg
Germany
https://hpscreg.eu/
3,360
○
○
B/R
HipSci
United Kingdom
http://www.hipsci.org/
3,720
○
B
EBiSC
Germany
https://ebisc.org/
897
○
B/R
CIRM/FufiFilm
United States
https://fujifilmcdi.com/the-cirm-ipsc-bank/
1,554
△
B
Harvard Stem Cell Institute
United States
http://stemcelldistribution.harvard.edu/
41
B
NYSCF
United States
https://nyscf.org/
111
○
B
NINDS Human Cell and Data Repository
United States
https://bioq.nindsgenetics.org/
162
B
WiCell Research Institute
United States
https://www.wicell.org/
1,519
○
△
B/R
eagle-i
United States
https://www.eagle-i.net/
2,415
○
○
B/R
RIKEN BRC
Japan
https://www.brc.riken.jp/en/
4,102
○
R
SKIP
Japan
https://skip.stemcellinformatics.org/
5,770
○
B
JCRB
Japan
https://cellbank.nibiohn.go.jp/
31
○
B
Taiwan Human Disease iPSC Service Consortium
Taiwan
https://catalog.bcrc.firdi.org.tw/
102
○
B
National Stem Cell Bank of Korea
Korea
http://kscr.nih.go.kr/
147
B, bank; R, registry.
Stem Cell Banks and Registries Worldwide (as of December 6, 2020)B, bank; R, registry.In this paper, as our next step toward the unification and utilization of the stem cell line data in the world, we report our new database portal, Integrated Collection of Stem Cell Bank data (ICSCB), which was designed using MIACARM guideline items and formats to serve as an entrance “port” to individual data repositories. The main objectives of ICSCB are (1) to establish an integrated stem cell database portal that can cover the majority of stem cell resources in the world and (2) to offer users minimum but efficient access to information on stem cell lines based on MIACARM guidelines. Currently, ICSCB provides data on more than 15,000 stem cell lines registered in four major stem cell line databases: hPSCreg (Seltmann et al., 2015), SKIP (Kim et al., 2017), RIKEN BRC (Kobayashi et al., 2016), and eagle-i (Vasilevsky et al., 2012). ICSCB has a user-friendly search engine for stem cell lines and can be accessed directly at http://icscb.stemcellinformatics.org/ or, as a slim version by removing cell line redundancy as much as possible, through the SHOGoiN (Human Omics Database for the Generation of iPS and Normal Cells) homepage at http://shogoin.stemcellinformatics.org/.
Results
Web Interface
ICSCB was designed for researchers searching for available cell lines to conduct various studies, such as regenerative medicine and disease analysis. Covering as many diverse cell lines as possible was the first priority when deciding which resources to include in ICSCB. Sharing cell line information between different stem cell banks and registries has been problematic due to different cell naming methods, different policies on cell assessment in different registries, unclear data sources, and so on. ICSCB is a collection of cell lines from four major and reliable cell line data resources based in Europe, Japan, and the United States. ICSCB updating is regularly performed for new SKIP and eagle-i stem cell line data as well as automatically performed for hPSCreg and RIKEN BRC data in a synchronized manner. Users can retrieve all related stem cell line information by using a free text search. Detailed information on a specific cell line can be accessed by clicking on the stem cell ID, which is linked to the information page in the original resources (Figure 1). The results can be further filtered according to users' requests. There may be several records for the same cell line if the cell line is included in multiple data resources. To provide users as much information as possible, the results page is designed to show cell lines with matching cell names as well as similar descriptions.
Figure 1
Web Interface of ICSCB
(A) The ICSCB search page. Any keyword related to cell lines (including cell line name, disease name, gender, and so on) can be used to perform an instant search.
(B) The ICSCB results page. Matched or partially matched cell lines are listed according to MIACARM terms. To check the details of the cell lines, the user can click on the stem cell ID, which is linked to the original source of cell line information.
Web Interface of ICSCB(A) The ICSCB search page. Any keyword related to cell lines (including cell line name, disease name, gender, and so on) can be used to perform an instant search.(B) The ICSCB results page. Matched or partially matched cell lines are listed according to MIACARM terms. To check the details of the cell lines, the user can click on the stem cell ID, which is linked to the original source of cell line information.
Data Coverage
ICSCB covers more data than any other stem cell line repository available. The integration of all major data resources allows us to check the current state of stem cell research in the world (Figure 2). Although we recognize redundancies in the data, according to our statistics, iPSC lines constitute more than 80% of all cell lines, and the ratio of healthy to diseased donors is approximately 3 to 2 (Figures 2A and 2B). The total number of countries from which cell lines can be retrieved is 39 (as of December 6, 2020), of which the top 9 countries identified in SKIP and hPSCreg are (in descending order) the United Kingdom, the United States, Japan, Germany, China, Spain, Sweden, Denmark, and Taiwan (Figure 2C). In addition, as the number of iPSC lines generated from patient donors has been growing recently, ICSCB supports disease-oriented searches to help users find all disease-related stem cell lines by using disease names. The distribution of disease and disorder types is shown in Figure 2D.
Figure 2
Details of Cell Lines Collected by ICSCB
Cell line information is categorized as (A) stem cell type, (B) health/disease status of donor, (C) country that established the cell lines, and (D) disease category. ES cells, embryonic stem cells.
Details of Cell Lines Collected by ICSCBCell line information is categorized as (A) stem cell type, (B) health/disease status of donor, (C) country that established the cell lines, and (D) disease category. ES cells, embryonic stem cells.
Easy Search Interface on SHOGoiN Homepage
ICSCB also has a quick and easy search module on the SHOGoiN homepage (https://stemcellinformatics.org/). SHOGoiN is a repository for accumulating and integrating diverse human cell information to support a wide range of research using cell-related data. The database consists of several modules that store cell lineage maps, transcriptomes, methylomes, cell conversions, cell type markers, and cell images with morphology data curated from public as well as contracted resources based on sophisticated cell taxonomy. Collaboration between ICSCB and SHOGoiN makes it possible for users to directly use free text searches for stem cell line data on the SHOGoiN homepage. The ICSCB easy search module in SHOGoiN supports a simplified ICSCB search with keywords, and the advanced search is designed to redirect users to the ICSCB homepage with full functions. Results from the SHOGoiN homepage have the same structure as in the ICSCB homepage.
Discussion
Concluding Remarks and Future Plan
So far, the registration and submission of newly established cell lines have been complicated by the lack of standardized data formats. Most data registries are currently limited by respective domestic policies and have adopted their information structures and validation processes independently (Andrews et al., 2015; Zarzeczny et al., 2009). The lack of standardized data formats has caused problems for researchers, who must usually search several websites to find the stem cell lines they are looking for (Wells et al., 2013). In the present work, we developed ICSCB, an integrated data distribution system that provides stem cell line information from major stem cell banks and registries all over the world. ICSCB adopts a standardized information format based on the “Source Cell” module of MIACARM to integrate different data resources while keeping important information.ICSCB has several limitations; for instance, there exist cell lines having limited donor information and/or incomplete information, as well as replicates of the same line with different names. In addition, ICSCB contains only a few clinical-grade cells due to strict requirements for exhaustive as well as expensive quality checks and haplotype compatibility for clinical-grade lines. Currently, when searching for stem cell lines for a specific disease, users have limited access to a distinct aspect of research for the diseases registered in individual repositories. Indeed, our next step would be to establish an integrated and refined collection of research on stem cell lines in order to understand the possible causes and mechanisms of complex diseases on the basis of genetic background and environmental effects in terms of molecular pathways during the developmental process. We may need a new project based on new funding to establish such a global collaboration.ICSCB has several issues that merit improvement. First, we plan to assign unique accession codes to all cell line entries by utilizing standardized nomenclature for pluripotent stem cells (Kurtz et al., 2018) in order to remove redundant cell line data. Second, as the cost of experimental technologies, such as genome/RNA sequencing or teratoma assay, to characterize stem cell lines decreases and it becomes easy to obtain various types of profiles on them, we may be able to define a standard profile set for a complete data format to render comparisons of cell lines more efficient. Third, once the RNA-sequencing or genome mutation data are collected, it will be possible to perform statistical analysis such as principal component analysis or other refined bioinformatics methods to mathematically map individual cell lines to a global stem cell feature space, such as differentiation propensity, carcinogenic potential, immune response, and so on. We expect ICSCB to further evolve, thereby providing users better accessibility to relevant stem cell lines.In the future, to respond to the rapid growth in the number of stem cell lines, we will include more data resources in ICSCB, including the Taiwan Human Disease iPSC Service Consortium and other recently developed stem cell banks, to make ICSCB more resource abundant and usable. We also plan to add a detailed quality check to help users find stem cell lines of high quality. As the largest stem cell line information resource, we will support stem cell communities by improving the quality and increasing the scale of our database.
Experimental procedures
Data Resources
ICSCB resources were selected from existing major stem cell registries that collect cell line information in Europe, Japan, and the United States and stem cell banks that provide cell lines with information on the attributes. We checked the number of registered cell lines and the criteria for registration in these registries and banks to decide to what extent their cell line data can fulfill MIACARM guidelines for inclusion in ICSCB. Considering the size, accessibility, and diversity of the different databases, three stem cell registries and one stem cell bank were included: (1) SKIP (5,615 cell lines), (2) hPSCreg (3,360 cell lines), (3) RIKEN BRC (3,548 cell lines), and (4) eagle-i (3,548 cell lines) (as of December 6, 2020). These data resources were selected because they had the highest number of registered cell lines and a large diversity, which would provide a good regional balance of cell sources to reduce redundancies in cell line entries. RIKEN BRC basically collected cell lines from Japanese institutions, SKIP contained data mostly from other Japanese and Asian institutions, hPSCreg collected data mainly from European institutions, and eagle-i collected data mostly from the United States. Details of the data sources are listed in Figure 3.
Figure 3
Overview of ICSCB
ICSCB includes data from three stem cell registries and one cell bank in order to maximize data coverage worldwide.
Overview of ICSCBICSCB includes data from three stem cell registries and one cell bank in order to maximize data coverage worldwide.
Data Integration
Since our previous research on the listed stem cell banks 4 years ago (Sakurai et al., 2016), the number of registered cell lines has skyrocketed, from 1,483 to approximately 8,000. As a result, stem cell registries are tasked with collecting information on the rapidly increasing number of new cell lines and registering the cell lines into their databases as quickly as possible. However, because the stem cell banks and registries are using their own formats for data entry, the integration of the data into a centralized collection system is an extraordinary challenge. To solve this problem, we used a decentralized or distributed database system (Fujibuchi et al., 1998) by adopting items of different database formats into 16 attributes, or terms, from three MIACARM modules: stem cell general identification, donor identification, and source cell identification (Table 2). To practically integrate the data from the four data resources (SKIP, hPSCreg, RIKEN BRC, and eagle-i), we adopted a mechanism of cross-reference tables that allow users to conduct a search using MIACARM terms that are translated into the corresponding terms in the individual data resources to implement the search. For example, the term “Stem cell ID” in MIACARM was translated into the terms “stem cell id” (hPSCreg), “stem cell id” (SKIP), “CellID” (RIKEN BRC), and “cell line label” (eagle-i) for the search implementation. Thus, ICSCB submits search requests to each data resource with its own (translated) terms and integrates all retrieved results by common MIACARM terms, thereby achieving a standardized data format at the level of display (Figure 4).
Table 2
Cross-Reference Table for Integration of Four Databases according to MIACARM Module (as of December 6, 2020)
MIACARM Module
ICSCB Term
hPSCreg
SKIP
RIKEN BRC
eagle-i
Stem cell general identification
Stem cell ID
stem cell id
stem cell id
CellID
cell line label
Stem cell name
stem cell name
cell line name
CellName
cell line label
Stem cell type
NA
cell type
cell grouping
cell line type
Cell grade
NA
research grade
NA
NA
Produced by
produced by
establisher name
originator
NA
Provider/distributor
distributor
establisher organization
depositor
cell line provider
Reference publication
publication
pubmed ID
reference
NA
Donor identification
Gender of donor
gender
donor sex
gender
sex
Ethnicity of donor
race
donor race
race
ethnicity
Health status
health status
disease name
disease
diagnosed disease
Source cell identification
Source cell type
source cell type
NA
NA
NA
Organ/tissue of originof source cell
origin of source cell
organ/tissue of origin of source cell
NA
NA
Ethical operation
Informed consentfrom donor
NA
NA
description
NA
MTA
NA
NA
description
NA
Data
Data ID
NA
NA
NA
NA
Data type
NA
NA
NA
NA
MTA, material transfer agreement; NA, not available.
Figure 4
Workflow of ICSCB Data Integration
The SKIP and eagle-i databases were fully replicated from websites and imported to MySQL, whereas hPSCreg and RIKEN BRC used a web API and SPARQL for data collection, respectively. A cross-reference table (Table 2) was used when ICSCB integrated and standardized cell line data.
Cross-Reference Table for Integration of Four Databases according to MIACARM Module (as of December 6, 2020)MTA, material transfer agreement; NA, not available.Workflow of ICSCB Data IntegrationThe SKIP and eagle-i databases were fully replicated from websites and imported to MySQL, whereas hPSCreg and RIKEN BRC used a web API and SPARQL for data collection, respectively. A cross-reference table (Table 2) was used when ICSCB integrated and standardized cell line data.
ICSCB workflow and search engine updating
To provide fast and easy access to the latest and accurate cell line information, we built an automatic updating system that adds newly released cell lines to ICSCB as soon as they become available in any of the four data resources. Data from eagle-i and SKIP are directly collected and stored in the MySQL database with the terms required for the MIACARM modules. Data from hPSCreg and RIKEN BRC are collected on the fly per request using a web application programming interface (API) provided by the respective sites. RIKEN BRC also uses SPARQL language for data retrieval requests (Kim et al., 2017; Kobayashi et al., 2016).To simplify the search process, ICSCB provides an easy-to-use and mobile-friendly web application. The goal of the application is to help users find the desired stem cell lines as quickly as possible. The interface of the search engine is designed with the 16 MIACARM terms (Table 2), except the term “Stem cell ID.” Users receive result pages containing all the matching results listed in a table that includes all the basic attributes under the structure of MIACARM. To ensure a more specific search with a wide variety of attributes, ICSCB is designed to accommodate searches not only by standardized terms from MIACARM but also by terms specific to each of the four data resources, such as “age” or “country” (Figure 5A). When user queries are submitted, ICSCB simultaneously retrieves MIACARM standardized data and resource-specific data so as not to miss any relevant entries. If a keyword entered by a user in a general keyword search does not exist in MIACARM terms, but is included in data specific to any of the four data resources, the user will get detailed descriptions of the matching data in the results page. For example, even if the standardized MIACARM terms do not contain “transgene,” it is still possible to enter a gene name into the keyword field (e.g., SOX2) such that the results page will display relevant entries by showing the indicated keyword in the extra field below (Figure 5B). Furthermore, the user can filter the results by data resource and detailed keywords from the “Searching options” box on the results page to narrow down the results list. In addition, all results can be easily downloaded as a table directly from the results page.
Figure 5
Keyword Search Is Automatically Extended to All Terms Provided by the Four Data Resources, Even if a Keyword Is Not Included in Standardized MIACARM Terms
(A) Terms specific to each of the four data resources.
(B) Even if the standardized MIACARM terms do not contain, for example, “transgene,” it is still possible to enter a gene name into the keyword field (e.g., SOX2), which will lead users to results from the four data resources with relevant information. The results of the match will be shown in another row below the standardized fields.
Keyword Search Is Automatically Extended to All Terms Provided by the Four Data Resources, Even if a Keyword Is Not Included in Standardized MIACARM Terms(A) Terms specific to each of the four data resources.(B) Even if the standardized MIACARM terms do not contain, for example, “transgene,” it is still possible to enter a gene name into the keyword field (e.g., SOX2), which will lead users to results from the four data resources with relevant information. The results of the match will be shown in another row below the standardized fields.ICSCB also provides a quality control panel based on MIACARM, thereby supporting customized searches according to quality control results. At present, assays for teratoma formation, differentiation ability in vitro, morphology data, marker gene expression/surface antigen expression data, karyotyping assay results, copy number variation, residual exogene detection results, genome profiling, transcriptome profiling, and epigenome profiling data are accessible from ICSCB.
Data and Code Availability
The original data and R source code for creating figures and supplementary figures and tables are available at https://github.com/YingChen-bio/ICSCB.
Author contributions
Y.C. and Y.P. drafted the manuscript. S.M., T.M., H.O., J.D., S.S., A.K., H.M., Y.N., M.S., J.S., and W.F. provided and facilitated the stem cell data. W.F. and K.S. conceptualized the research. A.K., G.S., and W.F. led the project.
Conflicts of interest
H.O. is a founding scientist of SanBio Co., Ltd., and K Pharma, Inc.
Authors: Christine A Wells; Rowland Mosbergen; Othmar Korn; Jarny Choi; Nick Seidenman; Nicholas A Matigian; Alejandra M Vitale; Jill Shepherd Journal: Stem Cell Res Date: 2012-12-20 Impact factor: 2.020
Authors: P W Andrews; D Baker; N Benvinisty; B Miranda; K Bruce; O Brüstle; M Choi; Y-M Choi; J M Crook; P A de Sousa; P Dvorak; C Freund; M Firpo; M K Furue; P Gokhale; H-Y Ha; E Han; S Haupt; L Healy; D J Hei; O Hovatta; C Hunt; S-M Hwang; M S Inamdar; R M Isasi; M Jaconi; V Jekerle; P Kamthorn; M C Kibbey; I Knezevic; B B Knowles; S-K Koo; Y Laabi; L Leopoldo; P Liu; G P Lomax; J F Loring; T E Ludwig; K Montgomery; C Mummery; A Nagy; Y Nakamura; N Nakatsuji; S Oh; S-K Oh; T Otonkoski; M Pera; M Peschanski; P Pranke; K M Rajala; M Rao; R Ruttachuk; B Reubinoff; L Ricco; H Rooke; D Sipp; G N Stacey; H Suemori; T A Takahashi; K Takada; S Talib; S Tannenbaum; B-Z Yuan; F Zeng; Q Zhou Journal: Regen Med Date: 2015 Impact factor: 3.806