| Literature DB >> 26594156 |
Ignacio Arganda-Carreras1, Srinivas C Turaga2, Daniel R Berger3, Dan Cireşan4, Alessandro Giusti4, Luca M Gambardella4, Jürgen Schmidhuber4, Dmitry Laptev5, Sarvesh Dwivedi5, Joachim M Buhmann5, Ting Liu6, Mojtaba Seyedhosseini6, Tolga Tasdizen6, Lee Kamentsky7, Radim Burget8, Vaclav Uher8, Xiao Tan9, Changming Sun10, Tuan D Pham11, Erhan Bas2, Mustafa G Uzunbas12, Albert Cardona2, Johannes Schindelin13, H Sebastian Seung14.
Abstract
To stimulate progress in automating the reconstruction of neural circuits, we organized the first international challenge on 2D segmentation of electron microscopic (EM) images of the brain. Participants submitted boundary maps predicted for a test set of images, and were scored based on their agreement with a consensus of human expert annotations. The winning team had no prior experience with EM images, and employed a convolutional network. This "deep learning" approach has since become accepted as a standard for segmentation of EM images. The challenge has continued to accept submissions, and the best so far has resulted from cooperation between two teams. The challenge has probably saturated, as algorithms cannot progress beyond limits set by ambiguities inherent in 2D scoring and the size of the test dataset. Retrospective evaluation of the challenge scoring system reveals that it was not sufficiently robust to variations in the widths of neurite borders. We propose a solution to this problem, which should be useful for a future 3D segmentation challenge.Entities:
Keywords: connectomics; electron microscopy; image segmentation; machine learning; reconstruction
Year: 2015 PMID: 26594156 PMCID: PMC4633678 DOI: 10.3389/fnana.2015.00142
Source DB: PubMed Journal: Front Neuroanat ISSN: 1662-5129 Impact factor: 3.856
Figure 1Challenge datasets. (A) EM image of the ventral nerve cord of a larval Drosophila. (B) Boundary map annotated by human experts. (C) Segmentation into neurite cross-sections. (D) The annotated dataset was split into training and test sets and distributed publicly. Ground truth labels for the test set were withheld and used to evaluate the predictive performance of candidate algorithms.
Figure 2Top entries from the competition (Section 3.2) and cooperation (Section 3.3) phases of the challenge. (A) Electron micrograph with overlaid segmentation and corresponding boundary map. (B) Boundary maps of the top 3 submissions at the time of ISBI'12. (C) Boundary maps of the top 3 submissions in the cooperation phase. Segmentation errors are marked by arrows colored based on the type of mistake: split (green), merge (red), omission (magenta), and addition (blue). Scale bar = 100 nm.
Best Rand and information theoretic scores of all teams and the human experts using the undisclosed test set at ISBI.
| Human 1 vs. consensus | 0.997 ± 0.001 | 0.997 ± 0.001 |
| human 2 vs. consensus | 0.971 ± 0.003 | 0.941 ± 0.002 |
| IDSIA | 0.944 ± 0.011 | 0.968 ± 0.002 |
| BlackEagles | 0.929 ± 0.008 | 0.916 ± 0.003 |
| MLL-ETH | 0.927 ± 0.008 | 0.923 ± 0.004 |
| SCI | 0.915 ± 0.016 | 0.967 ± 0.003 |
| CellProfiler | 0.904 ± 0.015 | 0.937 ± 0.006 |
| Harvard | 0.892 ± 0.017 | 0.947 ± 0.004 |
| CoMPLEX | 0.877 ± 0.019 | 0.903 ± 0.008 |
| UCL | 0.860 ± 0.020 | 0.939 ± 0.005 |
| TSC+PP | 0.843 ± 0.012 | 0.838 ± 0.006 |
| IMMI | 0.826 ± 0.022 | 0.862 ± 0.008 |
| CLP | 0.809 ± 0.018 | 0.846 ± 0.005 |
| Freiburg | 0.800 ± 0.026 | 0.825 ± 0.005 |
| NIST | 0.730 ± 0.021 | 0.757 ± 0.007 |
Mean and standard error are computed over 20 test images not used for the public leaderboard.
Figure 3Merge vs. split scores for submissions prior to competition deadline. Upper right hand corner corresponds to perfect performance. (A) Rand scores of Equations (1, 2), (B) information theoretic scores of Equations (4, 5).
Figure 4Evolution of Rand score over time. No overfitting. (A) Competition phase prior to ISBI'12 workshop. (B) Cooperation phase. Individual submissions are colored by team. The dotted blue line shows the best Rand score achieved by that date. (C,D) Score differences between private and public test datasets.
Best Rand and information theoretic scores (before and after border thinning) of all teams and the human experts using the undisclosed test set as of November 4, 2013.
| Human 1 vs. consensus | 0.997 ± 0.001 | 0.997 ± 0.001 | 0.998 ± 0.001 | 0.999 ± 0.001 |
| Human 2 vs. consensus | 0.971 ± 0.003 | 0.941 ± 0.002 | 0.990 ± 0.002 | 0.989 ± 0.001 |
| IDSIA-SCI | 0.979 ± 0.005 | 0.988 ± 0.002 | 0.979 ± 0.005 | 0.988 ± 0.002 |
| IDSIA-optree | 0.969 ± 0.006 | 0.977 ± 0.003 | 0.972 ± 0.006 | 0.984 ± 0.002 |
| SCI | 0.966 ± 0.006 | 0.984 ± 0.002 | 0.968 ± 0.006 | 0.984 ± 0.002 |
| IDSIA | 0.944 ± 0.011 | 0.969 ± 0.002 | 0.978 ± 0.004 | 0.988 ± 0.001 |
| BlackEagles | 0.930 ± 0.009 | 0.941 ± 0.003 | 0.973 ± 0.005 | 0.983 ± 0.002 |
| MLL-ETH | 0.927 ± 0.008 | 0.926 ± 0.003 | 0.968 ± 0.006 | 0.981 ± 0.002 |
| SDU | 0.909 ± 0.011 | 0.926 ± 0.004 | 0.942 ± 0.008 | 0.974 ± 0.003 |
| CellProfiler | 0.904 ± 0.015 | 0.937 ± 0.006 | 0.915 ± 0.015 | 0.958 ± 0.005 |
| Coxlab | 0.901 ± 0.012 | 0.936 ± 0.006 | 0.939 ± 0.012 | 0.976 ± 0.003 |
| Harvard | 0.892 ± 0.017 | 0.944 ± 0.006 | 0.907 ± 0.016 | 0.957 ± 0.003 |
| CoMPLEX | 0.877 ± 0.019 | 0.903 ± 0.008 | 0.890 ± 0.018 | 0.947 ± 0.005 |
| MLA | 0.875 ± 0.016 | 0.885 ± 0.004 | 0.916 ± 0.016 | 0.964 ± 0.004 |
| ML | 0.867 ± 0.016 | 0.879 ± 0.006 | 0.911 ± 0.016 | 0.958 ± 0.003 |
| UCL | 0.860 ± 0.020 | 0.939 ± 0.005 | 0.863 ± 0.020 | 0.948 ± 0.005 |
| TSC+PP | 0.843 ± 0.012 | 0.839 ± 0.006 | 0.922 ± 0.013 | 0.961 ± 0.005 |
| CLP | 0.839 ± 0.024 | 0.885 ± 0.008 | 0.869 ± 0.024 | 0.940 ± 0.006 |
| IMMI | 0.826 ± 0.022 | 0.862 ± 0.008 | 0.876 ± 0.020 | 0.948 ± 0.005 |
| ICOS | 0.809 ± 0.018 | 0.838 ± 0.011 | 0.883 ± 0.015 | 0.936 ± 0.004 |
| Freiburg | 0.800 ± 0.026 | 0.839 ± 0.007 | 0.835 ± 0.027 | 0.928 ± 0.006 |
| NIST | 0.730 ± 0.021 | 0.757 ± 0.007 | 0.796 ± 0.020 | 0.851 ± 0.006 |
| Computer Vision Jena | 0.709 ± 0.024 | 0.768 ± 0.012 | 0.832 ± 0.022 | 0.904 ± 0.007 |
| Bar-Ilan | 0.701 ± 0.034 | 0.792 ± 0.011 | 0.773 ± 0.032 | 0.872 ± 0.012 |
For each team, the submission with the highest score is chosen for each column. The values were computed as the mean and standard error over the n = 20 test images that were not used in the public leaderboard.
Figure 5Metric robustness to thinning. (A) Rand (VRand) and information theoretic (VInfo) scoring measures produce similar rankings, Spearman correlation ρ = 0.80. (B) This correlation is greatly increased by post-processing boundaries by thinning, Spearman correlation ρ = 0.94. (C,D) Thinning of boundaries almost universally improves Rand and information theoretic scoring measures. VRand rankings are more robust to thinning, (C) Spearman correlation ρ = 0.89, compared to VInfo rankings, (D) Spearman correlation ρ = 0.59.
Figure 6Minor ambiguities in 3D can become significant in 2D. Three rows correspond to three successive slices in the image stack and each path shows a possible segmentation of a neuron based on a different interpretation. In the top panel (A) the neurite borders are clear and therefore its interpretation is unambiguous. However, in the middle row, a membrane is parallel to the sectioning plane (darkened area in red box), leading to ambiguity. It is unclear whether the darkened area should be a boundary between neurons (B,D), or assigned to a neuron (C). This ambiguity has no topological consequences in 3D unlike in 2D, where the neuron can be assigned to just one segment (C), or two (D). Finally, in the 3D interpretation the two cross sections in (E) have the same color because they are connected with each other through previous slices, while in the 2D interpretation, the two cross sections in (F) have different colors because they are not connected to each other in this slice.