Semantic Boundaries Dataset and Benchmark


We created the Semantic Boundaries Dataset(henceforth abbreviated as SBD) and the associated benchmark to evaluate the task of predicting semantic contours, as opposed to semantic segmentations. While semantic segmentation aims to predict the pixels that lie inside the object, we are interested in predicting the pixels that lie on the boundary of the object, a task that is arguably harder (or alternatively, an error metric that is arguably more stringent).

The dataset and benchmark can be downloaded as a single tarball here.

The following sections provide an overview of the dataset and benchmark. For details about how to use the benchmarking code, please look at the README inside the download. If you use this dataset and benchmark, please cite:

author = "Bharath Hariharan and Pablo Arbelaez and Lubomir Bourdev and Subhransu Maji and Jitendra Malik",
title = "Semantic Contours from Inverse Detectors",
booktitle = "International Conference on Computer Vision (ICCV)",
year = "2011",


The SBD currently contains annotations from 11355 images taken from the PASCAL VOC 2011 dataset.These images were annotated on Amazon Mechanical Turk and the conflicts between the segmentations were resolved manually. For each image, we provide both category-level and instance-level segmentations and boundaries. The segmentations and boundaries provided are for the 20 object categories in the VOC 2011 challenge.


We focus on the evaluation of category-specific boundaries. The experimental framework we propose is based heavily on the BSDS benchmark. Machine pixels are matched to pixels on the ground truth boundaries. Pixels that are farther from the ground truth than a threshold are not matched. Machine pixels that are matched form the true positives, while other machine pixels are false positives. One can then compute precision-recall curves. The numbers we report in the paper are the AP(average precision) and MF(maximal F-measure).


For the purpose of comparison, we also provide our own best results ("1-stage (allclasses)" from Table 1 in [1]) here. These results are slightly different from the numbers in [1], since the dataset has been cleaned up. Please use these newer results for your comparisons.