Semi-automatic Annotation on Image Segmentation Hierarchies

Zankl, Georg

by Georg Zankl

Abstract:

In the field of object recognition in natural images, a variety of established tasks exist, which are focus of attention when it comes to comparing different methods, for example image segmentation, semantic image segmentation or object detection. Image segmentation is the task of grouping pixels in an image that belong to the same region or object. Semantic image segmentation is the task of assigning a semantic label to each pixel of the image. The semantic labels can be objects: for example car, person, building; or classes of areas in an image: sky, floor, vertical surface. Object detection is the task of predicting occurrence and position in an image, for example by determining a bounding box of the object. Traditional object recognition challenges have limitations such as ambiguity in more general contexts. For example for a single natural image, there are often multiple image segmentations a human would consider to be correct, depending on the object that person is particularly interested in. We raise the question: 'Is there a different task, that overcomes these limitations?' As an example we propose the task of interactively assigning a semantic label to each segment of a segmentation hierarchy. The result can be represented as a stack of semantic segmentations, with an inclusion-relationship between segments of adjacent segmentations. The focus of this work is to provide a solution to this task and discuss advantages and problems that arise. The main disadvantage is that it is harder to obtain suitable ground-truth that consists of annotated segmentation hierarchies. Also the quality of underlying segmentation methods is, in general, sub-optimal for natural images. The main advantage is that the structure implied by the occurrence of labels in the ground-truth can be used to aid the user in labeling the segments of the hierarchy. We propose a framework that consists of a feedback loop, where a label prediction is provided by the framework and a human user may select one or more misclassified segments and assign the correct label. This process can be repeated until the user is satisfied. The prediction is done using a Conditional Random Field (CRF) that is modified so that we are able to condition the model on the segmentation hierarchy as well as the user input. The framework is evaluated on two distinct datasets by comparing its quality to a straight-forward baseline. The baseline consists of a single prediction step of the proposed framework followed by fully manual correction of the segments without new predictions. The results show a significant difference in quality, after several user interactions. For example after 20 user interactions the baseline adjusts 20 misclassified segments, while the CRF-based framework adjusts about 130 misclassied segments for the two datasets. This experiment illustrates the potential of structured prediction for the proposed task.

View PDF

Reference:

Semi-automatic Annotation on Image Segmentation Hierarchies (Georg Zankl), Technical report, PRIP, TU Wien, 2012.

Bibtex Entry:

@TechReport{TR127,
author = "Georg Zankl",
title = "Semi-automatic Annotation on Image Segmentation Hierarchies",
institution = "PRIP, TU Wien",
number = "PRIP-TR-127",
year = "2012",
url = "https://www.prip.tuwien.ac.at/pripfiles/trs/tr127.pdf",
abstract = "In the field of object recognition in natural images, a variety of established
tasks exist, which are focus of attention when it comes to comparing different
methods, for example image segmentation, semantic image segmentation or
object detection. Image segmentation is the task of grouping pixels in an
image that belong to the same region or object. Semantic image segmentation
is the task of assigning a semantic label to each pixel of the image. The
semantic labels can be objects: for example car, person, building; or classes
of areas in an image: sky, floor, vertical surface. Object detection is the task
of predicting occurrence and position in an image, for example by determining
a bounding box of the object. Traditional object recognition challenges
have limitations such as ambiguity in more general contexts. For example
for a single natural image, there are often multiple image segmentations a
human would consider to be correct, depending on the object that person is
particularly interested in. We raise the question: 'Is there a different task,
that overcomes these limitations?' As an example we propose the task of
interactively assigning a semantic label to each segment of a segmentation
hierarchy. The result can be represented as a stack of semantic segmentations,
with an inclusion-relationship between segments of adjacent segmentations.
The focus of this work is to provide a solution to this task and discuss advantages
and problems that arise. The main disadvantage is that it is harder to
obtain suitable ground-truth that consists of annotated segmentation hierarchies.
Also the quality of underlying segmentation methods is, in general,
sub-optimal for natural images. The main advantage is that the structure
implied by the occurrence of labels in the ground-truth can be used to aid
the user in labeling the segments of the hierarchy. We propose a framework
that consists of a feedback loop, where a label prediction is provided by the
framework and a human user may select one or more misclassified segments
and assign the correct label. This process can be repeated until the user is
satisfied. The prediction is done using a Conditional Random Field (CRF)
that is modified so that we are able to condition the model on the segmentation
hierarchy as well as the user input. The framework is evaluated on two
distinct datasets by comparing its quality to a straight-forward baseline. The
baseline consists of a single prediction step of the proposed framework followed
by fully manual correction of the segments without new predictions. The
results show a significant difference in quality, after several user interactions.
For example after 20 user interactions the baseline adjusts 20 misclassified segments,
while the CRF-based framework adjusts about 130 misclassied
segments for the two datasets. This experiment illustrates the potential of
structured prediction for the proposed task.",
}