Download LabelMe3D MATLAB Toolbox

The LabelMe3D database contains labeled images and their absolute real-world 3D coordinates. The database spans many different everyday scene and object categories.


If you use the database or any source code, we would appreciate it if you cite:

B. C. Russell and A. Torralba. Building a Database of 3D Scenes from User Annotations. In CVPR, 2009. (PDF)

Documentation for scene geometry, camera model, and structure of the XML annotation files (PDF)

Download and interact with the database

We provide a Matlab toolbox for downloading the database and interacting with it. There are two ways to download the toolbox:

1. Github repository

We maintain the latest version of the toolbox on github. To pull the latest version, make sure that "git" is installed on your machine and then run "git clone" on the command line. You can refresh your copy to the latest version by running "git pull" from inside the project directory.

If you have an idea for a new feature and want to implement it, then let us know! With github, you can fork the code and send us a pull request. If we like your feature and implementation, then we will incorporate it into the main code.

2. Zip file

The zip file is a snapshot of the latest source code on github.

A quick look at the toolbox

The toolbox allows you to grab an annotation from LabelMe directly and compute its 3D information:

DB = LMdatabase(HOMEANNOTATIONS,HOMEIMAGES,{'05june05_static_street_boston'},{'p1010736.jpg'});
img = LMimread(DB,1,HOMEIMAGES);
annotation3D = Recover3DSceneComponents(DB.annotation);

This example reads the LabelMe annotation, computes its 3D information, and plots the 3D scene.

Database and Matlab toolbox documentation

The Matlab toolbox contains functions for downloading, interacting with, and displaying the LabelMe3D database. We outline the main functionalities of the toolbox inside of demo.m.

The scene geometry, camera model, and structure of the XML annotation files are documented here (PDF).

How to get good 3D models?

When you label objects and their location in an image, the tool uses the labels to build a 3D model of the scene. The tool does not require from you any knowledge about geometry, as all of the 3D information is automatically inferred from the annotations. For instance, the tool will know that a 'road' is a horizontal surface and that a 'car' is supported by the road. The tool learns to go from 2D to 3D using all the other labels already present in the database. The more images that are labeled, the better models the tool will learn.

In order to get good 3D and pop-up models of your pictures, it is important to try to label accurately. For each object that you label, the tool will ask you to enter the name. The system will use this name to decide which 3D model to use.

Start labeling the ground: Ground objects (such as the "road", "sidewalk", "floor", "sea", etc) are used to define the basic structure of the scene. If you use the correct names, the system will recognize them and automatically place them in the correct location in the 3D scene.

Complete objects behind the occlusions: When labeling objects, try to complete the objects behind the occlusions. This is important so that the tool can reconstruct the 3D contact points. In the example on the right, the sidewalk is delineated as if the people were not there.

Follow the outline of each object: The more accurate the boundaries and the object names are, the better the 3D model will look. In addition, these annotations will be used to build a large database of annotated images to train computer vision algorithms to recognize everyday objects.

The 3D models can be downloaded and played outside of this tool using any VRML viewer.

How does it work?

The tool learns two kinds of scene representations from all the annotated images: a qualitative model of the relationships ("part-of"', "supports'') holding between scene components ("sidewalk'', "person'', "car''), and a rough 3D shape model for some of these components, obtained from multiple segmented images of their instances. These models are combined with geometric cues (depth ordering, horizon line) extracted from the photograph being analyzed to construct the final scene description.

To illustrate the above, consider when we (as humans) see a person against a wall. We know that the person is not physically attached to the wall because people are not parts of walls. We have learned this from many images in which we see the co-occurrence of people and walls. We also know that windows are parts of walls. Therefore a window overlapping with a wall is not a window resting against the wall, it is actually attached to it. Again, we know this because we exploit the information coming from many images and how walls and windows relate to each other. These relationships influence (and are influenced by) our interpretation of geometric image cues.

In addition, statistical evidence may guide the interpretation of edge fragments as occlusion boundaries, contacts between objects, or attachment points. For instance, a chimney is part of a house. However, only the lower part of the boundary is attached to the house, with the rest being an occlusion boundary. On the other hand, a window is a part of a house, with all the edges attached to the building and having no occlusions. As a final example, a person is always in contact with the road. However, a person is not part of the road, which causes the points of contact to not be points of attachment.

List of objects supported by the road. This list is automatically inferred from the annotations available in the LabelMe dataset.