David Crandall

Home Website
Professor
Luddy School of Informatics, Computing, and Engineering
Director
Luddy Center for Artificial Intelligence

Field of study

Computer vision, data mining, machine learning, artificial intelligence.

Education

Ph.D., Cornell University, Ithaca, NY, 2008

Research interests

My main research interest is computer vision, the area of computer science that tries to design algorithms that can "see". I am particularly interested in visual object recognition and scene understanding. I am also interested in other problems that involve analyzing and modeling large amounts of uncertain data, like mining data from the web and from online social networking sites.

Representative publications

Mapping the world's photos (2009)
David J Crandall, Lars Backstrom, Daniel Huttenlocher and Jon Kleinberg
ACM. 761-770

We investigate how to organize a large collection of geotagged photos, working with a dataset of about 35 million images collected from Flickr. Our approach combines content analysis based on text tags and image data with structural analysis based on geospatial data. We use the spatial distribution of where people take photos to define a relational structure between the photos that are taken at popular places. We then study the interplay between this structure and the content, using classification methods for predicting such locations from visual, textual and temporal features of the photos. We find that visual and temporal features improve the ability to estimate the location of a photo, compared to using just textual features. We illustrate using these techniques to organize a large photo collection, while also revealing various interesting properties about popular cities and landmarks at a global scale.

Feedback effects between similarity and social influence in online communities (2008)
David Crandall, Dan Cosley, Daniel Huttenlocher, Jon Kleinberg and Siddharth Suri
ACM. 160-168

A fundamental open question in the analysis of social networks is to understand the interplay between similarity and social ties. People are similar to their neighbors in a social network for two distinct reasons: first, they grow to resemble their current friends due to social influence; and second, they tend to form new links to others who are already like them, a process often termed selection by sociologists. While both factors are present in everyday social processes, they are in tension: social influence can push systems toward uniformity of behavior, while selection can lead to fragmentation. As such, it is important to understand the relative effects of these forces, and this has been a challenge due to the difficulty of isolating and quantifying them in real settings.We develop techniques for identifying and modeling the interactions between social influence and selection, using data from online communities where both …

Inferring social ties from geographic coincidences (2010)
David J Crandall, Lars Backstrom, Dan Cosley, Siddharth Suri, Daniel Huttenlocher and Jon Kleinberg
Proceedings of the National Academy of Sciences, 107 (52), 22436-22441

We investigate the extent to which social ties between people can be inferred from co-occurrence in time and space: Given that two people have been in approximately the same geographic locale at approximately the same time, on multiple occasions, how likely are they to know each other? Furthermore, how does this likelihood depend on the spatial and temporal proximity of the co-occurrences? Such issues arise in data originating in both online and offline domains as well as settings that capture interfaces between online and offline behavior. Here we develop a framework for quantifying the answers to such questions, and we apply this framework to publicly available data from a social media site, finding that even a very small number of co-occurrences can result in a high empirical likelihood of a social tie. We then present probabilistic models showing how such large probabilities can arise from a natural …

Spatial priors for part-based recognition using statistical models (2005)
David Crandall, Pedro Felzenszwalb and Daniel Huttenlocher

We present a class of statistical models for part-based object recognition that are explicitly parameterized according to the degree of spatial structure they can represent. These models provide a way of relating different spatial priors that have been used for recognizing generic classes of objects, including joint Gaussian models and tree-structured models. By providing explicit control over the degree of spatial structure, our models make it possible to study the extent to which additional spatial constraints among parts are actually helpful in detection and localization, and to consider the tradeoff in representational power and computational cost. We consider these questions for object classes that have substantial geometric structure, such as airplanes, faces and motorbikes, using datasets employed by other researchers to facilitate evaluation. We find that for these classes of objects, a relatively small amount of spatial structure in the model can provide statistically indistinguishable recognition performance from more powerful models, and at a substantially lower computational cost.

Landmark classification in large-scale image collections (2009)
Yunpeng Li, David J Crandall and Daniel P Huttenlocher
IEEE. 1957-1964

With the rise of photo-sharing websites such as Facebook and Flickr has come dramatic growth in the number of photographs online. Recent research in object recognition has used such sites as a source of image data, but the test images have been selected and labeled by hand, yielding relatively small validation sets. In this paper we study image classification on a much larger dataset of 30 million images, including nearly 2 million of which have been labeled into one of 500 categories. The dataset and categories are formed automatically from geotagged photos from Flickr, by looking for peaks in the spatial geotag distribution corresponding to frequently-photographed landmarks. We learn models for these landmarks with a multiclass support vector machine, using vector-quantized interest point descriptors as features. We also explore the non-visual information available on modern photo-sharing sites, showing …

Discrete-continuous optimization for large-scale structure from motion (2011)
David Crandall, Andrew Owens, Noah Snavely and Dan Huttenlocher
IEEE. 3001-3008

Recent work in structure from motion (SfM) has successfully built 3D models from large unstructured collections of images downloaded from the Internet. Most approaches use incremental algorithms that solve progressively larger bundle adjustment problems. These incremental techniques scale poorly as the number of images grows, and can drift or fall into bad local minima. We present an alternative formulation for SfM based on finding a coarse initial solution using a hybrid discrete-continuous optimization, and then improving that solution using bundle adjustment. The initial optimization step uses a discrete Markov random field (MRF) formulation, coupled with a continuous Levenberg-Marquardt refinement. The formulation naturally incorporates various sources of information about both the cameras and the points, including noisy geotags and vanishing point estimates. We test our method on several large …

Discovering Localized Attributes for Fine-grained Recognition (2012)
Kun Duan, Devi Parikh, David Crandall and Kristen Grauman

Attributes are visual concepts that can be detected by machines, understood by humans, and shared across categories. They are particularly useful for fine-grained domains where categories are closely related to one other (e.g. bird species recognition). In such scenarios, relevant attributes are often local (e.g. “white belly”), but the question of how to choose these local attributes remains largely unexplored. In this paper, we propose an interactive approach that discovers local attributes that are both discriminative and semantically meaningful from image datasets annotated only with fine-grained category labels and object bounding boxes. Our approach uses a latent conditional random field model to discover candidate attributes that are detectable and discriminative, and then employs a recommender system that selects attributes likely to be semantically meaningful. Human interaction is used to provide semantic …

Weakly supervised learning of part-based spatial models for visual object recognition (2006)
David J Crandall and Daniel P Huttenlocher
Springer, Berlin, Heidelberg. 16-29

In this paper we investigate a new method of learning part-based models for visual object recognition, from training data that only provides information about class membership (and not object location or configuration). This method learns both a model of local part appearance and a model of the spatial relations between those parts. In contrast, other work using such a weakly supervised learning paradigm has not considered the problem of simultaneously learning appearance and spatial models. Some of these methods use a “bag” model where only part appearance is considered whereas other methods learn spatial models but only given the output of a particular feature detector. Previous techniques for learning both part appearance and spatial relations have instead used a highly supervised learning process that provides substantial information about object part location. We show that our weakly …

Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions (2015)
Sven Bambach, Stefan Lee, David J Crandall and Chen Yu
1949-1957

Hands appear very often in egocentric video, and their appearance and pose give important cues about what people are doing and what they are paying attention to. But existing work in hand detection has made strong assumptions that work well in only simple scenarios, such as with limited interaction with other people or in lab settings. We develop methods to locate and distinguish between hands in egocentric video using strong appearance models with Convolutional Neural Networks, and introduce a simple candidate region generation approach that outperforms existing techniques at a fraction of the computational cost. We show how these high-quality bounding boxes can be used to create accurate pixelwise hand regions, and as an application, we investigate the extent to which hand segmentation alone can distinguish between different activities. We evaluate these techniques on a new dataset of 48 first-person videos (along with pixel-level ground truth for over 15,000 hand instances) of people interacting in realistic environments.

Privacy behaviors of lifeloggers using wearable cameras (2014)
Roberto Hoyle, Robert Templeman, Steven Armes, Denise Anthony, David Crandall and Apu Kapadia
ACM. 571-582

A number of wearable'lifelogging'camera devices have been released recently, allowing consumers to capture images and other sensor data continuously from a first-person perspective. Unlike traditional cameras that are used deliberately and sporadically, lifelogging devices are always' on'and automatically capturing images. Such features may challenge users'(and bystanders') expectations about privacy and control of image gathering and dissemination. While lifelogging cameras are growing in popularity, little is known about privacy perceptions of these devices or what kinds of privacy challenges they are likely to create.To explore how people manage privacy in the context of lifelogging cameras, as well as which kinds of first-person images people consider'sensitive,'we conducted an in situ user study (N= 36) in which participants wore a lifelogging device for a week, answered questionnaires about the …

SfM with MRFs: Discrete-continuous optimization for large-scale structure from motion (2012)
David J Crandall, Andrew Owens, Noah Snavely and Daniel P Huttenlocher
IEEE transactions on pattern analysis and machine intelligence, 35 (12), 2841-2853

Recent work in structure from motion (SfM) has built 3D models from large collections of images downloaded from the Internet. Many approaches to this problem use incremental algorithms that solve progressively larger bundle adjustment problems. These incremental techniques scale poorly as the image collection grows, and can suffer from drift or local minima. We present an alternative framework for SfM based on finding a coarse initial solution using hybrid discrete-continuous optimization and then improving that solution using bundle adjustment. The initial optimization step uses a discrete Markov random field (MRF) formulation, coupled with a continuous Levenberg-Marquardt refinement. The formulation naturally incorporates various sources of information about both the cameras and points, including noisy geotags and vanishing point (VP) estimates. We test our method on several large-scale photo …

PlaceAvoider: Steering First-Person Cameras away from Sensitive Spaces (2014)
Robert Templeman, Mohammed Korayem, David J Crandall and Apu Kapadia
23-26

Cameras are now commonplace in our social and computing landscapes and embedded into consumer devices like smartphones and tablets. A new generation of wearable devices (such as Google Glass) will soon make ‘first-person’cameras nearly ubiquitous, capturing vast amounts of imagery without deliberate human action.‘Lifelogging’devices and applications will record and share images from people’s daily lives with their social networks. These devices that automatically capture images in the background raise serious privacy concerns, since they are likely to capture deeply private information. Users of these devices need ways to identify and prevent the sharing of sensitive images.As a first step, we introduce PlaceAvoider, a technique for owners of first-person cameras to ‘blacklist’sensitive spaces (like bathrooms and bedrooms). PlaceAvoider recognizes images captured in these spaces and flags them for review before the images are made available to applications. PlaceAvoider performs novel image analysis using both fine-grained image features (like specific objects) and coarse-grained, scene-level features (like colors and textures) to classify where a photo was taken. PlaceAvoider combines these features in a probabilistic framework that jointly labels streams of images in order to improve accuracy. We test the technique on five realistic firstperson image datasets and show it is robust to blurriness, motion, and occlusion.

Extraction of special effects caption text events from digital video (2003)
David Crandall, Sameer Antani and Rangachar Kasturi
International journal on document analysis and recognition, 5 (3-Feb), 138-157

Abstract. The popularity of digital video is increasing rapidly. To help users navigate libraries of video, algorithms that automatically index video based on content are needed. One approach is to extract text appearing in video, which often reflects a scene's semantic content. This is a difficult problem due to the unconstrained nature of general-purpose video. Text can have arbitrary color, size, and orientation. Backgrounds may be complex and changing. Most work so far has made restrictive assumptions about the nature of text occurring in video. Such work is therefore not directly applicable to unconstrained, general-purpose video. In addition, most work so far has focused only on detecting the spatial extent of text in individual video frames. However, text occurring in video usually persists for several seconds. This constitutes a text event that should be entered only once in the video index. Therefore …

Robust extraction of text in video (2000)
Sameer Antani, David Crandall and Rangachar Kasturi
IEEE. 1 831-834 vol. 1

Despite advances in the archiving of digital video, we are still unable to efficiently search and retrieve the portions that interest us. Video indexing by shot segmentation has been a proposed solution and several research efforts are seen in the literature. Shot segmentation alone cannot solve the problem of content based access to video. Recognition of text in video has been proposed as an additional feature. Several research efforts are found in the literature for text extraction from complex images and video with applications for video indexing. We present an update of our system for detection and extraction of an unconstrained variety of text from general purpose video. The text detection results from a variety of methods are fused and each single text instance is segmented to enable it for OCR. Problems in segmenting text from video are similar to those faced in detection and localization phases. Video has low …

Method for detecting objects in digital images (2006)

A method for detecting objects in a digital image includes the steps of generating a first segmentation map of the digital image according to a non-object specific criterion, generating a second segmentation map of the digital image according to an object specific criterion, and detecting objects in the digital image using both the first and second segmentation maps. In a preferred embodiment of the invention, the non-object specific criterion is a color homogeneity criterion and the object specific criterion is an object specific color similarity, wherein the object specific color is skin color and the method further comprises the step of detecting red-eye in the detected skin color regions.

Edit your profile