Employment

  • Present 10/2015

    Visiting Researcher

    UC Berkeley, Computer Science Department

  • Present 09/2015

    CEO

    Felix Vision

  • 08/2015 03/2015

    Engineering Manager

    Facebook, Applied Machine Learning

  • 02/2015 03/2012

    Research Scientist

    Facebook, Facebook AI Research

  • 02/2012 06/1998

    Sr. Research Scientist

    Adobe, Adobe Research

Education

  • Ph.D. 2007-2011

    Ph.D. in Computer Science

    University of California at Berkeley

  • M.A.1997-1998

    Master of Arts

    in Computer Science

    Brown University

  • B.Sc.1994-1998

    Bachelor of Science

    in Computer Science

    Brown University

Awards, Affiliations, Professional Activities

  • 2015
    Area Chair for CVPR 2016
    CVPR is one of the largest conferences in computer vision. Area chairs are responsible for assigning papers to reviewers, coordinating the review process, writing meta-reviews and providing final recommendations on the acceptance status of paper submissions.
  • 2013
    Organizer of the Bay Area Vision Meeting (BAVM2013)
    The Bay Area Vision Meeting is an annual one-day workshop organized by a company or university. The goal is to bring together researchers, professors and graduate students in the field and discuss latest advances. Prior organizers were UC Berkeley, Stanford and Google.
  • 2013
    A tutorial of the history of part-based models and their applications to a variety of computer vision model. Co-organized with Ross Girshick from UC Berkeley.
  • 2012-present
    PC member of various workshops
    Program Committee member for:
    • The Action Recognition at Pose Estimation workshop at ECCV 2012 APSI2012
    • The Scene Understanding workshop at CVPR 2012 SUNW2012
    • The Big Vision workshop at CVPR 2015 BigVision2015
  • 2007
    Adobe's University Sabbatical program
    One of two Adobe employees accepted into the University Sabbatical program, which allowed me to pursue a PhD while employed (completed in four years)
  • 1998
    Brown University Combined Program
    The only student from the Computer Science department in 1998 accepted to the Brown University Combined Program, which allowed me to complete both Bachelor’s and Master’s degrees in a total of four years.
  • 1995
    Brown University Teaching and Research Assistantship
    As an undergraduate I was a teaching assistant for three computer science classes, including a Head TA for the Algorithms and Data Structures class. I was conducting labs and occasionally giving lectures to a class of 100+ students
  • 1992-1994
    National Competition in Computational Linguistics in Bulgaria
    I was awarded first place in 1992 and second place in 1994 on the National Competition in Computational Linguistics in Bulgaria
  • 2006
    Member of MENSA
  • image

    Facebook's Image Classification Engine

    Object recognition applied on every Facebook photo

    When I joined Facebook in February 2012 I was the first person at the company hired to do computer vision. My first task was evaluating the Face.com technology and providing technical advice on the aquisition as well as integrating the technology into Facebook. After that I focused on object and scene recognition. Together with my then-intern Manohar Paluri, we developed the computer vision engine that Facebook uses to analyze photos and videos. It was incredibly exciting to deploy an engine and have it run on three hundred million photos per day! The original engine was based on traditional computer vision features and was able to tell basic properties, like whether the photo is a closeup, indoors or outdoors, or in nature. Over the next two years we released nine versions of the engine, significantly improving it in each release. The latest version is able to recognize more than a thousand types of objects, scenes, activities and places of interest using convolutional neural networks with multiple loss functions. At peak time it handles more than 10000 calls per second and is run on every photo and every second of every video on Facebook and Instagram. It has already been called more than half a trillion times! Our engine is key for spam detection, pornographic content filtering, visual search, feed ranking, ad targeting, and many other areas. I was first the project lead and then became the manager of the group responsible for everything from the research to the development and deployment of the engine. I developed a large part of the training code as well as a highly optimized feedforward path used in production.

  • image

    Person Detection and Recognition at Adobe

    The face detector and person recognizer in Photoshop Elements

    This was my first research project. Since Computer Vision was a new area for me, I started by reading papers and textbooks. My first experiment was a human ear detector using a neural network on the Haar wavelets of an image. It worked fairly well, but evaluating the neural network at every location and scale was too slow. I then spent some time thinking about evaluating the neural network incrementally, simultaneously at every place, and focusing the computations on the most promising areas. Although the total detection time deteriorated, this approach resulted in discovering most ears almost instantaneously and allowed for a nice tradeoff between detection rate and speed. I then generalized my idea to incrementally evaluate any learning machine (which I called the Soft Cascade).

    A colleague of mine, Jonathan Brandt, was investigating the (at the time) state-of-the-art Viola-Jones face detector, which had very good performance and accuracy. I decided to apply the Soft Cascade on the VJ detector, and, to my delight, the resulting system was both faster and more accurate. It also has numerous advantages - it considers some information that the "hard cascade" throws away. The detector is less "brittle" and generalizes better, the speed/accuracy tradeoff is not hard-coded during training, but could be specified afterwards, and the new framework allows for augmenting the operational domain of an existing detector. For example, we could improve an existing detector to handle, say, wider out-of-plane rotation. My colleague observed that the ability of the Soft Cascade to be quickly calibrated for a specific point in the speed/accuracy space allows us to explore the operational domain of the detector not just along the detection rate and false positive rate, but also along the speed dimension. As far as I know, our CVPR paper was the first to describe the ROC surface of an object detector.

    My face detector was first deployed in the face tagging feature of Photoshop Elements 4, and has received positive reviews. It wouldn't have happened without the help of Claire Schendel, a Photoshop engineer who integrated the feature into the product. As far as I know, Photoshop Elements 4, in 2005, was the first application to use face detection in a consumer product. Face detection started appearing in cameras shortly after that.

    In 2007 I started developing a system that uses face detection combined with face recognition, leveraging context, such as the fact that the same people on the same day tend to wear the same clothes. This was a research project for my class at UC Berkeley which formed the basis of the People Recognition feature in Photoshop Elements 8. My Adobe colleague Alex Parenteau and I developed the core engine, using an external face recognizer and we collaborated with the Elements engineering team. The big engineering challenge we addressed was scalability - the ability to extend the technology to very large albums with limited memory.

  • image

    Boost GIL

    Generic Image Library as part of the Boost C++ libraries

    While at Adobe I was fortunate to have Alex Stepanov, the main guy behind STL, as my colleague. He led a class on Generic Programming, which was an inspiration to all of us. Generic Programming is exciting because it allows for abstraction with no loss in performance. I have been collaborating with Prof. Jarvi from Texas A&M on a method for applying generic programming to create C++ code that is generic, efficient and run-time flexible, without incurring unnecessary code bloat. Here is our LCSD paper and my presentation slides. Our approach achieves the specified goals, but has other disadvantages, namely type safety.

    One excellent application for generic programming is to abstract away the image representation and allow us to write generic image processing algorithms that work efficiently with images of any color space, channel ordering, channel depth, and pixel representation. This is the goal of my Generic Image Library - a C++ library I have created together with my former colleague Hailin Jin. GIL is an open-source library now part of the popular Boost libraries and it is used by dozens of institutions. Here is a video tutorial I prepared to give an overview of GIL.

    It was a great honor for me to receive an invitation by Prof. Bjarne Stroustrup, the creator of C++, to give a talk about GIL at his institution.
  • image

    Auto-Fill in Adobe Acrobat

    The engine to auto-fill forms in Acrobat based on history and form structure

    Have you ever applied for a mortgage? After going through the experience of filling a billion forms, with the same information over an over again, I decided I have had enough and started thinking about ways of simplifying the form filling experience. I created a probabilistic framework that can suggest suitable defaults for form entries. It observes your entry patterns, learns from experience and is able to extrapolate the results to previously unseen forms. When it is fairly confident with the result, it can populate the field once you tab into it. It is now used by Adobe Acrobat to streamline the form filling process. I think it is also in the free Acrobat Reader. (You need to enable it from the preferences menu). Thanks to Alex Mohr, an Acrobat engineer, for integrating my engine into the product.

  • image

    Symbolism Tools in Adobe Illustrator

    Tools to allow for artistic effects, such as drawing organic shapes, pen-and-ink illustrations

    Vector graphics applications like Adobe Illustrator have been used to create some amazing art. But we have only scratched the surface of what computers can do. By building some intelligence into the tools, we could enable a new generation of art that would be too time consuming to generate and edit by hand. This idea inspired me to create the Symbolism tools - a suite of tools in Illustrator that allow for scattering, moving, "combing", coloring and applying styles to a collection of graphical symbols. These tools could be used for a variety of objects, like hair, organic shapes, pen-and-ink style of shading. I am using a particle system to guide the behavior of the tools. My manager Martin Newell gave me some insightful ideas for the underlying technology. I designed, prototyped, performance-optimized and integrated the feature into Illustrator. Here is some sample art created by these tools. The Symbolism tools have received outstanding reviews.

  • image

    Adobe's Transparency Flattener

    The engine that Adobe uses to print vector documents containing transparency

    When I joined Adobe in 1998, the big company initiative was introducing transparency in the vector graphics products. Transparency can be used to represent a dazzling range of effects, see-through objects, lens effects, soft clips, drop shadows... However, the biggest technical challenge was the ability to print vector graphics with transparent elements. Adobe PostScript, the universal language of printers, does not support transparency. There were two options for printing - rasterizing into an image and printing the image, or making an opaque illustration that looks just like a transparent one by subdividing the illustration into pieces (planar mapping), and drawing them with the appropriate color, as the illustration shows. Planar mapping results in higher quality printing as it remains resolution independent. However, it is easy to create vector art for which planar mapping results in many thousands of small pieces, some smaller than a pixel. Planar mapping in those cases would be unacceptably slow, and rasterization would be the only option. But how do we know if certain parts of the document are going to result in unacceptably many pieces, without actually computing the planar map? It is a chicken and egg problem. I invented an algorithm that quickly estimates which areas of the document need to be rasterized and which can be planar mapped. Also, planar mapping is a complex operation, but we can often get by without it, in places of the document that are not involved in transparency. But how do you know if an object is involved in transparency without checking to see if it intersects with transparent object, i.e. without computing the planar map? Another chicken and egg problem. I created an algorithm to analyze the document and determine which objects need to be included into the planar map, and then interleave the results of planar mapping to generate the final document. These are just a few examples of the problems I needed to resolve in the flattener. Other problems I had to resolve are how to preserve native type through planar mapping, how to preserve native gradients and gradient meshes, how to support spot color planes, how to avoid stitching problems when dealing with strokes, how to preserve patterns, how to deal with overprint, how to schedule the color computations to avoid doing them repeatedly, how to design the system so that it performs on a single pass (the output may be too big to keep in memory), how to make sure it is fail safe - i.e. if it runs our of memory, it should fall back, break the problem into smaller pieces and attempt to do it again...

    I single-handedly designed and implemented the entire flattener module - the system that takes a vector graphics document containing transparency and outputs one that is visually equivalent but contains no transparency. (That does not include the planar mapping code, implemented by my colleague Steve Schiller). The flattener is now used by many of Adobe's vector products, including Illustrator, Acrobat and InDesign. It is used when printing and exporting to various formats, and Adobe also licenses it to other companies such as Kodak. The flattener is also ported into high-end PDF printers (RIPs). This was one of the largest and most complex projects I have ever done. It is also very widely used - not just for printing labels and posters, but also big titles like the cover of Glamour magazine.

Filter by topic:

Sort by year:

Improving Image Classification with Location Context

Kevin Tang, Manohar Paluri, Li Fei-Fei, Rob Fergus, Lubomir Bourdev
Computer Vision International Conference in Computer Vision (ICCV2015)

With the widespread availability of cellphones and cameras that have GPS capabilities, it is common for images being uploaded to the Internet today to have GPS coordinates associated with them. In addition to research that tries to predict GPS coordinates from visual features, this also opens up the door to problems that are conditioned on the availability of GPS coordinates. In this work, we tackle the problem of performing image classification with location context, in which we are given the GPS coordinates for images in both the train and test phases. We explore different ways of encoding and extracting features from the GPS coordinates, and show how to naturally incorporate these features into a Convolutional Neural Network (CNN), the current state-of-the-art for most image classification and recognition problems. We also show how it is possible to simultaneously learn the optimal pooling radii for a subset of our features within the CNN framework. To evaluate our model and to help promote research in this area, we identify a set of location-sensitive concepts and annotate a subset of the Yahoo Flickr Creative Commons 100M dataset that has GPS coordinates with these concepts, which we make publicly available. By leveraging location context, we are able to achieve almost a 7% gain in mean average precision.

Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri
Computer Vision International Conference in Computer Vision (ICCV2015)

We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets; 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets; and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

Beyond Frontal Faces: Improving Person Recognition Using Multiple Cues

Ning Zhang, Manohar Paluri, Yaniv Taigman, Rob Fergus, Lubomir Bourdev
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2015)

We explore the task of recognizing peoples' identities in photo albums in an unconstrained setting. To facilitate this, we introduce the new People In Photo Albums (PIPA) dataset, consisting of over 60000 instances of over 2000 individuals collected from public Flickr photo albums. With only about half of the person images containing a frontal face, the recognition task is very challenging due to the large variations in pose, clothing, camera viewpoint, image resolution and illumination. We propose the Pose Invariant PErson Recognition (PIPER) method, which accumulates the cues of poselet-level person recognizers trained by deep convolutional networks to discount for the pose variations, combined with a face recognizer and a global recognizer. Experiments on three different settings confirm that in our unconstrained setup PIPER significantly improves on the performance of DeepFace, which is one of the best face recognizers as measured on the LFW dataset.

Web-Scale Photo Hash Clustering on a Single Machine

Yunchao Gong, Marcin Pawlowski, Fei Yang, Louis Brandy, Lubomir Bourdev and Rob Fergus
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2015)

This paper addresses the problem of clustering a very large number of photos (i.e. hundreds of millions a day) in a stream into millions of clusters. This is particularly important as the popularity of photo sharing websites, such as Facebook, Google, and Instagram. Given large number of photos available online, how to efficiently organize them is an open problem. To address this problem, we propose to cluster the binary hash codes of a large number of photos into binary cluster centers. We present a fast binary k-means algorithm that works directly on the similarity-preserving hashes of images and clusters them into binary centers on which we can build hash indexes to speedup computation. The proposed method is capable of clustering millions of photos on a single machine in a few minutes. We show that this approach is usually several magnitude faster than standard k-means and produces comparable clustering accuracy. In addition, we propose an online clustering method based on binary k-means that is capable of clustering large photo stream on a single machine, and show applications to spam detection and trending photo discovery.

Training Convolutional Networks with Noisy Labels

Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, Rob Fergus
Computer Vision International Conference in Learning Representations, Workshop Paper (ICLR 2015)

The availability of large labeled datasets has allowed Convolutional Network models to achieve impressive recognition results. However, in many settings manual annotation of the data is impractical; instead our data has noisy labels, i.e. there is some freely available label for each image which may or may not be accurate. In this paper, we explore the performance of discriminatively-trained Convnets when trained on such noisy data. We introduce an extra noise layer into the network which adapts the network outputs to match the noisy label distribution. The parameters of this noise layer can be estimated as part of the training process and involve simple modifications to current training infrastructures for deep networks. We demonstrate the approaches on several datasets, including large scale experiments on the ImageNet classification benchmark.

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnik and Piotr Dollar
Computer Vision Arxiv 2015

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

PANDA: Pose Aligned Networks for Deep Attribute Modeling

Ning Zhang, Manohar Paluri, Marc'Aurelio Ranzato, Trevor Darrell, Lubomir Bourdev
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2014)

We propose a method for inferring human attributes (such as gender, hair style, clothes style, expression, action) from images of people under large variation of viewpoint, pose, appearance, articulation and occlusion. Convolutional Neural Nets (CNN) have been shown to perform very well on large scale object recognition problems. In the context of attribute classification, however, the signal is often subtle and it may cover only a small part of the image, while the image is dominated by the effects of pose and viewpoint. Discounting for pose variation would require training on very large labeled datasets which are not presently available. Part-based models, such as poselets and DPM have been shown to perform well for this problem but they are limited by flat low-level features. We propose a new method which combines part-based models and deep learning by training pose-normalized CNNs. We show substantial improvement vs. state-of-the-art methods on challenging attribute classification tasks in unconstrained settings. Experiments confirm that our method outperforms both the best part-based methods on this problem and conventional CNNs trained on the full bounding box of the person.

Hierarchical Cascade of Classifiers for Efficient Poselet Evaluation

David Bo Chen, Pietro Perona, and Lubomir Bourdev
Computer Vision British Machine Vision Conference (BMVC 2014)

Poselets have been used in a variety of computer vision tasks, such as detection, segmentation, action classification, pose estimation and action recognition, often achieving state-of-the-art performance. Poselet evaluation, however, is computationally intensive as it involves running thousands of scanning window classifiers. We present an algorithm for training a hierarchical cascade of part-based detectors and apply it to speed up poselet evaluation. Our cascade hierarchy leverages common components shared across poselets. We generate a family of cascade hierarchies, including trees that grow logarithmically on the number of poselet classifiers. Our algorithm, under some reasonable assumptions, finds the optimal tree structure that maximizes speed for a given target detection rate. We test our system on the PASCAL dataset and show an order of magnitude speedup at less than 1% loss in AP.

Deep Poselets for Human Detection

Lubomir Bourdev, Fei Yang, Rob Fergus
Computer Vision Arxiv 2014

We address the problem of detecting people in natural scenes using a part approach based on poselets. We propose a bootstrapping method that allows us to collect millions of weakly labeled examples for each poselet type. We use these examples to train a Convolutional Neural Net to discriminate different poselet types and separate them from the background class. We then use the trained CNN as a way to represent poselet patches with a Pose Discriminative Feature (PDF) vector -- a compact 256-dimensional feature vector that is effective at discriminating pose from appearance. We train the poselet model on top of PDF features and combine them with object-level CNNs for detection and bounding box prediction. The resulting model leads to state-of-the-art performance for human detection on the PASCAL datasets.

Articulated Pose Estimation using Discriminative Armlet Classifiers

Georgia Gkioxari, Pablo Arbelaez, Lubomir Bourdev and Jitendra Malik
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2013)

We propose a novel approach for human pose estimation in real-world cluttered scenes, and focus on the challenging problem of predicting the pose of both arms for each person in the image. For this purpose, we build on the notion of poselets and train highly discriminative classifiers to differentiate among arm configurations, which we call armlets. We propose a rich representation which, in addition to standard HOG features, integrates the information of strong contours, skin color and contextual cues in a principled manner. Unlike existing methods, we evaluate our approach on a large subset of images from the PASCAL VOC detection dataset, where critical visual phenomena, such as occlusion, truncation, multiple instances and clutter are the norm. Our approach outperforms Yang and Ramanan, the state-of-the-art technique, with an improvement from 29.0% to 37.5% PCP accuracy on the arm keypoint prediction task, on this new pose estimation dataset.

Interactive Facial Feature Localization

Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, and Thomas Huang
Computer Vision European Conference in Computer Vision (ECCV 2012)

We address the problem of interactive facial feature localization from a single image. Our goal is to obtain an accurate segmentation of facial features on high-resolution images under a variety of pose, expression, and lighting conditions. Although there has been significant work in facial feature localization, we are addressing a new application area, namely to facilitate intelligent high-quality editing of portraits, that brings requirements not met by existing methods. We propose an improvement to the Active Shape Model that allows for greater independence among the facial components and improves on the appearance fitting step by introducing a Viterbi optimization process that operates along the facial contours. Despite the improvements, we do not expect perfect results in all cases. We therefore introduce an interaction model whereby a user can efficiently guide the algorithm towards a precise solution. We introduce the Helen Facial Feature Dataset consisting of annotated portrait images gathered from Flickr that are more diverse and challenging than currently existing datasets. We present experiments that compare our automatic method to published results, and also a quantitative evaluation of the effectiveness of our interactive method.

Semantic Segmentation using Regions and Parts

Pablo Arbeláez, Bharath Hariharan, Chunhui Gu, Saurabh Gupta, Lubomir Bourdev, and Jitendra Malik
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2012)

We address the problem of segmenting and recognizing objects in real world images, focusing on challenging articulated categories such as humans and other animals. For this purpose, we propose a novel design for region-based object detectors that integrates efficiently top-down information from scanning-windows part models and global appearance cues. Our detectors produce class-specific scores for bottom-up regions, and then aggregate the votes of multiple overlapping candidates through pixel classification. We evaluate our approach on the PASCAL segmentation challenge, and report competitive performance with respect to current leading techniques. On VOC2010, our method obtains the best results in 6/20 categories and the highest performance on articulated objects.

Facial Expression Editing in Video Using a Temporally-Smooth Factorization

Fei Yang, Lubomir Bourdev, Eli Shechtman, Jue Wang and Dimitri Metaxas
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2012)

We address the problem of editing facial expression in video, such as exaggerating, attenuating or replacing the expression with a different one in some parts of the video. To achieve this we develop a tensor-based 3D face geometry reconstruction method, which fits a 3D model for each video frame, with the constraint that all models have the same identity and requiring temporal continuity of pose and expression. With the identity constraint, the differences between the underlying 3D shapes capture only changes in expression and pose. We show that various expression editing tasks in video can be achieved by combining face reordering with face warping, where the warp is induced by projecting differences in 3D face shapes into the image plane. Analogously, we show how the identity can be manipulated while fixing expression and pose. Experimental results show that our method can effectively edit expressions and identity in video in a temporally-coherent way with high fidelity.

Urban Tribes: Analyzing Group Photos from a Social Perspective

Ana Murillo, Iljung Kwak, Lubomir Bourdev, David Kriegman and Serge Belongie
Computer Vision CVPR 2012 Workshop on Socially Intelligent Surveillance and Monitoring

The explosive growth in image sharing via social networks has produced exciting opportunities for the computer vision community in areas including face, text, product and scene recognition. In this work we turn our attention to group photos of people and ask the question: what can we determine about the social subculture or urban tribe to which these people belong? To this end, we propose a framework employing low- and mid-level features to capture the visual attributes distinctive to a variety of urban tribes. We proceed in a semi-supervised manner, employing a metric that allows us to extrapolate from a small number of pairwise image similarities to induce a set of groups that visually correspond to familiar urban tribes such as biker, hipster or goth. Automatic recognition of such information in group photos offers the potential to improve recommendation services, context sensitive advertising and other social analysis applications. We present promising preliminary experimental results that demonstrate our ability to categorize group photos in a socially meaningful manner

Face Morphing using 3D-Aware Appearance Optimization

Fei Yang, Eli Shechtman, Jue Wang, Lubomir Bourdev, Dimitris Metaxas
Computer Graphics Graphics Interface (GI 2012)

We address the problem of editing facial expression in video, such as exaggerating, attenuating or replacing the expression with a different one in some parts of the video. To achieve this we develop a tensor-based 3D face geometry reconstruction method, which fits a 3D model for each video frame, with the constraint that all models have the same identity and requiring temporal continuity of pose and expression. With the identity constraint, the differences between the underlying 3D shapes capture only changes in expression and pose. We show that various expression editing tasks in video can be achieved by combining face reordering with face warping, where the warp is induced by projecting differences in 3D face shapes into the image plane. Analogously, we show how the identity can be manipulated while fixing expression and pose. Experimental results show that our method can effectively edit expressions and identity in video in a temporally-coherent way with high fidelity.

Describing People: A Poselet-Based Approach to Attribute Classification

Lubomir Bourdev, Subhransu Maji, Jitendra Malik
Computer Vision International Conference in Computer Vision (ICCV 2011)

We propose a method for recognizing attributes, such as the gender, hair style and types of clothes of people under large variation in viewpoint, pose, articulation and occlusion typical of personal photo album images. Robust attribute classifiers under such conditions must be invariant to pose, but inferring the pose in itself is a challenging problem. We use a part-based approach based on poselets. Our parts implicitly decompose the aspect (the pose and viewpoint). We train attribute classifiers for each such aspect and we combine them together in a discriminative model. We propose a new dataset of 8000 people with annotated attributes. Our method performs very well on this dataset, significantly outperforming a baseline built on the spatial pyramid match kernel method. On gender recognition we outperform a commercial face recognition system.

Semantic Contours from Inverse Detectors

Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev, Subhransu Maji and Jitendra Malik
Computer Vision International Conference in Computer Vision (ICCV 2011)

We study the challenging problem of localizing and classifying category-specific object contours in real world images. For this purpose, we present a simple yet effective method for combining generic object detectors with bottomup contours to identify object contours. We also provide a principled way of combining information from different part detectors and across categories. In order to study the problem and evaluate quantitatively our approach, we present a dataset of semantic exterior boundaries on more than 20, 000 object instances belonging to 20 categories, using the images from the VOC2011 PASCAL challenge

Pause-and-play: Automatically Linking Screencast Video Tutorials with Applications

Suporn Pongnumkul, Mira Doncheva, Wil Li, Lubomir Bourdev, Shai Avidan, Jue Wang and Michael Cohen
Computer Graphics ACM Symposium on User Interface Software and Technology (UIST 2011)

Video tutorials provide a convenient means for novices to learn new software applications. Unfortunately, staying in sync with a video while trying to use the target application at the same time requires users to repeatedly switch from the application to the video to pause or scrub backwards to replay missed steps. We present Pause-and-Play, a system that helps users work along with existing video tutorials. Pauseand-Play detects important events in the video and links them with corresponding events in the target application as the user tries to replicate the depicted prodedure. This linking allows our system to automatically pause and play the video to stay in sync with the user. Pause-and-Play also supports convenient video navigation controls that are accessible from within the target application and allow the user to easily replay portions of the video without switching focus out of the application. Finally, since our system uses computer vision to detect events in existing videos and leverages application scripting APIs to obtain real time usage traces, our approach is largely independent of the specific target application and does not require access or modifications to application source code. We have implemented Pause-and-Play for two target applications, Google SketchUp and Adobe Photoshop, and we report on a user study that shows our system improves the user experience of working with video tutorials.

Expression Flow for 3D-Aware Face Component Transfer

Fei Yang, Jue Wang, Eli Shechtman, Lubomir Bourdev and Dimitris Metaxas
Computer Graphics ACM Transactions on Graphics (SIGGRAPH 2011)

We address the problem of correcting an undesirable expression on a face photo by transferring local facial components, such as a smiling mouth, from another face photo of the same person which has the desired expression. Direct copying and blending using existing compositing tools results in semantically unnatural composites, since expression is a global effect and the local component in one expression is often incompatible with the shape and other components of the face in another expression. To solve this problem we present Expression Flow, a 2D flow field which can warp the target face globally in a natural way, so that the warped face is compatible with the new facial component to be copied over. To do this, starting with the two input face photos, we jointly construct a pair of 3D face shapes with the same identity but different expressions. The expression flow is computed by projecting the difference between the two 3D shapes back to 2D. It describes how to warp the target face photo to match the expression of the reference photo. User studies suggest that our system is able to generate face composites with much higher fidelity than existing methods.

Action Recognition from a Distributed Representation of Pose and Appearance

Subhransu Maji, Lubomir Bourdev, Jitendra Malik
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2010)

We present a distributed representation of pose and appearance of people called the “poselet activation vector”. First we show that this representation can be used to estimate the pose of people defined by the 3D orientations of the head and torso in the challenging PASCAL VOC 2010 person detection dataset. Our method is robust to clutter, aspect and viewpoint variation and works even when body parts like faces and limbs are occluded or hard to localize. We combine this representation with other sources of information like interaction with objects and other people in the image and use it for action recognition. We report competitive results on the PASCAL VOC 2010 static image action classification challenge

Object Segmentation by Alignment of Poselet Activations to Image Contours

Thomas Brox, Lubomir Bourdev, Subhransu Maji and Jitendra Malik
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2011)

In this paper, we propose techniques to make use of two complementary bottom-up features, image edges and texture patches, to guide top-down object segmentation towards higher precision. We build upon the part-based poselet detector, which can predict masks for numerous parts of an object. For this purpose we extend poselets to 19 other categories apart from person. We non-rigidly align these part detections to potential object contours in the image, both to increase the precision of the predicted object mask and to sort out false positives. We spatially aggregate object information via a variational smoothing technique while ensuring that object regions do not overlap. Finally, we propose to refine the segmentation based on self-similarity de- fined on small image patches. We obtain competitive results on the challenging Pascal VOC benchmark. On four classes we achieve the best numbers to-date.

Poselets and Their Applications in High-Level Computer Vision

Lubomir Bourdev
Computer Vision PhD Thesis, University of California at Berkeley, 2011

We address the classic problems of detection and segmentation using a part based detector that operates on a novel part, which we refer to as a poselet. Poselets are tightly clustered in both appearance space (and thus are easy to detect) as well as in configuration space (and thus are helpful for localization and segmentation). We demonstrate poselets are effective for detection, pose extraction, segmentation, action/pose estimation and attribute classification. Poselet construction requires extra annotations beyond the object bounds. To train poselets we have created H3D (Humans in 3D) - a dataset of 1200+ person annotations. The annotations include the joints, the extracted 3D pose, keypoint visibility and region labels. We have also annotated the people in the training and validation sets of PASCAL VOC 2009. Our poselet classifier achieves state-of-the-art results for the person category on PASCAL VOC 2007, 2008, 2009 and 2010 as well as on our dataset, H3D.

Detecting People Using Mutually Consistent Poselet Activations

Lubomir Bourdev, Subhransu Maji, Thomas Brox, Jitendra Malik
Computer Vision European Conference in Computer Vision (ECCV 2010)

Bourdev and Malik (ICCV 09) introduced a new notion of parts, poselets, constructed to be tightly clustered both in the configuration space of keypoints, as well as in the appearance space of image patches. In this paper we develop a new algorithm for detecting people using poselets. Unlike that work which used 3D annotations of keypoints, we use only 2D annotations which are much easier for naive human annotators. The main algorithmic contribution is in how we use the pattern of poselet activations. Individual poselet activations are noisy, but considering the spatial context of each can provide vital disambiguating information, just as object detection can be improved by considering the detection scores of nearby objects in the scene. This can be done by training a two-layer feed-forward network with weights set using a max margin technique. The refined poselet activations are then clustered into mutually consistent hypotheses where consistency is based on empirically determined spatial keypoint distributions. Finally, bounding boxes are predicted for each person hypothesis and shape masks are aligned to edges in the image to provide a segmentation. To the best of our knowledge, the resulting system is the current best performer on the task of people detection and segmentation with an average precision of 47.8% and 40.5% respectively on PASCAL VOC 2009.

Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations

Lubomir Bourdev and Jitendra Malik
Computer Vision International Conference in Computer Vision (ICCV 2009)

We address the classic problems of detection, segmentation and pose estimation of people in images with a novel definition of a part, a poselet. We postulate two criteria (1) It should be easy to find a poselet given an input image (2) it should be easy to localize the 3D configuration of the person conditioned on the detection of a poselet. To permit this we have built a new dataset, H3D, of annotations of humans in 2D photographs with 3D joint information, inferred using anthropometric constraints. This enables us to implement a data-driven search procedure for finding poselets that are tightly clustered in both 3D joint configuration space as well as 2D image appearance. The algorithm discovers poselets that correspond to frontal and profile faces, pedestrians, head and shoulder views, among others. Each poselet provides examples for training a linear SVM classifier which can then be run over the image in a multiscale scanning mode. The outputs of these poselet detectors can be thought of as an intermediate layer of nodes, on top of which one can run a second layer of classification or regression. We show how this permits detection and localization of torsos or keypoints such as left shoulder, nose, etc. Experimental results show that we obtain state of the art performance on people detection in the PASCAL VOC 2007 challenge, among other datasets. We are making publicly available both the H3D dataset as well as the poselet parameters for use by other researchers.

Generic Image Library

Lubomir Bourdev
Software Engineering Software Developer's Journal 2007

The Generic Image Library (GIL) is a C++ image library sponsored by Adobe Systems, Inc. and developed by Lubomir Bourdev and Hailin Jin. It is an open-source library, planned for inclusion in Boost 1.35.0. GIL is also a part of the Adobe Source Libraries. It is used in several Adobe projects, including some new features in Photoshop CS4

Efficient run-time dispatching in generic programming with minimal code bloat

Lubomir Bourdev and Jaakko Järvi
Software Engineering Science of Computer Programming, 2010

Generic programming with C++ templates results in efficient but inflexible code: efficient, because the exact types of inputs to generic functions are known at compile time; inflexible because they must be known at compile time. We show how to achieve run-time polymorphism without compromising performance by instantiating the generic algorithm with a comprehensive set of possible parameter types, and choosing the appropriate instantiation at run time. Applying this approach naïvely can result in excessive template bloat: a large number of template instantiations, many of which are identical at the assembly level. We show practical examples of this approach quickly approaching the limits of the compiler. Consequently, we combine this method of run-time polymorphism for generic programming, with a strategy for reducing the number of necessary template instantiations. We report on using our approach in GIL, Adobe’s open source Generic Image Library. We observed a notable reduction, up to 70% at times, in executable sizes of our test programs. This was the case even with compilers that perform aggressive template hoisting at the compiler level, due to significantly smaller dispatching code. The framework draws from both the generic and generative programming paradigms, using static metaprogramming to fine tune the compilation of a generic library. Our test bed, GIL, is deployed in a real world industrial setting, where code size is often an important factor.

Robust Object Detection Via Soft Cascade

Lubomir Bourdev and Jonathan Brandt
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2005)

We describe a method for training object detectors using a generalization of the cascade architecture, which results in a detection rate and speed comparable to that of the best published detectors while allowing for easier training and a detector with fewer features. In addition, the method allows for quickly calibrating the detector for a target detection rate, false positive rate or speed. One important advantage of our method is that it enables systematic exploration of the ROC Surface, which characterizes the trade-off between accuracy and speed for a given classifier.

Art-Based Rendering of Fur, Grass, and Trees

Michael Kowalski, Lee Markosian, J.D. Northrup, Lubomir Bourdev, Ronen Barzel, Loring Holden and John Hughes
Computer Graphics ACM Transactions on Graphics (SIGGRAPH 1999)

Artists and illustrators can evoke the complexity of fur or vegetation with relatively few well-placed strokes. We present an algorithm that uses strokes to render 3D computer graphics scenes in a stylized manner suggesting the complexity of the scene without representing it explicitly. The basic algorithm is customizable to produce a range of effects including fur, grass and trees, as we demonstrate in this paper and accompanying video. The algorithm is implemented within a broader framework that supports procedural stroke-based textures on polyhedral models. It renders moderately complex scenes at multiple frames per second on current graphics workstations, and provides some interframe coherence.

Rendering Nonphotorealistic Strokes with Temporal and Arc-Length Coherence

Lubomir Bourdev
Computer Graphics Master's Thesis, Brown University, 1998

We describe a method for rendering a silhouette of an object in a frame-to-frame coherent way. The input to the system each frame is a set of silhouette pixels in a rendering of the object and their corresponding silhouette edges in a polygonal model (mesh) of the object. The output is a set of silhouette strokes.

Real-Time Nonphotorealistic Rendering

Lee Markosian, Michael Kowalski, Sam Trychin, Lubomir Bourdev, Daniel Goldstein and John Hughes
Computer Graphics ACM Transactions on Graphics (SIGGRAPH 1997)

Nonphotorealistic rendering (NPR) can help make comprehensible but simple pictures of complicated objects by employing an economy of line. But current nonphotorealistic rendering is primarily a batch process. This paper presents a real-time nonphotorealistic renderer that deliberately trades accuracy and detail for speed. Our renderer uses a method for determining visible lines and surfaces which is a modification of Appel’s hidden-line algorithm, with improvements which are based on the topology of singular maps of a surface into the plane. The method we describe for determining visibility has the potential to be used in any NPR system that requires a description of visible lines or surfaces in the scene. The major contribution of this paper is thus to describe a tool which can significantly improve the performance of these systems. We demonstrate the system with several nonphotorealistic rendering styles, all of which operate on complex models at interactive frame rates.

  • J. Brandt, Z. Lin, L. Bourdev, Vuong Le, Fitting Contours to Features, U.S. Patent 9158963

  • L. Bourdev, Reviewing and Editing Word Processing Documents, U.S. Patent 9092173

  • L. Bourdev, E. Shechtman, J. Wang, and F. Yang, Methods and Apparatus for Face Fitting and Editing Applications, U.S. Patent 8923392

  • L. Dontcheva, S. Pongnumkul, W. Li, S. Avidan and L. Bourdev, Methods and Apparatus for Tutorial Video Enhancement, U.S. Patent 8909024

  • A. Lerios, D. Stoop, R. Mack, L. Bourdev, M. Paluri, Methods and Systems for Differentiating Synthetic and Non-Synthetic Images, U.S. Patent 8903186

  • J. Wang, E. Shechtman, L. Bourdev, F. Yang, Methods and Apparatus for Facial Feature Replacement, U.S. Patent 8818131

  • K. Dale, L. Bourdev, S. Avidan, A. Parenteau, System and Method for Labeling a Collection of Images, U.S. Patent 8724908

  • A. Casillas, L. Bourdev, Indicating a Correspondence Between an Image and an Object, U.S. Patent 8548211

  • L. Bourdev, Generation and Usage of Attractiveness Scores, U.S. Patent 8532347

  • L. Bourdev, J. Xu, System and Method for using Contextual Features to Improve Face Recognition in Digital Images, U.S. Patent 8503739

  • J. Wang, E. Shechtman, L. Bourdev, F. Yang, Methods and Apparatus for Facial Feature Replacement, U.S. Patent 8457442

  • L. Bourdev, Reviewing and Editing Word Processing Documents, U.S. Patent 8418051

  • L. Bourdev, A. Parenteau, Efficient and Scalable Face Recognition in Photo Albums, U.S. Patent 8379939

  • L. Bourdev, Reviewing and Editing Word Processing Documents, U.S. Patent 8296647

  • C. Schendel, L. Bourdev, Designating a Tag Icon, U.S. Patent 8259995

  • L. Bourdev, Facilitating Computer-Assisted Tagging of Object Instances in Digital Images, U.S. Patent 8244069

  • L. Bourdev, Autocompleting Form Fields Based on Previously Entered Values, U.S. Patent 8234561

  • L. Bourdev, Detecting Objects within an Image by Incrementally Evaluating Subwindows of the Image in Parallel, U.S. Patent 8077920

  • L. Bourdev, Generation and Usage of Attractiveness Scores, U.S. Patent 8041076

  • A. Casillas, L. Bourdev, Indicating a Correspondence Between an Image and an Object, U.S. Patent 7978936

  • L. Bourdev, Reviewing and Editing Word Processing Documents, U.S. Patent 7966566

  • L. Bourdev, Facilitating Computer-Assisted Tagging of Object Instances in Digital Images, U.S. Patent 7889946

  • L. Bourdev, Previewing the Effects of Flattening Transparency, U.S. Patent 7827485

  • L. Bourdev, S. Schiller, M. Newell, Processing Illustration Artwork, U.S. Patent 7825941

  • L. Bourdev, Method and System to Monitor Installation of a Software Program, U.S. Patent 7818741

  • L. Bourdev, Method for Displaying Extracted Faces from Images in Normalized Form, U.S. Patent 7813526

  • L. Bourdev, Tagging Detected Objects, U.S. Patent 7813557

  • L. Bourdev, J. Brandt, Image Splitting to Use Multiple Execution Channels of a Graphics Processor to Perform an Operation on Single-Channel Input, U.S. Patent 7768516

  • L. Bourdev, Detecting Objects within an Image by Incrementally Evaluating Subwindows of the Image in Parallel, U.S. Patent 7738680

  • L. Bourdev, Incremental Batch-Mode Editing of Digital Media Objects, U.S. Patent 7730043

  • L. Bourdev, C. Shendel, J. Heileson, Searching Images with Extracted Objects, U.S. Patent 7716157

  • A. Casillas, L. Bourdev, Exporting Extracted Faces, U.S. Patent 7706577

  • L. Bourdev, Indicating a Tag with Visual Data, U.S. Patent 7694885

  • A. Parenteau, L. Bourdev, Selectively Transforming Overlapping Illustration Artwork, U.S. Patent 7692652

  • L. Bourdev, Displaying Detected Objects to Indicate Grouping, U.S. Patent 7636450

  • L. Bourdev, J. Brandt, Detecting Objects in an Image Using a Soft Cascade, U.S. Patent 7634142

  • L. Bourdev, Method and Apparatus for Calibrating Sampling Operations for an Object Detection Process, U.S. Patent 7616780

  • L. Bourdev, Facilitating Computer-Assisted Tagging of Object Instances in Digital Images, U.S. Patent 7587101

  • L. Bourdev, G. Wilensky, Detection of Objects in an Image using Color Analysis, U.S. Patent 7580563

  • P. Asente, T. Pettit, L. Bourdev, M. Schuster, Assigning Region Attributes in a Drawing, U.S. Patent 7502028

  • L. Bourdev, S. Schiller, M. Newell, Processing Illustration Artwork, U.S. Patent 7495675

  • L. Bourdev, Method and Apparatus for Calibrating Sampling Operations for an Object Detection Process, U.S. Patent 7440587

  • L. Bourdev, Autocompleting Form Fields Based on Previously Entered Values, U.S. Patent 7343551

  • L. Bourdev, M. Newell, Creating and Manipulating Related Vector Objects in an Image, U.S. Patent 7339597

  • A. Parenteau, L. Bourdev, Selectively Transforming Overlapping Illustration Artwork, U.S. Patent 7262782

  • L. Bourdev, S. Schiller, Processing Complex Regions of Illustration Artwork, U.S. Patent 7256798

  • L. Bourdev, Previewing the Effects of Flattening Transparency, U.S. Patent 7181687

  • L. Bourdev, M. Newell, Operations on Related Set of Vector Objects, U.S. Patent 7123269

  • P. Louveaux, L. Bourdev, Hierarchical 2D Compositing with Blending Mode and Opacity Controls at All Levels, U.S. Patent 7102651

  • L. Bourdev, S. Schiller, Processing Complex Regions of Illustration Artwork, U.S. Patent 6894704

  • L. Bourdev, S. Schiller, Flattening Images with Abstracted Objects, U.S. Patent 6859553

  • P. Louveaux, L. Bourdev, Hierarchical 2D Compositing with Blending Mode and Opacity Controls at All Levels, U.S. Patent 6847380

  • L. Bourdev, S. Schiller, M. Newell, Processing Illustration Artwork, U.S. Patent 6720977

  • L. Bourdev, Processing Opaque Pieces of Illustration Artwork, U.S. Patent 6515675

My former intern Ning Zhang and I, together with colleagues from Facebook AI Research, published a CVPR poster about recognizing people even if their face is not visible. It went largely unnoticed in the vision community until it was suddenly picked by the press, with dozens of articles about it - by Wired, Time, Wall Street Journal, The Hacker News, Fortune, Another one by WSJ, New Scientist, ZDNet, Business Insider, Huffington Post, Yahoo, Daily Mail, USA Today and many other ones. They even made a Jimmy Kimmel skit! While I can't help feeling flattered by the press attention and think of PIPER as is a neat project and the first of its kind to recognize people from any viewpoint without the presence of a face, I feel this work is being overhyped, certainly not worth being called "technological breakthrough".  A lot of the articles expressed privacy concerns (which are unwarranted -- this is a research-only project with no plans to deploy to production).

Another project popular with the press was my work on the Symbolism tools at Adobe. It got rave reviews by many influential sources. My work on face tagging in Photoshop Elements also got noticed.

Here is a Wall Street Journal article written in 2008 about young researchers and the age span of inventors across different companies. It mentions me in the section about Adobe. (At that time I was considered "young"!)