Employment

  • Present 12/2015

    CEO

    WaveOne

  • 04/2016 10/2015

    Visiting Researcher

    UC Berkeley, Computer Science Department

  • 08/2015 03/2015

    Engineering Manager

    Facebook, Applied Machine Learning

  • 02/2015 03/2012

    Research Scientist

    Facebook, Facebook AI Research

  • 02/2012 06/1998

    Sr. Research Scientist

    Adobe, Adobe Research

Education

  • Ph.D. 2007-2011

    Ph.D. in Computer Science

    University of California at Berkeley

  • M.A.1997-1998

    Master of Arts

    in Computer Science

    Brown University

  • B.Sc.1994-1998

    Bachelor of Science

    in Computer Science

    Brown University

Awards, Affiliations, Professional Activities

  • 2015-present
    Area Chair for CVPR and ECCV
    CVPR and ECCV are two of the top three computer vision conferences. Area chairs are responsible for assigning papers to reviewers, coordinating the review process, writing meta-reviews and providing final recommendations on the acceptance status of paper submissions. I have been an area chair for CVPR 2016, CVPR 2019 and ECCV 2020
  • 2013
    Organizer of the Bay Area Vision Meeting (BAVM2013)
    The Bay Area Vision Meeting is an annual one-day workshop organized by a company or university. The goal is to bring together researchers, professors and graduate students in the field and discuss latest advances. Prior organizers were UC Berkeley, Stanford and Google.
  • 2013
    A tutorial of the history of part-based models and their applications to a variety of computer vision model. Co-organized with Ross Girshick from UC Berkeley.
  • 2012-present
    PC member of various workshops
    Program Committee member for:
    • The Action Recognition at Pose Estimation workshop at ECCV 2012 APSI2012
    • The Scene Understanding workshop at CVPR 2012 SUNW2012
    • The Big Vision workshop at CVPR 2015 BigVision2015
  • 2007
    Adobe's University Sabbatical program
    One of two Adobe employees accepted into the University Sabbatical program, which allowed me to pursue a PhD while employed (completed in four years)
  • 1998
    Brown University Combined Program
    The only student from the Computer Science department in 1998 accepted to the Brown University Combined Program, which allowed me to complete both Bachelor’s and Master’s degrees in a total of four years.
  • 1995
    Brown University Teaching and Research Assistantship
    As an undergraduate I was a teaching assistant for three computer science classes, including a Head TA for the Algorithms and Data Structures class. I was conducting labs and occasionally giving lectures to a class of 100+ students
  • 1992-1994
    National Competition in Computational Linguistics in Bulgaria
    I was awarded first place in 1992 and second place in 1994 on the National Competition in Computational Linguistics in Bulgaria
  • 2006
    Member of MENSA
  • image

    Facebook's Image Classification Engine

    Object recognition applied on every Facebook photo

    When I joined Facebook in February 2012 I was the first person at the company hired to do computer vision. My first task was evaluating the Face.com technology and providing technical advice on the acquisition as well as integrating the technology into Facebook. After that I focused on object and scene recognition. Together with my then-intern Manohar Paluri, we developed the computer vision engine that Facebook uses to analyze photos and videos. It was incredibly exciting to deploy an engine and have it run on three hundred million photos per day! The original engine was based on traditional computer vision features and was able to tell basic properties, like whether the photo is a closeup, indoors or outdoors, or in nature. Over the next two years we released nine versions of the engine, significantly improving it in each release. The latest version is able to recognize more than a thousand types of objects, scenes, activities and places of interest using convolutional neural networks with multiple loss functions. At peak time it handles more than 10000 calls per second and is run on every photo and every second of every video on Facebook and Instagram. It has already been called more than half a trillion times! Our engine is key for spam detection, pornographic content filtering, visual search, feed ranking, ad targeting, and many other areas. I was first the project lead and then became the manager of the group responsible for everything from the research to the development and deployment of the engine. I developed a large part of the training code as well as a highly optimized feedforward path used in production.

  • image

    Person Detection and Recognition at Adobe

    The face detector and person recognizer in Photoshop Elements

    This was my first research project. Since Computer Vision was a new area for me, I started by reading papers and textbooks. My first experiment was a human ear detector using a neural network on the Haar wavelets of an image. It worked fairly well, but evaluating the neural network at every location and scale was too slow. I then spent some time thinking about evaluating the neural network incrementally, simultaneously at every place, and focusing the computations on the most promising areas. Although the total detection time deteriorated, this approach resulted in discovering most ears almost instantaneously and allowed for a nice tradeoff between detection rate and speed. I then generalized my idea to incrementally evaluate any learning machine (which I called the Soft Cascade).

    A colleague of mine, Jonathan Brandt, was investigating the (at the time) state-of-the-art Viola-Jones face detector, which had very good performance and accuracy. I decided to apply the Soft Cascade on the VJ detector, and, to my delight, the resulting system was both faster and more accurate. It also has numerous advantages - it considers some information that the "hard cascade" throws away. The detector is less "brittle" and generalizes better, the speed/accuracy tradeoff is not hard-coded during training, but could be specified afterwards, and the new framework allows for augmenting the operational domain of an existing detector. For example, we could improve an existing detector to handle, say, wider out-of-plane rotation. My colleague observed that the ability of the Soft Cascade to be quickly calibrated for a specific point in the speed/accuracy space allows us to explore the operational domain of the detector not just along the detection rate and false positive rate, but also along the speed dimension. As far as I know, our CVPR paper was the first to describe the ROC surface of an object detector.

    My face detector was first deployed in the face tagging feature of Photoshop Elements 4, and has received positive reviews. It wouldn't have happened without the help of Claire Schendel, a Photoshop engineer who integrated the feature into the product. As far as I know, Photoshop Elements 4, in 2005, was the first application to use face detection in a consumer product. Face detection started appearing in cameras shortly after that.

    In 2007 I started developing a system that uses face detection combined with face recognition, leveraging context, such as the fact that the same people on the same day tend to wear the same clothes. This was a research project for my class at UC Berkeley which formed the basis of the People Recognition feature in Photoshop Elements 8. My Adobe colleague Alex Parenteau and I developed the core engine, using an external face recognizer and we collaborated with the Elements engineering team. The big engineering challenge we addressed was scalability - the ability to extend the technology to very large albums with limited memory.

  • image

    Boost GIL

    Generic Image Library as part of the Boost C++ libraries

    While at Adobe I was fortunate to have Alex Stepanov, the main guy behind STL, as my colleague. He led a class on Generic Programming, which was an inspiration to all of us. Generic Programming is exciting because it allows for abstraction with no loss in performance. I have been collaborating with Prof. Jarvi from Texas A&M on a method for applying generic programming to create C++ code that is generic, efficient and run-time flexible, without incurring unnecessary code bloat. Here is our LCSD paper and my presentation slides. Our approach achieves the specified goals, but has other disadvantages, namely type safety.

    One excellent application for generic programming is to abstract away the image representation and allow us to write generic image processing algorithms that work efficiently with images of any color space, channel ordering, channel depth, and pixel representation. This is the goal of my Generic Image Library - a C++ library I have created together with my former colleague Hailin Jin. GIL is an open-source library now part of the popular Boost libraries and it is used by dozens of institutions. Here is a video tutorial I prepared to give an overview of GIL.

    It was a great honor for me to receive an invitation by Prof. Bjarne Stroustrup, the creator of C++, to give a talk about GIL at his institution.
  • image

    Auto-Fill in Adobe Acrobat

    The engine to auto-fill forms in Acrobat based on history and form structure

    Have you ever applied for a mortgage? After going through the experience of filling a billion forms, with the same information over an over again, I decided I have had enough and started thinking about ways of simplifying the form filling experience. I created a probabilistic framework that can suggest suitable defaults for form entries. It observes your entry patterns, learns from experience and is able to extrapolate the results to previously unseen forms. When it is fairly confident with the result, it can populate the field once you tab into it. It is now used by Adobe Acrobat to streamline the form filling process. I think it is also in the free Acrobat Reader. (You need to enable it from the preferences menu). Thanks to Alex Mohr, an Acrobat engineer, for integrating my engine into the product.

  • image

    Symbolism Tools in Adobe Illustrator

    Tools to allow for artistic effects, such as drawing organic shapes, pen-and-ink illustrations

    Vector graphics applications like Adobe Illustrator have been used to create some amazing art. But we have only scratched the surface of what computers can do. By building some intelligence into the tools, we could enable a new generation of art that would be too time consuming to generate and edit by hand. This idea inspired me to create the Symbolism tools - a suite of tools in Illustrator that allow for scattering, moving, "combing", coloring and applying styles to a collection of graphical symbols. These tools could be used for a variety of objects, like hair, organic shapes, pen-and-ink style of shading. I am using a particle system to guide the behavior of the tools. My manager Martin Newell gave me some insightful ideas for the underlying technology. I designed, prototyped, performance-optimized and integrated the feature into Illustrator. Here is some sample art created by these tools. The Symbolism tools have received outstanding reviews.

  • image

    Adobe's Transparency Flattener

    The engine that Adobe uses to print vector documents containing transparency

    When I joined Adobe in 1998, the big company initiative was introducing transparency in the vector graphics products. Transparency can be used to represent a dazzling range of effects, see-through objects, lens effects, soft clips, drop shadows... However, the biggest technical challenge was the ability to print vector graphics with transparent elements. Adobe PostScript, the universal language of printers, does not support transparency. There were two options for printing - rasterizing into an image and printing the image, or making an opaque illustration that looks just like a transparent one by subdividing the illustration into pieces (planar mapping), and drawing them with the appropriate color, as the illustration shows. Planar mapping results in higher quality printing as it remains resolution independent. However, it is easy to create vector art for which planar mapping results in many thousands of small pieces, some smaller than a pixel. Planar mapping in those cases would be unacceptably slow, and rasterization would be the only option. But how do we know if certain parts of the document are going to result in unacceptably many pieces, without actually computing the planar map? It is a chicken and egg problem. I invented an algorithm that quickly estimates which areas of the document need to be rasterized and which can be planar mapped. Also, planar mapping is a complex operation, but we can often get by without it, in places of the document that are not involved in transparency. But how do you know if an object is involved in transparency without checking to see if it intersects with transparent object, i.e. without computing the planar map? Another chicken and egg problem. I created an algorithm to analyze the document and determine which objects need to be included into the planar map, and then interleave the results of planar mapping to generate the final document. These are just a few examples of the problems I needed to resolve in the flattener. Other problems I had to resolve are how to preserve native type through planar mapping, how to preserve native gradients and gradient meshes, how to support spot color planes, how to avoid stitching problems when dealing with strokes, how to preserve patterns, how to deal with overprint, how to schedule the color computations to avoid doing them repeatedly, how to design the system so that it performs on a single pass (the output may be too big to keep in memory), how to make sure it is fail safe - i.e. if it runs our of memory, it should fall back, break the problem into smaller pieces and attempt to do it again...

    I single-handedly designed and implemented the entire flattener module - the system that takes a vector graphics document containing transparency and outputs one that is visually equivalent but contains no transparency. (That does not include the planar mapping code, implemented by my colleague Steve Schiller). The flattener is now used by many of Adobe's vector products, including Illustrator, Acrobat and InDesign. It is used when printing and exporting to various formats, and Adobe also licenses it to other companies such as Kodak. The flattener is also ported into high-end PDF printers (RIPs). This was one of the largest and most complex projects I have ever done. It is also very widely used - not just for printing labels and posters, but also big titles like the cover of Glamour magazine.

Filter by topic:

Sort by year:

PIM: Video Coding using Perceptual Importance Maps

Evgenya Pergament, Pulkit Tandon, Oren Rippel, Lubomir Bourdev, Alexander G. Anderson, Bruno Olshausen, Tsachy Weissman, Sachin Katti, Kedar Tatwawadi
Computer Graphics Arxiv 2022

Human perception is at the core of lossy video compression, with numerous approaches developed for perceptual quality assessment and improvement over the past two decades. In the determination of perceptual quality, different spatio-temporal regions of the video differ in their relative importance to the human viewer. However, since it is challenging to infer or even collect such fine-grained information, it is often not used during compression beyond low-level heuristics. We present a framework which facilitates research into fine-grained subjective importance in compressed videos, which we then utilize to improve the rate-distortion performance of an existing video codec (x264). The contributions of this work are threefold: (1) we introduce a web-tool which allows scalable collection of fine-grained perceptual importance, by having users interactively paint spatio-temporal maps over encoded videos; (2) we use this tool to collect a dataset with 178 videos with a total of 14443 frames of human annotated spatio-temporal importance maps over the videos; and (3) we use our curated dataset to train a lightweight machine learning model which can predict these spatio-temporal importance regions. We demonstrate via a subjective study that encoding the videos in our dataset while taking into account the importance maps leads to higher perceptual quality at the same bitrate, with the videos encoded with importance maps preferred 2.1× over the baseline videos. Similarly, we show that for the 18 videos in test set, the importance maps predicted by our model lead to higher perceptual quality videos, 2× preferred over the baseline at the same bitrate.

An Interactive Annotation Tool for Perceptual Video Compression

Evgenya Pergament, Pulkit Tandon, Kedar Tatwawadi, Oren Rippel, Lubomir Bourdev, Bruno Olshausen, Tsachy Weissman, Sachin Katti, Alexander G Anderson
Computer Graphics International Conference on Quality of Multimedia Experience (QoMEX 2022)

Human perception is at the core of lossy video compression and yet, it is challenging to collect data that is sufficiently dense to drive compression. In perceptual quality assessment, human feedback is typically collected as a single scalar quality score indicating preference of one distorted video over another. In reality, some videos may be better in some parts but not in others. We propose an approach to collecting finer-grained feedback by asking users to use an interactive tool to directly optimize for perceptual quality given a fixed bitrate. To this end, we built a novel web-tool which allows users to paint these spatio-temporal importance maps over videos. The tool allows for interactive successive refinement: we iteratively re-encode the original video according to the painted importance maps, while maintaining the same bitrate, thus allowing the user to visually see the trade-off of assigning higher importance to one spatio-temporal part of the video at the cost of others. We use this tool to collect data in-the-wild (10 videos, 17 users) and utilize the obtained importance maps in the context of x264 coding to demonstrate that the tool can indeed be used to generate videos which, at the same bitrate, look perceptually better through a subjective study - and are 1.9 times more likely to be preferred by viewers. The code for the tool and dataset can be found at https://github.com/jenyap/video-annotation-tool.git

ELF-VC: Efficient Learned Flexible-Rate Video Coding

Oren Rippel, Alexander G. Anderson, Kedar Tatwawadi, Sanjay Nair, Craig Lytle, Lubomir Bourdev
Computer Vision International Conference in Computer Vision (ICCV 2021)

While learned video codecs have demonstrated great promise, they have yet to achieve sufficient efficiency for practical deployment. In this work, we propose several novel ideas for learned video compression which allow for improved performance for the low-latency mode (I- and Pframes only) along with a considerable increase in computational efficiency. In this setting, for natural videos our approach compares favorably across the entire R-D curve under metrics PSNR, MS-SSIM and VMAF against all mainstream video standards (H.264, H.265, AV1) and all ML codecs. At the same time, our approach runs at least 5x faster and has fewer parameters than all ML codecs which report these figures. Our contributions include a flexible-rate framework allowing a single model to cover a large and dense range of bitrates, at a negligible increase in computation and parameter count; an efficient backbone optimized for MLbased codecs; and a novel in-loop flow prediction scheme which leverages prior information towards more efficient compression. 140% 120% 100% BD-rate relative to AV1 80% 60% 40% 20% 0%-20% 1.0 Wu et al. Habibian et al. 0.8 Liu et al. 0.6 DVC 0.4 0.2 Ours 0.0 100 200 400 Encode time (ms) 800 0.0 Wu et al. Liu et al. Habibian et al. DVC Ours 50 0.2 200 0.4 600 105 0.6 106 0.8 Decode time (ms) 107 1.0 Figure 1: BD-Rate for ML-based codecs relative to AV1 as a function of encode/decode time on HD 1080 videos [42, 14, 29, 27] (UVGdataset, PSNR metric). Our approach reduces the BD-rate by 54% relative to the current fastest MLcodecwhichreportsspeed[27], whilerunning5xfaster. We benchmark our method, which we call ELF-VC (Eff icient, Learned and Flexible Video Coding) on popular video test sets UVG and MCL-JCV under metrics PSNR, MS-SSIM and VMAF. For example, on UVG under PSNR, it reduces the BD-rate by 44% against H.264, 26% against H.265, 15% against AV1, and 35% against the current best ML codec. At the same time, on an NVIDIA Titan V GPU our approach encodes/decodes VGA at 49/91 FPS, HD 720 at 19/35 FPS, and HD 1080 at 10/18 FPS.

Learned Video Compression

Oren Rippel, Sanjay Nair, Carissa Lew, Steve Branson, Alexander G. Anderson, Lubomir Bourdev
Computer Vision International Conference in Computer Vision (ICCV 2019)

We present a new algorithm for video coding, learned end-to-end for the low-latency mode. In this setting, our approach outperforms all existing video codecs across nearly the entire bitrate range. To our knowledge, this is the first ML-based method to do so. We evaluate our approach on standard video compression test sets of varying resolutions, and benchmark against all mainstream commercial codecs, in the low-latency mode. On standard-definition videos, relative to our algorithm, HEVC/H.265, AVC/H.264 and VP9 typically produce codes up to 60% larger. On high-definition 1080p videos, H.265 and VP9 typically produce codes up to 20% larger, and H.264 up to 35% larger. Furthermore, our approach does not suffer from blocking artifacts and pixelation, and thus produces videos that are more visually pleasing. We propose two main contributions. The first is a novel architecture for video compression, which (1) generalizes motion estimation to perform any learned compensation beyond simple translations, (2) rather than strictly relying on previously transmitted reference frames, maintains a state of arbitrary information learned by the model, and (3) enables jointly compressing all transmitted signals (such as optical flow and residual). Secondly, we present a framework for ML-based spatial rate control — a mechanism for assigning variable bitrates across space for each frame. This is a critical component for video coding, which to our knowledge had not been developed within a machine learning setting.

Real-Time Adaptive Image Compression

Oren Rippel and Lubomir Bourdev
Computer Vision International Conference in Machine Learning (ICML 2017)

We present a machine learning-based approach to lossy image compression which outperforms all existing codecs, while running in real-time. Our algorithm typically produces files 2.5 times smaller than JPEG and JPEG 2000, 2 times smaller than WebP, and 1.7 times smaller than BPG on datasets of generic images across all quality levels. At the same time, our codec is designed to be lightweight and deployable: for example, it can encode or decode the Kodak dataset in around 10ms per image on GPU. Our architecture is an autoencoder featuring pyramidal analysis, an adaptive coding module, and regularization of the expected codelength. We also supplement our approach with adversarial training specialized towards use in a compression setting: this enables us to produce visually pleasing reconstructions for very low bitrates.

ProNet: Learning to Propose Object-Specific Boxes for Cascaded Neural Networks

Chen Sun, Manohar Paluri, Ronan Collobert, Ram Nevatia, Lubomir Bourdev
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2016)

This paper aims to classify and locate objects accurately and efficiently, without using bounding box annotations. It is challenging as objects in the wild could appear at arbitrary locations and in different scales. We propose a novel classification architecture ProNet based on convolutional neural networks. It uses computationally efficient neural networks to propose image regions that are likely to contain objects, and applies more powerful but slower networks on the proposed regions. The basic building block is a multi-scale fully-convolutional network which assigns object confidence scores to boxes at different locations and scales. We show that such networks can be trained effectively using image-level annotations, and can be connected into cascades or trees for efficient object classification. ProNet outperforms previous state-of-the-art on PASCAL VOC 2012 and MS COCO datasets for object classification and point-based localization.

Deep End2End Voxel2Voxel Prediction

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani and Manohar Paluri
Computer Vision The 3rd Workshop on Deep Learning in Computer Vision 2016 (in CVPR 2016)

Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis. However, so far their most successful applications have been in the area of video classification and detection, i.e., problems involving the prediction of a single class label or a handful of output variables per video. Furthermore, while deep networks are commonly recognized as the best models to use in these domains, there is a widespread perception that in order to yield successful results they often require time-consuming architecture search, manual tweaking of parameters and computationally intensive preprocessing or post-processing methods. In this paper we challenge these views by presenting a deep 3D convolutional architecture trained end to end to perform voxel-level prediction, i.e., to output a variable at every voxel of the video. Most importantly, we show that the same exact architecture can be used to achieve competitive results on three widely different voxel-prediction tasks: video semantic segmentation, optical flow estimation, and video coloring. The three networks learned on these problems are trained from raw video without any form of preprocessing and their outputs do not require post-processing to achieve outstanding performance. Thus, they offer an efficient alternative to traditional and much more computationally expensive methods in these video domains.

Metric Learning with Adaptive Density Discrimination

Oren Rippel, Manohar Paluri, Piotr Dollar, Lubomir Bourdev
Computer Vision International Conference in Learning Representations, (ICLR 2016)

Distance metric learning (DML) approaches learn a transformation to a representation space where distance is in correspondence with a predefined notion of similarity. While such models offer a number of compelling benefits, it has been difficult for these to compete with modern classification algorithms in performance and even in feature extraction. In this work, we propose a novel approach explicitly designed to address a number of subtle yet important issues which have stymied earlier DML algorithms. It maintains an explicit model of the distributions of the different classes in representation space. It then employs this knowledge to adaptively assess similarity, and achieve local discrimination by penalizing class distribution overlap. We demonstrate the effectiveness of this idea on several tasks. Our approach achieves state-of-the-art classification results on a number of fine-grained visual recognition datasets, surpassing the standard softmax classifier and outperforming triplet loss by a relative margin of 30-40%. In terms of computational performance, it alleviates training inefficiencies in the traditional triplet loss, reaching the same error in 5-30 times fewer iterations. Beyond classification, we further validate the saliency of the learnt representations via their attribute concentration and hierarchy recovery properties, achieving 10-25% relative gains on the softmax classifier and 25-50% on triplet loss in these tasks.

Improving Image Classification with Location Context

Kevin Tang, Manohar Paluri, Li Fei-Fei, Rob Fergus, Lubomir Bourdev
Computer Vision International Conference in Computer Vision (ICCV2015)

With the widespread availability of cellphones and cameras that have GPS capabilities, it is common for images being uploaded to the Internet today to have GPS coordinates associated with them. In addition to research that tries to predict GPS coordinates from visual features, this also opens up the door to problems that are conditioned on the availability of GPS coordinates. In this work, we tackle the problem of performing image classification with location context, in which we are given the GPS coordinates for images in both the train and test phases. We explore different ways of encoding and extracting features from the GPS coordinates, and show how to naturally incorporate these features into a Convolutional Neural Network (CNN), the current state-of-the-art for most image classification and recognition problems. We also show how it is possible to simultaneously learn the optimal pooling radii for a subset of our features within the CNN framework. To evaluate our model and to help promote research in this area, we identify a set of location-sensitive concepts and annotate a subset of the Yahoo Flickr Creative Commons 100M dataset that has GPS coordinates with these concepts, which we make publicly available. By leveraging location context, we are able to achieve almost a 7% gain in mean average precision.

Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri
Computer Vision International Conference in Computer Vision (ICCV2015)

We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets; 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets; and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

Beyond Frontal Faces: Improving Person Recognition Using Multiple Cues

Ning Zhang, Manohar Paluri, Yaniv Taigman, Rob Fergus, Lubomir Bourdev
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2015)

We explore the task of recognizing peoples' identities in photo albums in an unconstrained setting. To facilitate this, we introduce the new People In Photo Albums (PIPA) dataset, consisting of over 60000 instances of over 2000 individuals collected from public Flickr photo albums. With only about half of the person images containing a frontal face, the recognition task is very challenging due to the large variations in pose, clothing, camera viewpoint, image resolution and illumination. We propose the Pose Invariant PErson Recognition (PIPER) method, which accumulates the cues of poselet-level person recognizers trained by deep convolutional networks to discount for the pose variations, combined with a face recognizer and a global recognizer. Experiments on three different settings confirm that in our unconstrained setup PIPER significantly improves on the performance of DeepFace, which is one of the best face recognizers as measured on the LFW dataset.

Web-Scale Photo Hash Clustering on a Single Machine

Yunchao Gong, Marcin Pawlowski, Fei Yang, Louis Brandy, Lubomir Bourdev and Rob Fergus
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2015)

This paper addresses the problem of clustering a very large number of photos (i.e. hundreds of millions a day) in a stream into millions of clusters. This is particularly important as the popularity of photo sharing websites, such as Facebook, Google, and Instagram. Given large number of photos available online, how to efficiently organize them is an open problem. To address this problem, we propose to cluster the binary hash codes of a large number of photos into binary cluster centers. We present a fast binary k-means algorithm that works directly on the similarity-preserving hashes of images and clusters them into binary centers on which we can build hash indexes to speedup computation. The proposed method is capable of clustering millions of photos on a single machine in a few minutes. We show that this approach is usually several magnitude faster than standard k-means and produces comparable clustering accuracy. In addition, we propose an online clustering method based on binary k-means that is capable of clustering large photo stream on a single machine, and show applications to spam detection and trending photo discovery.

Training Convolutional Networks with Noisy Labels

Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, Rob Fergus
Computer Vision International Conference in Learning Representations, Workshop Paper (ICLR 2015)

The availability of large labeled datasets has allowed Convolutional Network models to achieve impressive recognition results. However, in many settings manual annotation of the data is impractical; instead our data has noisy labels, i.e. there is some freely available label for each image which may or may not be accurate. In this paper, we explore the performance of discriminatively-trained Convnets when trained on such noisy data. We introduce an extra noise layer into the network which adapts the network outputs to match the noisy label distribution. The parameters of this noise layer can be estimated as part of the training process and involve simple modifications to current training infrastructures for deep networks. We demonstrate the approaches on several datasets, including large scale experiments on the ImageNet classification benchmark.

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnik and Piotr Dollar
Computer Vision Arxiv 2015

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

PANDA: Pose Aligned Networks for Deep Attribute Modeling

Ning Zhang, Manohar Paluri, Marc'Aurelio Ranzato, Trevor Darrell, Lubomir Bourdev
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2014)

We propose a method for inferring human attributes (such as gender, hair style, clothes style, expression, action) from images of people under large variation of viewpoint, pose, appearance, articulation and occlusion. Convolutional Neural Nets (CNN) have been shown to perform very well on large scale object recognition problems. In the context of attribute classification, however, the signal is often subtle and it may cover only a small part of the image, while the image is dominated by the effects of pose and viewpoint. Discounting for pose variation would require training on very large labeled datasets which are not presently available. Part-based models, such as poselets and DPM have been shown to perform well for this problem but they are limited by flat low-level features. We propose a new method which combines part-based models and deep learning by training pose-normalized CNNs. We show substantial improvement vs. state-of-the-art methods on challenging attribute classification tasks in unconstrained settings. Experiments confirm that our method outperforms both the best part-based methods on this problem and conventional CNNs trained on the full bounding box of the person.

Hierarchical Cascade of Classifiers for Efficient Poselet Evaluation

David Bo Chen, Pietro Perona, and Lubomir Bourdev
Computer Vision British Machine Vision Conference (BMVC 2014)

Poselets have been used in a variety of computer vision tasks, such as detection, segmentation, action classification, pose estimation and action recognition, often achieving state-of-the-art performance. Poselet evaluation, however, is computationally intensive as it involves running thousands of scanning window classifiers. We present an algorithm for training a hierarchical cascade of part-based detectors and apply it to speed up poselet evaluation. Our cascade hierarchy leverages common components shared across poselets. We generate a family of cascade hierarchies, including trees that grow logarithmically on the number of poselet classifiers. Our algorithm, under some reasonable assumptions, finds the optimal tree structure that maximizes speed for a given target detection rate. We test our system on the PASCAL dataset and show an order of magnitude speedup at less than 1% loss in AP.

Deep Poselets for Human Detection

Lubomir Bourdev, Fei Yang, Rob Fergus
Computer Vision Arxiv 2014

We address the problem of detecting people in natural scenes using a part approach based on poselets. We propose a bootstrapping method that allows us to collect millions of weakly labeled examples for each poselet type. We use these examples to train a Convolutional Neural Net to discriminate different poselet types and separate them from the background class. We then use the trained CNN as a way to represent poselet patches with a Pose Discriminative Feature (PDF) vector -- a compact 256-dimensional feature vector that is effective at discriminating pose from appearance. We train the poselet model on top of PDF features and combine them with object-level CNNs for detection and bounding box prediction. The resulting model leads to state-of-the-art performance for human detection on the PASCAL datasets.

Articulated Pose Estimation using Discriminative Armlet Classifiers

Georgia Gkioxari, Pablo Arbelaez, Lubomir Bourdev and Jitendra Malik
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2013)

We propose a novel approach for human pose estimation in real-world cluttered scenes, and focus on the challenging problem of predicting the pose of both arms for each person in the image. For this purpose, we build on the notion of poselets and train highly discriminative classifiers to differentiate among arm configurations, which we call armlets. We propose a rich representation which, in addition to standard HOG features, integrates the information of strong contours, skin color and contextual cues in a principled manner. Unlike existing methods, we evaluate our approach on a large subset of images from the PASCAL VOC detection dataset, where critical visual phenomena, such as occlusion, truncation, multiple instances and clutter are the norm. Our approach outperforms Yang and Ramanan, the state-of-the-art technique, with an improvement from 29.0% to 37.5% PCP accuracy on the arm keypoint prediction task, on this new pose estimation dataset.

Interactive Facial Feature Localization

Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, and Thomas Huang
Computer Vision European Conference in Computer Vision (ECCV 2012)

We address the problem of interactive facial feature localization from a single image. Our goal is to obtain an accurate segmentation of facial features on high-resolution images under a variety of pose, expression, and lighting conditions. Although there has been significant work in facial feature localization, we are addressing a new application area, namely to facilitate intelligent high-quality editing of portraits, that brings requirements not met by existing methods. We propose an improvement to the Active Shape Model that allows for greater independence among the facial components and improves on the appearance fitting step by introducing a Viterbi optimization process that operates along the facial contours. Despite the improvements, we do not expect perfect results in all cases. We therefore introduce an interaction model whereby a user can efficiently guide the algorithm towards a precise solution. We introduce the Helen Facial Feature Dataset consisting of annotated portrait images gathered from Flickr that are more diverse and challenging than currently existing datasets. We present experiments that compare our automatic method to published results, and also a quantitative evaluation of the effectiveness of our interactive method.

Semantic Segmentation using Regions and Parts

Pablo Arbeláez, Bharath Hariharan, Chunhui Gu, Saurabh Gupta, Lubomir Bourdev, and Jitendra Malik
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2012)

We address the problem of segmenting and recognizing objects in real world images, focusing on challenging articulated categories such as humans and other animals. For this purpose, we propose a novel design for region-based object detectors that integrates efficiently top-down information from scanning-windows part models and global appearance cues. Our detectors produce class-specific scores for bottom-up regions, and then aggregate the votes of multiple overlapping candidates through pixel classification. We evaluate our approach on the PASCAL segmentation challenge, and report competitive performance with respect to current leading techniques. On VOC2010, our method obtains the best results in 6/20 categories and the highest performance on articulated objects.

Facial Expression Editing in Video Using a Temporally-Smooth Factorization

Fei Yang, Lubomir Bourdev, Eli Shechtman, Jue Wang and Dimitri Metaxas
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2012)

We address the problem of editing facial expression in video, such as exaggerating, attenuating or replacing the expression with a different one in some parts of the video. To achieve this we develop a tensor-based 3D face geometry reconstruction method, which fits a 3D model for each video frame, with the constraint that all models have the same identity and requiring temporal continuity of pose and expression. With the identity constraint, the differences between the underlying 3D shapes capture only changes in expression and pose. We show that various expression editing tasks in video can be achieved by combining face reordering with face warping, where the warp is induced by projecting differences in 3D face shapes into the image plane. Analogously, we show how the identity can be manipulated while fixing expression and pose. Experimental results show that our method can effectively edit expressions and identity in video in a temporally-coherent way with high fidelity.

Urban Tribes: Analyzing Group Photos from a Social Perspective

Ana Murillo, Iljung Kwak, Lubomir Bourdev, David Kriegman and Serge Belongie
Computer Vision CVPR 2012 Workshop on Socially Intelligent Surveillance and Monitoring

The explosive growth in image sharing via social networks has produced exciting opportunities for the computer vision community in areas including face, text, product and scene recognition. In this work we turn our attention to group photos of people and ask the question: what can we determine about the social subculture or urban tribe to which these people belong? To this end, we propose a framework employing low- and mid-level features to capture the visual attributes distinctive to a variety of urban tribes. We proceed in a semi-supervised manner, employing a metric that allows us to extrapolate from a small number of pairwise image similarities to induce a set of groups that visually correspond to familiar urban tribes such as biker, hipster or goth. Automatic recognition of such information in group photos offers the potential to improve recommendation services, context sensitive advertising and other social analysis applications. We present promising preliminary experimental results that demonstrate our ability to categorize group photos in a socially meaningful manner

Face Morphing using 3D-Aware Appearance Optimization

Fei Yang, Eli Shechtman, Jue Wang, Lubomir Bourdev, Dimitris Metaxas
Computer Graphics Graphics Interface (GI 2012)

We address the problem of editing facial expression in video, such as exaggerating, attenuating or replacing the expression with a different one in some parts of the video. To achieve this we develop a tensor-based 3D face geometry reconstruction method, which fits a 3D model for each video frame, with the constraint that all models have the same identity and requiring temporal continuity of pose and expression. With the identity constraint, the differences between the underlying 3D shapes capture only changes in expression and pose. We show that various expression editing tasks in video can be achieved by combining face reordering with face warping, where the warp is induced by projecting differences in 3D face shapes into the image plane. Analogously, we show how the identity can be manipulated while fixing expression and pose. Experimental results show that our method can effectively edit expressions and identity in video in a temporally-coherent way with high fidelity.

Describing People: A Poselet-Based Approach to Attribute Classification

Lubomir Bourdev, Subhransu Maji, Jitendra Malik
Computer Vision International Conference in Computer Vision (ICCV 2011)

We propose a method for recognizing attributes, such as the gender, hair style and types of clothes of people under large variation in viewpoint, pose, articulation and occlusion typical of personal photo album images. Robust attribute classifiers under such conditions must be invariant to pose, but inferring the pose in itself is a challenging problem. We use a part-based approach based on poselets. Our parts implicitly decompose the aspect (the pose and viewpoint). We train attribute classifiers for each such aspect and we combine them together in a discriminative model. We propose a new dataset of 8000 people with annotated attributes. Our method performs very well on this dataset, significantly outperforming a baseline built on the spatial pyramid match kernel method. On gender recognition we outperform a commercial face recognition system.

Semantic Contours from Inverse Detectors

Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev, Subhransu Maji and Jitendra Malik
Computer Vision International Conference in Computer Vision (ICCV 2011)

We study the challenging problem of localizing and classifying category-specific object contours in real world images. For this purpose, we present a simple yet effective method for combining generic object detectors with bottomup contours to identify object contours. We also provide a principled way of combining information from different part detectors and across categories. In order to study the problem and evaluate quantitatively our approach, we present a dataset of semantic exterior boundaries on more than 20, 000 object instances belonging to 20 categories, using the images from the VOC2011 PASCAL challenge

Pause-and-play: Automatically Linking Screencast Video Tutorials with Applications

Suporn Pongnumkul, Mira Doncheva, Wil Li, Lubomir Bourdev, Shai Avidan, Jue Wang and Michael Cohen
Computer Graphics ACM Symposium on User Interface Software and Technology (UIST 2011)

Video tutorials provide a convenient means for novices to learn new software applications. Unfortunately, staying in sync with a video while trying to use the target application at the same time requires users to repeatedly switch from the application to the video to pause or scrub backwards to replay missed steps. We present Pause-and-Play, a system that helps users work along with existing video tutorials. Pauseand-Play detects important events in the video and links them with corresponding events in the target application as the user tries to replicate the depicted prodedure. This linking allows our system to automatically pause and play the video to stay in sync with the user. Pause-and-Play also supports convenient video navigation controls that are accessible from within the target application and allow the user to easily replay portions of the video without switching focus out of the application. Finally, since our system uses computer vision to detect events in existing videos and leverages application scripting APIs to obtain real time usage traces, our approach is largely independent of the specific target application and does not require access or modifications to application source code. We have implemented Pause-and-Play for two target applications, Google SketchUp and Adobe Photoshop, and we report on a user study that shows our system improves the user experience of working with video tutorials.

Expression Flow for 3D-Aware Face Component Transfer

Fei Yang, Jue Wang, Eli Shechtman, Lubomir Bourdev and Dimitris Metaxas
Computer Graphics ACM Transactions on Graphics (SIGGRAPH 2011)

We address the problem of correcting an undesirable expression on a face photo by transferring local facial components, such as a smiling mouth, from another face photo of the same person which has the desired expression. Direct copying and blending using existing compositing tools results in semantically unnatural composites, since expression is a global effect and the local component in one expression is often incompatible with the shape and other components of the face in another expression. To solve this problem we present Expression Flow, a 2D flow field which can warp the target face globally in a natural way, so that the warped face is compatible with the new facial component to be copied over. To do this, starting with the two input face photos, we jointly construct a pair of 3D face shapes with the same identity but different expressions. The expression flow is computed by projecting the difference between the two 3D shapes back to 2D. It describes how to warp the target face photo to match the expression of the reference photo. User studies suggest that our system is able to generate face composites with much higher fidelity than existing methods.

Action Recognition from a Distributed Representation of Pose and Appearance

Subhransu Maji, Lubomir Bourdev, Jitendra Malik
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2010)

We present a distributed representation of pose and appearance of people called the “poselet activation vector”. First we show that this representation can be used to estimate the pose of people defined by the 3D orientations of the head and torso in the challenging PASCAL VOC 2010 person detection dataset. Our method is robust to clutter, aspect and viewpoint variation and works even when body parts like faces and limbs are occluded or hard to localize. We combine this representation with other sources of information like interaction with objects and other people in the image and use it for action recognition. We report competitive results on the PASCAL VOC 2010 static image action classification challenge

Object Segmentation by Alignment of Poselet Activations to Image Contours

Thomas Brox, Lubomir Bourdev, Subhransu Maji and Jitendra Malik
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2011)

In this paper, we propose techniques to make use of two complementary bottom-up features, image edges and texture patches, to guide top-down object segmentation towards higher precision. We build upon the part-based poselet detector, which can predict masks for numerous parts of an object. For this purpose we extend poselets to 19 other categories apart from person. We non-rigidly align these part detections to potential object contours in the image, both to increase the precision of the predicted object mask and to sort out false positives. We spatially aggregate object information via a variational smoothing technique while ensuring that object regions do not overlap. Finally, we propose to refine the segmentation based on self-similarity de- fined on small image patches. We obtain competitive results on the challenging Pascal VOC benchmark. On four classes we achieve the best numbers to-date.

Poselets and Their Applications in High-Level Computer Vision

Lubomir Bourdev
Computer Vision PhD Thesis, University of California at Berkeley, 2011

We address the classic problems of detection and segmentation using a part based detector that operates on a novel part, which we refer to as a poselet. Poselets are tightly clustered in both appearance space (and thus are easy to detect) as well as in configuration space (and thus are helpful for localization and segmentation). We demonstrate poselets are effective for detection, pose extraction, segmentation, action/pose estimation and attribute classification. Poselet construction requires extra annotations beyond the object bounds. To train poselets we have created H3D (Humans in 3D) - a dataset of 1200+ person annotations. The annotations include the joints, the extracted 3D pose, keypoint visibility and region labels. We have also annotated the people in the training and validation sets of PASCAL VOC 2009. Our poselet classifier achieves state-of-the-art results for the person category on PASCAL VOC 2007, 2008, 2009 and 2010 as well as on our dataset, H3D.

Detecting People Using Mutually Consistent Poselet Activations

Lubomir Bourdev, Subhransu Maji, Thomas Brox, Jitendra Malik
Computer Vision European Conference in Computer Vision (ECCV 2010)

Bourdev and Malik (ICCV 09) introduced a new notion of parts, poselets, constructed to be tightly clustered both in the configuration space of keypoints, as well as in the appearance space of image patches. In this paper we develop a new algorithm for detecting people using poselets. Unlike that work which used 3D annotations of keypoints, we use only 2D annotations which are much easier for naive human annotators. The main algorithmic contribution is in how we use the pattern of poselet activations. Individual poselet activations are noisy, but considering the spatial context of each can provide vital disambiguating information, just as object detection can be improved by considering the detection scores of nearby objects in the scene. This can be done by training a two-layer feed-forward network with weights set using a max margin technique. The refined poselet activations are then clustered into mutually consistent hypotheses where consistency is based on empirically determined spatial keypoint distributions. Finally, bounding boxes are predicted for each person hypothesis and shape masks are aligned to edges in the image to provide a segmentation. To the best of our knowledge, the resulting system is the current best performer on the task of people detection and segmentation with an average precision of 47.8% and 40.5% respectively on PASCAL VOC 2009.

Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations

Lubomir Bourdev and Jitendra Malik
Computer Vision International Conference in Computer Vision (ICCV 2009)

We address the classic problems of detection, segmentation and pose estimation of people in images with a novel definition of a part, a poselet. We postulate two criteria (1) It should be easy to find a poselet given an input image (2) it should be easy to localize the 3D configuration of the person conditioned on the detection of a poselet. To permit this we have built a new dataset, H3D, of annotations of humans in 2D photographs with 3D joint information, inferred using anthropometric constraints. This enables us to implement a data-driven search procedure for finding poselets that are tightly clustered in both 3D joint configuration space as well as 2D image appearance. The algorithm discovers poselets that correspond to frontal and profile faces, pedestrians, head and shoulder views, among others. Each poselet provides examples for training a linear SVM classifier which can then be run over the image in a multiscale scanning mode. The outputs of these poselet detectors can be thought of as an intermediate layer of nodes, on top of which one can run a second layer of classification or regression. We show how this permits detection and localization of torsos or keypoints such as left shoulder, nose, etc. Experimental results show that we obtain state of the art performance on people detection in the PASCAL VOC 2007 challenge, among other datasets. We are making publicly available both the H3D dataset as well as the poselet parameters for use by other researchers.

Generic Image Library

Lubomir Bourdev
Software Engineering Software Developer's Journal 2007

The Generic Image Library (GIL) is a C++ image library sponsored by Adobe Systems, Inc. and developed by Lubomir Bourdev and Hailin Jin. It is an open-source library, planned for inclusion in Boost 1.35.0. GIL is also a part of the Adobe Source Libraries. It is used in several Adobe projects, including some new features in Photoshop CS4

Efficient run-time dispatching in generic programming with minimal code bloat

Lubomir Bourdev and Jaakko Järvi
Software Engineering Science of Computer Programming, 2010

Generic programming with C++ templates results in efficient but inflexible code: efficient, because the exact types of inputs to generic functions are known at compile time; inflexible because they must be known at compile time. We show how to achieve run-time polymorphism without compromising performance by instantiating the generic algorithm with a comprehensive set of possible parameter types, and choosing the appropriate instantiation at run time. Applying this approach naïvely can result in excessive template bloat: a large number of template instantiations, many of which are identical at the assembly level. We show practical examples of this approach quickly approaching the limits of the compiler. Consequently, we combine this method of run-time polymorphism for generic programming, with a strategy for reducing the number of necessary template instantiations. We report on using our approach in GIL, Adobe’s open source Generic Image Library. We observed a notable reduction, up to 70% at times, in executable sizes of our test programs. This was the case even with compilers that perform aggressive template hoisting at the compiler level, due to significantly smaller dispatching code. The framework draws from both the generic and generative programming paradigms, using static metaprogramming to fine tune the compilation of a generic library. Our test bed, GIL, is deployed in a real world industrial setting, where code size is often an important factor.

Robust Object Detection Via Soft Cascade

Lubomir Bourdev and Jonathan Brandt
Computer Vision IEEE Conference in Computer Vision and Pattern Recognition (CVPR 2005)

We describe a method for training object detectors using a generalization of the cascade architecture, which results in a detection rate and speed comparable to that of the best published detectors while allowing for easier training and a detector with fewer features. In addition, the method allows for quickly calibrating the detector for a target detection rate, false positive rate or speed. One important advantage of our method is that it enables systematic exploration of the ROC Surface, which characterizes the trade-off between accuracy and speed for a given classifier.

Art-Based Rendering of Fur, Grass, and Trees

Michael Kowalski, Lee Markosian, J.D. Northrup, Lubomir Bourdev, Ronen Barzel, Loring Holden and John Hughes
Computer Graphics ACM Transactions on Graphics (SIGGRAPH 1999)

Artists and illustrators can evoke the complexity of fur or vegetation with relatively few well-placed strokes. We present an algorithm that uses strokes to render 3D computer graphics scenes in a stylized manner suggesting the complexity of the scene without representing it explicitly. The basic algorithm is customizable to produce a range of effects including fur, grass and trees, as we demonstrate in this paper and accompanying video. The algorithm is implemented within a broader framework that supports procedural stroke-based textures on polyhedral models. It renders moderately complex scenes at multiple frames per second on current graphics workstations, and provides some interframe coherence.

Rendering Nonphotorealistic Strokes with Temporal and Arc-Length Coherence

Lubomir Bourdev
Computer Graphics Master's Thesis, Brown University, 1998

We describe a method for rendering a silhouette of an object in a frame-to-frame coherent way. The input to the system each frame is a set of silhouette pixels in a rendering of the object and their corresponding silhouette edges in a polygonal model (mesh) of the object. The output is a set of silhouette strokes.

Real-Time Nonphotorealistic Rendering

Lee Markosian, Michael Kowalski, Sam Trychin, Lubomir Bourdev, Daniel Goldstein and John Hughes
Computer Graphics ACM Transactions on Graphics (SIGGRAPH 1997)

Nonphotorealistic rendering (NPR) can help make comprehensible but simple pictures of complicated objects by employing an economy of line. But current nonphotorealistic rendering is primarily a batch process. This paper presents a real-time nonphotorealistic renderer that deliberately trades accuracy and detail for speed. Our renderer uses a method for determining visible lines and surfaces which is a modification of Appel’s hidden-line algorithm, with improvements which are based on the topology of singular maps of a surface into the plane. The method we describe for determining visibility has the potential to be used in any NPR system that requires a description of visible lines or surfaces in the scene. The major contribution of this paper is thus to describe a tool which can significantly improve the performance of these systems. We demonstrate the system with several nonphotorealistic rendering styles, all of which operate on complex models at interactive frame rates.

  1. O. Rippel, A. Anderson, K. Tatwawadi, S. Nair, C. Lytle, H. Guihot, B. Sprague, L. Bourdev, Machine-learned In-Loop Predictor for Video Compression, U.S. Patent 11570465

  2. O. Rippel, L. Bourdev, Deep learning based adaptive arithmetic coding and codelength regularization, U.S. Patent 11423310

  3. L. Bourdev, Pose-aligned Networks for Deep Attribute Modeling, U.S. Patent 11380119

  4. O. Rippel, L. Bourdev, C. Lew, S. Nair, Using Generative Adversarial Networks in compression, U.S. Patent 11315011

  5. L. Bourdev, C. Lew, S. Nair, O. Rippel, Data compression for machine learning tasks, U.S. Patent 11256984

  6. O. Rippel, L. Bourdev, Deep learning based adaptive arithmetic coding and codelength regularization, U.S. Patent 11100394

  7. O. Rippel, L. Bourdev, Deep learning based adaptive arithmetic coding and codelength regularization, U.S. Patent 11062211

  8. Y. Gong, M. Pawlowski, Y. Fei, L. Bourdev, L. Brandy, R. Fergus, Systems and methods for online clustering of content items, U.S. Patent 11003692

  9. O. Rippel, L. Bourdev, Enhanced coding efficiency with progressive representation, U.S. Patent 10977553

  10. O. Rippel, S. Nair, C. Lew. S. Branson, A. Anderson, L. Bourdev, Machine-learning based video compression, U.S. Patent 10860929

  11. O. Rippel, L. Bourdev, Deep learning based adaptive arithmetic coding and codelength regularization, U.S. Patent 10748062

  12. O. Rippel, S. Nair, C. Lew. S. Branson, A. Anderson, L. Bourdev, Machine-learning based video compression, U.S. Patent 10685282

  13. C. Lew. S. Branson, O. Rippel, S. Nair, A. Anderson, L. Bourdev, Adaptive quantization, U.S. Patent 10594338

  14. K. Tang, L. Bourdev, M. Paluri, R. Fergus, Systems and methods for image object recognition based on location information and object categories, U.S. Patent 10572771

  15. L. Bourdev, C. Lew, S. Nair, O. Rippel, Autoencoding image residuals for improving upsampled images, U.S. Patent 10565499

  16. O. Rippel, L. Bourdev, Adaptive compression based on content, U.S. Patent 10402722

  17. L. Bourdev, Pose-aligned networks for deep attribute modeling, U.S. Patent 10402632

  18. R. Rergus, L. Bourdev, B. Paluri, S. Sukhbaatar, Unsupervised training sets for content classification, U.S. Patent 10360498

  19. A. Hassan, L. Bourdev, Systems and methods to determine location of media items, U.S. Patent 10360255

  20. O. Rippel, L. Bourdev, Enhanced coding efficiency with progressive representation, U.S. Patent 10332001

  21. D. Tran, B. Paluri, L. Bourdev, R. Fergus, S. Chopra, Systems and methods for determining video feature descriptors based on convolutional neural networks, U.S. Patent 10198637

  22. L. Bourdev, B. Paluri, Systems and methods for image recognition normalization and calibration, U.S. Patent 10169686

  23. A. Lerios, D. Stoop, M. Ryan, L. Bourdev, B. Paluri, Methods and systems for differentiating synthetic and non-synthetic images, U.S. Patent 10140545

  24. N. Johri, B. Paluri, L. Bourdev, Systems and methods for image recognition normalization and calibration, U.S. Patent 9946926

  25. D. Tran, B. Paluri, L. Bourdev, R. Fergus, Systems and methods for determining video feature descriptors based on convolutional neural networks, U.S. Patent 9858484

  26. N. Johri, B. Paluri, L. Bourdev, Systems and methods for image recognition normalization and calibration, U.S. Patent 9767357

  27. B. Paluri, D. Tran, L. Bourdev, R. FergusSystems and methods for processing content using convolutional neural networks, U.S. Patent 9754351

  28. K. Tang, L. Bourdev, B. Paluri, R. Fergus, Systems and methods for image object recognition based on location information and object categories, U.S. Patent 9727803

  29. L. Bourdev, N. Zhang, Y. Taigman, R. Fergus Systems and methods for identifying users in media content based on poselets and neural networks , U.S. Patent 9704029

  30. A. Lerios, D. Stoop, R. Mack, L. Bourdev, B. Paluri, Methods and Systems for Differentiating Synthetic and Non-synthetic Images, U.S. Patent 9558422

  31. L. Bourdev, N. Zhang, B. Paluri, Y. Taigman, R. Fergus, Systems and Methods for Identifying Users in Media Content Based on Poselets and Neural Networks, U.S. Patent 9514390

  32. K. Tang, L. Bourdev, B. Paluri, R. Fergus, Systems and Methods for Image Object Recognition Based on Location Information and Object Categories , U.S. Patent 9495619

  33. L. Bourdev, Pose-aligned Networks for Deep Attribute Modeling, U.S. Patent 9400925

  34. A. Lerios, D. Stoop, R. Mack, L. Bourdev, B. Paluri, Methods and Systems for Differentiating Synthetic and Non-synthetic Images, U.S. Patent 9280723

  35. J. Brandt, Z. Lin, Vuong Le, L. Bourdev, Adjusting a Contour by a Shape Modek, U.S. Patent 9202138

  36. J. Brandt, Z. Lin, L. Bourdev, Vuong Le, Fitting Contours to Features, U.S. Patent 9158963

  37. L. Bourdev, Reviewing and Editing Word Processing Documents, U.S. Patent 9092173

  38. L. Bourdev, E. Shechtman, J. Wang, and F. Yang, Methods and Apparatus for Face Fitting and Editing Applications, U.S. Patent 8923392

  39. L. Dontcheva, S. Pongnumkul, W. Li, S. Avidan and L. Bourdev, Methods and Apparatus for Tutorial Video Enhancement, U.S. Patent 8909024

  40. A. Lerios, D. Stoop, R. Mack, L. Bourdev, M. Paluri, Methods and Systems for Differentiating Synthetic and Non-Synthetic Images, U.S. Patent 8903186

  41. J. Wang, E. Shechtman, L. Bourdev, F. Yang, Methods and Apparatus for Facial Feature Replacement, U.S. Patent 8818131

  42. K. Dale, L. Bourdev, S. Avidan, A. Parenteau, System and Method for Labeling a Collection of Images, U.S. Patent 8724908

  43. A. Casillas, L. Bourdev, Indicating a Correspondence Between an Image and an Object, U.S. Patent 8548211

  44. L. Bourdev, Generation and Usage of Attractiveness Scores, U.S. Patent 8532347

  45. L. Bourdev, J. Xu, System and Method for using Contextual Features to Improve Face Recognition in Digital Images, U.S. Patent 8503739

  46. J. Wang, E. Shechtman, L. Bourdev, F. Yang, Methods and Apparatus for Facial Feature Replacement, U.S. Patent 8457442

  47. L. Bourdev, Reviewing and Editing Word Processing Documents, U.S. Patent 8418051

  48. L. Bourdev, A. Parenteau, Efficient and Scalable Face Recognition in Photo Albums, U.S. Patent 8379939

  49. L. Bourdev, Reviewing and Editing Word Processing Documents, U.S. Patent 8296647

  50. C. Schendel, L. Bourdev, Designating a Tag Icon, U.S. Patent 8259995

  51. L. Bourdev, Facilitating Computer-Assisted Tagging of Object Instances in Digital Images, U.S. Patent 8244069

  52. L. Bourdev, Autocompleting Form Fields Based on Previously Entered Values, U.S. Patent 8234561

  53. L. Bourdev, Detecting Objects within an Image by Incrementally Evaluating Subwindows of the Image in Parallel, U.S. Patent 8077920

  54. L. Bourdev, Generation and Usage of Attractiveness Scores, U.S. Patent 8041076

  55. A. Casillas, L. Bourdev, Indicating a Correspondence Between an Image and an Object, U.S. Patent 7978936

  56. L. Bourdev, Reviewing and Editing Word Processing Documents, U.S. Patent 7966566

  57. L. Bourdev, Facilitating Computer-Assisted Tagging of Object Instances in Digital Images, U.S. Patent 7889946

  58. L. Bourdev, Previewing the Effects of Flattening Transparency, U.S. Patent 7827485

  59. L. Bourdev, S. Schiller, M. Newell, Processing Illustration Artwork, U.S. Patent 7825941

  60. L. Bourdev, Method and System to Monitor Installation of a Software Program, U.S. Patent 7818741

  61. L. Bourdev, Method for Displaying Extracted Faces from Images in Normalized Form, U.S. Patent 7813526

  62. L. Bourdev, Tagging Detected Objects, U.S. Patent 7813557

  63. L. Bourdev, J. Brandt, Image Splitting to Use Multiple Execution Channels of a Graphics Processor to Perform an Operation on Single-Channel Input, U.S. Patent 7768516

  64. L. Bourdev, Detecting Objects within an Image by Incrementally Evaluating Subwindows of the Image in Parallel, U.S. Patent 7738680

  65. L. Bourdev, Incremental Batch-Mode Editing of Digital Media Objects, U.S. Patent 7730043

  66. L. Bourdev, C. Shendel, J. Heileson, Searching Images with Extracted Objects, U.S. Patent 7716157

  67. A. Casillas, L. Bourdev, Exporting Extracted Faces, U.S. Patent 7706577

  68. L. Bourdev, Indicating a Tag with Visual Data, U.S. Patent 7694885

  69. A. Parenteau, L. Bourdev, Selectively Transforming Overlapping Illustration Artwork, U.S. Patent 7692652

  70. L. Bourdev, Displaying Detected Objects to Indicate Grouping, U.S. Patent 7636450

  71. L. Bourdev, J. Brandt, Detecting Objects in an Image Using a Soft Cascade, U.S. Patent 7634142

  72. L. Bourdev, Method and Apparatus for Calibrating Sampling Operations for an Object Detection Process, U.S. Patent 7616780

  73. L. Bourdev, Facilitating Computer-Assisted Tagging of Object Instances in Digital Images, U.S. Patent 7587101

  74. L. Bourdev, G. Wilensky, Detection of Objects in an Image using Color Analysis, U.S. Patent 7580563

  75. P. Asente, T. Pettit, L. Bourdev, M. Schuster, Assigning Region Attributes in a Drawing, U.S. Patent 7502028

  76. L. Bourdev, S. Schiller, M. Newell, Processing Illustration Artwork, U.S. Patent 7495675

  77. L. Bourdev, Method and Apparatus for Calibrating Sampling Operations for an Object Detection Process, U.S. Patent 7440587

  78. L. Bourdev, Autocompleting Form Fields Based on Previously Entered Values, U.S. Patent 7343551

  79. L. Bourdev, M. Newell, Creating and Manipulating Related Vector Objects in an Image, U.S. Patent 7339597

  80. A. Parenteau, L. Bourdev, Selectively Transforming Overlapping Illustration Artwork, U.S. Patent 7262782

  81. L. Bourdev, S. Schiller, Processing Complex Regions of Illustration Artwork, U.S. Patent 7256798

  82. L. Bourdev, Previewing the Effects of Flattening Transparency, U.S. Patent 7181687

  83. L. Bourdev, M. Newell, Operations on Related Set of Vector Objects, U.S. Patent 7123269

  84. P. Louveaux, L. Bourdev, Hierarchical 2D Compositing with Blending Mode and Opacity Controls at All Levels, U.S. Patent 7102651

  85. L. Bourdev, S. Schiller, Processing Complex Regions of Illustration Artwork, U.S. Patent 6894704

  86. L. Bourdev, S. Schiller, Flattening Images with Abstracted Objects, U.S. Patent 6859553

  87. P. Louveaux, L. Bourdev, Hierarchical 2D Compositing with Blending Mode and Opacity Controls at All Levels, U.S. Patent 6847380

  88. L. Bourdev, S. Schiller, M. Newell, Processing Illustration Artwork, U.S. Patent 6720977

  89. L. Bourdev, Processing Opaque Pieces of Illustration Artwork, U.S. Patent 6515675

  • image

    WaveOne: The Real Pied Piper

    Our article on the front page of the Wall Street Journal

    I am humbled to see the company I co-founded featured on the front page of the Wall Street Journal. According Rolfe Winkler, the journalist, we were the first pre-Series A startup to ever make it on the front page. While we are proud with our technology, being the first to outperform the video standards in the low-latency mode, what helped with this article is our uncanny resemblance to Pied Piper, the video compression startup in the HBO series Silicon Valley. Funny story: The article featured a photo in front of our white board which contained, at the time, confidential formulas. We only realized our mistake once we saw the photo on the WSJ website!

  • image

    Genius Makers

    A book on the history of AI labs at Google and Facebook

    I was featured in the book by Cade Metz, a New York Times technology correspondent, on my role in the early days of what became Facebook AI Research. On pages 124-127 the book describes how I used out-of-the-box tactics and, with the help of Mark Zuckerberg (CEO) and Mike Schroepfer (CTO) I was able to hire a key Google researcher Marc-Aurelio Ranzato. Once he joined Facebook, he was instrumental in attracting his former advisor Prof. Yann LeCun to join and lead FAIR, and also Prof. Rob Fergus. On page 323 the book identifies the six of us as the key people behind the formation of Facebook AI Research.

  • image

    WaveOne in TechCrunch

    Article about our vision on the next generation video compression

    Here is a TechCrunch article about WaveOne, the company I co-founded, in association with our fundrasing round.

  • image

    Beyond Frontal Faces

    Research on person recognition at Facebook

    My former intern Ning Zhang and I, together with colleagues from Facebook AI Research, published a CVPR poster about recognizing people even if their face is not visible. It went largely unnoticed in the vision community until it was suddenly picked by the press, with dozens of articles about it - by Wired, Time, Wall Street Journal, The Hacker News, Fortune, Another one by WSJ, New Scientist, ZDNet, Business Insider, Huffington Post, Yahoo, Daily Mail, USA Today and many other ones. They even made a Jimmy Kimmel skit! While I can't help feeling flattered by the press attention and think of PIPER as is a neat project and the first of its kind to recognize people from any viewpoint without the presence of a face, I feel this work is being overhyped, certainly not worth being called "technological breakthrough".  A lot of the articles expressed privacy concerns (which are unwarranted -- this is a research-only project with no plans to deploy to production).

  • image

    Article in WSJ on young researchers

    Here is a Wall Street Journal article written in 2008 about young researchers and the age span of inventors across different companies. It mentions me in the section about Adobe. (At that time I was considered "young"!)