Fourth Workshop on Computer Vision for AR/VR

#CV4ARVR

June 15, 2020

Organized in conjunction with CVPR 2020

Overview | Submission | Program | Papers | People | FAQ

Q&A

Do you have a question for the authors?

Join the Discord Server

Join the CV4ARVR Discord server and leave a question for the authors in their respective channels, or participate in pre-arranged live Q&A sessions. The main virtual poster session took place on June 19, 2020, during CVPR. For the best experience, we recommend using the Discord desktop app.

The authors of “Multi-user, Scalable 3D Object Detection in AR Cloud” present their work during the virtual poster session on June 19, 2020.

Extended Abstracts

Play all videos here (YouTube playlist)

Attention Mesh: High-fidelity Face Mesh Prediction in Real-time

Ivan Grischenko, Artsiom Ablavatski, Yury Kartynnik, Karthik Raveendran, Matthias Grundmann

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

We present Attention Mesh, a lightweight architecture for 3D face mesh prediction that uses attention to semantically meaningful regions. Our neural network is designed for real-time on-device inference and runs at over 50 FPS on a Pixel 2 phone. Our solution enables applications like AR makeup, eye tracking and AR puppeteering that rely on highly accurate landmarks for eye and lips regions. Our main contribution is a unified network architecture that achieves the same accuracy on facial landmarks as a multi-stage cascaded approach, while being 30 percent faster.

BlazePose: On-device Real-time Body Pose tracking

Valentin Bazarevsky, Ivan Grischenko, Karthik Raveendran, Matthias Grundmann, Fan Zhang, Tyler (Lixuan) Zhu

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

We present BlazePose, a lightweight convolutional neural network architecture for human pose estimation that is tailored for real-time inference on mobile devices. During inference, the network produces 33 body keypoints for a single person and runs at over 30 frames per second on a Pixel 2 phone. This makes it particularly suited to real-time use cases like fitness tracking and sign language recognition. Our main contributions include a novel body pose tracking solution and a lightweight body pose estimation neural network that uses both heatmaps and regression to keypoint coordinates.

Boosting Perceptual Resolution of VR Displays

Hung-Chung Lu, Wen-Tsung Hsieh, Shao-Yi Chien

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

Due to the insufficient pixel density of current virtual reality (VR) headset displays, the fully immersive experiences that VR applications aim to provide are hard to achieve. In this work, we propose a framework that can boost the perceptual resolution of VR displays with reasonable computational cost. The proposed perceptual frame synthesis network can generate high-resolution information in temporal, and the high-resolution perception can then be restored by the integration process in retina. Furthermore, we propose a method to blend mixed frame-rate regions in the same frame, allowing us to improve the perceptual experience only in the focused region. Subjective experiments are conducted to verify the effectiveness of the proposed framework.

DARNavi: An Indoor-Outdoor Immersive Navigation System with Augmented Reality

Xiaoqiang Teng, Pengfei Xu, Bin Xu, Jun Zhang, Ronghao Li, Gengxin Gu, Liang Wang, Guanghui Zhao, Song Zhang, Runbo Hu, Hua Chai

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

The online car-hailing services have gained great popularity all over the world. It is important for online car-hailing services to effectively guide passengers to their pickup points, which can save path finding time and improve the overall user experiences. In this paper, we present an immersive navigation system to guide passengers from any indoor position to the pickup point. During navigation, the system supports realistic rendering guidance elements into the physical world with Augmented Reality. The system has been deployed in a commercial application on common mobile phones and has served over thousands of passengers.

Decoupled Localization and Sensing with HMD-based AR for Interactive Scene Acquisition

Søren Skovsen, Harald Haraldsson, Abe Davis, Henrik Karstoft, Serge Belongie

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

Real-time tracking and visual feedback offer interactive AR-assisted capture systems as a convenient and low-cost alternative to specialized sensor rigs and robotic gantries. We present a simple strategy for decoupling localization and visual feedback in these applications from the primary sensor being used to capture the scene. Our strategy is to use an AR HMD and 6-DOF controller for tracking and feedback, synchronized with a separate primary sensor for capturing the scene. In this extended abstract, we present a prototype implementation of this strategy and investigate the accuracy of decoupled tracking by comparing runtime pose estimates to the results of high-resolution offline SfM.

Epipolar Transformer for Multi-view Pose Estimation

Yihui He, Rui Yan, Katerina Fragkiadaki, Shoou-I Yu

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

A common way to localize 3D human joints in a synchronized and calibrated multi-view setup is a two-step process: (1) apply a 2D detector separately on each view to localize joints in 2D, and (2) robust triangulation on 2D detections from each view to acquire 3D joint locations. However, in step 1, the 2D detector is constrained to solve challenging cases which could be better resolved in 3D, such as occlusions and oblique viewing angles, purely in 2D without leveraging any 3D information. Therefore, we propose the differentiable “epipolar transformer”, which empowers the 2D detector to leverage 3D-aware intermediate features to improve 2D pose estimation. The intuition is: given a 2D location p in reference view, we would like to first find its corresponding point p′ in source view, then combine the features at p′ with the features at p, thus leading to a more 3D-aware intermediate feature at p. Inspired by stereo matching, the epipolar transformer leverages epipolar constraints and feature matching to approximate the features at p′. The key advantages of the epipolar transformer is: (1) it has minimal learnable parameters, (2) it can be easily plugged into existing networks, moreover (3) it is easily interpretable, i.e., we can analyze the location p′ to understand whether matching over the epipolar line was successful. Experiments on InterHand and Human3.6M show that our approach has consistent improvements over the baselines. Specifically, in the condition where no external data is used, our Human3.6M model trained with ResNet-50 backbone and image size 256×256 outperforms state-of-the-art by a large margin and achieves MPJPE 26.9 mm.

Fakeye: Sky Augmentation with Real-time Sky Segmentation and Texture Blending

Anh-Thu Thi Tran, Yen Ngoc Le

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

Augmented Reality (AR) has been intensively used to enhance human experience by providing artificial content in addition to their real surroundings. While many AR applications are focusing on serving indoor tasks, higher and vaster space such as the sky should also be paid attention to in order to create a fuller virtual environment. Since objects from afar behave differently from near objects in term of rendering, in this paper, we try to devise an approach to augment the sky with virtual objects where challenges such as real-time sky segmentation for creating illusion of occlusion, real-and-virtual scenes blending as well as cameras alignment are also addressed. Our mobile implementation of the approach which is called “Fakeye” produces promising result and brings about exciting experience.

FMKit: An In-Air-Handwriting Analysis Library and Data Repository

Duo Lu, Linzhen Luo, Dijiang Huang, Yezhou Yang

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

Hand-gesture and in-air-handwriting provide ways for users to input information in Augmented Reality (AR) and Virtual Reality (VR) applications where a physical keyboard or a touch screen is unavailable. However, understanding the movement of hands and fingers is challenging, which requires a large amount of data and data-driven models. In this paper, we propose an open research infrastructure named FMKit for in-air-handwriting analysis, which contains a set of Python libraries and a data repository collected from over 180 users with two different types of motion capture sensors. We also present three research tasks enabled by FMKit, including in-air-handwriting based user authentication, user identification, and word recognition, and preliminary baseline performance.

Head-mounted Augmented Reality for Guided Surface Reflectance Capture

Harald Haraldsson, Søren Skovsen, Ser-Nam Lim, Steve Marschner, Serge Belongie, Abe Davis

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

In this project, we explore the design of a system for helping users capture surface reflectance functions with a head-mounted augmented reality (AR) device and a hand-held controller. We supplement a standard 6-DOF controller with a mountable light source that we track during capture. Users begin by using the controller to select a surface region for capture. We then record images through the head-mounted camera while guiding the user's control of the hand-held light source with real-time feedback through the AR display. Our system provides a simple and efficient way to capture information about surface reflectance in the wild.

Holopix50k: A Large-Scale In-the-wild Stereo Image Dataset

Yiwen Hua, Puneet Kohli, Pritish M Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, Yaguang Li

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

With the mass-market adoption of dual-camera mobile phones, leveraging stereo information in computer vision has become increasingly important in AR/VR industry. Current state-of-the-art methods utilize learning-based algorithms, where the amount and quality of training samples heavily influence results. Existing stereo image datasets are limited either in size or subject variety. Hence, algorithms trained on such datasets do not generalize well to scenarios encountered in mobile photography. We present Holopix50k, a novel in-the-wild stereo image dataset, comprising 49,368 image pairs contributed by users of the Holopix™ mobile social platform. In this work, we describe our data collection process and statistically compare our dataset to other popular stereo datasets. We experimentally show that using our dataset significantly improves results for tasks such as stereo super-resolution. Finally, we showcase practical applications of our dataset to train neural networks to predict disparity map from stereo and monocular images. The high-quality disparity maps are critical for improving the projection and 3D reconstruction for AR/VR applications on mobile phones.

Instant 3D Object Tracking with Applications in Augmented Reality

Adel Ahmadyan‎, Tingbo Hou, Jianing Wei, Liangkai Zhang, Matthias Grundmann, Artsiom Ablavatski

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

Tracking object poses in 3D is a crucial building block for Augmented Reality applications. We propose an instant motion tracking system that tracks an object’s pose in space (represented by its 3D bounding box) in real-time on mobile devices. Our system does not require any prior sensory calibration or initialization to function. We employ a deep neural network to detect objects and estimate their initial 3D pose. Then the estimated pose is tracked using a robust planar tracker. Our tracker is capable of performing relative-scale 9-DoF tracking in real-time on mobile devices. By combining use of CPU and GPU efficiently, we achieve 26-FPS+ performance on mobile devices.

MediaPipe Hands: On-device Real-time Hand Tracking

Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, George Sung, Chuo-Ling Chang, Matthias Grundmann, Andrei Tkachenka

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

We present a real-time on-device hand tracking pipeline that predicts hand skeleton from single RGB camera for AR/VR applications. The pipeline consists of two models: 1) a palm detector, 2) a hand landmark model. It's implemented via MediaPipe, a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrates real-time inference speed on mobile GPUs and high prediction quality. MediaPipe Hands is open source at https://mediapipe.dev.

Multi-user, Scalable 3D Object Detection in AR Cloud

Siddarth Choudhary, Nitesh Sekhar, Siddharth Mahendran, Prateek Singhal

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

As AR Cloud gains importance, one key challenge is large scale, multi-user 3D object detection. Current approaches typically focus on the single-room, single-user scenarios. In this work, we present an approach for multi-user and scalable 3D object detection, based on distributed data association and fusion. We use an off-the-shelf detector to detect object instances in 2D and then combine them in 3D, per object while allowing asynchronous updates to the map. The distributed data association and fusion allows us to scale the detection to a large number of users concurrently, while maintaining a lower memory footprint without loss in accuracy. We show empirical results, where the distributed and centralized approaches achieve comparable accuracy on the ScanNet dataset while reducing the memory consumption by a factor of 15.

Panoramic convolutions for 360º single-image saliency prediction

Daniel Martín, Ana Serrano, Belen Masia

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

We present a convolutional neural network based on panoramic convolutions for saliency prediction in 360º equirectangular panoramas. Our network architecture is designed leveraging recently presented 360o-aware convolutions that represent kernels as patches tangent to the sphere where the panorama is projected, and a spherical loss function that penalizes prediction errors for each pixel depending on its coordinates in a gnomonic projection. Our model is able to successfully predict saliency in 360º scenes from a single image, outperforming other state-of-the-art approaches for panoramic content, and yielding more precise results that may help in the understanding of users’ behavior when viewing 360º VR content.

Realistic Training in VR using Physical Manipulation

Alvaro Villegas, Pablo Perez, Redouane Kachach, Francisco Pereira, Ester Gonzalez-Sosa

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

Previously Published at IEEE VR workshops We propose an interaction method for mixed reality with a direct application on training applications. It is based on the combination of two key ideas: providing real user embodiment within virtual spaces by segmenting user real hands and allowing the manipulation of physical objects, which may preserve or change their real appearance, in the immersive scenario. We deployed a functional prototype of this concept using Unity for the base scenario, a color-based algorithm for object segmentation and Aruco library for object track- ing. Then, to take advantage of gamification, we built a five-minute escape room game designed to validate all components: real embodiment, real objects, augmented real objects and control devices. We carried out a thorough evaluation of our system with 53 users and a fair comparison scheme: each user had to play the escape room twice, once with real objects and hands, and once with purely virtual objects and avatars. After each run, users had to answer a survey derived from standard questionnaires measuring presence, embodiment and quality of experience. Some quantitative data related to performance were also extracted from the game runs. Results validate our hypothesis that this approach significantly improves key factors such as presence or embodiment with respect to the counter- part virtual solution. We foresee our solution to provide a significant advance in the implementation of virtual training applications.

Real-time Pupil Tracking from Monocular Video for Digital Puppetry

Artsiom Ablavatski, Andrey Vakunov, Ivan Grischenko, Karthik Raveendran, Matsvei Zhdanovich, Matthias Grundmann

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

We present a simple, real-time approach for pupil tracking from live video on mobile devices. Our method extends a state-of-the-art face mesh detector with two new components: a tiny neural network that predicts positions of the pupils in 2D, and a displacement-based estimation of the pupil blend shape coefficients. Our technique can be used to accurately control the pupil movements of a virtual puppet, and lends liveliness and energy to it. The proposed approach runs at over 50 FPS on modern phones, and enables its usage in any real-time puppeteering pipeline.

Real-time Retinal Localization for Eye-tracking in Head-mounted Displays

Chen Gong, Laura Trutiou, Brian Schowengerdt, Steven L Brunton, Eric J Seibel

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

Accurate and robust eye-tracking is highly desirable in head-mounted displays. A method of using retina movement videos for estimating eye gaze is investigated in this work. We localize each frame of the retinal movement video on a mosaicked large field of view search image. The localization is based on a Kalman filter, which embeds deep learning in the estimation process with image registration as the measurement. This algorithm is demonstrated in experiments, where the retinal movement videos are captured from a dynamic real phantom. The average localization accuracy of our algorithm is 0.68°, excluding the annotation error. The classic pupil-glint eye tracking method has an average error of 0.5°-1°, while using retina videos results in a tracking resolution of 0.05° per pixel, which is nearly 20 times higher than that of pupil-glint methods. The accuracy of our inherently robust method is expected to be improved with further development.

Slow Glass: Visualizing History in 3D

Xuan Luo, Yanmeng Kong, Jason Lawrence, Ricardo Martin-Brualla, Steve Seitz

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

We introduce new techniques for reconstructing and viewing antique stereographs in 3D. Leveraging the Keystone-Mast image collection from the California Museum of Photography, we apply multiple processing steps to produce clean stereo pairs, complete with calibration data, rectification transforms, and disparity maps. We describe an approach for synthesizing novel views from these scenes that runs at real-time rates on a mobile device, simulating the experience of looking through an open window into these historical scenes.

Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild

Dominik Kulon, Alp Guler, Iasonas Kokkinos, Michael Bronstein, Stefanos Zafeiriou

CVPR Workshop on Computer Vision for Augmented and Virtual Reality, 2020

We introduce a simple and effective network architecture for monocular 3D hand pose estimation consisting of an image encoder followed by a mesh convolutional decoder that is trained through a direct 3D hand mesh reconstruction loss. We train our network by gathering a large-scale dataset of hand action in YouTube videos and use it as a source of weak supervision. Our weakly-supervised mesh convolutions-based system largely outperforms state-of-the-art methods, even halving the errors on the in the wild benchmark. The resulting networks will be demonstrated through real-time multi-person hand pose estimation demos running on a mobile phone.