Generating 3D spaces from a single picture
Looking at a picture of a room, you as a human can probably imagine what the rest of the space looks like out of frame. We have both tons of experience looking at rooms and very active imaginations to help us fill in the gaps. We know what happens when walls meet ceilings, we know to expect windows and certain types of furniture, and context clues from the picture tell us what the rest of the décor probably looks like.
This is a much harder skillset to map to a machine. Like so many other seamless human processes, formalizing what happens when we imagine the space beyond an image is endlessly complex. A machine has to learn from the limited features available in an image and pattern match with images it has seen before, essentially “creating” new features beyond the border of the picture.
In the world of computer vision, this challenge is more broadly known as novel view synthesis — given an image or images, create images from other points of view that make sense with the inputs. It’s a challenging problem, because the features of real environments can change rapidly and unexpectedly. Extrapolating more wall from a given wall is one thing, but knowing when a desk or couch might appear is a different story.
A new publication by CSE PhD student Chris Rockwell and Profs. David Fouhey and Justin Johnson has pushed this technique to new territory. While other work in the area has looked at how to fill in the gaps between multiple images or frames of a video, a technique presented at the 2021 International Conference on Computer Vision, called PixelSynth, can create interactive 3D spaces from just a single picture.
“Historically this has been done by taking many, many photos or using expensive cameras to do this,” says Rockwell. “We want to do this on a few images, with a focus on a single image.”
In a series of demos available on the project’s webpage, viewers can click and drag the starting image and watch as a field of vision opens up around the edges. Unlike faux-3D photo effects available on some apps, new space is actually being generated several degrees in every direction around the test images by the PixelSynth model.
PixelSynth draws on several recent advancements in novel view synthesis and image extrapolation, fusing 3D reasoning with autoregressive modeling. The model requires only two images for each entry in the training dataset — one for input, and one nearby as the target output. It also takes a measurement of the difference in camera angle between images.
Past models that worked on single-image view synthesis (which themselves are quite new) had limited fields of view, and simply training these models to produce wider angles returned unusable results. The model powering PixelSynth produces views up to six times larger than predecessors, with cleaner and more accurate synthesized spaces.
The technology shows promise in areas like virtual and augmented reality, automatically producing new environments from less hard-coded raw material. It could also be useful in applications that rely on a huge number of raw images to generate usable views, like Google Streetview, film, and animation, to generate more material from fewer frames or hand-produced imagery.
While right now the model isn’t equipped to produce things like fully navigable spaces or a 360º field of view, Rockwell is optimistic about the technology’s rapid growth. “This area of research has progressed so fast that we may have some new cool VR experiences in the next few years.”