PatchMatch Multi-View Stereo
Understand how the MVS tool provided by COLMAP works
Introduction
SfM/MVS
Multi-view Stereo (MVS) refers to an ensemble of techniques used in Computer Vision to reconstruct a dense 3D scene from multiple 2D images taken from different viewpoints. It allows you to turn smartphones and their high-resolution cameras into robust and cost-effective sensors to capture high-quality 3D data.
You may also have heard of Structure-from-Motion (SfM), which is another Computer Vision technique also used to retrieve 3D geometry from a set of 2D images. While both techniques aim to estimate the 3D structure of the scene, they differ in their approaches and goals.
- Structure-from-Motion (SfM)
SfM is used to estimate simultaneously the camera poses (position and orientation) of each input photograph and a sparse 3D structure of the scene. Each 2D observation of a key point tracked over several images refers to a single 3D point in space. Thus, we can retrieve the most plausible camera poses and 3D points by minimizing the 2D reprojection error of these key points. While SfM is typically limited to offline processing of static scenes, Visual Simultaneous Localization and Mapping (VSLAM) can be used for real-time localization in dynamic scenes. - Multi-View Stereo (MVS)
MVS algorithms, on the other hand, typically assume that the camera poses and the sparse 3D structure is known (from SfM or other sources). They use stereo matching and triangulation techniques to estimate the 3D coordinates of points visible in multiple images. While SfM provides only sparse information on the scene, we obtain dense 3D data:depthmaps
,pointcloud
, signed distance function (SDF), or a mesh.
MVS isn’t just a mere generalization of stereo with more than two cameras. Indeed, significantly more complex problems emerge along the way, like dealing with strong occlusions, requiring inventing new algorithms.
COLMAP
COLMAP is a general-purpose Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline with a graphical and command-line interface. (Website, GitHub)
COLMAP’s combination of accuracy, speed, ease of use, flexibility, and open-source nature has made it a popular choice among researchers and engineers who are interested in 3D reconstruction from images.
In this article, we’ll focus on the major steps that led to the MVS method provided by the famous COLMAP software. It belongs to the family of MVS algorithms based on PatchMatch
that provide depthmap
s as output.
ETH3D benchmark
The ETH3D Benchmark is widely recognized in Computer Vision 3D Reconstruction introduced by the Computer Vision Lab at ETH Zürich. It provides a standardized set of datasets and evaluation metrics for 3D reconstruction algorithms. It consists of several sub-benchmarks, each focusing on a different aspect of 3D reconstruction.
The category we’re interested in for MVS is called “High-resolution multi-view.” The leading algorithm (SteuartSystems-ACMMP) in this ranking for the last year or so was based on PatchMatch
, as presented in this article, but another model has just recently overtaken it.
2. PatchMatch [1]
In 2009, Barnes et al. introduced the PatchMatch
algorithm, which efficiently finds the nearest matching patches between two images. It does this by randomly initializing the nearest neighbor field and then iteratively improving it using a patch-based search strategy. It is commonly used for real-time image and video editing, object tracking, and stereo reconstruction.
Nearest-Neighbor Field
Given two images, A and B, we’d like to find for every patch in A its nearest neighbor in B, given a patch distance metric, e.g. the sum of squared differences (SSD). As illustrated in the figure below, it’s very unlikely that image B is a simple translation of image A.
Given two images of M
pixels and patches of m
pixels, the naive brute-force approach has a complexity of O(m*M²)
since it must compute the patch distance over m
pixels for each patch position in A
and each patch position in B
.
There have been some improvements, like using a Tree structure to get down to O(m*M*log(M))
and dimensionality reduction techniques like PCA. But it was still far too slow to be used in real time for image editing.
Key ideas of PatchMatch
Here are the three main ideas introduced by PatchMatch
to get much faster than KD-tree with PCA:
- Dimensionality of offset space
Instead of clustering theO(M)
patches in patch space (dim m
), we can search inside the 2D offset space. - Natural structure of images
Adjacent pixels are likely to have adjacent nearest-neighbors. Thus, we can perform the search in an interdependent manner. - The law of large numbers
Even a random offset is likely to be a good guess for many patches over a large image (M
).
As illustrated in the figure below, we now consider the Nearest-Neighbor Field as a 2D array of offsets in pixels with two channels X
and Y
.
This array will be iteratively refined by the PatchMatch
algorithm.
Steps
When I first read this paper, I didn’t really understand why the third point was so important. At first glance, drawing random guesses doesn’t seem like an efficient way of solving an optimization problem. However, when you combine it with the second point, everything starts to make sense. Each pixel searches randomly but also communicates with its neighbors to see if they have found a more relevant offset. Thus, if a random draw results in a good offset, it can be propagated over a whole region of the image.
The figure below comes from the original PatchMatch paper and describes the initialization and the two steps performed at each iteration:
- Propagation
Use coherence to propagate good solutions to adjacent pixels - Search
Test random offsets to explore outside local minima
Conclusion
Its complexity is still O(m*M*log(M))
, but the leading constants are substantially smaller.
PatchMatch
CPU: about 20–100x faster than KD-tree with PCAPatchMatch
GPU: about 7x faster thanPatchMatch
CPU
Image Inpainting and Texture Synthesis can now be performed in real-time in an image editing software.
PatchMatch
has unlocked a new level of performance and efficiency in patch-based image processing, making it an invaluable tool in the field of Computer Vision.
3. PatchMatch Stereo [2]
In 2011, Bleyer et al. introduced PatchMatch
Stereo, which is an extension of the PatchMatch
algorithm that uses the same randomized search strategy to find correspondences between patches in left and right stereo images and reconstruct the 3D scene.
However, the implementation and the optimization problem being solved in PatchMatch Stereo are completely different from the original PatchMatch
algorithm. It’s not simply using the PatchMatch
algorithm as a dependency, but rather it’s inspired by its propagation scheme to solve the optimization problem of stereo matching.
There’s no fancy depthmap
inpainting using PatchMatch
to be found!
Stereo disparity
In a pair of stereo images, a point in one image corresponds to a line in the other image, and vice versa. The Wikipedia figure below illustrates this very well:
Stereo image rectification is the process that involves transforming a pair of stereo images using homographies so that the corresponding epipolar lines become aligned and parallel. Once the images have been rectified, stereo matching can be performed more easily and accurately since corresponding points lie on the same horizontal line.
The disparity corresponds to this horizontal shift in pixels between two matching patches. It is inversely proportional to the depth of the object. Specifically, the larger the disparity between the two images, the closer the object is to the cameras.
Matching is done by searching a window of pixels around each feature in one image and looking for a similar feature in the other image within a corresponding disparity range. However, there are two main issues with common local stereo methods:
- Integer-valued disparities
From a computational point of view, it is very efficient to test only a small set of integer disparities. But they are sub-pixel in nature, and the search space's discretization can lead to inaccurate depth estimates and result in a “staircase effect” in the depth map. - Bias toward fronto-parallel surfaces
Similarly, looking only for a patch translation imposes a strong assumption, namely that the disparity is locally constant, i.e., a fronto-parallel plane. Slanted or curved surfaces do indeed cause changes in the shape and size of the projection of an object on the image plane. In summary, we can’t just roll a rectangular patch over the image. We need to warp it according to the normal of the local plane.
The figure below is quite simplistic but illustrates well how a patch in the left image (black) should be projected onto the right image, given different depth/normal hypotheses. The blue hypothesis is a fronto-parallel plane and results in a rectangular patch. However, the red and green hypotheses illustrate how the patch should be warped for slanted surfaces.
Key idea of PatchMatch Stereo
Here comes the twist. The brilliant idea of the authors of PatchMatch
Stereo was to mimic the propagation scheme of PatchMatch
, but on an array of plane hypotheses instead of mere 2D offsets.
As for PatchMatch
, we can leverage the fact that neighboring pixels are likely to belong to the same local plane.
To avoid unnecessary future conversions between meters and pixels, a plane is directly represented by the coefficients a
, b
, and c
of a 2D estimator that transforms pixel coordinates in the image (u,v
) into disparity (d
). Each pixel has its own local plane estimate, hence the index p
. Since it’s only a local model, (u,v
) should stay relatively close to p
.
Unlike when dealing with integer disparities, we have infinite possible plane candidates. This will, however, remain computationally feasible since the PatchMatch
scheme will only provide a small set of plane candidates to compare.
Steps
As for the initialization, plain random values for the a
, b
, and c
coefficients won’t correctly sample the plane space. The idea is to sample both a random disparity in the allowed range and a random unit vector for the plane normal. They can then be merged into a plane equation.
Three different types of propagation are involved:
- Spatial
Test estimates of neighboring pixels. - Stereo
Use the predicted disparity to test the plane hypothesis from the other image. - Temporal
Test the plane estimate at the same pixel in the previous/next frame. (Works only on stereo video sequences)
Like in PatchMatch
, we explore new plane candidates by randomly perturbing the disparity and the normal.
Occlusions can be handled via left/right consistency checking.
Conclusion
PatchMatch Stereo was the first paper to mimic the PatchMatch
scheme to estimate the depth and normal maps. It has paved the way for many future algorithms.
4. PatchMatch-based Joint View-Selection and Depth Estimation [3]
Moving from simple Stereo to Multi-View Stereo can be challenging due to the complexity of handling occlusions, as well as the increased computational complexity and the need for accurate camera calibration.
In 2014, Zheng et al. introduced “PatchMatch Based Joint View Selection and Depthmap Estimation.” At the heart of this technique is the desire to build a robust Probabilistic Graphical Model that can accurately model the underlying structure of the scene and handle complex scenarios such as occlusions, repetitive patterns, and illumination changes. This involves analyzing each pixel in a reference image and determining which source images provide the most useful information for estimating its depth.
Propagation scheme
For the sake of readability, I have omitted a detail about the propagation mechanism in the PatchMatch section. Pixels are processed sequentially in scan order instead of updating the entire array simultaneously using the four direct neighbors. The propagation starts from the top-left corner of the image and proceeds in a row-wise manner from left to right, then top to bottom. Each pixel uses estimates only from its two neighbors that have already been processed, i.e., top and left. Moreover, in every other iteration, the direction of travel is reversed, and we start from the bottom-right corner.
As its name suggests, “PatchMatch Based Joint View Selection and Depthmap Estimation” uses a PatchMatch
propagation scheme. By using it on only rows or columns, the algorithm can be better parallelized, allowing for faster and more efficient computation. We, therefore, alternate between four propagation modes throughout the iterations, namely, downward, rightward, upward, and leftward.
Variables
The most important contribution of this paper relies on the formulation of a robust Probabilistic Graphical Model taking the View-Selection into account. Thus, the authors have taken the liberty of reducing the dimension of the search space by imposing single-oriented planes, with the idea of adding the normals in a later paper. Depth only for the moment!
In turn, each image is labelled as a reference and compared against all the other images. The reference image is the image for which the depth map is to be computed, while the source images, i.e., the others, provide additional information to build or refine the estimated depths.
Since the propagation scheme runs in parallel on rows or columns, we can continue the explanations by considering a single row of the reference image.
For each pixel p
of the row in the reference image, we define:
- the patch around
p
in the reference image - the depth
θp
atp
in the reference image - the boolean list
Zp
indicating whether or not a source image sees correctly what’s in the reference patch centered aroundp
, which will allow us to avoid outliers during matching and triangulation
It should be noted that neither the depths nor the occlusion indicators are known in advance, so both must be optimized simultaneously, hence the paper's name.
Probabilistic graphical model
Let’s define the underlying probabilistic model describing the conditional dependencies between our variables.
Given a pixel p
, its patch Xref_p
and its depth θp
in the reference image, we can compute through homography warping the appearance of the patch as seen from the point of view of the source image Xm
. If Zm_p=1
, i.e., if there’s no occlusion, it’s very likely to look similar to the reference patch. On the contrary, if Zm_p=0
, the object is occluded and what is observed is not related to the reference patch, which means we have a uniform distribution over all possible colored patches.
Neighboring pixels are likely to have the same view-selection. Thus, we can form a Markov Chain with the following 2x2 transition probability matrix for the two possible states of Z (0 or 1)
, with γ
close to 1
.
Adding such a smoothness constraint on the depths of nearby pixels would be counterproductive as this tends to generate a blurred depth map. Thus, we keep the depths independent from each other.
Generalized Expectation-Maximization (GEM)
To maintain simplicity and avoid overwhelming the reader with technical information, I have chosen to omit certain details. However, it is worth mentioning that the solution’s computational feasibility is achieved using variational inference techniques. It is also important to note that this has transformed the binary occlusion indicators into continuous distributions.
Instead of optimizing the depths and occlusions indicators simultaneously, they are solved using the Expectation-Maximization paradigm, where the two following steps are interleaved:
- Expectation (Image selection): Infer
Z
using forward-backward algo.
Considering thatθ
is known, all the patches can be pre-computed through homography warping. Thus, we end up with a Hidden Markov Model (HMM) with occlusion indicatorsZ
as hidden variables and patches as observed variables. The transition and emission probabilities have been defined in the previous section. Finally, we can use the forward-backward algorithm, also known as the Baum-Welch algorithm, to retrieve the probability distributions of the occlusion indicatorsZ
given the depthsθ
. - Maximization (Depth estimation): Infer
θ
using thePatchMatch
scheme.
Considering thatZ
is known, we can run aPatchMatch
propagation scheme, where each pixel compares a small number of depth estimatesθ
to find the one that maximizes the patch similarity with the source images. Since occlusion indicators have been transformed into floating-point weights, we can optimize the weighted-sum of these similarity scores.
However, evaluating a given depth might be very expensive to compute if we need to extract the patches for all non-zero weights, i.e., for almost all the source images, since floating-point weights can be very small without being exactly0
. A smart alternative is to approximate the weighted sum using Monte-Carlo, i.e., drawN
source images given the distribution of weights and compute their average similarity score. According to the authors,N≈15
is a good trade-off.
Conclusion
“PatchMatch Based Joint View Selection and Depthmap Estimation” is a promising approach for MVS reconstruction that employs a clean Probabilistic Graphical Model to jointly estimate depthmap
s and select optimal views. The algorithm leverages the strengths of PatchMatch
propagation and variational inference to efficiently estimate high-quality depthmap
s for a given scene. However, the method could be further improved by incorporating normal estimation techniques to handle slanted and curved surfaces better.
5. COLMAP [4, 5]
In 2016, Schönberger et al. introduced “Structure-from-Motion Revisited,” which proposes a new SfM technique that improves upon the state-of-the-art techniques that already exist.
The same year, the same lead author published “Pixelwise View Selection Unstructured MVS,” which expands on the method proposed by Zheng.
The full reconstruction process, named COLMAP, which is available as an open-source implementation, comprises these two algorithms as its fundamental components.
Let’s explain the innovative ideas brought by the second paper, i.e., the Multi-View Stereo part.
Normal estimation
Zheng, i.e., the author of the paper described in the previous section, is also present in this one. They enhanced the previous model by incorporating per-pixel normal estimation alongside depth estimation. This enables the use of slanted support windows, which avoids the assumption of fronto-parallelism.
Unlike PatchMatch Stereo, the plane assumptions are expressed in meters and not in the disparity space because disparity only makes sense in the simple case of two left/right stereo images.
Geometric priors
To enhance the robustness of the pixel-wise view selection, geometric priors were integrated into the method on a per-pixel basis in addition to the photometric occlusion indicators.
- Triangulation Prior
It encourages selecting views with a sufficient baseline, i.e., the distance between the camera centers. A larger baseline results in a greater amount of parallax between views, which makes the triangulation more accurate. - Resolution Prior
It encourages the selection of views with a similar resolution to the reference view. This is because images with significantly different resolutions can lead to inconsistencies in the depth estimation process. Ensuring that the reconstruction algorithm accounts for the wide range of camera resolutions in unstructured datasets is particularly crucial. - Incident Prior
It encourages the selection of views with non-oblique viewing directions. Oblique views, where the viewing direction is significantly different from the surface normal, can result in large errors in depth estimation due.
View selection smoothness
The alternating propagation directions (down, right, up, left) used in interleaved inference may cause oscillations in Z
. To address this issue, temporal smoothness can be added to the existing spatial smoothness.
Please take note that the term “time” mentioned here pertains to the iterative steps involved in the optimization process and is not related to the temporal aspect of a captured video.
Geometric consistency
Geometric consistency check is usually enforced during post-processing, like left-right checks in conventional Stereo.
The authors innovate here by integrating the geometric consistency directly into the probabilistic graphical model on the same level as the photometric consistency.
The forward-backward reprojection error against the source views can be used to quantify the degree of geometric consistency.
- Forward
Convert the pixelp
of the reference view to a 3D point in the scene using its depthθp
and project it to the source viewm
. - Backward
Estimate its depth by interpolating the depth estimates of the source viewm
to convert it to a 3D point in the scene. Finally, project it back to the reference view and compare the distance error with respect to the original pixelp
.
Conclusion
Overall, this work represents an important step forward in Multi-View Stereo reconstruction.
As an open-source tool, COLMAP encourages collaboration and innovation within the community, making it an important contributor to the advancement of the field.
Bibliography
I hope you enjoyed reading this article and that it gives you more insights on PatchMatch Multi-View Stereo!