PatchMatch Multi-View Stereo

Understand how the MVS tool provided by COLMAP works

Thomas Rouch

Published in

Better Programming

15 min readApr 4, 2023

Introduction

SfM/MVS

Multi-view Stereo (MVS) refers to an ensemble of techniques used in Computer Vision to reconstruct a dense 3D scene from multiple 2D images taken from different viewpoints. It allows you to turn smartphones and their high-resolution cameras into robust and cost-effective sensors to capture high-quality 3D data.

You may also have heard of Structure-from-Motion (SfM), which is another Computer Vision technique also used to retrieve 3D geometry from a set of 2D images. While both techniques aim to estimate the 3D structure of the scene, they differ in their approaches and goals.

Structure-from-Motion (SfM)
SfM is used to estimate simultaneously the camera poses (position and orientation) of each input photograph and a sparse 3D structure of the scene. Each 2D observation of a key point tracked over several images refers to a single 3D point in space. Thus, we can retrieve the most plausible camera poses and 3D points by minimizing the 2D reprojection error of these key points. While SfM is typically limited to offline processing of static scenes, Visual Simultaneous Localization and Mapping (VSLAM) can be used for real-time localization in dynamic scenes.
Multi-View Stereo (MVS)
MVS algorithms, on the other hand, typically assume that the camera poses and the sparse 3D structure is known (from SfM or other sources). They use stereo matching and triangulation techniques to estimate the 3D coordinates of points visible in multiple images. While SfM provides only sparse information on the scene, we obtain dense 3D data: depthmaps, pointcloud, signed distance function (SDF), or a mesh.

Structure-from-Motion and Multi-View Stereo —Figure by the author

MVS isn’t just a mere generalization of stereo with more than two cameras. Indeed, significantly more complex problems emerge along the way, like dealing with strong occlusions, requiring inventing new algorithms.

COLMAP

COLMAP is a general-purpose Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline with a graphical and command-line interface. (Website, GitHub)

COLMAP’s combination of accuracy, speed, ease of use, flexibility, and open-source nature has made it a popular choice among researchers and engineers who are interested in 3D reconstruction from images.

In this article, we’ll focus on the major steps that led to the MVS method provided by the famous COLMAP software. It belongs to the family of MVS algorithms based on PatchMatch that provide depthmaps as output.

ETH3D benchmark

The ETH3D Benchmark is widely recognized in Computer Vision 3D Reconstruction introduced by the Computer Vision Lab at ETH Zürich. It provides a standardized set of datasets and evaluation metrics for 3D reconstruction algorithms. It consists of several sub-benchmarks, each focusing on a different aspect of 3D reconstruction.

The category we’re interested in for MVS is called “High-resolution multi-view.” The leading algorithm (SteuartSystems-ACMMP) in this ranking for the last year or so was based on PatchMatch, as presented in this article, but another model has just recently overtaken it.

2. PatchMatch [1]

In 2009, Barnes et al. introduced the PatchMatch algorithm, which efficiently finds the nearest matching patches between two images. It does this by randomly initializing the nearest neighbor field and then iteratively improving it using a patch-based search strategy. It is commonly used for real-time image and video editing, object tracking, and stereo reconstruction.

Nearest-Neighbor Field

Given two images, A and B, we’d like to find for every patch in A its nearest neighbor in B, given a patch distance metric, e.g. the sum of squared differences (SSD). As illustrated in the figure below, it’s very unlikely that image B is a simple translation of image A.

Image-patches in A and their nearest neighbors in B — Figure by the author

Given two images of M pixels and patches of m pixels, the naive brute-force approach has a complexity of O(m*M²) since it must compute the patch distance over m pixels for each patch position in A and each patch position in B.

There have been some improvements, like using a Tree structure to get down to O(m*M*log(M)) and dimensionality reduction techniques like PCA. But it was still far too slow to be used in real time for image editing.

Key ideas of PatchMatch

Here are the three main ideas introduced by PatchMatch to get much faster than KD-tree with PCA:

Dimensionality of offset space
Instead of clustering the O(M) patches in patch space (dim m), we can search inside the 2D offset space.
Natural structure of images
Adjacent pixels are likely to have adjacent nearest-neighbors. Thus, we can perform the search in an interdependent manner.
The law of large numbers
Even a random offset is likely to be a good guess for many patches over a large image (M).

As illustrated in the figure below, we now consider the Nearest-Neighbor Field as a 2D array of offsets in pixels with two channels X and Y.

This array will be iteratively refined by the PatchMatch algorithm.

Nearest-Neighbor Field represented by an array of offsets to apply on coordinates in image A to find the nearest neighbor patch in image B — Figure by the author

Steps

When I first read this paper, I didn’t really understand why the third point was so important. At first glance, drawing random guesses doesn’t seem like an efficient way of solving an optimization problem. However, when you combine it with the second point, everything starts to make sense. Each pixel searches randomly but also communicates with its neighbors to see if they have found a more relevant offset. Thus, if a random draw results in a good offset, it can be propagated over a whole region of the image.

The figure below comes from the original PatchMatch paper and describes the initialization and the two steps performed at each iteration:

Propagation
Use coherence to propagate good solutions to adjacent pixels
Search
Test random offsets to explore outside local minima

Phases of the randomized nearest neighbor algorithm — Figure from the original paper “PatchMatch A Randomized Correspondence Algorithm for Structural Image Editing”

Conclusion

Its complexity is still O(m*M*log(M)), but the leading constants are substantially smaller.

PatchMatch CPU: about 20–100x faster than KD-tree with PCA
PatchMatch GPU: about 7x faster than PatchMatch CPU

Image Inpainting and Texture Synthesis can now be performed in real-time in an image editing software.

PatchMatch has unlocked a new level of performance and efficiency in patch-based image processing, making it an invaluable tool in the field of Computer Vision.

3. PatchMatch Stereo [2]

In 2011, Bleyer et al. introduced PatchMatch Stereo, which is an extension of the PatchMatch algorithm that uses the same randomized search strategy to find correspondences between patches in left and right stereo images and reconstruct the 3D scene.

However, the implementation and the optimization problem being solved in PatchMatch Stereo are completely different from the original PatchMatch algorithm. It’s not simply using the PatchMatch algorithm as a dependency, but rather it’s inspired by its propagation scheme to solve the optimization problem of stereo matching.

There’s no fancy depthmap inpainting using PatchMatch to be found!

Stereo disparity

In a pair of stereo images, a point in one image corresponds to a line in the other image, and vice versa. The Wikipedia figure below illustrates this very well:

Epipolar Geometry — Figure from the “Epipolar Geometry ”Wikipedia page

Stereo image rectification is the process that involves transforming a pair of stereo images using homographies so that the corresponding epipolar lines become aligned and parallel. Once the images have been rectified, stereo matching can be performed more easily and accurately since corresponding points lie on the same horizontal line.

The disparity corresponds to this horizontal shift in pixels between two matching patches. It is inversely proportional to the depth of the object. Specifically, the larger the disparity between the two images, the closer the object is to the cameras.

Matching is done by searching a window of pixels around each feature in one image and looking for a similar feature in the other image within a corresponding disparity range. However, there are two main issues with common local stereo methods:

Integer-valued disparities
From a computational point of view, it is very efficient to test only a small set of integer disparities. But they are sub-pixel in nature, and the search space's discretization can lead to inaccurate depth estimates and result in a “staircase effect” in the depth map.
Bias toward fronto-parallel surfaces
Similarly, looking only for a patch translation imposes a strong assumption, namely that the disparity is locally constant, i.e., a fronto-parallel plane. Slanted or curved surfaces do indeed cause changes in the shape and size of the projection of an object on the image plane. In summary, we can’t just roll a rectangular patch over the image. We need to warp it according to the normal of the local plane.

The figure below is quite simplistic but illustrates well how a patch in the left image (black) should be projected onto the right image, given different depth/normal hypotheses. The blue hypothesis is a fronto-parallel plane and results in a rectangular patch. However, the red and green hypotheses illustrate how the patch should be warped for slanted surfaces.

Left/Right images and patch projections for different depth/normal hypotheses — Figure by the author

Key idea of PatchMatch Stereo

Here comes the twist. The brilliant idea of the authors of PatchMatch Stereo was to mimic the propagation scheme of PatchMatch, but on an array of plane hypotheses instead of mere 2D offsets.

As for PatchMatch, we can leverage the fact that neighboring pixels are likely to belong to the same local plane.

To avoid unnecessary future conversions between meters and pixels, a plane is directly represented by the coefficients a, b, and c of a 2D estimator that transforms pixel coordinates in the image (u,v) into disparity (d). Each pixel has its own local plane estimate, hence the index p. Since it’s only a local model, (u,v) should stay relatively close to p.

Unlike when dealing with integer disparities, we have infinite possible plane candidates. This will, however, remain computationally feasible since the PatchMatch scheme will only provide a small set of plane candidates to compare.

Steps

As for the initialization, plain random values for the a, b, and c coefficients won’t correctly sample the plane space. The idea is to sample both a random disparity in the allowed range and a random unit vector for the plane normal. They can then be merged into a plane equation.

Three different types of propagation are involved:

Spatial
Test estimates of neighboring pixels.
Stereo
Use the predicted disparity to test the plane hypothesis from the other image.
Temporal
Test the plane estimate at the same pixel in the previous/next frame. (Works only on stereo video sequences)

Like in PatchMatch, we explore new plane candidates by randomly perturbing the disparity and the normal.

Occlusions can be handled via left/right consistency checking.

Conclusion

PatchMatch Stereo was the first paper to mimic the PatchMatch scheme to estimate the depth and normal maps. It has paved the way for many future algorithms.

4. PatchMatch-based Joint View-Selection and Depth Estimation [3]

Moving from simple Stereo to Multi-View Stereo can be challenging due to the complexity of handling occlusions, as well as the increased computational complexity and the need for accurate camera calibration.

In 2014, Zheng et al. introduced “PatchMatch Based Joint View Selection and Depthmap Estimation.” At the heart of this technique is the desire to build a robust Probabilistic Graphical Model that can accurately model the underlying structure of the scene and handle complex scenarios such as occlusions, repetitive patterns, and illumination changes. This involves analyzing each pixel in a reference image and determining which source images provide the most useful information for estimating its depth.

Propagation scheme

For the sake of readability, I have omitted a detail about the propagation mechanism in the PatchMatch section. Pixels are processed sequentially in scan order instead of updating the entire array simultaneously using the four direct neighbors. The propagation starts from the top-left corner of the image and proceeds in a row-wise manner from left to right, then top to bottom. Each pixel uses estimates only from its two neighbors that have already been processed, i.e., top and left. Moreover, in every other iteration, the direction of travel is reversed, and we start from the bottom-right corner.

As its name suggests, “PatchMatch Based Joint View Selection and Depthmap Estimation” uses a PatchMatch propagation scheme. By using it on only rows or columns, the algorithm can be better parallelized, allowing for faster and more efficient computation. We, therefore, alternate between four propagation modes throughout the iterations, namely, downward, rightward, upward, and leftward.

Variables

The most important contribution of this paper relies on the formulation of a robust Probabilistic Graphical Model taking the View-Selection into account. Thus, the authors have taken the liberty of reducing the dimension of the search space by imposing single-oriented planes, with the idea of adding the normals in a later paper. Depth only for the moment!

In turn, each image is labelled as a reference and compared against all the other images. The reference image is the image for which the depth map is to be computed, while the source images, i.e., the others, provide additional information to build or refine the estimated depths.

Since the propagation scheme runs in parallel on rows or columns, we can continue the explanations by considering a single row of the reference image.

For each pixel p of the row in the reference image, we define:

the patch around p in the reference image
the depth θp at p in the reference image
the boolean list Zp indicating whether or not a source image sees correctly what’s in the reference patch centered around p, which will allow us to avoid outliers during matching and triangulation

It should be noted that neither the depths nor the occlusion indicators are known in advance, so both must be optimized simultaneously, hence the paper's name.

Reference patch around pixel p + depth and occlusion indicators for the source images— Figure by the author

Probabilistic graphical model

Let’s define the underlying probabilistic model describing the conditional dependencies between our variables.

Given a pixel p, its patch Xref_p and its depth θp in the reference image, we can compute through homography warping the appearance of the patch as seen from the point of view of the source image Xm. If Zm_p=1, i.e., if there’s no occlusion, it’s very likely to look similar to the reference patch. On the contrary, if Zm_p=0, the object is occluded and what is observed is not related to the reference patch, which means we have a uniform distribution over all possible colored patches.

Neighboring pixels are likely to have the same view-selection. Thus, we can form a Markov Chain with the following 2x2 transition probability matrix for the two possible states of Z (0 or 1), with γ close to 1.

Adding such a smoothness constraint on the depths of nearby pixels would be counterproductive as this tends to generate a blurred depth map. Thus, we keep the depths independent from each other.

Generalized Expectation-Maximization (GEM)

To maintain simplicity and avoid overwhelming the reader with technical information, I have chosen to omit certain details. However, it is worth mentioning that the solution’s computational feasibility is achieved using variational inference techniques. It is also important to note that this has transformed the binary occlusion indicators into continuous distributions.

Instead of optimizing the depths and occlusions indicators simultaneously, they are solved using the Expectation-Maximization paradigm, where the two following steps are interleaved:

Expectation (Image selection): Infer Z using forward-backward algo.
Considering that θ is known, all the patches can be pre-computed through homography warping. Thus, we end up with a Hidden Markov Model (HMM) with occlusion indicators Z as hidden variables and patches as observed variables. The transition and emission probabilities have been defined in the previous section. Finally, we can use the forward-backward algorithm, also known as the Baum-Welch algorithm, to retrieve the probability distributions of the occlusion indicators Z given the depths θ.
Maximization (Depth estimation): Infer θ using the PatchMatch scheme.
Considering that Z is known, we can run a PatchMatch propagation scheme, where each pixel compares a small number of depth estimates θ to find the one that maximizes the patch similarity with the source images. Since occlusion indicators have been transformed into floating-point weights, we can optimize the weighted-sum of these similarity scores.
However, evaluating a given depth might be very expensive to compute if we need to extract the patches for all non-zero weights, i.e., for almost all the source images, since floating-point weights can be very small without being exactly 0. A smart alternative is to approximate the weighted sum using Monte-Carlo, i.e., draw N source images given the distribution of weights and compute their average similarity score. According to the authors, N≈15 is a good trade-off.

Conclusion

“PatchMatch Based Joint View Selection and Depthmap Estimation” is a promising approach for MVS reconstruction that employs a clean Probabilistic Graphical Model to jointly estimate depthmaps and select optimal views. The algorithm leverages the strengths of PatchMatch propagation and variational inference to efficiently estimate high-quality depthmaps for a given scene. However, the method could be further improved by incorporating normal estimation techniques to handle slanted and curved surfaces better.

5. COLMAP [4, 5]

In 2016, Schönberger et al. introduced “Structure-from-Motion Revisited,” which proposes a new SfM technique that improves upon the state-of-the-art techniques that already exist.

The same year, the same lead author published “Pixelwise View Selection Unstructured MVS,” which expands on the method proposed by Zheng.

The full reconstruction process, named COLMAP, which is available as an open-source implementation, comprises these two algorithms as its fundamental components.

Let’s explain the innovative ideas brought by the second paper, i.e., the Multi-View Stereo part.

Normal estimation

Zheng, i.e., the author of the paper described in the previous section, is also present in this one. They enhanced the previous model by incorporating per-pixel normal estimation alongside depth estimation. This enables the use of slanted support windows, which avoids the assumption of fronto-parallelism.

Unlike PatchMatch Stereo, the plane assumptions are expressed in meters and not in the disparity space because disparity only makes sense in the simple case of two left/right stereo images.

Geometric priors

To enhance the robustness of the pixel-wise view selection, geometric priors were integrated into the method on a per-pixel basis in addition to the photometric occlusion indicators.

Triangulation Prior
It encourages selecting views with a sufficient baseline, i.e., the distance between the camera centers. A larger baseline results in a greater amount of parallax between views, which makes the triangulation more accurate.
Resolution Prior
It encourages the selection of views with a similar resolution to the reference view. This is because images with significantly different resolutions can lead to inconsistencies in the depth estimation process. Ensuring that the reconstruction algorithm accounts for the wide range of camera resolutions in unstructured datasets is particularly crucial.
Incident Prior
It encourages the selection of views with non-oblique viewing directions. Oblique views, where the viewing direction is significantly different from the surface normal, can result in large errors in depth estimation due.

View selection smoothness

The alternating propagation directions (down, right, up, left) used in interleaved inference may cause oscillations in Z. To address this issue, temporal smoothness can be added to the existing spatial smoothness.

Please take note that the term “time” mentioned here pertains to the iterative steps involved in the optimization process and is not related to the temporal aspect of a captured video.

Geometric consistency

Geometric consistency check is usually enforced during post-processing, like left-right checks in conventional Stereo.

The authors innovate here by integrating the geometric consistency directly into the probabilistic graphical model on the same level as the photometric consistency.

The forward-backward reprojection error against the source views can be used to quantify the degree of geometric consistency.

Forward
Convert the pixel p of the reference view to a 3D point in the scene using its depth θp and project it to the source view m.
Backward
Estimate its depth by interpolating the depth estimates of the source view m to convert it to a 3D point in the scene. Finally, project it back to the reference view and compare the distance error with respect to the original pixel p.

Conclusion

Overall, this work represents an important step forward in Multi-View Stereo reconstruction.

As an open-source tool, COLMAP encourages collaboration and innovation within the community, making it an important contributor to the advancement of the field.

Bibliography

[1] Barnes, C.; Shechtman, E.; Finkelstein, A. & Goldman, D. B. (2009), ‘PatchMatch: a randomized correspondence algorithm for structural image editing.’, ACM Trans. Graph. 28 (3), 24.

[2] Bleyer, M.; Rhemann, C. & Rother, C. (2011), PatchMatch Stereo — Stereo Matching with Slanted Support Windows., in Jesse Hoey; Stephen J. McKenna & Emanuele Trucco, ed., ‘BMVC’, BMVA Press, , pp. 1–11 .

[3] Zheng, E.; Dunn, E.; Jojic, V. & Frahm, J.-M. (2014), PatchMatch Based Joint View Selection and Depthmap Estimation., in ‘CVPR’, IEEE Computer Society, , pp. 1510–1517 .

[4] Schönberger, J. L. & Frahm, J.-M. (2016), Structure-from-Motion Revisited., in ‘CVPR’ , IEEE Computer Society, , pp. 4104–4113 .

[5] Schönberger, J. L.; Zheng, E.; Frahm, J.-M. & Pollefeys, M. (2016), Pixelwise View Selection for Unstructured Multi-View Stereo., in Bastian Leibe; Jiri Matas; Nicu Sebe & Max Welling, ed., ‘ECCV (3)’ , Springer, , pp. 501–518 .

I hope you enjoyed reading this article and that it gives you more insights on PatchMatch Multi-View Stereo!

My GitHub page

PatchMatch Multi-View Stereo

Understand how the MVS tool provided by COLMAP works

Introduction

SfM/MVS

COLMAP

ETH3D benchmark

2. PatchMatch [1]

Nearest-Neighbor Field

Key ideas of PatchMatch

Steps

Conclusion

3. PatchMatch Stereo [2]

Stereo disparity

Key idea of PatchMatch Stereo

Steps

Conclusion

4. PatchMatch-based Joint View-Selection and Depth Estimation [3]

Propagation scheme

Variables

Probabilistic graphical model

Generalized Expectation-Maximization (GEM)

Conclusion

5. COLMAP [4, 5]

Normal estimation

Geometric priors

View selection smoothness

Geometric consistency

Conclusion

Bibliography

Written by Thomas Rouch