3D Scene Flow Estimation with a Piecewise Rigid Scene Model

3D scene flow estimation aims to jointly recover dense geometry and 3D motion from stereoscopic image sequences, thus generalizes classical disparity and 2D optical flow estimation. To realize its conceptual benefits and overcome limitations of many existing methods, we propose to represent the dynamic scene as a collection of rigidly moving planes, into which the input images are segmented. Geometry and 3D motion are then jointly recovered alongside an over-segmentation of the scene. This piecewise rigid scene model is significantly more parsimonious than conventional pixel-based representations, yet retains the ability to represent real-world scenes with independent object motion. It, furthermore, enables us to define suitable scene priors, perform occlusion reasoning, and leverage discrete optimization schemes toward stable and accurate results. Assuming the rigid motion to persist approximately over time additionally enables us to incorporate multiple frames into the inference. To that end, each view holds its own representation, which is encouraged to be consistent across all other viewpoints and frames in a temporal window. We show that such a view-consistent multi-frame scheme significantly improves accuracy, especially in the presence of occlusions, and increases robustness against adverse imaging conditions. Our method currently achieves leading performance on the KITTI benchmark, for both flow and stereo.


Introduction
The scene flow of a dynamic scene is defined as a dense representation of the 3D shape and its 3D motion field. Scene flow estimation aims to extract this information from images captured by two (or more) cameras at two (or more) different time instants. Applications that benefit from knowing the scene flow include 3D video generation for 3D-TV (Hung et al. 2013), motion capture (Courchay et al. 2009;Park et al. 2012;Vedula et al. 1999), and driver assistance (e.g., Müller et al. 2011;Rabe et al. 2010;Wedel et al. 2008). The 3D scene flow can be seen as a combination of two classical computer vision problems-it generalizes optical flow to 3D, or alternatively, dense stereo to dynamic scenes.
While progress in dense binocular stereo (Bleyer et al. 2011b;Hirschmüller 2008;Yamaguchi et al. 2012, etc.) and optical flow (Brox et al. 2004;Sun et al. 2010;Unger et al. 2012, among others) has been both steady and significant over the years, the performance of 3D scene flow algorithms (e.g., Basha et al. 2010;Huguet and Devernay 2007;Wedel et al. 2008) had been lacking in comparison. Only recently, methods emerged (Vogel et al. 2013b(Vogel et al. , 2014Yamaguchi et al. 2014) that could leverage the additional information present in stereo video streams and outperform their dedicated twodimensional counterparts at their respective tasks.
This may seem surprising, because 3D scene flow has a lot of commonalities with stereo and optical flow. This includes some of the principal difficulties, for example  Vaudrey et al. (2008): jointly estimated 3D geometry, 3D motion vectors, and superpixel boundaries, rendered from a different viewpoint matching ambiguities due to insufficient evidence from the local appearance, or the aperture problem (more precisely a 3D version of it). Therefore, 3D scene flow estimation similarly requires prior assumptions about geometry and motion. A recent trend in both stereo and optical flow is to move away from simple pixelwise smoothness priors, as they have been found limiting. More expressive priors have been introduced, for example, by virtue of an over-parameterization (Nir et al. 2008), layered (Sun et al. 2010) or piecewise planar scene models (Bleyer et al. 2011b). In contrast, there has been relatively little work on using advanced priors in scene flow estimation. One exception is a regularizer that promotes local rigidity (Vogel et al. 2011), a common property of realistic scenes, by penalizing deviations from it.

Piecewise Rigid Scene Model
Our first contribution is to go one step further and represent dynamic scenes as a collection of planar regions, each undergoing a rigid motion. Following previous work in stereo (Bleyer et al. 2011b), we argue that most scenes of interest consist of regions with a consistent motion pattern, into which they can be segmented. Consequently, we aim to jointly recover an implicit (over-)segmentation of the scene into planar, rigidly moving regions, as well as the shape and motion parameters of those regions (see Fig. 1). As we will show, such a parsimonious model is well-suited for many scenes of interest: The approximation holds well enough to capture the shape and motion of many real-world scenarios accurately, including scenes with independent object motion, while the stronger regularization affords stability. At the same time, reasoning in terms of rigid planar regions rather than pixels drastically reduces the number of unknowns to be recovered. Thereby, we additionally address the challenge of optimization or inference, one of the other principal difficulties that 3D scene flow shares with stereo and optical flow.
We (implicitly) represent 3D scene flow by assigning each pixel to a rigidly moving 3D plane, which has 9 continuous degrees of freedom (3 plane parameters, 6 motion parameters). To bootstrap their estimation, we start not from individual pixels, but from an initial superpixel segmentation of the scene. Based on the superpixels we compute a large, but finite set of candidate (moving) planes, and cast scene flow estimation as a labeling problem. The inference thus assigns each pixel to one of the segments (superpixels), and each segment to one of the candidate moving planes. We split the optimization into two steps. First, we find the best moving plane for each segment; reasoning on this coarser level captures long-range interactions and significantly simplifies and stabilizes the inference. Second, we go back to the pixel level and reassign pixels to segments; this step cleans up inaccuracies of the segmentation, whose initial boundaries were generated without taking the previously unknown surface or motion discontinuities into account.

View-Consistent Multi-frame Scene Flow
Our second contribution is to exploit this piecewise rigid scene model to overcome two limitations of existing scene flow techniques. We begin by observing that (i) there is no conceptual reason for a privileged reference view (e.g., Basha et al. 2010;Rabe et al. 2010;Valgaerts et al. 2010;Wedel et al. 2008), as systematic challenges in imaging (specular reflections, occlusions, noise, lack of contrast, etc.) affect all frames, but not necessarily equally. Thus parameterizing the model w.r.t. a single viewpoint may in fact ignore important evidence present in other views (c.f. Fig. 2); (ii) data usually comes in the form of a stereo video sequence, and it appears wasteful not to exploit longer time intervals, especially in light of the first observation.
We go on to show that our piecewise planar and rigid scene model can be extended to simultaneously estimate geometry and 3D motion over longer time intervals, and to ensure that the estimate is consistent across all views within the considered time window. To that end we simultaneously parameterize the scene flow w.r.t. all views. While it may not be surprising that considering longer sequences may help motion estimation, at least in classical 2D optical flow estimation multi-frame extensions have largely not had the desired effect; two-frame methods are still the state of the art (see Baker et al. 2011;Geiger et al. 2012). We argue that longterm constraints may be more helpful in scene flow, since the representation resides in 3D space, rather than in a 2D projection. Constraints caused by physical properties, such as inertia, remain valid in the long term, and can be exploited more directly.
To make the estimate consistent across all views from a longer sequence, we constrain the segmentation to remain stable over time, enforce coherence of the representation between different viewpoints, and integrate a dynamic model that favors constant velocity of the individual planes. We empirically found this assumption to be valid as long as segments and temporal windows do not get too large.

Contributions
The main features of our proposed approach are: (i) A novel scene flow model that represents the scene with piecewise planar, rigidly moving regions in 3D space, featuring regularization between these regions and explicit occlusion reasoning; (ii) a view-consistent model extension that leads to improved results in challenging scenarios, by simultaneously representing 3D shape and motion w.r.t. every image in a time interval, while demanding consistency of the representations; (iii) a multi-frame extension that yields a temporally consistent piecewise-planar segmentation of the scene and favors constant 3D velocity over time; and (iv) a clean energybased formulation capturing all these aspects, as well as a suitable discrete inference scheme. The formulation can-at least conceptually-handle any number of viewpoints and time steps.
We demonstrate the advantages of our model using a range of qualitative and quantitative experiments. On particularly hard qualitative examples, our model turns out to be remarkably resistant to missing evidence, outliers, and occlusions. As a quantitative testbed we evaluate our method on the challenging KITTI dataset of real street scenes, using both stereo and flow benchmarks. In both benchmarks we achieve leading performance, even beating methods that are designed for the specific situation in the benchmark. At the time of writing (August 2014) our full (view-consistent multi-frame) model is the top performing method for both optical flow and stereo, when evaluated on full images including occlusion areas.
The present paper is based on two conference publications (Vogel et al. 2013b(Vogel et al. , 2014. We here describe the approach in greater detail, including the model itself, the inference scheme, proposal generation, and technical issues of occlusion reasoning. Moreover, we present a deeper analysis and more detailed comparison between the conventional parameterization and the view-consistent model, an experimental investigation of different optimization strategies, and study the influence of parameters on the quantitative results. Vedula et al. (1999) first defined scene flow as the collective estimation of dense 3D geometry and 3D motion from image data. Their approach operates in two steps. After computing independent 2D optical flow fields for all views of the scene, the final 3D flow field is fit to the 2D flows, thus neglecting the image data in this step. Similarly, Wedel et al. (2008) and Rabe et al. (2010) proceed sequentially on the data of a calibrated stereo camera system. Starting from a precomputed disparity map, optical flow for a reference frame and disparity difference for the other view are estimated. Possibly the first to calculate geometry and flow jointly in a two-view setup were Huguet and Devernay (2007), addressing the problem in a variational formulation. The problem was generalized by Valgaerts et al. (2010) to work with an unknown relative pose between the cameras, solely assuming knowledge of the camera intrinsics. They alternate scene flow calculation with estimating the relative camera pose. Operating entirely with 2D entities, these approaches partially neglect the 3D origin of the data. In particular, the proposed 2D regularizer encourages smooth projections, but not necessarily smooth 3D scene flow.

Related Work
In contrast, Basha et al. (2010) choose a 3D parameterization by depth and a 3D motion vector w.r.t. a reference view and estimate all parameters jointly, extending the popular optical flow method of Brox et al. (2004) to scene flow. Arguing that a total variation prior on the 3D motion field is biased for realistic baselines, Vogel et al. (2011) propose a regularizer that encourages locally rigid motion. Our model also employs a local rigidity assumption, but here we explicitly identify regions with a consistent motion pattern, into which the image is segmented.
The history of local rigidity priors dates back at least to Adiv (1985), who employed this assumption for sparse motion estimation. The idea was later extended to sparse scene flow by Carceroni and Kutulakos (2002). In a similar manner, Devernay et al. (2006) extend the Lucas-Kanade technique (1981) to multi-camera scene flow and track planar, rigidly moving regions in 3D over several frames. While the scene representation of Carceroni and Kutulakos (2002), Devernay et al. (2006) is similar to ours, there the regions move independently without interaction imposed by a global objective. Furukawa and Ponce (2008) go one step further and use the locally tracked rigid patch motion as input for a global optimization step, where the connectivity is defined by an explicit surface model, thus limiting admissible scenes to a fixed topology. 3D rigid body motions are further exploited in the context of scene flow estimation from RGB-D data by Hornacek et al. (2014). They do not need to assume local surface planarity, but exploit the additional information from the depth sensor and use a local rigidity prior to overcome large displacements. For computing optical flow, Nir et al. (2008) over-parameterize the 2D flow field and explicitly search for rigid motion parameters, while encouraging their smoothness.
Most previous dense 3D scene flow methods have in common that they penalize deviations from spatial smoothness in a robust manner. Explicit modeling of discontinuities by means of segmentation or layer-based formulations has a long history in the context of stereo (Tao and Sawhney 2000) and optical flow (Wang and Adelson 1994). These ideas recently gained renewed attention, however modern methods do not hold the segmentation fixed, but rather infer or refine it together with the scene parameters. Bleyer et al. (2010Bleyer et al. ( , 2011b segment the scene into planar superpixels and estimate disparity by parameterizing their geometry. Additionally penalizing deviations from an initial solution, segmentbased stereo is also promoted by Yamaguchi et al. (2012). More recently, this method was extended to epipolar flow (Yamaguchi et al. 2013) and epipolar scene flow (Yamaguchi et al. 2014), both assuming that the flow fulfills epipolar geometry constraints, i.e. is the result of pure camera egomotion. General 2D optical flow is computed by Unger et al. (2012), who parameterize the motion of each segment with 2D affine transformations, and also allow for occlusion handling. Aside from estimating 2D and not 3D motion, the method differs in the sense that no inter-patch regularization is performed, such that motion fields of adjacent segments are estimated completely independently of one another. Murray and Buxton (1987) were among the first to perform motion estimation over multiple frames. The admissible 2D optical flow fields are, however, limited to only small displacements. Black and Anandan (1991) instead encourage the similarity between the current and the past flow estimates, extrapolating motion fields from previous frames. While allowing for larger displacements, information is only processed in a feed-forward fashion, in particular the present cannot influence the past. Much later, assuming a constant 2D motion field, Werlberger et al. (2009) jointly reason over three consecutive frames. By considering constant 3D scene flow over time, we are able to address more general scenes. This constant velocity constraint is relaxed by Volz et al. (2011), who encourage first and second order smoothness of the motion field as soft constraints. The motion is parameterized w.r.t. a single reference frame, thus reasoning about occlusion regions or outliers appears hard to achieve. Irani (2002) operates on much longer time intervals and enforces the estimated 2D motion trajectories to lie in a (rigid) subspace. Similarly, Garg et al (2013) require the 2D motions to lie in a low-rank trajectory space, but instead can use the prior as a soft constraint. Sun et al. (2010Sun et al. ( , 2013 argue that the scene structure is more likely to persist over time than any motion pattern, hence avoid temporal smoothing at all, and instead jointly estimate the flow together with a segmentation into a small number of layers while requiring the pixel-to-layer membership to be constant. With the primary goal of high-level motion segmentation, Schoenemann and Cremers (2008) operate in a similar way: A video is segmented into several motion layers with long-term temporal consistency. Optionally, a 2D parametric motion for each layer is estimated as well. Our view-consistent formulation makes a related assumption, since we group pixels into planar and rigidly moving segments, while enforcing consistency of the segmentation over multiple frames. In contrast to motion layers, this much more fine-grained representation with hundreds of small segments enables us to address a wider range of scenes.
An explicit representation of 3D motion and shape allows scene flow methods to exploit temporal consistency over longer time intervals in a more straightforward manner, since smoothness constraints are better supported in the 3D scene than in its 2D projection. Rabe et al. (2010) take advantage of this fact and propagate geometry and 3D motion across frames with the help of a Kalman filter. At each pixel the measurement vector for the filter is composed of scene flow vectors from the current and the previous frame, which are estimated with the method of Wedel et al. (2008). Compared to its input, the filtered output contains significantly fewer outliers. Hung et al. (2013) concatenate frame-to-frame stereo and flow to longer motion trajectories, which are, after passing several plausibility tests, included into the final optimization as soft constraints, similar to including feature matches in two-frame optical flow (Brox and Malik 2011). The method advocates to propagate information through the whole sequence and, therefore, cannot output the scene flow without significant temporal delay, as is needed for several application scenarios. In their multi-camera setup Park et al. (2012) also operate sequentially. Scene flow is first estimated frame-by-frame and then smoothed over time by tensor voting. Courchay et al. (2009) go further and represent the scene with an explicit deformable 3D mesh template, which is fitted to the video data from multiple cameras over 10-60 frames. The method is theoretically elegant, but computationally expensive. Both approaches target motion capture in controlled settings. Schematic sketch of our scene representation: the scene is modeled as a collection of rigidly moving planar segments, here three different segments cover the side of a car Techniques that avoid an arbitrary reference frame and instead treat all views equally are predominantly used in stereo. The simplest form is the widespread left-right consistency check (e.g., Hirschmüller 2008) during postprocessing. More recently, consistency tests were directly incorporated in the objective (Bleyer et al. 2011b). In our view-consistent formulation, we extend the latter strategy to scene flow, considering consistency across all images within a temporal window.
Introduced by Lempitsky et al. (2008) for the case of 2D optical flow, fusion of different proposal sets has become a standard optimization technique. Here we employ such a scheme for the estimation of 3D scene flow.

Piecewise Rigid Model for 3D Scene Flow
To estimate 3D scene flow, we describe the dynamic scene as a collection of piecewise planar regions moving rigidly over time (Fig. 3). The motion and geometry of each region is governed by nine degrees of freedom, which we determine by minimizing a single objective function. During optimization, pixels are grouped into superpixels, and a suitable 3D plane and rigid motion is selected for each of these segments. Note that the implicitly obtained spatial segmentation does not aim to decompose the scene into semantic objects. Rather, an over-segmentation is desired to capture geometry and motion discontinuities, and to allow for the accurate recovery of non-planar and articulated objects. We begin our detailed description with the basic parameterization of the scene w.r.t. a single reference view and consider two time steps (Sect. 4). Later, we show how to achieve view-consistent scene flow over multiple frames (Sect. 5).

Preliminaries and Notation
We formalize our model for the classical case of images obtained by a calibrated stereo rig at two subsequent time steps. However, we note that an extension to a larger number of simultaneous views is straightforward. To distinguish between the different views, we use subscripts l,r to identify the left and right camera 1 , and superscripts t ∈ T = {−1, 0, 1, . . .} to indicate the acquisition time. We let the left camera at time t = 0 define a common coordinate system and refer to it as the canonical view; this simplifies the notation. This canonical view, on one hand, serves as an evaluation basis, and on the other hand, coincides with the sole reference view, in case view consistency is not employed. These choices lead to the projection matrices (K|0) for the left and (M|m) for the right camera. For simplicity, we assume w.o.l.g. the calibration matrix K to be identical for both cameras.
In our model a 3D moving plane π ≡ π(R, t, n) is governed by nine parameters, composed of a rotation matrix R, a translation vector t, and a scaled normal n, each with three degrees of freedom. Note that we do not explicitly distinguish between camera ego-motion and independent object motion, but describe the full motion in one forward time step. Later, when we extend our model to reason over multiple frames, we show how to cope with high frequent ego-motion of the camera (Sect. 5.3). In case of a single reference view, we assume all planes to be visible in the canonical view. Thus, as the canonical camera center and coordinate origin coincide, no visible plane can pass the origin. We can then define the scaled normal n ≡ n 0 l via the plane equation x T n = 1, which holds for all 3D points x on the plane. Throughout the paper it is convenient to transfer the moving plane also into other views and their respective camera coordinate systems. The plane equation still has to be valid after any rigid transformation, hence the scaled normal transforms in correspondence with 3D points x on the plane n 0 l . For example, for the left camera at time step t = 1 the normal n 1 l in the respective coordinate system is found as: We can, furthermore, determine the depth d observed at a pixel p of the image I t v , acquired at time t w.r.t. the center of camera v through the inverse scalar product: This information is later needed to test for occlusions (Sect. 4.7), as well as to check the geometric consistency (Sect. 5.2) of the representation. Utilizing a planar scene representation allows to map pixel locations conveniently to their corresponding positions from one view to another. In particular, a moving plane π induces homographies from the canonical view I 0 l to the other views given by: Single reference-view model. Data terms (black arrows) and homographies (green). (center top) Pixels without correspondences when using a reference view (blue). Areas that are hard to match may be without correspondence in other views; view-consistency avoids this.
(center bottom) Enlarged areas containing pixels without correspon-dence in the right camera. (right) Data terms in the three-frame viewconsistent model: Consistency is encouraged for spatial and direct temporal neighbors (black arrows). All pixels of all views are considered in the energy (Color figure online) Concatenating the transformations above, mappings between arbitrary view pairs can be obtained. This is achieved by first transforming back to the canonical view and then into the desired frame, e.g. 1 l H 1 r (π ) = 0 l H 1 r (π ) · 0 l H 1 l (π ) −1 . For notational convenience we define 0 l H 0 l (π ) to be the identity, which maps the canonical frame onto itself.

Single Reference View
For now our aim is to determine depth and 3D motion for every pixel of the designated reference view I 0 l . To that end, we formally define an energy function E(P, S ) over two mappings: a mapping S : I 0 l → S that assigns each pixel of the reference view p ∈ I 0 l to a segment s ∈ S; and a mapping P : S → Π to select a 3D moving plane π ∈ Π from a predefined set of proposals Π for each of the segments s ∈ S. To find these mappings, we aim to minimize a single energy consisting of four terms: The data term E D measures photo-consistency across the four views of our basic model. The regularization term E R encourages (piecewise) smoothness of geometry and motion at segment boundaries. The boundary term E S evaluates the quality of the spatial segmentation, encouraging a compact and edge-preserving over-segmentation of the reference image. The visibility term E V deals with missing correspondences from areas that move out of the viewing frustum (out of bounds). The energy is then minimized in two steps: Starting with a fixed initial over-segmentation S , we establish the link between segments and 3D moving planes, labeling each segment s ∈ S to belong to one of the moving planes π ∈ Π .
Subsequently, we operate with a fixed mapping P and reassign each pixel p ∈ I 0 l to one of the segments and, thereby, associated 3D moving planes. Note that the basic form of the energy remains, even when considering view consistency in Sect. 5.

Data Term
In its traditional role, the data term embodies the assumption that corresponding points in different views have similar appearance. Here, we achieve this through four constraints per pixel, two for the stereo pairs at time steps 0 and 1, and two optical flow constraints, one for each camera (see Fig. 4, left). Denoting the 3D moving plane at a pixel p as π p = P(S (p)) and utilizing the homographies defined in Eq.
(3), we can define stereo data terms between the cameras as and optical flow data terms across time as The corresponding pixel location in a different view is usually a sub-pixel coordinate, hence image intensities are obtained via bilinear interpolation. For increased robustness in general conditions (e.g., outdoors), we utilize the census transform ρ = ρ C (Zabih and Woodfill 1994) over a 7×7 neighborhood to assess photo-consistency. We scale the Hamming distances by 1/30. Although we are not limited to this specific choice, all examples and results are generated with the census data cost, unless explicitly stated otherwise. The complete data term is given as the sum of the four terms in Eqs. (5) and (6):

Spatial Regularization of Geometry and Motion
In our scene representation, geometry and motion parameters are shared among all pixels within a segment, hence explicit regularization within a segment is not needed. We can thus focus on the segment boundaries. One important benefit over pixelwise regularizers (Basha et al. 2010;Vogel et al. 2011) is that our boundary regularizer does not have to be overly strong to significantly stabilize scene flow estimation. Moreover, it rather naturally deals with discontinuities, a key problem area of previous scene flow techniques (e.g., Vogel et al. 2011). Since boundaries regularly occur within a single object due to the over-segmentation, our regularization term assumes piecewise smooth 3D geometry and motion. We model shape and motion priors independently (given a segmentation), and define our regularizer E R (P, S ) as the sum of a geometric term E G R (P, S ) and a term E M R (P, S ) to measure the regularity of the motion field.
For now assume that two adjacent pixels p and q are assigned to the moving planes π p = P(S (p)) and π q = P(S (q)). We treat pixels as square patches, residing in the image plane in which they share a boundary. To measure the contribution to the regularization term along their common edge, we consider the (2D) endpoints of the edge between the pixels, c 1 and c 2 . We begin with the geometry term. By projecting the endpoints onto each of the two 3D planes, we obtain the 3D endpoints c 1 p , c 1 q , c 2 p and c 2 q (see Fig. 5). In case p and q lie on different planes, the pixel boundaries will, in general, not coincide in 3D space. We thus compute distance vectors between the 3D endpoints: d 1 = c 1 p −c 1 q and d 2 = c 2 p −c 2 q . Our goal is to penalize the distances along the shared edge. One could compute 3D distances for any point on the boundary in a similar fashion. However, since we are using planes as primitives, the 3D distance along the shared boundary in the image plane is simply a convex combination of the endpoint distances To consider surface curvature we exploit this observation further and shift the 3D endpoints along their respective plane normals n p and n q before measuring distances. We denote the difference of the normals as d n = n p − n q , and define a distance function (see Fig. 5) The weight γ balances boundary distance vs. curvature. The geometry regularizer is then found by integration. Adding a factor 3/2 for mathematical convenience, we integrate the squared distance function ( f γ ) 2 along the boundary (w.r.t. α) and along the normal direction (w.r.t. β) in closed form: The summation considers pixels to be adjacent in an (8-) neighborhood N , where the length of the common edge is taken into account through the weight w p,q , which can optionally also incorporate edge information (Eq. 13) of the image data. ψ(·) denotes a (robust) penalty function. The intuition behind this form of regularization is shown in Fig.  6. Setting γ := 1 our energy favors planar configurations over bending. By integrating squared distances of 3D vectors, the induced penalty increases smoothly as the situation degenerates. This soft transition helps in the realistic case of a limited proposal set of 3D moving planes Π . The motion regularizer is obtained by first applying the rigid transformation to the moving planes. We then similarly integrate the endpoint distances , as well as the differences between the (rotated) normals d M n = (R p n p −n p )−(R q n q −n q ), leading to In both cases, robustness to discontinuities is achieved by employing truncated penalties ψ(y) = min( √ y, η) (with The proposed regularization scheme is not limited to 3D. For instance, the endpoint distances can be replaced by 2D entities such as the disparity difference, the difference between optical flow vectors, and the change of disparity over time. This is a popular choice for scene flow (Huguet and Devernay 2007;Valgaerts et al. 2010) and (optionally) used here. Note, however, that falling back to 2D regularization can only yield a (close) approximation of the true 3D penalties, as projective foreshortening is not considered. When reasoning at the segment level, we can approximate the regularizers by computing the penalties directly from the endpoints of the segments. By precomputing the length of the boundary (summing the edge weights along the shared border), the evaluation of the regularizer becomes much more efficient. Because superpixels in our framework are near-convex, the overall accuracy of the algorithm is barely affected (Fig. 12, bottom).

Spatial Regularization of the Segmentation
Data term and spatial regularization operate not only on the segment-to-plane mapping P, but also depend on the assignment of pixels to segments S , which in our experience can lead to rather fragmented over-segmentations. To counteract this behavior and to incorporate prior knowledge that segments should be spatially coherent (but not necessarily connected) and preserve image edges, we add an additional regularization term, assessing the quality of the underlying segmentation: The first term resembles a contrast sensitive pairwise Potts model, again evaluated over the (8-)neighborhood N of a pixel. Here, the weight u p,q allows to take into account the image structure and the length of the edge between the pixels.
To define these weights we follow Werlberger et al. (2009) and apply the anisotropic diffusion tensor: The image gradient direction g = ∇ I /|∇ I | is determined via bicubic interpolation in the middle between p and q. Assuming I ∈ [0, 1], we set α = 5 and define the weight The second term links a segment to its seed point e ∈ E (s i ) in order to limit its maximum extent to a size smaller than (2N S −1) × (2N S −1) pixels. This strategy prevents the scene flow from becoming overly simplified, but more importantly also restricts the number of candidate segments for a pixel, thus reducing the time needed for optimizing the energy w.r.t. S . We found that a good strategy to define the seed points is to reuse the center of the original superpixels. Here we set N S = 25, but values between 10 and 30 pixels perform alike (see Sect. 6.1). Note that a similar strategy was proposed by Veksler et al. (2010) to compute an over-segmentation of a single image.

Visibility Term
So far we have not considered the problem of visibility, thus areas that fall out of bounds, i.e. are not visible in some of the images. Especially when dealing with large motions, these regions can cover a significant portion of the image. Configurations with no valid correspondence are not considered by the data term Eq. (7) and contribute 0 cost to the energy. Allowing for arbitrary moving planes in our model could, therefore, easily lead to a solution, where a significant portion of pixels is erroneously assigned a motion that moves them out of bounds. On the other hand, penalizing these kinds of configurations strongly could harm the results. Consider, for instance, a saturated region that actually moves out of bounds. A solution in which this region is mapped to a similarly saturated, but unrelated area in the other images lowers the data cost and would therefore be preferred. Since this regularly happens in challenging scenes, we address the problem as follows: Let us assume that we have access to an "oracle" V , which can predict whether a pixel will stay in the image or move out of bounds. Further, let V 1 l , V 0 r and V 1 r be the predicted binary visibility masks for all but the reference image (out-of-bounds: 0, pixel visible: 1), and let Γ j i [·] be a binary function that determines whether its argument lies within the boundaries of image I j i . We encourage the scene flow estimate to stay near that prediction, by defining a visibility term that forms part of the energy in Eq. (4): with θ oob := 0.5 max(ρ C ) set to half the maximal data cost.
In practice, we found that common stereo and variational flow methods can predict pixels moving out-of-bounds sufficiently reliably, and consequently reuse the output of the 2D stereo and optical flow algorithms from the proposal generation step (Sect. 4.6). An alternative visibility predictor could be the ego-motion of the stereo camera system.

Approximate Inference
Inference in our piecewise rigid model entails estimating the continuous 9-dimensional variables describing geometry and motion of each rigidly moving plane, and the discrete assignments of pixels to segments. By restricting the optimization to a finite set of proposal moving planes, the whole problem is transferred into a labeling problem in a discrete CRF. The benefit is two-fold: First, we can leverage robust discrete optimization techniques that cope well with complex energies, particularly here the fusion move framework of Lempitsky et al. (2008Lempitsky et al. ( , 2010. Second, occlusions are discrete events and can thus naturally be integrated in the objective (Sect. 4.7).
To bootstrap the process, we start with a fixed segmentation S and optimize the energy w.r.t. P, selecting a suitable moving plane for each segment from the proposal set. To obtain the initial superpixel segmentation, we simply minimize the segmentation energy E S alone, and subsequently split strongly non-convex segments. We alternatively tested a segmentation into regular grid cells. Interestingly, this simplistic initialization works almost as well (see Sec. 6.1). In either case, the seed points E are selected as the central pixels of the initial segments. When solving for P we need to consider the data, visibility, and regularization terms only. After we found a solution for P, the mapping is kept fixed and the energy is optimized w.r.t. S , reassigning the pixels to segments and, thereby, implicitly to moving planes (c.f. Fig.  7). Because the segment size is restricted to a maximal side length of N S through Eq. (11), the pseudo-Boolean function (Lempitsky et al. 2008) representing the local energy has at most (2N S − 1) 2 variables, which makes the optimization efficient. Distant segments can even be expanded in parallel. We use a similar strategy when optimizing for P: We locally restrict the validity of each moving plane proposal to cover only a certain expansion region in the scene. In practice, we found that a proposal should at least cover 100 of its closest neighboring segments and set the region size accordingly. This allows to test several proposals in parallel. Note that we can iterate the alternating optimization further, but observe no practical benefit.
General pseudo-Boolean energies are usually optimized with QPBO (Rother et al. 2007), which can also handle non-submodular energies, but does not guarantee a complete labeling when supermodular edges are present. One disad- vantage compared to standard graph cuts, however, is that the instantiated graph has twice the number of nodes than the (pseudo-Boolean) energy has variables. For our (nonsubmodular) energy we can alternatively use the local submodular approximation proposed by Gorelick et al. (2014). This has the advantage that conventional graph cuts can be used, which is usually faster than QPBO. We particularly use LSA-AUX, which for each α-expansion replaces pairwise supermodular potentials by a local plane approximation that bounds the true energy from above. This idea is very simple to implement and delivers a significantly better approximation than a simple truncation of non-submodular terms. We experimentally compare both approaches in Sect. 6.

Proposal Generation
To perform inference over the 3D geometry and motion of the segments, we require an (initial) set of proposal planes together with their rigid motion. We can create these from either the output of other scene flow algorithms, or from a combination of stereo and optical flow methods. To convert the pixelwise correspondence information to our representation, we separately fit the parameters of a 3D plane and its rigid motion to each superpixel of the initial segmentation. Fitting is complicated by inaccuracies or noise in the stereo and flow estimates, and by superpixels that are not wellaligned with depth and motion discontinuities. We thus opt for a robust procedure and minimize the transfer error integrated into a robust cost function, particularly the Lorentzian where the dependence of the homographies on the parameters (the normal n and rigid motion (R, t)) is made explicit, and P denotes the conventional projection operator. Each pixel p of segment s ∈ S is matched to its 2D correspondence p , determined by the proposal algorithm. We parameterize the rotation in Eq. (15b) by its exponential map to define the derivatives, and use the previously determined scaled normal to derive the homography (c.f. Eq. 3). After bootstrapping this non-convex optimization problem with the solution of an efficient algebraic minimization, two iterations of the limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm (LM-BFGS) suffice for our purposes. The quality of the fit is analyzed in Sect. 6.1. Note that since we are treating the estimation of 3D planes and rigid motions independently, the problem of fitting a rigid motion is similar to the computation of the ego-motion of a stereo camera system, such that algorithms for this problem could also be applied (e.g., Badino and Kanade 2011). Here, however, we only consider the motion of an individual segment and not of the complete stereo rig.

Additional Proposals
The strategy of selecting parts of the solution from a set of proposals allows to include additional information in an unbiased way, without the need for altering the energy formulation. We exploit this property by including the estimated ego-motion of the stereo system as an additional proposal. The ego-motion is found by reusing our fitting procedure from above (Eq.15b) on the segment centers and their correspondences, given by the output of our per-segment solution (obtained after optimizing w.r.t. the mapping P). We then can fuse the current solution with the estimated ego-motion. Additionally, we use a local replacement strategy, motivated by proposal instances for which depth and motion errors are not correlated. We posit that these largely result from the 2D proposal algorithms, which estimate motion and depth independently. We address this with additional proposals: We randomly select proposals and propagate a part of their state to other segments in a 2-neighborhood. This can either be the geometry or the rigid motion, which simply replaces the corresponding state of the neighbors. This procedure is iterated several (≈ 4000) times, leading to a combination of geometry and motion of neighboring segments. The strategy has similarities to the PatchMatch idea (Barnes et al. 2009;Bleyer et al. 2011a), as information is shared and distributed among neighboring segments.

Occlusion Handling
The data term as defined in Eq. (7) assumes that every pixel is visible; no occlusion reasoning takes place. Given our 3D scene representation, we can explicitly reason about occlusion, however. Compared to stereo, the handling of occlusions for scene flow has the advantage of having two (or more, c.f. Sect. 5.3) additional views of the scene. Accordingly, pixels that are occluded in a subset of views may still be visible in one of the view pairs.
To leverage this, occlusion handling is applied to all pairs of views for which a data term is formulated. We formalize this only for a single view pair, because the mathematical formulation is equivalent for each summand of the data term. We make use of the well-known principle (dating back at least to Kolmogorov and Zabih 2001) of applying a constant penalty θ occ , if a pixel is occluded in at least one of the two views of the pair. The penalty is chosen as θ occ := θ oob (Eq. 14). Although occlusions and out-of-bound areas have different causes, the impact on the correspondence is the same: The pixel correspondence cannot be judged by the appearance, and hence the data costs of Eqs. (5) or (6) are invalid. Note that pixels that are assigned to the same moving plane in our scene representation naturally cannot occlude each other.
To simplify the exposition, we will not present our occlusion model in its most general form, but rather one instantiation within a single fusion/expansion move of the approximate inference procedure from Sect. 4.5. Hence, we are dealing with a binary optimization problem. Assuming a fixed segment-to-plane mapping P, we will first investigate the update of the per-pixel segmentation S . Differences in the update procedure when solving for P will be discussed later. W.l.o.g. let the binary state x p = 0 denote that the pixel p retains its current segment assignment and, accordingly, x p = 1 indicate a switch to the trial segment α. We begin by expressing the data term from Sect. 4.1 in the form of a pseudo-Boolean function: where the vector x denotes all binary pixel assignments. The data penalty equals u 0 p if p remains in its current segment, and u 1 p if p is assigned to segment α. Whether a pixel p is occluded or not depends both on its binary segment assignment x p , and on whether there is any other pixel q (or possibly multiple pixels) that occludes p. Determining whether q triggers an occlusion in turn depends on its segment assignment x q . With O i p we identify the set of all pixel-assignment pairs (q, j), for which pixel q occludes pixel p if x p = i and x q = j. Now we can replace Eq. (16) with our occlusion-aware data term Here, we denote the difference of the (unoccluded) data penalty and the occlusion cost θ occ byû i p = u i p − θ occ , and with [·] the Iverson bracket. To facilitate a better understanding of the equation above, let us focus on a single pixel p. The respective summand becomesû 0 p , if both x p = 0 and the product equals to 1. The latter happens if no occlusion occurs, that is either all possibly occluding pixels q are assigned to a segment x q in which they do not lead to an occlusion, or the set O 0 p is empty, meaning that no pixel exists that could possibly occlude p. The data cost overall thus equals θ occ in case of an occlusion, and the standard data penalty u 0 p or u 1 p , otherwise.
Recall that we establish the segment-to-plane mapping P by reasoning over entire segments (see Sect. 4.5). Therefore, we directly extend the occlusion model to the segment level. The potentials of the respective pseudo-Boolean energies in Eqs. (16) and (17) look the same, but with variables representing segments instead of pixels. We consider a segment to be (significantly) occluded if its central pixel is occluded. Because our segments are nearly convex and similarly sized, at least one quarter of a segment has to be occluded by a different region to render the central pixel occluded. To check for occlusions we employ conventional z-buffering, utilizing Eq.
(2) to compute the depth at each pixel.
Depending on the number of possibly occluding pixels, the (per-pixel) penalty may be a higher-order pseudo-Boolean function (|O i p | > 1). Optimization techniques based on graph cuts, including QPBO, can only be applied to quadratic polynomials, which is why all higher-order terms have to be reduced to pairwise ones. Over the years several reduction techniques have been proposed (e.g., Ali et al. 2008;Ishikawa 2009;Rother et al. 2009). Each applies a certain transformation that approaches the reduction independently for each higher order summand of the energy. We refrain from presenting these exhaustive details at this point and instead refer to the Appendix.

View-Consistent Model
Equipped with our basic representation and model from Sect. 4, we now generalize it to estimate scene flow for all views and time instants simultaneously. A major benefit compared to using a single reference view is that the entire image evidence of all views has to be explained. This results in a more robust estimate, which is less prone to common imaging artifacts. Occlusion handling can be improved as well.
Another benefit is that significantly fewer non-submodular edges occur in the pseudo-Boolean function constructed during the optimization process. We defer details to the experimental evaluation. To enable a view-consistent model, we first need to extend the notion of the segmentation to all views, with the challenge of generating a consistent segmentation of the scene across views and time. An obvious downside of a view-consistent approach is a significantly enlarged set of unknowns, since the assignments from segments to moving planes and pixels to segments have to be computed for each involved view.
After establishing the concept of view-consistency, we aim to estimate scene flow for more than two time steps. We thus extend the idea of rigidity by assuming constant translational and rotational velocity of the 3D moving planes. Note that due to the short time intervals considered, this assumption is valid for many application scenarios. In the following, we start our description for only two time steps, and later explain how to extend our model to multiple frames in time.

Model Overview
As before we strive to determine depth and a 3D motion vector for every pixel, but this time for all the views examined. We thus keep track of a superpixel segmentation in every view, denoted as S t v , the set of segments in the image I t v in view v at time step t. The energy definition (Eq. 4) is extended to be a function of two sets of mappings. The first set of mappings S = {S t v : t, v} with S t v : I t v → S t v assigns each pixel of frame I t v to a segment of S t v . With the second set P = {P t v : t, v}, a rigidly moving plane is selected for each segment in each view: P t v : S t v → Π . Recall that Π denotes a candidate set of possible 3D moving planes. The formal definition of the energy takes the same basic form as Eq. (4): However, in our view-consistent setting the definition of the data term E VC D is significantly different, as not only photoconsistency w.r.t. a reference view is considered, but also the consistency of the underlying geometric configuration and segmentation of the scene. The regularization term E VC R and the segmentation term E VC S are straightforward extensions of their single view counterpart from Sect. 4. In our experience, by explaining the available evidence from all images, this view-consistent formulation does not require an explicit visibility term (Sect. 4.4).
The spatial smoothness assumption is extended to all views, simply summing the contributions of motion (Eq. 9) and geometry (Eq. 10) terms per frame: (a) Implausible case (b) Occlusion (c) Normal case Fig. 8 Illustration of the per-pixel view-consistent data term (see text for more details) In a similar fashion we extend the regularization of the segmentation (Eq. 11) to all considered views: where N is again defined as the 8-neighborhood. Note that the second term is only applied to the canonical view, such that the maximal size of a segment is only restricted in the canonical frame. Also note that we treat the segmentations of the different frames independently. However we encourage the segmentation to be consistent across views (c.f. Fig. 11) such that the restriction on the maximal segment size is also propagated to all other images, which is further exploited in the inference procedure. Consistency between the superpixel segmentations is encouraged in the data term, described in the following.

View-Consistent Data Term
In our view-consistent model we explicitly store a description of the scene in terms of moving planes as observed in each of the views. To exploit the redundancy in this representation, we check the consistency of the scene flow estimate in each view with its direct neighbors in time, as well as with the other views at the same time instant (Fig. 4, right). We here slightly abuse the term consistency: In its classical sense we check for photo-consistency of the images at corresponding pixel locations, determined through their assigned moving planes π ≡ π(R, t, n). However, in our novel scene representation we can also check the geometric configuration for plausibility, test for occlusions, and verify the consistency of the segmentation. This is done by comparing depth values induced by the respective moving plane (Eq. 2), based on the underlying image segmentation (see Fig. 8). Now let us assume we want to check the consistency between a pixel location p ≡ p t v in view v at time t and its corresponding pixel locationptv in viewv at timet. We denote the 3D moving plane of the pixel p by π p = P t v S t v (p) . The related homography allows to determine the corresponding pixel location in the other view,ptv = t v Htv(π p )p, and the depth function d(p, n t v (π )) from Eq. (2) enables evaluating the geometric configuration at that pixel. The data term for a single pixel p in view v at time-step t assigned to the moving plane π p with the adjacent viewv at time-stept is then given by The first two cases are depicted in Fig. 8a and b. Here the relative difference in depth is used to distinguish between implausible and occlusion cases. This distinction is similar to comparing disparity values for the stereo case (Bleyer et al. 2011b). The first case (Fig. 8a) describes a geometrically implausible situation, in which the depth of the moving plane π p , observed from the 2nd camera in pixelptv, is smaller than the depth assigned to the pixel in that 2nd view. In this situation the 3D point on the plane πptˆv would be occluded by the moving plane π p and not be visible by the 2nd camera. We apply a fixed penalty θ imp in this case. In the second case (see Fig. 8b), the depth of the moving plane π p is greater than that of the corresponding plane πptˆv and, therefore, the pixel p is occluded in the second view. Again, a fixed penalty θ occ is applied. This concept of occlusion reasoning via cross checking the current solution among views is only possible by simultaneously estimating a solution for all views and rather different from the occlusion detection technique presented in Sect. 4.7 for a single reference view.
An additional benefit is that the resulting energy function induces only pairwise edges. In Eq.(17), in contrast, multiple possible labels for the corresponding location in the other view may exist, which in turn leads to higher-order terms in the respective pseudo-Boolean energy. In our experience the view-consistent formulation leads to fewer supermodular edges in the optimization (see Sect. 6.2), resulting in a simpler optimization problem.
Since the set of proposal planes is limited due to practical considerations, we cannot assume that our representation always assigns a fully accurate depth for every pixel. Instead of strictly comparing relative depth values we, therefore, opt for a relaxed test by including the parameter, empirically set to := 0.015. This additionally alleviates aliasing artifacts introduced by the finite resolution of the pixel grid.
The third case penalizes pixels moving out of the viewing frustum (out of bounds) with a fixed penalty θ oob . By employing view consistency, the solution has to respect the information from all views of the scene. Hence the treatment of this event can be a lot simpler than in the case of a single reference frame, where an additional visibility term (Sect. 4.4) was included.
When pixels are in geometric correspondence we apply the usual census data penalty ρ = ρ C to measure photoconsistency (c.f. Sect. 4.1). In (Vogel et al. 2014) we originally proposed to additionally truncate the data term at half the maximal possible cost at a pixel (0.5 max(ρ C )). An investigation of this particular choice shows that the number of resulting non-submodular terms in the optimization is reduced (Sect. 5.4), however some of the information is lost, which can lead to a decreased accuracy. Consequently, we avoid the truncation here.
If the pixels are in geometric correspondence, but belong to different moving planes, we assert a moving plane violation and impose an additional penalty θ mvp . This leads to the desired view-consistent segmentation, as pixels are encouraged to pick the same 3D moving plane in neighboring views.
In practice, it appears prudent to penalize pixels without correspondence equally, thus we set both penalties for occlusions and pixels moving out of bounds to θ oob = θ occ = 0.5 max(ρ C ). Aliasing again prevents us from penalizing implausible configurations with an infinite penalty; instead we set θ imp := max(ρ C ), which also prevents deadlocks in the optimization. While this can lead to a few implausible assignments in the final estimate, the overall error is reduced. For the same reasons we allow for deviations from our consistency assumption for the segmentation and empirically set θ mvp := 5/16 θ oob .
All views are treated equally in our model, thus the perpixel contribution from Eq. (21) is summed over all pixels of all frames. Our data term consists of the summed data costs for all stereo pairs and frames that are direct neighbors in time (Fig. 4, right): In contrast to the reference-view formulation (c.f. Fig. 4, left), each view pair is considered twice by the data term, because every view holds its own scene flow representation. Figure 9 illustrates the view-consistent data term. The internal states assigned by the data term (cases of Eq. 21) to each view pair are shown for each individual pixel.

View-Consistent Multi-frame Extension
We now discuss the details of extending our viewconsistent model to more than just two frames. As mentioned, geometry, motion and segmentation regularizers can be extended to a larger number of frames in a rather straightforward fashion (Eqs. 19 and 20). The data term however needs special consideration, as we need to define homographies between the additional views and also transform the normals into the specific view coordinate system. Recall that we restrict ourselves to reason only over shorter time intervals and thus can assume the motion of a moving plane to be of constant velocity in both its rotational and translational component. Under this condition suitable homographies can be found by a concatenation of the homographies defined in Eq. (3). Similarly, view-normals for the different time steps are generated by a repeated application of Eq. (1), thus again assuming constant velocity. Note here, that the normals in the proposal set Π are always stored in the canonical coordinate system. (right) Camera images with induced 2D flow (black arrows). We compensate camera pitch by removing the ego-motion of the camera Such a model can tolerate small deviations from this constant 3D velocity assumption in the scene, but this is put to a test if the camera system itself is violating this assumption. Especially abrupt rotational changes in the viewing direction affect the whole image of the scene. The automotive application in our experiments is a good example for this. Scene flow estimation is challenged by a common high-frequent pitching motion of the stereo rig, often caused by an unsteady road surface and amplified by the suspension of the vehicle. In our model the motion is encoded relative to the respective camera coordinate system, such that even slight changes in the relative camera position can induce significant changes in the relative geometry and motion (Fig. 10). To address this problem, we introduce the following extension, in which we include ego-motion estimates for the different time steps.
First, we compute the relative ego-motion E t = [Q t |s t ] between all consecutive time steps t and t + 1. The computation of homographies between successive frames then proceeds by first applying the motion induced by the moving plane representation with the ego-motion part removed, and then the relative ego-motion from time step t to t + 1. Recall that the rotation R and the translation t of a moving plane are stored in the coordinate system of the canonical view, thus unaware of any ego-motion. Then we can remove the relative ego-motion of the canonical view E 0 by applying As an example, the homography between the frames t and t + 1 in the left view becomes Further note the use of the corrected view normal in Eq. (23), for which we can find a similar expression: Other homographies and view-normals can be corrected for ego-motion accordingly. The estimation of camera egomotion of a stereo camera system is a well-studied problem (e.g., Badino and Kanade 2011). Here we use the method proposed in Sect. 4.6.

Approximate Inference for View-Consistency
Our inference procedure closely follows the approach for a single reference view in Sect. 4.5. Again, we perform inference in a discrete CRF and optimize the energy in two steps, first solving for the mappings P, while keeping the segmentation fixed. Then we proceed the other way around, fixing the mappings from segments to moving planes and optimizing w.r.t. to the segmentation mappings S . The alternation can be iterated further, but again without practical benefits. Instead of an initial superpixel segmentation, we prefer to start from a regular checkerboard grid with an edge length of 16 pixels. Seed points e ∈ E (see Eq. 20) are simply the grid centers. This trivial "segmentation" is more efficient and also reduces aliasing artifacts, caused by a possibly uneven size of segments across views. The per-pixel refinement step (Fig. 11) will eventually deliver a consistent oversegmentation across views, adhering to depth and motion boundaries.
Because of the grid structure, segments can be treated as large pixels when solving for P. However, the use of an initially not view-consistent segmentation will lead to aliasing effects. We thus relax the consistency constraints and set := 0.1 and θ mvp := 3/16 θ oob in the first optimization round, to ensure that proposals are not discarded at an early stage. We generate the proposal set in the same manner as described in Sect. 4.6. We discovered that by first running a single segment-to-plane step of our reference-view above, and removing unused proposals, above, the proposal set is filtered without loosing important information, leading to a significantly reduced computation time. When optimizing over more than two frames, proposals are generated for all consecutive frame pairs. I.e., when using 3 frames we generate proposals for time steps t = −1 and 0, and additionally for t = 1 when using 4 frames. The additional proposals are discarded when they are found to be similar to already existing ones nearby. We consider proposals to be valid in a certain expansion region, centered at the seed point in the canonical frame. Empirically, we found that an expansion region size of 13 × 9 cells (208 × 144 pixels) yields a good compromise between accuracy and speed. During a fusion move, we thus only have to instantiate a local graph, which is determined by a projection of the expansion region into all other views.
The inference for the pixel-to-segment mappings S follows similar principles. Unused moving plane proposals are discarded. The size of the instantiated graph is restricted by the region constraint (Eq. 20), using an expansion region of 39 × 39 pixels (N S = 20), and determined by projection into the other views. We penalize inconsistencies more strictly here, since the decisions are made on a per-pixel basis, and use the default parameters for and θ mvp from Sect. 5.2. Figure 11 illustrates the computed mappings over the course of the optimization for one of the cameras. Consistent moving plane assignments at segment level are shown on the left, illustrating the distribution of P. The final, consistent superpixel segmentation S is depicted on the right.

Hierarchical Refinement
The grid-based segment structure, furthermore, allows for a very simple refinement procedure, which we found to work well in practice. Instead of directly redistributing pixels to segments by solving for S after all segments have been assigned a moving plane, we optionally refine the segmentation and solve for P again. In practice we halve the grid resolution in each image and start the inference from the previous assignment. We prune the initial proposal set by retaining only those moving planes that are in use. In our experience, this hierarchical approach allows to reduce aliasing problems due to the smaller segment size, but still considers a more global context during the optimisation stage. Because we again set the expansion region to 13 × 9 cells and the set of moving plane proposals is already reduced significantly, this step is very efficient. Note that after the refinement, we also reduce the expansion region (i.e. N S = 10) accordingly when re-assigning pixels to segments.

Experiments
We begin the experimental evaluation with our basic model based on a single reference view and later examine the viewconsistent approach. Quantitative experiments rely on the KITTI dataset (Geiger et al. 2012), which has emerged as a standard benchmark for optical flow and stereo algorithms, with over 50 submissions in both categories. Its images are acquired by a calibrated stereo rig, mounted on top of a car together with a laser scanner, which delivers the semi-dense ground truth. Targeting automotive applications, the scenes are challenging for mainly two reasons. First, the strong forward motion of the car leads to very large displacements in stereo (>150 pixels) and flow (>250 pixels). Consequently, there are also many pixels without direct correspondence in the other image. Second, the images are acquired outdoors under realistic lighting conditions and exhibit oversaturation, shadows and lens flare, but also translucent and specular glass and metal surfaces. The KITTI benchmark is the first large scale dataset that allows evaluating scene flow methods along with their 2D counterparts, stereo and optical flow. However, it often lacks ground truth for independently moving objects, which leads to a bias toward methods that focus on the dominant background. Nonetheless, we strongly believe that this dataset is better suited for the evaluation of scene flow methods than other existing, synthetic datasets used previously (e.g., Huguet and Devernay 2007;Vogel et al. 2011).
Our quantitative experiments mainly employ the KITTI training dataset, which is ideal for a detailed performance and parameter study due to its size of 194 images (1240 × 376 pixels) with public ground truth. For a comparison to the state of the art, we also submitted our results on the 195 images of the test portion of the KITTI dataset to the official KITTI benchmark (Sect. 6.5); there the ground truth is withheld. Because of inaccuracies in the laser measurements from the moving platform, the standard KITTI metric is to compute the number of outlier pixels that deviate more than a certain threshold from the ground truth. We report results for error thresholds of 2, 3, 4, and 5 pixels for the entire image (All), or only for unoccluded areas (Noc). We additionally report the endpoint error (EPE) for optical flow and stereo. We occasionally use the abbreviations SN for stereo without occluded areas, and SA when including these regions. Similarly, we shorten the respective identifiers for optical flow as FN and FA.

Evaluation of the Single Reference View Model
All experiments use fixed parameters, except where stated. We set the smoothness weight to λ = 1/16, and the weight of the segmentation term relative to λ as μ = 1/10λ. If not mentioned otherwise, we regularize in 2D space and fix η G = η M = 20.
We generate the proposal set from the output of 2D optical flow and stereo algorithms. For computing optical flow we employ the algorithm of Vogel et al. (2013a), which uses a census data term and a total generalized variation regularizer, a popular and effective combination for the KITTI scenes. To obtain an estimate in a reasonable time, we only apply 3 warps and 10 iterations per warp with an up-scaling factor of 0.9 in the image pyramid. The disparity map is obtained using semi-global matching (Hirschmüller 2008).
First, we evaluate the proposal fitting procedure from Sect. 4.6. Figure 12 (top) shows the KITTI metric at the default threshold (3 pixels), as well as the endpoint error of the plain 2D proposal algorithms (Init), and after the per-segment fitting took place (Fit). We observe only small changes in error, thus can conclude that planar rigid segment fitting does not significantly affect the accuracy. We attribute slight deviations in error to non-planar or non-rigid segments, e.g. due to misalignment with depth and motion boundaries.
Next, we investigate the simplification of the smoothness term when reasoning over segments, and how it affects the results. Recall that for computational efficiency we evaluate the spatial regularizer directly on the endpoint distances of the shared edge, instead of accumulating the contribution of all boundary pixels (Sect. 4.2). As we can see in Fig. 12 (bottom), the approximation (App) is quite accurate given our compact superpixels and on par with the exhaustive computation (Full), but in our experience ∼30× faster. Note that we here report results directly after the segment-level optimization, since both approaches employ the same per-pixel refinement step.
We now demonstrate that our representation and optimization approach are quite robust, in the sense that the results do not strongly depend on the initialization, parameter choice, etc. The importance of the initial segmentation is evaluated in Fig. 13 (left). Starting from a trivial "grid" segmentation (edge length 16 pixels) leads to a slight decrease in perfor- mance before the per-pixel refinement takes place. This gap is closed after the refinement step. Only a small difference in accuracy remains compared to starting from a proper superpixel segmentation. Note that this also helps understanding why, as mentioned, iterating the alternating inference further has little practical benefit; energy and error are not significantly reduced further.
The effect of starting with a different number of superpixels is depicted on the left of Fig. 14. After using more than ∼1000 initial segments, the accuracy of the final result becomes stable, as the per-pixel refinement can compensate for eventual inaccuracies in the coarser initial segmentation. But even starting with fewer segments does not harm the performance dramatically.
Similarly, varying the weight for the regularization term λ (Fig. 14, center) and the maximum superpixel size N S in the per-pixel refinement (Fig. 14, right) shows that the method is not sensitive to changes in these parameters. In the latter case higher values lead to better results, but also longer computation times. Next, we investigate the behavior when switching from 2D to 3D regularization. For 3D regularization we set η G = η M = 5 and λ = 0.25, thus increase the robustness in the smoothing process. We can observe from Fig. 13 (center left) that regularization w.r.t. 2D entities is slightly superior in the evaluated measures. This can possibly be explained by the fact that the error measures do not evaluate the 3D quality of the scene flow, but only its reprojection, i.e. disparity and 2D optical flow. Figure 13 (right) depicts the effect of replacing the visibility prediction (Sect. 4.4) by a trivial predictor, which assumes pixels to always stay in bounds. As we can see, predicting visibility by the initial 2D algorithms has a strong effect on the flow endpoint error in occluded regions. Other measures, however, are nearly unaffected.
The biggest impact on the quality of the estimated scene flow is given by the different proposal algorithms we utilize. In Fig. 15 we extend our standard 2D proposal set by adding proposals from 3D scene flow methods (3D-Props), namely L 1 -regularized 3D scene flow (Basha et al. 2010) and locally rigid 3D scene flow (Vogel et al. 2011). Furthermore, we evaluate our local replacement strategy (R), the ego-motion proposals (E, Sect. 4.6.1), and combine both proposal meth-ods (R+E). Additionally, we evaluate a variant in which we replace the rigid motion component of our proposals with the estimated camera motion (E-Hard), thus simulating a motion stereo algorithm, which enforces a rigid scene with only ego-motion, similar to Yamaguchi et al. (2013Yamaguchi et al. ( , 2014. We can observe that adding more proposals improves results; especially the endpoint error of optical flow is reduced. A larger gain is achieved by local replacement and, furthermore, by adding additional ego-motion proposals. Both approaches are complementary to some extent, as a combination slightly improves the results further. Finally, the best results can be achieved by enforcing ego-motion as a hard constraint, underlining the bias in the KITTI benchmark.

Evaluation of the Occlusion Model
We begin the evaluation of our occlusion model of Sect. 4.7 with a qualitative example of a street scene 2 from Vaudrey  (2008). The scene is recorded from a vehicle approaching a roundabout. Several independently moving traffic participants and a rather complex occlusion pattern pose a challenging scenario for our method. Figure 16 displays the results after the different processing steps of our approach. The estimate appears acceptable without occlusion handling, except for regions that are not visible in the reference image, e.g. at the left of the pedestrian. Adding the occlusion handling from Sect. 4.7 allows to detect occluded regions and to extrapolate the lateral motion in a plausible way. The per-pixel refinement (Per-Pixel and Occlusion) enhances the object contours and improves the occlusion boundaries even more.
We now quantitatively compare our basic model with and without additional occlusion handling. Figure 13 (center right) shows a small, but consistent advantage of explicit occlusion handling. The gap is largest for optical flow evalu-Footnote 2 continued truncation parameters (η G = η M = 1). Other deviating parameters were set to λ = 0.1, μ= 0.1, θ occ = 0.03. ated over the whole image. Note, however, that with additional proposals the advantage diminishes and the difference between both models becomes smaller. Recall that in order to perform optimization with graph-cut based techniques, like QPBO, the higher-order potentials, which can occur in case of multiple occlusions, have to be reduced to pairwise ones (Sect. 4.7). The resulting optimization problem possesses supermodular edges, such that nodes can remain unlabeled after running QPBO. To approximately minimize this NP-hard problem, Rother et al. (2007) proposed the QPBO-I method, which we also apply here. Table 1 summarizes our experience when applying the method on the KITTI training dataset. While the number of supermodular edges and unlabeled nodes appears to be small, employing QPBO-I instead of QPBO has a notable impact on the resulting energies. At the pixel level, the number of nodes that cannot be labeled by QPBO alone appears rather high at 7.7%. Optimization with QPBO-I, however, takes an order of magnitude more time. Another challenge is that this form of occlusion reasoning is sensitive to out- Numbers are averaged over the KITTI training set liers in the data term, such as specular highlights on the window of the car in Fig. 16. Note that without occlusion handling unlabeled nodes occur only very rarely (<1 per image).

Evaluation of the View-Consistent Model
As before, we keep all parameters fixed, unless otherwise mentioned. The only parameter that deviates strongly from the reference-view model is the smoothness weight. We set λ = 1/60, and regularize using the intensity-weighted edge length (Eq. 13), which is now based on multiple images. We set N S = 20 to speed up the per-pixel refinement, and start from an initial grid segmentation. We begin with several quantitative analyses to illustrate the different aspects of the proposed approach. First, we investigate whether our model can benefit from the hierarchical refinement of the grid described in Sect. 5.4.1. Figure 17 (left) compares the performance after a single and two refinement levels to the result without hierarchical refinement. The gain in performance is not large, but consistent throughout the evaluation; we use a single refinement step in the remaining experiments.
As our model is capable of jointly reasoning over multiple frames by assuming constant velocity for the rigidly moving segments, we investigate the performance when considering 2, 3, or 4 consecutive frames in Fig. 18. We further distinguish the addition of proposals from time steps other than the current one ("+"), meaning that we derive proposals from the disparity and 2D flow computed from the other adjacent frame pairs in the time window. Moreover, we include a variant that reasons about only two frames, but is provided with proposals extracted from three frames (VC-2F+). For comparison, we also add the single reference-view version PWRS+R (with local replacement), which is used to reduce the initial proposal set of the current frame pair. Note again that this reference-view method is only applied at the segment level.
Analyzing Fig. 18 one can observe that moving away from the single reference view (PWRS+R vs.VC-2F) already yields a significant improvement, most notably in the optical flow error w.r.t. all pixels. View-consistency benefits by considering the data of all views jointly. Parts that are occluded in the canonical view used for evaluation (and as a reference view PWRS+R) can still be visible in two other views. Furthermore, a strong drop in the endpoint error hints at a reduction of gross outliers. Including proposals from the previous time step (VC-2F+), and considering the image data of the previous frames (VC-3F) improve the results further. But only a combination (VC-3F+) of both leads to a larger performance gain in all measures, again affecting occluded regions most strongly. This suggests that a larger set of proposals from multiple frames alone is not sufficient, but that the image evidence from the longer sequence is important. Finally, including a fourth frame into the model yields diminishing returns, with only marginal improvements over the three frame case.
In another experiment we analyze the effect of the proposal set. Recall that we use the reference-view version of our method in order to prune the proposal set in the beginning, with the advantage of a reduced computation time for the view-consistent model. Figure 17 (center), however, also shows an effect on the accuracy of the algorithm, here evaluated for the three frame case without considering additional proposals from the previous time step. Interestingly, despite the fact that the application of PWRS-Seg yields only a subset of the original proposals (2D-Proposals), the results improve. An analysis shows that both variants deliver almost the same final energy, such that the cause is not well-captured by our energy formulation. We posit that this may be due to the proposal set not being sufficiently varied in crucial parts of the solution space, which is supported by the fact that the observed accuracy difference diminishes when we use proposals from the previous time step as well (VC-3F+). As we would expect given previous results Fig. 15, we observe a strong accuracy gain from the local replacement strategy (PWRS-Seg+R) and ego-motion proposals (PWRS-Seg+R+E); in these cases the additional proposals also noticeably lower the final energy. Because our method requires proposals for computing scene flow, we investigated how much a poor proposal set affects the performance. To that end we change the parameters of the initial 2D stereo and flow algorithms. For instance, in the optical flow case we use only a single warp per image pyramid and change the pyramid scale to 0.5. We then apply our two-frame view-consistent method (VC-2F) with PWRS-Seg+R to reduce the proposal set. The result is depicted in Fig. 17 (right). The notably high error of the 2D algorithms is reduced by a factor of 6 on average, showing that our scene flow approach can also cope with unfavorable proposal sets. This somewhat surprising result, achieved without consid-ering ego-motion information, can partially be explained by the particularities of the dataset and the algorithms used to compute the proposals. The flow algorithm should deliver reasonable results in areas with only small 2D motion vectors. Given the largely planar nature of the street scenes in the dataset, these parts can then be propagated into other image areas, which have the same 3D motion and geometry, but strongly differing 2D motion. This in turn suggests that 3D scene flow may be well-suited to cope with large motions due to its internal 3D representation.
Recall that the formulation of the data term, although directly leading to only pairwise edge potentials, introduces supermodular edges into the energy. In Table 2 we investigate the situation in a similar manner as for the occlusion handling strategy with a single reference view, again collecting data over the whole KITTI training set. We apply the QPBO-I algorithm to the optimization problem given by our three-frame version (VC-3F) and count the number of unlabeled nodes and supermodular edges over the course of the optimization. As we can see, the number of non-submodular edges is not much lower than in the reference-view case, but unlabeled nodes occur significantly less often. This motivates considering to solve the problem using graph cuts by applying the LSA-AUX algorithm (Gorelick et al. 2014) to find a submodular approximation of the problem at each expansion step. Conveniently, the local approximation bounds the true energy from above, such that the overall energy cannot increase, which is not the case if supermodular terms are just truncated. The final solutions show a comparable energy to results obtained with QPBO-I, while being an order of magnitude faster. A similar performance can be obtained by using QPBO instead of LSA-AUX and graph cuts. We begin with an illustration of several difficult examples from the KITTI benchmark (Fig. 19) recovered by our threeframe method (VC-3F+). The most interesting example is shown at the top (#74). In the presence of severe lens flares in both cameras, many optical and scene flow methods fail hopelessly to recover the motion in this scene. While the appearance of these artifacts is rather consistent in consecutive views, their location is not. This allows our approach to recover the scene flow reasonably well. Notably only 8.1% of the flow vectors of all pixels and 5.7% in the visible areas are outside the standard 3-pixel error threshold of KITTI. It is important to note that the robust handling of these artifacts is achieved only through view-and multi-frame consistency. Also depicted is a scene (#11) with low image contrast in shadow regions. Scene #123 is interesting because of similar problems with lens flare as for #74, here however challenging the reconstruction of the geometry as their location is consistent across frames. Finally #116 has fine structures in the image (e.g., the traffic sign), several areas with occlusions, and a car moving independently, albeit without ground truth. Figure 20 illustrates results for different outdoor scenes from Meister et al. (2012). We display the input images on the left. Our scene flow estimates (VC-3F+) are shown as disparities (center) and reprojected optical flow in the usual color coding. These examples show that our model is capable of handling independent object motion under unfavorable conditions. Even though the motion displacements in the image plane are rather small, the scenes contain many scenarios that are hard for conventional flow and stereo algorithms. The scenes 'car truck' and 'crossing' have saturated highlights and reflections, as well as a rather complex occlusion pattern. The scene 'car truck' also exhibits cast shadows dancing on the truck and the street. More challenging is 'sun flares', where the sun causes lens flares and 'flying snow', which as the name suggests contains heavy snow fall and a wet and reflecting street. The scene from Fig. 2 shows the wiper occluding the view and is, therefore, very difficult to recover for conventional approaches that parameterize the scene in a single camera only. The most complex scene is 'night snow', in which the aperture of the cameras is wide open and the image has only a shallow depth of field. Moreover, the windshield is wet, causing the headlights of approaching cars to flare. We can only give a qualitative evaluation here, as no ground truth for these scenes is available. Apart from the last scene, which has an incorrect depth in the sky region, our estimates appear quite appropriate.

Typical Failure Cases
Figure 21 displays some typical failure cases of our method. For example, it is challenged by over-saturated areas, especially if these are located close to the boundary of the images or in occlusion regions. Recall that we replace the data term with a fixed penalty (θ occ or θ oob ), if a pixel lacks a correspondence in other images. Now assume that a proposal exists that maps this over-saturated image region to a similarly over-saturated, but incorrect one in the other images. The data penalty in this case is close to zero, which compared to the energy of the true solution in our model (θ oob ) is decidedly lower. By demanding view-consistency, this incorrect solution will still incur penalties for the incorrect regions, since the geometry and/or motion is not consistent. However, as the penalties are accumulated per pixel, whether the correct correspondence can be recovered depends on the size of the respective regions in the images.
As already mentioned, a second challenge are imaging artifacts, e.g. sun flares (Fig. 21, bottom), that appear consistently in all the views. In the example the sun flare even leads to over-saturation, such that the low data energy may overrule the consistency penalty.

Quantitative Summary and Timings
A direct comparison between the view-consistent and single reference-view models is given in Table 3. Note that these differ from the published results in (Vogel et al. 2013b(Vogel et al. , 2014 due to a change in the KITTI ground truth, slightly different parameter sets, and extensions such as the local replacement strategy. The first row gives results for the 2D algorithms used to derive the proposals (2D Algorithms). Otherwise, we use the usual notation: PWRS for our basic reference-view model, PWRS+R for a version with local replacements, and PWRS+R+E to denote the usage of additional ego-motion proposals. For the view-consistent version (VC) we use PWRS+R+E to prune the proposals and distinguish between the two, three and four-frame versions, with (+) and without proposals from all frames. In general the numbers improve from top to bottom. Already our basic version achieves a significant reduction in all error measures compared to the state-of-the-art 2D proposals. Both strategies to generate additional proposals show their benefit, especially for flow. The view-consistent model leads to a visible reduction of the error in all measures already for the twoframe case. Moving to three frames further improves the results, especially for occluded areas, but considering four frames only yields marginal improvements. Notably, all but two numbers are at least halved when comparing our best result with the initial 2D solution. Table 4 illustrates the time spent on the different parts of the algorithm. We distinguish between running the 2D flow and stereo algorithms (Init), the proposal fitting procedure (Fit), and further time the inference at the segment level (Seg) and at the pixel level (Pix). We also show the time needed for generating additional proposals (R and E), and one hierarchical refinement step (Ref ). We compare numbers when starting with 1,850 and 1,150 segments. In both cases, our model with a single reference view (PWRS) needs less time for the optimization and both additional proposal generation strategies than for computing the initial optical flow and disparity maps. For the view-consistent case, we exploit the reduction in the number of proposals by first running PWRS+R+E at the segment level. With a low number of segments, our basic version (PWRS) needs only 20s to deliver a result. However, running the 2D proposal algorithms already takes significantly more time. Our most advanced three-frame method needs ∼3 min including proposals.
6.5 Comparison with the State of the Art Table 5 shows a comparison of our piecewise rigid scene model with the state of the art on the KITTI test set. At the time of writing, (August 2014) the benchmark has over 50 submissions in both categories. Our scene flow methods rank among the top performers, with the view-consistent model coming out first overall for both stereo and flow, when considering full images with occluded areas. Note that several top-ranked methods assume epipolar motion as a hard constraint (Setting ms). In contrast, our method can handle scenes with independently moving objects (c.f. Fig. 20), which are uncommon in the benchmark. Considering only methods applicable to general scenes, i.e. with independent object motion, the distance to the next best competitor is rather large, which demonstrates that scene flow from our piecewise rigid scene model, has a clear advantage over single camera methods for motion estimation under challenging conditions. Table 3 Results on the KITTI training set: average KITTI metric (% of flow vectors/disparities above 2, 3, 4, 5 pixels of endpoint error) and average endpoint error [px] with (All) and without (Noc) counting occluded regions

Conclusion
In this paper we introduced a scene flow approach that models dynamic scenes as a collection of piecewise planar, local regions, moving rigidly over time. It allows to densely recover geometry, 3D motion, and an over-segmentation of the scene into moving planes, leading to accurate geometry and motion boundaries. Employing unified reasoning over geometry, motion, segmentation and occlusions within the observed scene, our method achieves leading performance in a popular benchmark, surpassing dedicated state-of-the-art stereo and optical flow techniques at their respective task. We extend our basic reference-view technique to leverage information from multiple consecutive frames of a stereo video.
Our view-consistent approach exploits consistency over time and viewpoints, thereby significantly improving 3D scene flow estimation.
In particular, our model shows remarkable robustness to missing evidence, outliers, and occlusions, and can recover motion and geometry even under unfavorable imaging conditions, where many methods fail. In future work we plan to incorporate long-term temporal consistency into our framework, and to relax the constant velocity assumption to a more flexible formulation. Moreover, we aim to explicitly model small deviations from the local planarity and rigidity assumptions. Another promising route may be to include object-level semantic image understanding into the segmentation scheme, with associated class-specific motion and geometry models.