Contents
Tracking Methods
The essence of this tracker is region matching. The match criterion is similarity based on a color histogram. Between frames, the region being tracked can change location, appearance (somewhat), and size. To examine possibilities quickly, the Mean Shift algorithm is used as a way to converge from an initial guess for location and scale to the best match based on the color-histogram similarity. The color histogram is computed over an ellipsoid window, and the contributions at each point are gaussian weighted based on their distance from the window's center. An initial, hand-positioned oval defines the target distribution. Location is updated in each frame using the Mean Shift algorithm across pixel location and scale to find the local maximum for similarity to the target distribution. Similarity is measured as Bhattacharyya distance.
Pro:
A weighted histogram is a good paradigm for soft appearance matching. Mean
Shift as the region-search method should then be an efficient way to locate
the best match.
Tracks the motion of individual features from frame to frame. The features are local, textured patches with gradients in more than one direction. Features that change appearance too rapidly during tracking are discarded.
Pro:
Explicitly detects unstable features that are likely to be boundary
features arising from accidental intersections.
In this class of methods, the appearance, scale, and/or location, of the tracked region is predicted based on prior frames and on some underlying model for state transitions. The objective is to prevent drift by conditioning each possible new state by a probability envelope that's based on meaningful predictions about the tracked object's image motion.
The object is tracked by its contour. Probabilities for the contour's current shape and location are estimated for each frame by sampling state transitions from the previous frames. The underlying model is Hidden Markov: the probability of a state depends only on the state preceding it. Pro: Data driven; makes few assumptions about the object's motion.Con: Needs to be initialized somehow, so prediction can begin. The contour needs to be represented analytically.
Sethian's conceptual overview of level-set methods Level-set methods model the evolution of a curve over time with partial differential equations (PDEs). The curve is represented as the intersection of a moving surface with a plane (the image plane, e.g.). Because propagation speed is curvature dependent, the curvature of the front acts as a smoothing function. While level-set methods are elegant and have desirable properties, such as automatically handling changes in topology, solving the PDEs is computationally demanding. In Shi and Karl's method, the boundary of the tracked object is represented with two lists. One list contains the pixels on the interior boundary of the curve. The other list contains the pixels on its outer boundary. Curve evolution is achieved by simply switching pixels between these two lists. This process is very fast. The PDEs are never solved explicitly. Instead, the behavior of these equations is modeled algorithmically. At each new frame, the algorithm iterates in two steps: The smoothing step consists of applying a gaussian filter to a sign function of the boundary pixels. The signs of interior and exterior pixels are opposed, and boundary pixels have a lower magnitude than pixels further from the boundary.
Pro:
The level-set approach explicitly embodies mathematically proven constraints
on how curve fronts evolve. These constraints appear to offer a principled
method for maintaining tracking quality. With an appropriate descriptor
choice, this method can do both segmentation and tracking.
In this paper, the authors explicitly modeled the effects of motion blur as local convolution with a gaussian. They use this model to extend Shi and Tomasi's tracking model. Tracking is then represented as an optimization problem -- minimizing the error in the extended model.
This tracker represents batches of tracked and scale-rectified object windows from k frames by their average pixel intensity. These averaged windows are then included in a rolling set of size N. As each batch of k frames is processed, its average window becomes the latest in the rolling set, and the oldest batch average is dropped. As each batch is added, the rolling set of averages is converted to an orthonormal basis using the Gramm-Schmidt process. This orthonormal basis then serves as the appearance model for the next batch of frames. In each frame, the location, orientation, and scale of the tracked object is selected by similarity to the current appearance model.
Pro:
Realtime tracking. Adaptive appearance model with very few assumptions.
Segmentation Methods
Two-stage process: 1. Discover motions from a sparse representation (interest points).
B.Use RANSAC to detect motion fields. This step uses a planar homography.
Pro:
Handles large interframe disparities.
The Mean Shift algorithm can be used for segmentation as well as for tracking. The first step is to filter the image by representing each pixel with a vector that combines both image location and pixel values. (Pixel values might, for example, be color or grayscale intensity.) This filtering step assigns each pixel the values of the nearest maximum density location for the combined vector. These maximum density points are then clustered.
Pro:
An ellipse is simple to implement and conforms well to face
shape from several angles. It leverages available information --
user input for the location of eye centers and physical knowledge
about faces.
Supporting Methods and Concepts
The Mean Shift algorithm is an iterative, gradient ascent method for finding the local maximum of a target kernel-density distribution. Providing that the kernel function, k, is differentiable, convergence is guaranteed. Steps:
The mean shift vector, M, is a local approximation of
the normalized density gradient obtained with kernel k. The
recursion formula for this algorithm is
![]() The inner square-root term is the dot product between the histograms, with the histograms represented as vectors. This term is proportional to the cosine of the angle between these vectors.
Excerpt: "Given a model that requires a minimum of n data points to instantiate its free parameters, and a set of data points P such that the number of points in P is greater than n [#(P)>=n], randomly select a subset S1 of n data points from P and instantiate the model. Use the instantiated model M1 to determine the subset SI* of points in P that are within some error tolerance of Ml. The set SI* is called the consensus set of S1. "If #(SI*) is greater than some threshold t, which is a function of the estimate of the number of gross errors in P, use SI* to compute (possibly using least squares) a new model MI*. "If #(SI*) is less than t, randomly select a new subset S2 and repeat the above process. If, after some predetermined number of trials, no consensus set with t or more members has been found, either solve the model with the largest consensus set found, or terminate in failure. "There are two obvious improvements to the above algorithm: First, if there is a problem related rationale for selecting points to form the S's, use a deterministic selection process instead of a random one; second, once a suitable consensus set S* has been found and a model M* instantiated, add any new points from P that are consistent with M* to S* and compute a new model on the basis of this larger set."
If point x in image 1 and point x' in image 2 lie on the same plane, the relationship Hx = x' holds. H is a 3x3 transformation matrix. Four point pairs are required to solve for H. The 4 points must not be collinear (to avoid degeneracy). This planar-homography equation serves as the model for finding planes using RANSAC. To instantiate the model, use four random point pairs and find the matrix H that gives the least squared error.
|