|Select year: 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017
Seminars in 2016
Feature trajectories have shown to be efficient for representing
videos. Typically, they are extracted using the
KLT tracker or matching SIFT descriptors between frames.
However, the quality as well as quantity of these trajectories
is often not sufficient. Inspired by the recent success
of dense sampling in image classification, we propose an
approach to describe videos by dense trajectories. We sample
dense points from each frame and track them based on
displacement information from a dense optical flow field.
Given a state-of-the-art optical flow algorithm, our trajectories
are robust to fast irregular motions as well as shot
boundaries. Additionally, dense trajectories cover the motion
information in videos well.
We, also, investigate how to design descriptors to encode
the trajectory information. We introduce a novel descriptor
based on motion boundary histograms, which is robust to
camera motion. This descriptor consistently outperforms
other state-of-the-art descriptors, in particular in uncontrolled
realistic videos. We evaluate our video description
in the context of action classification with a bag-of-features
approach. Experimental results show a significant improvement
over the state of the art on four datasets of varying
difficulty, i.e. KTH, YouTube, Hollywood2 and UCF sports.
We investigate the problem of estimating the 3D shape of an object, given a set of 2D landmarks in a single image. To alleviate the reconstruction ambiguity, a widely-used approach is to confine the unknown 3D shape within a shape space built upon existing shapes. While this approach has proven to be successful in various applications, a challenging issue remains, i.e., the joint estimation of shape parameters and camera-pose parameters requires to solve a non-convex optimization problem. The existing methods often adopt an alternating minimization scheme to locally update the parameters, and consequently the solution is sensitive to initialization. In this paper, we propose a convex formulation to address this problem and develop an efficient algorithm to solve the proposed convex program. We demonstrate the exact recovery property of the proposed method, its merits compared to alternative methods, and the applicability in human pose and car shape estimation.
Background subtraction is a widely used technique for detecting moving objects in image sequences. Very often background subtraction approaches assume the availability of one or more clear (i.e., without foreground objects) frames at the beginning of the sequence in input. However, this assumption is not always true, especially when dealing with dynamic background or crowded scenes. In this paper, we present the results of a multi-modal background modeling method that is able to generate a reliable initial background model even if no clear frames are available. The proposed algorithm runs in real? time on HD images. Quantitative experiments have been conducted taking into account six different quality metrics on a set of 14 publicly available image sequences. The obtained results demonstrate a high-accuracy in generating the background model in comparison with several other methods.
Real-time automated vehicle make and model recognition (VMMR) based on a bag of speeded-up robust features (BoSURF).
Use SURF features of front- or rear-facing images and retain the dominant characteristic features (codewords) in adictionary.
Single dictionary, modular dictionary. SURF features to BoSURF histograms
Single multiclass SVM and an ensemble of multiclass SVM based on attribute bagging.
Today, there are two major paradigms for vision-based
autonomous driving systems: mediated perception approaches
that parse an entire scene to make a driving decision,
and behavior reflex approaches that directly map an
input image to a driving action by a regressor. In this paper,
we propose a third paradigm: a direct perception approach
to estimate the affordance for driving. We propose to map
an input image to a small number of key perception indicators
that directly relate to the affordance of a road/traffic
state for driving. Our representation provides a set of compact
yet complete descriptions of the scene to enable a simple
controller to drive autonomously. Falling in between the
two extremes of mediated perception and behavior reflex,
we argue that our direct perception representation provides
the right level of abstraction. To demonstrate this, we train
a deep Convolutional Neural Network using recording from
12 hours of human driving in a video game and show that
our model can work well to drive a car in a very diverse
set of virtual environments. We also train a model for car
distance estimation on the KITTI dataset. Results show that
our direct perception approach can generalize well to real
driving images. Source code and data are available on our
Person re-identification is the problem of recogniz-
ing people across images or videos from non-overlapping
views. Although there has been much progress in person
re-identification for the last decade, it still remains a chal-
lenging task because of severe appearance changes of a per-
son due to diverse camera viewpoints and person poses. In
this paper, we propose a novel framework for person re-
identification by analyzing camera viewpoints and person
poses, so-called Pose-aware Multi-shot Matching (PaMM),
which robustly estimates target poses and efficiently con-
ducts multi-shot matching based on the target pose in-
formation. Experimental results using public person re-
identification datasets show that the proposed methods are
promising for person re-identification under diverse view-
points and pose variances.
We propose an integer programming method for estimating the instantaneous count of pedestrians crossing a line of interest (LOI) in a video sequence. Through a line sampling process, the video is first converted into a temporal slice image. Next, the number of people is estimated in a set of overlapping sliding windows on the temporal slice image using a regression function that maps from local features to a count. Given that the count in a sliding window is the sum of the instantaneous counts in the corresponding time interval, an integer programming method is proposed to recover the number of pedestrians crossing the LOI in each frame. Integrating over a specific time interval yields the cumulative count of pedestrians crossing the line. Compared with current methods for line counting, our proposed approach achieves state-of-the-art performance on several challenging crowd video data sets.
This paper deals with Korean-English bilingual
videotext recognition for news headline generation. Because
videotext contains semantic content information, it can be
effectively used for understanding videos. Despite its usefulness,
it is a challengeable task to apply text recognition
technologies to practical video applications because of the
computational complexity and recognition accuracy. In this
paper, we propose a novel Korean-English bilingual videotext
recognition method to overcome the computational
complexity as well as achieve comparable recognition
accuracy. To recognize both Korean and English characters
effectively, the proposed method employs an elaborate splitmerge
strategy in which the split segments are merged into
characters using the recognition scores. Moreover, it avoids
unnecessary computation using geometric features such as
squareness and internal gap, and thus its computational
overhead is remarkably reduced. Therefore, the proposed
method is successfully employed in generating news headlines.
The effectiveness and efficiency of the proposed
method are verified by extensive experiments on a challenging
database containing 51,290 text images (176,884
We present a new, massively parallel method for high-quality multiview matching. Our work builds on the Patchmatch idea: starting from randomly generated 3D planes in scene space, the best-fitting planes are iteratively propagated and refined to obtain a 3D depth and normal field per
view, such that a robust photo-consistency measure over all images is maximized. Our main novelties are on the one hand to formulate Patchmatch in scene space, which makes it possible to aggregate image similarity across multiple views and obtain more accurate depth maps. And on the
other hand a modified, diffusion-like propagation scheme that can be massively parallelized and delivers dense multiview correspondence over ten 1.9-Megapixel images in 3 seconds, on a consumer-grade GPU. Our method uses a slanted support window and thus has no fronto-parallel
bias; it is completely local and parallel, such that computation time scales linearly with image size, and inversely proportional to the number of parallel threads. Furthermore, it has low memory footprint (four values per pixel, independent of the depth range). It therefore scales exceptionally well and can handle multiple large images at high depth resolution. Experiments on the DTU and Middlebury multiview datasets as well as oblique aerial images show that our method achieves very competitive results with high accuracy and completeness, across a range of different scenarios.
Although there has long been interest in foreground-background segmentation based on change detection for video surveillance applications, the issue of inconsistent performance across different scenarios remains a serious concern. To address this, we propose a new type of wordbased approach that regulates its own internal parameters using feedback mechanisms to withstand difﬁcult conditions while keeping sensitivity intact in regular situations. Coined ?PAWCS?, this method?s key advantages lie in its highly persistent and robust dictionary model based on color and local binary features as well as its ability to automatically adjust pixel-level segmentation behavior. Experiments using the 2012 ChangeDetection.net dataset show that it outranks numerous recently proposed solutions in terms of overall performance as well as in each category. A complete C++ implementation based on OpenCV is available online.
3D reconstruction is one of the most popular research areas in computer vision and computer graphics, it is widely used in many
fields, such as video game, animation and so on. It gets 3D model based on 2D images. Using this technology, we can implement
scene recurrence, observe the model from any viewpoints stereoscopically and perceive the world well. In this paper, use
technologies like point cloud building, surface reconstruction to obtain the visual hull. To make the visual hull looked more vivid
and natural, adding texture is necessary. This research proves that this solution plan has some advantages, such as feasibility, easy
reconstruction and so on.
This paper addresses the problem of detecting
people in two dimensional range scans. Previous approaches
have mostly used pre-defined features for the detection and
tracking of people. We propose an approach that utilizes a supervised
learning technique to create a classifier that facilitates
the detection of people. In particular, our approach applies
AdaBoost to train a strong classifier from simple features of
groups of neighboring beams corresponding to legs in range
data. Experimental results carried out with laser range data
illustrate the robustness of our approach even in cluttered office
Background modeling and subtraction is a classical topic in
compute vision. Gaussianmixture modeling (GMM) is a popular
choice for its capability of adaptation to background variations.
Lots of improvements have been made to enhance the
robustness by considering spatial consistency and temporal
correlation. In this paper, we propose a sharable GMM based
background subtraction approach. Firstly, a sharable mechanism
is presented to model the many-to-one relationship between
pixels and models. Each pixel dynamically searches
the best matched model in the neighborhood. This kind of
space-sharing way is robust to camera jitter, dynamic background,
etc. Secondly, the sharable models are built for both
background and foreground. The noises resulted by local small
movements could be effectively eliminated through the
background sharable models, while the integrity of moving
objects is enhanced by the foreground sharable models, especially
for small objects. Finally, each sharable model is updated
through randomly selecting a pixel which matches this
model. And a flexible mechanism is added for switching between
background and foreground models. Experiments on
ChangeDetection benchmark dataset demonstrate the effectiveness
of our approach.
This paper aims to deal with real-time trafﬁc sign recognition, i.e.localizing what type of trafﬁc sign appears in which area of an input image at a fast processing time. To achieve this goal, we ﬁrst propose an extremely fast detection module, which is 20 times faster than the existing best detection module. Our detection module is based on trafﬁc sign proposal extraction and classiﬁcation built upon a color probability model and a color HOG. Then, we harvest from a convolutional neural network to further classify the detected signs into their subclasses within each superclass.
Hierarchical neural networks have been shown to be effective in learning representative image features and recognizing object classes. However, most existing networks combine the low/middle level cues for classification without accounting for any spatial structures. For applications such as understanding a scene, how the visual cues are spatially distributed in an image becomes essential for successful analysis. This paper extends the framework of deep neural networks by accounting for the structural cues in the visual signals. In particular, two kinds of neural networks have been proposed. First, we develop a multitask deep convolutional network, which simultaneously detects the presence of the target and the geometric attributes (location and orientation) of the target with respect to the region of interest. Second, a recurrent neuron layer is adopted for structured visual detection. The recurrent neurons can deal with the spatial distribution of visible cues belonging to an object whose shape or structure is difficult to explicitly define. Both the networks are demonstrated by the practical task of detecting lane
boundaries in traffic scenes. The multitask convolutional neural network provides auxiliary geometric information to help the subsequent modeling of the given lane structures. The recurrent
neural network automatically detects lane boundaries, including those areas containing no marks, without any explicit prior knowledge or secondary modeling.
Observing that text in virtually any script is formed of strokes, we propose a novel easy-to-implement stroke detector based on an efficient pixel intensity comparison to surrounding pixels. Stroke-specific keypoints are efficiently detected and text fragments are subsequently extracted by local thresholding guided by keypoint properties. Classification based on effectively calculated features eliminates non-text regions. The stroke-specific keypoints produce 2 times less region segmentations and still detects 25% more characters than the commonly exploited MSER detector and the process is 4 times faster. After a novel efficient classification step, the number of regions is reduced to 7 times less than the standard method and is still almost 3 times faster. All stages of the proposed pipeline are scale- and rotation- invariant and support a wide variety of scripts (Latin, Hebrew, Chinese, etc.) and fonts. When the proposed detector is plugged into a scene text localization and recognition pipeline, a state-of-the-art text localization accuracy is maintained whilst the processing time is significantly reduced.
In this paper we propose a novel recurrent neural net- work architecture for video-based person re-identification. Given the video sequence of a person, features are extracted from each frame using a convolutional neural network that incorporates a recurrent final layer, which allows informa- tion to flow between time-steps. The features from all time- steps are then combined using temporal pooling to give an overall appearance feature for the complete sequence. The convolutional network, recurrent layer, and temporal pool- ing layer, are jointly trained to act as a feature extractor for video-based re-identification using a Siamese network ar- chitecture. Our approach makes use of colour and optical flow information in order to capture appearance and motion information which is useful for video re-identification. Ex- periments are conduced on the iLIDS-VID and PRID-2011 datasets to show that this approach outperforms existing methods of video-based re-identification.
In this paper we propose a novel recurrent neural net-
work architecture for video-based person re-identification.
Given the video sequence of a person, features are extracted
from each frame using a convolutional neural network that
incorporates a recurrent final layer, which allows informa-
tion to flow between time-steps. The features from all time-
steps are then combined using temporal pooling to give an
overall appearance feature for the complete sequence. The
convolutional network, recurrent layer, and temporal pool-
ing layer, are jointly trained to act as a feature extractor for
video-based re-identification using a Siamese network ar-
chitecture. Our approach makes use of colour and optical
flow information in order to capture appearance and motion
information which is useful for video re-identification. Ex-
periments are conduced on the iLIDS-VID and PRID-2011
datasets to show that this approach outperforms existing
methods of video-based re-identification.
?Maintaining a normal burning temperature is essential to ensuring the quality of nonferrous metals and cement clinker in a rotary kiln. Recognition of the temperature
condition is an important component of a temperature control system. Because of the interference of smoke and dust in the kiln, the temperature of the burning zone is
difficult to be measured accurately using traditional methods. Focusing on blurry images from which only the flame region can be segmented, an image recognition system for
the detection of the temperature condition in a rotary kiln is presented. First, the flame region is segmented employing a region-growing method with a dynamic seed point.
Seven features, comprising three luminous features and four dynamic features, are then extracted from the flame region. Dynamic features constructed from luminous feature
sequences are proposed to overcome the problem of mis-recognition when the temperature of the flame region changes rapidly. Finally, classifiers are trained to recognize the temperature state of the burning zone using its features. Experimental results using real datasets demonstrate that the proposed image-based systems for recognizing the temperature condition are effective and robust.
We present a set of experiments with a video OCR system (VOCR) tailored for video information retrieval
and establish its importance in multimedia search in general and for some specific queries in particular. The
system, inspired by an existing work on text detection and recognition in images, has been developed using
techniques involving detailed analysis of video frames producing candidate text regions. The text regions are
then binarized and sent to a commercial OCR resulting in ASCII text, that is finally used to create search
indexes. The system is evaluated using the TRECVID data. We compare the system?s performance from an
information retrieval perspective with another VOCR developed using multi-frame integration and empirically
demonstrate that deep analysis on individual video frames result in better video retrieval. We also evaluate
the effect of various textual sources on multimedia retrieval by combining the VOCR outputs with automatic
speech recognition (ASR) transcripts. For general search queries, the VOCR system coupled with ASR sources
outperforms the other system by a very large extent. For search queries that involve named entities, especially
people names, the VOCR system even outperforms speech transcripts, demonstrating that source selection for
particular query types is extremely essential.
This paper approximates the 3D geometry of a scene by a small number of 3D planes. The method is especially suited to man-made scenes, and only requires two calibrated wide-baseline views as inputs. It relies on the computation of a dense but noisy 3D point cloud, as for example obtained by matching DAISY descriptors  between the views. It then segments one of the two reference images, and adopts a multi-model fitting process to assign a 3D plane to each region, when the region is not detected as occluded. A pool of 3D plane hypotheses is first derived from the 3D point cloud, to include planes that reasonably approximate the part of the 3D point cloud observed from each reference view between randomly selected triplets of 3D points. The hypothesis-to-region assignment problem is then formulated as an energy-minimization problem, which simultaneously optimizes an original data-fidelity term, the assignment smoothness over neighboring regions, and the number of assigned planar proxies. The synthesis of intermediate viewpoints demonstrates the effectiveness of our 3D reconstruction, and thereby the relevance of our proposed data fidelity-metric.
In this paper, we present an in-vehicle computing system capable of localizing lane markings and communicating them to drivers. To the best of our knowledge, this is the first system that combines the Maximally Stable Extremal Region (MSER) technique with the Hough transform to detect and recognize lane markings
(i.e., lines and pictograms). Our system begins by localizing the region of interest using the MSER technique. A three-stage refinement computing algorithm is then introduced to enhance the results of MSER and to filter out undesirable information such as trees and vehicles. To achieve the requirements of real-time systems, the Progressive Probabilistic Hough Transform (PPHT) is used in the detection stage to detect line markings. Next, the recognition of the color and the form of line markings is performed; this it is based on the results of the application of the MSER to left and right line markings. The recognition of High-Occupancy Vehicle pictograms is performed using a new algorithm, based on the results of MSER regions. In the tracking stage, Kalman filter is used to track both ends of each detected line marking. Several experiments are conducted to show the efficiency of our system.
Vehicle lane-level localization is a fundamental
technology in autonomous driving. To achieve accurate and
consistent performance, a common approach is to use the
LIDAR technology. However, it is expensive and computational
demanding, and thus not a practical solution in many situations.
This paper proposes a stereovision system, which is of low
cost, yet also able to achieve high accuracy and consistency.
It integrates a new lane line detection algorithm with other lane
marking detectors to effectively identify the correct lane line
markings. It also fits multiple road models to improve accuracy.
An effective stereo 3D reconstruction method is proposed to
estimate vehicle localization. The estimation consistency is further
guaranteed by a new particle filter framework, which takes
vehicle dynamics into account. Experiment results based on
image sequences taken under different visual conditions showed
that the proposed system can identify the lane line markings with
98.6% accuracy. The maximum estimation error of the vehicle
distance to lane lines is 16 cm in daytime and 26 cm at night,
and the maximum estimation error of its moving direction with
respect to the road tangent is 0.06 rad in daytime and 0.12 rad
at night. Due to its high accuracy and consistency, the proposed
system can be implemented in autonomous driving vehicles as a
practical solution to vehicle lane-level localization.
This paper presents a stereo matching approach for
a novel multi-perspective panoramic stereo vision system,
making use of asynchronous and non-simultaneous stereo
imaging towards real-time 3D 360◦
vision. The method is
designed for events representing the scenes visual contrast
as a sparse visual code allowing the stereo reconstruction of
high resolution panoramic views. We propose a novel cost
measure for the stereo matching, which makes use of a similarity
measure based on event distributions. Thus, the robustness
to variations in event occurrences was increased.
An evaluation of the proposed stereo method is presented
using distance estimation of panoramic stereo views and
ground truth data. Furthermore, our approach is compared
to standard stereo methods applied on event-data. Results
show that we obtain 3D reconstructions of 1024 ?? 3600
round views and outperform depth reconstruction accuracy
of state-of-the-art methods on event data.
Recurrence of small image patches across different scales of a natural
image has been previously used for solving ill-posed problems (e.g., superresolution
from a single image). In this paper we show how this multi-scale property
can also be used for ??blind-deblurring??, namely, removal of an unknown blur
from a blurry image. While patches repeat ??as is?? across scales in a sharp natural
image, this cross-scale recurrence significantly diminishes in blurry images.
We exploit these deviations from ideal patch recurrence as a cue for recovering
the underlying (unknown) blur kernel. More specifically, we look for the blur
kernel k, such that if its effect is ??undone?? (if the blurry image is deconvolved
with k), the patch similarity across scales of the image will be maximized. We
report extensive experimental evaluations, which indicate that our approach compares
favorably to state-of-the-art blind deblurring methods, and in particular, is
more robust than them.
In this paper, we propose a Multiple Background
Model based Background Subtraction (MB2S) algorithm that is
robust against sudden illumination changes in indoor
environment. It uses multiple background models of expected
illumination changes followed by both pixel and frame based
background subtraction on both RGB and YCbCr color spaces.
The masks generated after processing these input images are
then combined in a framework to classify background and
foreground pixels. Evaluation of proposed approach on publicly
available test sequences show higher precision and recall than
other state-of-the-art algorithms.
The novelties of this paper are three aspects: 1) We use joint activities of four Gabor ﬁlters and conﬁdence measure for speeding up the process of texture orientation estimation. 2) Misidentiﬁcation chances and computational complexity of the algorithm are reduced by using a particle ﬁlter. It limits vanishing point search range and reduces the number of pixels to be voted. The algorithm combines the peakedness measure of vote accumulator space with the displacements of moving average of observations to regulate the distribution of vanishing point candidates. 3) Attributed to the design of a noise-insensitive observation model,
A new idea of an abandoned object detection system for road traffic surveillance systems based on three-dimensional image information is proposed in this paper to prevent traffic accidents. A novel Binocular Information Reconstruction and Recognition (BIRR) algorithm is presented to implement the new idea. As initial detection, suspected abandoned objects are detected by the proposed static foreground region segmentation algorithm based on surveillance video from a monocular camera. After detection of suspected abandoned objects, three-dimensional (3D) information of the suspected abandoned object is reconstructed by the proposed theory about 3D object information reconstruction with images from a binocular camera. To determine whether the detected object is hazardous to normal road traffic, road plane equation and height of suspected-abandoned object are calculated based on the three-dimensional information. Experimental results show that this system implements fast detection of abandoned objects and this abandoned object system can be used for road traffic monitoring and public area surveillance.
This paper revisits the classical multiple hypotheses
tracking (MHT) algorithm in a tracking-by-detection framework.
The success of MHT largely depends on the ability
to maintain a small list of potential hypotheses, which
can be facilitated with the accurate object detectors that are
currently available. We demonstrate that a classical MHT
implementation from the 90?s can come surprisingly close
to the performance of state-of-the-art methods on standard
benchmark datasets. In order to further utilize the strength
of MHT in exploiting higher-order information, we introduce
a method for training online appearance models for
each track hypothesis. We show that appearance models
can be learned efficiently via a regularized least squares
framework, requiring only a few extra operations for each
hypothesis branch. We obtain state-of-the-art results on
popular tracking-by-detection datasets such as PETS and
the recent MOT challenge.
Automatic fire detection has become more and more appealing because of the increasing use of video capabilities in surveillance systems used for early detection of fire. However, its high computational complexities limit its use in real-time applications. To meet the real-time processing of today?s fire detection techniques, this study proposes a single instruction, multiple data many-core model. To design an efficient many-core model for image processing applications such as fire detection, a key design parameter is the image data-per-processing-element (IDPE) variation of the many-core system, which is the amount of image data directly mapped to each processing element PE. This study quantitatively evaluates the impact of the IDPE variation on system performance and energy efficiency for the multi-stage fire detection approach that consists of movement-containing region detection, color segmentation, fire feature extraction of fires, and decision making if there is a fire or non-fire in a processing video frame. In this study, we use six IDPE ratios to determine an optimal many-core model that provides the most efficient operation for fire detection using architectural and workload simulation. Experimental results indicate that the most efficient many-core model is achieved at the 64 IDPE value in terms of the worst-case execution time and energy efficiency. In addition, this study compares the performance of the most efficient many core configuration with that of a commercial graphics processing unit (Nvidia GeForce GTX 480) to show the improved performance of the proposed many-core model for the fire detection algorithm. This many-core configuration outperforms the commercial graphic processing unit in the worst-case execution time and energy efficiency.
In this paper we explore interactions between the appearance of an outdoor scene and the ambient temperature. By studying statistical correlations between image sequences from outdoor cameras and temperature measurements we identify two interesting interactions. First, semantically meaningful regions such as foliage and reflective oriented surfaces are often highly indicative of the temperature. Second, small camera motions are correlated with the temperature in some scenes. We propose simple scene specific temperature prediction algorithms which can be used to turn a camera into a crude temperature sensor. We find that for this task, simple features such as local pixel intensities outperform sophisticated, global features such as from a semantically-trained convolutional neural network.
Multi-View-Stereo (MVS) methods aim for the highest detail possible, however, such detail is often not required. In this work, we propose a novel surface reconstruction method based on image edges, superpixels and secondorder smoothness constraints, producing meshes comparable to classic MVS surfaces in quality but orders of magnitudes faster. Our method performs per-view dense depth
optimization directly over sparse 3D Ground Control Points (GCPs), hence, removing the need for view pairing, image rectification, and stereo depth estimation, and allowing for full per-image parallelization. We use Structure-from-Motion (SfM) points as GCPs, but the method is not specific to these, e.g. LiDAR or RGB-D can also be used. The resulting meshes are compact and inherently edge-aligned with image gradients, enabling good-quality lightweight per-face flat renderings. Our experiments demonstrate on a variety of 3D datasets the superiority in speed and competitive surface quality.
Localization of the vehicle with respect to road lanes plays a critical role in the advances of making the vehicle fully autonomous. Vision based road lane line detection provides a feasible and low cost solution as the vehicle pose can be derived from the detection. While good progress has been made, the road lane line detection has remained an open one, given challenging road appearances with shadows, varying lighting conditions, worn-out lane lines etc. In this paper, we propose a more robust vision-based approach with respect to these challenges. The approach incorporates four key steps. Lane line pixels are first pooled with a ridge detector. An effective noise filtering mechanism will next remove noise pixels to a large extent. A modified version of sequential RANdom Sample consensus) is then adopted in a model fitting procedure to ensure each lane line in the image is captured correctly. Finally, if lane lines on both sides of the
road exist, a parallelism reinforcement technique is imposed to improve the model accuracy. The results obtained show that the proposed approach is able to detect the lane lines accurately and at a high success rate compared to current
approaches. The model derived from the lane line detection is capable of generating precise and consistent vehicle localization information with respect to road lane lines, including road geometry, vehicle position and orientation.
While numerous algorithms have been proposed for ob-
ject tracking with demonstrated success, it remains a chal-
lenging problem for a tracker to handle large change in
scale, motion, shape deformation with occlusion. One of
the main reasons is the lack of effective image representa-
tion to account for appearance variation. Most trackers use
high-level appearance structure or low-level cues for repre-
senting and matching target objects. In this paper, we pro-
pose a tracking method from the perspective of mid-level
vision with structural information captured in superpixels.
We present a discriminative appearance model based on su-
perpixels, thereby facilitating a tracker to distinguish the
target and the background with mid-level cues. The tracking
task is then formulated by computing a target-background
confidence map, and obtaining the best candidate by max-
imum a posterior estimate. Experimental results demon-
strate that our tracker is able to handle heavy occlusion and
recover from drifts. In conjunction with online update, the
proposed algorithm is shown to perform favorably against
existing methods for object tracking.
Methods for super-resolution can be broadly classified
into two families of methods: (i) The classical multi-image
super-resolution (combining images obtained at subpixel
misalignments), and (ii) Example-Based super-resolution
(learning correspondence between low and high resolution
image patches from a database). In this paper we propose a
unified framework for combining these two families of methods.
We further show how this combined approach can be
applied to obtain super resolution from as little as a single
image (with no database or prior examples). Our approach
is based on the observation that patches in a natural
image tend to redundantly recur many times inside the
image, both within the same scale, as well as across different
scales. Recurrence of patches within the same image
scale (at subpixel misalignments) gives rise to the classical
super-resolution, whereas recurrence of patches across different
scales of the same image gives rise to example-based
super-resolution. Our approach attempts to recover at each
pixel its best possible resolution increase based on its patch
redundancy within and across scales.
Most of the recently published background subtraction
methods can still be classified as pixel-based, as most of
their analysis is still only done using pixel-by-pixel comparisons.
Few others might be regarded as spatial-based
(or even spatiotemporal-based) methods, as they take into
account the neighborhood of each analyzed pixel. Although
the latter types can be viewed as improvements in many
cases, most of the methods that have been proposed so far
suffer in complexity, processing speed, and/or versatility
when compared to their simpler pixel-based counterparts.
In this paper, we present an adaptive background subtraction
method, derived from the low-cost and highly efficient
ViBe method, which uses a spatiotemporal binary similarity
descriptor instead of simply relying on pixel intensities as
its core component. We then test this method on multiple
video sequences and show that by only replacing the core
component of a pixel-based method it is possible to dramatically
improve its overall performance while keeping memory
usage, complexity and speed at acceptable levels for
Advanced driver assistance systems--the accurate detection and classiﬁcation of moving objects.
Deﬁne a composite object representation to include class information in the core object??s description.
Propose a complete perception fusion architecture based on the evidential framework to solve the detection and tracking of moving objects problem by integrating the composite representation and uncertainty management.
Integrate our fusion approach in a real-time application inside a vehicle demonstrator from the interactIVe IP European project.
This paper addresses the problem of single-target tracker performance evaluation. We consider the performance
measures, the dataset and the evaluation system to be the most important components of tracker evaluation and propose
requirements for each of them. The requirements are the basis of a new evaluation methodology that aims at a simple and
easily interpretable tracker comparison. The ranking-based methodology addresses tracker equivalence in terms of statistical
significance and practical differences. A fully-annotated dataset with per-frame annotations with several visual attributes is
introduced. The diversity of its visual properties is maximized in a novel way by clustering a large number of videos according to
their visual attributes. This makes it the most sophistically constructed and annotated dataset to date. A multi-platform evaluation
system allowing easy integration of third-party trackers is presented as well. The proposed evaluation methodology was tested
on the VOT2014 challenge on the new dataset and 38 trackers, making it the largest benchmark to date. Most of the tested
trackers are indeed state-of-the-art since they outperform the standard baselines, resulting in a highly-challenging benchmark.
An exhaustive analysis of the dataset from the perspective of tracking difficulty is carried out. To facilitate tracker comparison a
new performance visualization technique is proposed.
In this paper, we propose a method that is able to detect fires by analyzing videos acquired by surveillance cameras. Two main novelties have been introduced. First, complementary information, based on color, shape variation, and motion analysis, is combined by a multiexpert system. The main advantage deriving from this approach lies in the fact that the overall performance of the system significantly increases with a relatively small effort made by the designer. Second, a novel descriptor based on a bag-of-words approach has been proposed for representing motion. The proposed method has been tested on a very large dataset of fire videos acquired both in real environments and from the web. The obtained results confirm a consistent reduction in the number of false positives, without paying in terms of accuracy or renouncing the possibility to run the system on embedded platforms.
The preceding vehicles detection technique in nighttime traffic scenes is an important part of the advanced driver assistance system (ADAS). This paper proposes a region tracking-based vehicle detection algorithm via the image processing technique. First, the brightness of the taillights during nighttime is used as the typical feature, and we use the existing global detection algorithm to detect and pair the taillights. When the vehicle is detected, a time series analysis model is introduced to predict vehicle positions and the possible region (PR) of the vehicle in the next frame. Then, the vehicle is only detected in the PR. This could reduce the detection time and avoid the false pairing between the bright spots in the PR and the bright spots out of the PR. Additionally, we present a thresholds updating method to make the thresholds adaptive. Finally, experimental studies are provided to demonstrate the application and substantiate the superiority of the proposed algorithm. The results show that the proposed algorithm can simultaneously reduce both the false negative detection rate and the false positive detection rate.
Character recognition in video is a challenging task because low resolution and complex background ofvideo caused is connections,loss of information,loss of shapes of the characters etc.In this paper,we
introduce a novel ring radius transform(RRT) and the concept of medial pixels on characters with
broken contours in the edge domain for reconstruction. For each pixel,the RRT assigns a value which is
the distance to the nearest edge pixel.The medial pixels are those which have the maximum radius
values in their neighborhood. We demonstrate the application of these concepts in the problem of
character reconstruction to improve the character recognition rate in video images. With ring radius transform and medial pixels, our approach exploits the symmetry information between the inner and outer contours of a broken character to reconstruct the gaps. Experimental results and comparison with two existing methods show that the proposed method outperforms the existing methods in terms of measures such as relative error and character recognition rate.
This paper presents a method to predict social saliency, the likelihood of joint attention, given an input image or video by leveraging the social interaction data captured by first person cameras. Inspired by electric dipole moments, we introduce a social formation feature that encodes the geometric relationship between joint attention and its social formation. We learn this feature from the first person social interaction data where we can precisely measure the locations of joint attention and its associated members in 3D. An ensemble classifier is trained to learn the geometric relationship. Using the trained classifier, we predict social saliency in real-world scenes with multiple social groups including scenes from team sports captured in a third person view. Our representation does not require directional measurements such as gaze directions. A geometric analysis
of social interactions in terms of the F-formation theory is also presented.
This paper describes a multiple view based approach for building modeling via a novel multi-box grammar,which represents an occlusion relationship among the projections of a set of buildings sharing a common Manhattan World coordinate system. We formulate the building modeling problem
as an energy minimization to combine the constraints from the multi-box grammar with (1) the semantic labeling information from appearance models, (2) the directional information w.r.t
the vanishing points in each single view, and (3) the planar homography correspondence among multiple views. We further propose a two-step coarse-to-fine approach to achieve the optimal
solution. First we employ super-pixels and a simplified edition of the grammar to reduce the searching space, and obtain an initial layout to accelerate the convergence speed. At the second
stage, the scene model is refined to achieve pixel-level accuracy by minimizing the energy using Random Walk. Experiments on street view images demonstrate the capability of our method in
reconstructing multiple buildings at different distances, and also the robustness in handling occlusion.
Updating road markings is one of the routine tasks of transportation agencies. Compared with traditional road inventory mapping techniques, vehicle-borne mobile light detection and ranging (LiDAR) systems can undertake the job safely and efficiently. However, current hurdles include software and computing challenges when handling huge volumes of highly dense and irregularly distributed 3-D mobile LiDAR point clouds. This paper presents the development and implementation aspects of an automated object extraction strategy for rapid and accurate road marking inventory. The proposed road marking extraction method is based on 2-D georeferenced feature (GRF) images, which are interpolated from 3-D road surface points through a modified inverse distance weighted (IDW) interpolation. Weighted neighboring difference histogram (WNDH)-based dynamic thresholding and multiscale tensor voting (MSTV) are proposed to segment and extract road markings from the noisy corrupted GRF images. The results obtained using 3-D point clouds acquired by a RIEGL VMX-450 mobile LiDAR system in a subtropical urban environment are encouraging.