MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECNOLOGY
LE VAN HUNG
3D OBJECT DETECTIONS AND RECOGNITIONS:
ASSISTING VISUALLY IMPAIRED PEOPLE IN
DAILY ACTIVITIES
Major: Computer Science
Code: 9480101
ABSTRACT OF DOCTORAL DISSERTATION
COMPUTER SCIENCE
Hanoi −2018
The dissertation is completed at:
Hanoi University of Science and Technology
Supervisors:
1. Dr. Vu Hai
2. Assoc. Prof. Nguyen Thi Thuy
Reviewer 1: Assoc. Prof. Luong Chi Mai
Reviewer 2: Assoc. Prof. Le Thanh Ha
Reviewer 3: Assoc. Prof. Nguyen Quang Hoan
The dissertation will be defended before approval committee
at Hanoi University of Science and Technology:
the object like the front of cup, box or fruit. While the information that the VIPs need
are the information about the position, size and direction for safely grasping. From
this, we use the ”3-D objects estimation method” to estimate the information of the
objects.
By knowing the queried object is a coffee cup which is usually a cylindrical shape
and lying on a flat surface (table plane), the aided system could resolve the query by
fitting a primitive shape to the collected point cloud from the object. The objects in
the kitchen or tea room are usually placed on the tables such as cups, bowls, jars, fruit,
funnels, etc. Therefore, these objects can be simplified by the primitive shapes. The
problem of detecting and recognizing the complex objects in the scene is not considered
in the dissertation. The prior knowledge observed from the current scene such as a
1
Figure 1 Illustration of a real scenario: a VIP comes to the Kitchen and gives a
query: ”Where is a coffee cup? ” on the table. Left panel shows a Kinect mounted on
the human’s chest. Right panel: the developed system is build on a Laptop PC.
cup normally stands on the table, contextual constraints such as walls in the scene are
perpendicular to the table plane; the size/height of the queried object is limited, would
be valuable cues to improve the system performances.
Generally, we realize that the queried objects could be identified through simplifying geometric shapes: planar segments (boxes), cylinders (coffee mugs, soda cans),
sphere (balls), cones, without utilizing conventional 3-D features. Approaching these
ideas, a pipeline of the work ”3-D Object Detection and Recognition for Assisting Visually Impaired People” is proposed. It consists of several tasks, including: (1) separating
the queried objects from table plane detection result by using the transformation original coordinate system technique; (2) detecting candidates for the interested objects
using appearance features; and (3) estimating a model of the queried object from a
3-D point cloud. Wherein the last one plays an important role. Instead of matching
the queried objects into 3-D models as conventional learning-based approaches do, this
research work focuses on constructing a simplified geometrical model of the queried
objects from an unstructured set of point clouds collected by a RGB and range sensor.
proposed system operates with a MS Kinect sensor version 1. The Kinect sensor is
mounted on the chest of the VIPs and the laptop is warped in the backpack as shown
in Fig. 1-bottom. For deploying a real application, we have some constraints for the
scenario as the following:
❼ The MS Kinect sensor:
– A MS Kinect sensor is mounted on VIP’s chest and he/she moves slowly
around the table. This is to collect the data of the environment.
– A MS Kinect sensor captures RGB and Depth images at a normal frame rate
(from 10 to 30 fps) with image resolution of 640×480 pixels for both of those
image types. With each frame obtained from Kinect an acceleration vector
3
is also obtained. Because MS Kinect collects the images in a range from
10 to 30 fps, , it fits well with the slow movements of the VIPs (∼ 1 m/s).
Although collecting image data via a wearable sensor can be affected by
subject’s movement such as image blur, vibrations in the practical situations,
there are no specifically requirements for collecting the image data. For
instance, VIPs are not required to be stranded before collecting the image
data.
– Every queried object needs to be placed in the visible area of a MS Kinect
sensor, which is in a distance of 0.8 to 4 meter and an angle of 300 around
the center axis of the MS Kinect sensor. Therefore, the distance constraint
from the VIPs to the table is also about 0.8 to 4m.
❼ Interested (or queried) objects are assumed to have simple geometrical structures.
For instance, coffee mugs, bowls, jars, bottles, etc have cylindrical shape, whereas
ball(s) have spherical shape; a cube shape could be boxes, etc. They are idealized
and labeled. The modular interaction between a VIP and the system has not been
❼ Computational time: A point cloud of a scene that is generated from an image
with size of 640 × 480 pixels consists of hundreds of thousands of points. Therefore, computations in the 3-D environment often require higher computational
costs than a task in the 2-D environment.
Contributions
Throughout the dissertation, the main objectives are addressed by an unified
solution. We achieve following contributions:
❼ Contribution 1: Proposed a new robust estimator that called (GCSAC - Geometrical
Constraints SAmple Consensus) for estimation of primitive shapes from the
point cloud of the objects. Different from conventional RANSAC algorithms
(RANdom SAmple Consensus), GCSAC selects the uncontaminated (so-called
the qualified or good) samples from a set data of points using the geometrical
constraints. Moreover, GCSAC is extended by utilizing the contextual constraints
to validate results of the model estimation.
❼ Contribution 2: Proposed a comparative study on three different approaches
for recognizing the 3-D objects in a complex scene. Consequently, the best one
is a combination of deep-learning based technique and the proposed robust estimator(GCSAC). This method takes recent advantages of object detection using
a neural network on RGB image and utilizes the proposed GCSAC to estimate
the full 3-D models of the queried objects.
❼ Contribution 3: Deployed a successfully system using the proposed methods for
detecting 3-D primitive shape objects in a lab-based environment. The system
combined the table plane detection technique and the proposed method of 3-D
objects detection and estimation. It achieved fast computation for both tasks of
locating and describing the objects. As a result, it fully supports the VIPs in
grasping the queried objects.
table plane
3-D objects
model
estimation
3-D objects
information
Fitting 3-D objects
Candidates
Figure 3 A general framework of detecting the 3-D queried objects on the table of the
VIPs.
RGB and Depth images and table plane detection in order to separate the interested
objects from a current scene. The second phase aims to label the object candidates
on the RGB images. The third phase is to estimate a full model from the point cloud
specified from the first and the second phases. In the last phase, the 3-D objects are
estimated by utilizing a new robust estimator GCSAC for the full geometrical models.
Utilizing this framework, we deploy a real application. The application is evaluated
in different scenarios including data sets collected in lab environments and the public
datasets. Particularly, these research works in the dissertation are composed of six
chapters as following:
❼ Introduction: This chapter describes the main motivations and objectives of the
study. We also present critical points the research’s context, constraints and
challenges, that we meet and address in the dissertation. Additionally, the general
framework and main contributions of the dissertation are also presented.
❼ Chapter 1: A Literature Review: This chapter mainly surveys existing aided
CHAPTER 1
LITERATURE REVIEW
In this chapter, we would like to present surveys on the related works of aid systems
for the VIPs and detecting objects methods in indoor environment. Firstly, relevant
aiding applications for VIPs are presented in Sec. 1.1. Then, the robust estimators and
their applications in the robotics, computer vision are presented in Sec. 1.3. Finally,
we will introduce and analyses the state-of-the-art works with 3-D object detection,
recognition in Sec. 1.2.
7
1.1
Aided systems supporting for visually impaired people
1.1.1
Aided systems for navigation service
1.1.2
Aided systems for obstacle detection
1.1.3
Aided systems for locating the interested objects in scenes
Linear fitting algorithms
1.3.2
Robust estimation algorithms
1.3.3
RANdom SAmple Consensus (RANSAC) and its variations
1.3.4
Discussions
CHAPTER 2
POINT CLOUD REPRESENTATION AND THE
PROPOSED METHOD FOR TABLE PLANE
DETECTION
A common situation in activities of daily living of visually impaired people (VIPs)
is to query an object (a coffee cup, water bottle, so on) on a flat surface. We assume
that such flat surface could be a table plane in a sharing room, or in a kitchen. To
build the completed aided-system supporting for VIPs, obviously, the queried objects
should be separated from a table plane in current scene. In a general frame-work that
consists other steps such as detection, and estimation full model of the queried objects,
the table plane detection could be considered as a pre-processing step. Therefore, this
chapter is organized as follows: Firstly, we introduce a representation of the point
clouds which are combined the data collected by Kinect sensor in Section 2.1. We then
present the proposed method for the table plane detection in Section 2.2.
2.2.1
The proposed method for table plane detection
Introduction
Plane detection in 3-D point clouds is a critical task for many robotics and computer vision applications. In order to help visually impaired/blind people find and
grasp interesting objects (e.g., coffee cup, bottle, bowl) on the table, one has to find
the table planes in the captured scenes. This work is motivated by such adaptation
in which acceleration data provided by the MS Kinect sensor to prune the extraction results. The proposed algorithms achieve real-time performance as well as a high
detection rate of the table planes.
2.2.2
Related Work
2.2.3
The proposed method
2.2.3.1
The proposed framework
Our research context aims to develop object finding and grasping-aided services
for VIP. The proposed framework, as shown in Fig. 2.6, consists of four steps: downsampling, organized point cloud representation, plane segmentation and table plane
classification. Because of our work utilizing only depth feature, a simple and effective
method for down-sampling and smoothing the depth data is described below.
9
D(xi , yi )
N
(2.2)
where D(xi , yi ) is depth value of ith neighboring pixel of the center pixel (xc , yc ); N is
the number of pixels in the neighborhood n × n (N=(n × n) -1).
2.2.3.2
Plane segmentation
The detailed process of the plane segmentation is given in (Holz et al. RoboCup,
2011).
2.2.3.3
Table plane detection/extraction
The results of the first step are planes that are perpendicular to the acceleration
vector. After rotating the y axis such that it is parallel with the acceleration vector.
Therefore, the table plane is highest plane in the scene, that means the table plane is
the one with minimum y-value.
2.2.4
2.2.4.1
Experimental results
Experimental setup and dataset collection
The first dataset called ’MICA3D’ : A Microsoft Kinect version 1 is mounted on
the person’s chest, the person then moves around one table in the room. The distance
between the Kinect and the center of the table is about 1.5 m. The height of the
0.63
0.83
Proposed Method 96.65 96.78 97.73
97.0
0.81
5
Table 2.3: The average result of detected table plane on the dataset [3] (%).
Evaluation Measurement
Missing Frame per
Approach
EM1 EM2 EM3 Average
rate
second
First Method
87.39 68.47 98.19
84.68
0.0
1.19
Second Method
87.39 68.47 95.49
83.78
0.0
0.98
Proposed Method 87.39 68.47 99.09
84.99
0.0
5.43
normal vector extracted from the detected table plane and the normal vector extracted
from ground-truth data.
Evaluation measure 2 (EM2): By using EM1, only one point was used (center
Separating the interested objects on the table plane
2.3.1
Coordinate system transformation
2.3.2
Separating table plane and interested objects
2.3.3
Discussions
CHAPTER 3
PRIMITIVE SHAPES ESTIMATION BY A NEW
ROBUST ESTIMATOR USING GEOMETRICAL
CONSTRAINTS
3.1
3.1.1
Fitting primitive shapes: By GCSAC
Introduction
The geometrical model of an interested object can be estimated using from two to
seven geometrical parameters as in (Schnabel et al. 2007). A Random Sample Consensus (RANSAC) and its paradigm attempt to extract as good as possible shape parameters which are objected either heavy noise in the data or processing time constraints. In
particular, at each hypothesis in a framework of a RANSAC-based algorithm, a searching process aims at finding good samples based on the constraints of an estimated model
is implemented. To perform search for good samples, we define two criteria: (1) The
cloud
Randomly
sampling a
minimal subset
Geometrical
parameters
estimation M
Randomly
sampling
a minimal
subset
Geometrical
parameters
Estimation M
Model evaluation
M; Update the best
model
Update the number
of iterations K
adaptively (Eq. 3.2)
Terminate
?
yes
Model evaluation M via
Negative
Log-likehood;
Update the best model
Update the number
of iterations K
adaptively (Eq. 3.2)
Estimated
Model
RANSAC Iteration
A point cloud
Search good sampling
based on Geometrical
constraint based on (GS)
Random sampling
Estimation model;
Compute the inlier ratio w
Yes
k=0: MLESAC
k=1:w≥ wt: Yes
k=1:w≥ wt: No
As MLESAC
(3.2)
where p is the probability to find a model describing the data, s is the minimal number
of samples needed to estimate a model, w is percentage of inliers in the point cloud.
13
PlaneY
γc
p2
L1
p1
(a)
γ
n2
p2
γ1
(d)
L2
(f)
Figure 3.3: Geometrical parameters of a cylindrical object. (a)-(c) Explanation of the
geometrical analysis to estimate a cylindrical object. (d)-(e) Illustration of the
geometrical constraints applied in GCSAC. (f) Result of the estimated cylinder from
a point cloud. Blue points are outliers, red points are inliers.
3.1.3.2
Geometrical analyses and constraints for qualifying good samples
In the following sections, the principles of 3-D the primitive shapes are explained.
Based on the geometrical analysis, related constraints are given to select good samples.
The normal vector of any point is computed following the approach in (Holz et al.
2011) At each point pi , k-nearest neighbors kn of pi are determined within a radius r.
The normal vector of pi is therefore reduced to analysis of eigenvectors and eigenvalues
of the covariance matrix C, that is presented as in Sec. 2.2.3.2.
a. Geometrical analysis for cylindrical objects
The geometrical relationships of above parameters are shown in Fig. 3.3 (a). A cylinder
can be estimated from two points (p1 , p2 ) (two blue-squared points) and their corresponding normal vectors (n1 , n2 ) (marked by green and yellow line). Let γc be the
main axis of the cylinder (red line) which is estimated by:
γc = n1 × n2
(3.3)
To specify a centroid point I, we project the two parametric lines L1 = p1 + tn1 and
L2 = p2 + tn2 onto a plane specified by P laneY (see Figure 3.3(b)). The normal
vector of this plane is estimated by a cross product of γc and n1 vectors (γc × n1 ). The
centroid point I is the intersection of L1 and L2 (see Figure 3.3 (c)). The radius Ra
is set by the distance between I and p1 in P laneY . A result of the estimated cylinder
from a point cloud is illustrated in Figure 3.3 (f). The height of the estimated cylinder
The first one is synthesized datasets. These datasets consists of cylinders, spheres
and cones. In addition, we evaluate the proposed method on real datasets. For the
cylindrical objects, the dataset is collected from a public dataset [1] which contains
300 objects belonging to 51 categories. It named ’second cylinder’. For the spherical
object, the dataset consists of two balls collected from four real scenes. Finally, point
cloud data of the cone objects, named ’second cone’, is collected from dataset given in
[4].
3.1.4.2
Evaluation measurements of robust estimator
To evaluate the performance of the proposed method, we use following measurements:
- Let denote the relative error Ew of the estimated inlier ratio. The smaller Ew is,
the better the algorithm is. Where wgt is the defined inlier ratio of ground-truth;
w is the inlier ratio of the estimated model.
- The total distance errors Sd is calculated by summation of distances from any
point pj to the estimated model Me .
15
Table 3.2: The average evaluation results of synthesized datasets.
The synthesized datasets were repeated 50 times for statistically representative results.
Dataset/
Measure RANSAC PROSAC
Method
Ew
23.59
28.62
(%)
Ew (%)
24.89
37.86
Sd
2361.79 2523.68
tp (ms)
495.26
242.26
’first
cone’
EA (deg.)
6.48
15.64
E r(%)
20.47
17.65
MLESAC MSAC
LOSAC
NAPSAC
GCSAC
43.13
10.92
9.95
5.15
40.74
2388.64
227.57
15.64
17.31
1536.47
536.84
0.05
2.84
2.40
23.63
3558.06
31.57
0.21
17.52
30.11
2298.03
1258.07
6.79
20.22
3168.17
52.03
0.93
7.02
112.06
57.76
3904.22
(coffee mug)
’second cylinder’
(food can)
’second cylinder’
(food cup)
’second cylinder’
(soda can)
Method
MLESAC
GCSAC
MLESAC
GCSAC
MLESAC
GCSAC
MLESAC
GCSAC
w
(%)
9.94
13.83
19.05
21.41
15.04
18.8
13.54
20.6
Sd
- The processing time tp is measured in milliseconds (ms). The smaller tp is the
faster the algorithm is.
- The relative error of the estimated center (only for synthesized datasets) Ed is
Euclidean distance of the estimated center Ee and the truth one Et .
3.1.4.3
Evaluation results of new robust estimator
The performances of each method on the synthesized datasets are reported in
Tab. 3.2. For evaluating the real datasets, the experimental results are reported in
Tab. 3.3 for the cylindrical objects. Table 3.4 reports fitting results for spherical and
cone datasets.
16
Table 3.4: The average evaluation results on the ’second sphere’, ’second cone’ datasets.
The real datasets were repeated 20 times for statistically representative results.
Dataset/
Method
’second
sphere’
’second
cone’
3.1.5
80.21
96.37
96.37
29.42
71.66
99.98
26.62
3.43
26.55
71.89
156.40
7.42
40.35
77.09
99.83
29.38
4.17
30.36
75.45
147.00
13.05
35.62
74.84
99.80
29.37
2.97
30.38
roughly inlier ratio evaluation and geometrical constraints of the interested shapes.
This strategy aimed to select good samples for the model estimation. The proposed
method was examined with primitive shapes such as a cylinder, sphere and cone. The
experimental datasets consisted of synthesized, real datasets. The results of the GCSAC algorithm were compared to various RANSAC-based algorithms and they confirm
that GCSAC worked well even the point-clouds with low inlier ratio. In the future,
we will continue to validate GCSAC on other geometrical structures and evaluate the
proposed method with the real scenario for detecting multiple objects.
3.2
3.2.1
Fitting objects using the context and geometrical constraints
Finding objects using the context and geometrical constraints
Let’s consider a real scenario in common daily activities of the visually impaired
people. They come to a cafeteria then give a query ”where is a coffee cup?”, as shown
in Fig. 1.
3.2.2
The proposed method of finding objects using the context and geometrical constraints
In the context of developing object-finding-aided systems for the VIPs (as shown
in Fig. 1).
17
3.2.2.1
3.2.3
3.2.3.1
81.01
13.51
dataset
Second
MLESAC
47.56
50.78
25.89
GCSAC
40.68
38.29
18.38
dataset
Third
MLESAC
45.32
48.48
22.75
GCSAC
43.06
46.9
17.14
dataset
3.2.4
Discussions
CHAPTER 4
Dataset
CVFGS 56.24
50.38
DLGS
88.24
78.52
spherical objects on two stages
Second stage
Recall
(%)
60.56
48.27
76.52
Average
Processing
time
Precision
tp (s)/scene
(%)
46.68
1.05
42.34
1.2
72.29
0.5
4.1.2
4.1.4.1
Data collection
4.1.4.2
Object detection evaluation
4.1.4.3
Evaluation parameters
4.1.4.4
Results
The average result of detecting spherical objects at the first stage of evaluation is
presented in Tab. 4.1.
4.1.5
4.2
Discussions
Deploying an aided system for visually impaired people
From the evaluations above, they can see that the DLGS method has the best
results for detecting 3-D primitive objects that based on the queries of the VIPs.
Therefore, the complete system is developed according to the frame-work shown in
Fig. 4.20. To detect objects that based on the query-based of a VIP on the table in
model
estimation
3-D objects
information
Detected table plane
Point cloud representation
3-D Objects located on the
table plane
(m)
(m)
3-D objects location,
description for grasping
Depth image
Detected Objects
Figure 4.20: The frame-work for deploying the complete system to detect 3-D primitive
objects according to the queries of the VIPs.
1. Generating RGB point cloud from RGB image and depth image (presented in
Sec. 2.1) that used the calibration matrix and the down-sampling.
2. Using acceleration vector and constraints to detect the table plane (presented in
Sec. 2.2)
3. Separating the table plane and objects (presented in Sec. 2.3)
4. Objects detection on RGB image (YOLO)
5. 3-D Object location on the table plane
6. Fitting models by GCSAC (presented in Sec. 3.1) for grasping, describing objects.
Recall Precision Recall Precision
time
(%)
(%)
(%)
(%)
(frame/s)
Average
100
99.27
97.80
90.45
0.86
Results
4.2.3
Experimental results
From the experimental setup of system is described in the Sec. 4.2.1 and Sec.4.2.2.
It includes 8 scenes with different types of table, each scene has about 400 frames, the
frame rate of the MS Kinect is about 10 frames per second.
4.2.3.1
Evaluation of finding 3-D objects
To evaluate the 3-D queried objects detection of the VIPs, we have prepared
the ground truth data according to the two phases. The first phase is to evaluate
the table plane detection, we prepared as Sec. 2.2.4.2 and using ’EM1’ measurement
for evaluating the table plane detection. To evaluate the objects detection, we also
prepared the ground truth data and compute T1 for evaluating 3-D cylindrical objects
(Geometrical Constraint SAmple Consensus) for estimating primitive shapes (e.g.,
cylinder, sphere, cone) from a point cloud data that may contain contaminated data.
This algorithm is a RANSAC variation with improvements of the sampling step. Unlike RANSAC and MLESAC, where the samples are drawn randomly, GCSAC selects
intentionally good samples based on the proposed geometrical constraints. GCSAC
was evaluated and compared to RANSAC variations for the estimation of the primitive shapes on the synthesized and real datasets. The experimental results confirmed
that GCSAC is better than the RANSAC variations for both the quality of the estimated models and the computational time requirements. We also proposed to use the
contextual constraints which are delivered from the specific context of the environment
to significantly improve the estimation results.
In this dissertation, we also described a completed aided-system for detecting 3-D
primitive objects based on VIP’s query. This system was demonstrated and evaluated in the real environments. The application was developed utilizing the following
proposed techniques:
❼ The real-time table plane detection that achieved both high accuracy and low
computational time. It is a combination of down-sampling, region growth algorithm and the contextual constraints. A real dataset of table plane collected in
various real scenes is made publicity available.
❼ A combination of Deep Learning (YOLO network) for object detection on RGB
image and using the proposed robust estimator (GCSAC) for generating the full
object models to provide object’s descriptions for the VIPs. The evaluations
confirmed that YOLO achieved an acceptable accuracy and its computational
time is the fastest, while GCSAC could estimate full models with contaminated
or occluded data. These results ensure the feasibility of the developed application.
During the experimentations, we also find limitations of the proposed methods,
that are listed below:
❼ In table plane detection step, some context constraints are assumed. For instance,
table plane is flat, lying on the floor and its height is lower than the MS Kinect’s
22
❼ Short term:
– For an improvement of GCSAC to estimate primitive shapes: We need to
propose geometrical constraints for estimating many other geometrical structures. The combination of the proposed algorithm and the constraints for
the the complex shapes can be adopted by work of (Schnabel et al. 2007)
or composing graph of the primitive shapes as proposed by (Nieuwenhuisen
et al. 2012).
– Evaluating the developed system needs to be deployed on many VIPs with
23