Literature DB >> 26999131

Vision-Based SLAM System for Unmanned Aerial Vehicles.

Rodrigo Munguía^1,2, Sarquis Urzua³, Yolanda Bolea⁴, Antoni Grau⁵.

Abstract

The present paper describes a vision-based simultaneous localization and mapping system to be applied to Unmanned Aerial Vehicles (UAVs). The main contribution of this work is to propose a novel estimator relying on an Extended Kalman Filter. The estimator is designed in order to fuse the measurements obtained from: (i) an orientation sensor (AHRS); (ii) a position sensor (GPS); and (iii) a monocular camera. The estimated state consists of the full state of the vehicle: position and orientation and their first derivatives, as well as the location of the landmarks observed by the camera. The position sensor will be used only during the initialization period in order to recover the metric scale of the world. Afterwards, the estimated map of landmarks will be used to perform a fully vision-based navigation when the position sensor is not available. Experimental results obtained with simulations and real data show the benefits of the inclusion of camera measurements into the system. In this sense the estimation of the trajectory of the vehicle is considerably improved, compared with the estimates obtained using only the measurements from the position sensor, which are commonly low-rated and highly noisy.

Entities: Chemical Disease Gene Species

Keywords: localization; mapping; monocular vision; state estimation; unmanned aerial vehicle

Year: 2016 PMID： 26999131 PMCID： PMC4813947 DOI： 10.3390/s16030372

Source DB: PubMed Journal: Sensors (Basel) ISSN： 1424-8220 Impact factor: 3.576

1. Introduction

In recent years, many researchers have addressed the issue of making Unmanned Aerial Vehicles (UAVs) more autonomous. In this context, the state estimation of the six degrees of freedom (6-DoF) of a vehicle (i.e., its attitude and position) is a fundamental necessity for any application involving autonomy. Outdoors, this problem is seemingly solved with on-board Global Positioning System (GPS) and Inertial Measurements Units (IMU) with their integrated version, the Inertial Navigation Systems (INS). In fact, unknown, cluttered, and GPS-denied environments still pose a considerable challenge. While attitude estimation is well-handled with available systems [1], GPS-based position estimation has some drawbacks. Specifically GPS is not a reliable service as its availability can be limited in urban canyons and is completely unavailable in indoor environments. Moreover, even when GPS signal is available, the problem of position estimation could not be solved in different scenarios. For instance, aerial inspection of industrial plants is an application that requires performing precision manoeuvres in a complex environment. In this case, and due to the several sources of error, the position obtained with a GPS can vary with an error of several meters in just a few seconds for a static location [2]. In such a scenario, the use of GPS readings, smoothed or not, as the feedback signal of a control system can be unreliable because the control system cannot distinguish between sensor noise and actual small movements of the vehicle. Therefore, some additional sensory information (e.g., visual information) should be integrated into the system in order to improve accuracy. The aforementioned issues have motivated the move of recent works towards the use of cameras to perform visual-based navigation in periods or circumstances when the position sensor is not available, when it is partially available, or when a local navigation application requires high precision. Cameras are well adapted for embedded systems because they are cheap, lightweight, and power-saving. In this way, a combination of vision and inertial measurements is often chosen as means to estimate the vehicle attitude and position. This combination can be performed with different approaches, as in [3], where the vision measurement is provided by an external trajectometry system, directly yielding the position and orientation of the robot. In [4], an external CCD camera provides the measurements. Other on-board techniques were proposed by [5,6], where an embedded camera uses different markers to provide a good estimation of position and orientation as well. This estimation was obtained using the specific geometry of different markers and assuming that the marker’s position was known. The same idea was exploited by [7], implemented with the low-cost Wii remote visual sensor. Finally, visual sensing is sometimes provided by optical flow sensors to estimate the attitude, the position, and the velocity, as in [8]. In these different approaches, position estimation is obtained by computer vision and the attitude is either obtained by vision (see [3,6]) or by IMU sensors. In [9], even a single angular measurement could significantly improve attitude and position estimation. Another family of approaches (for instance [10,11]) relies on visual SLAM (Simultaneous Localization and Mapping) methods. In this case, the mobile robot operates in a priori unknown environment using only on-board sensors to simultaneously build a map of its surroundings and locate itself inside this map. Robot sensors have a large impact on the algorithm used in SLAM. Early SLAM approaches focused on the use of range sensors, such as sonar rings and lasers, see [12,13,14,15]. Nevertheless, some disadvantages appear when using range sensors in SLAM: correspondence or data association becomes difficult, the sensors are expensive and have a limited working range, and some of them are limited to 2D maps. For small unmanned aerial vehicles, there exist several limitations regarding the design of the platform, mobility, and payload capacity that impose considerable restrictions. Once again, cameras appear as a good option to be used in SLAM systems applied to UAVs. In this work, the authors propose the use of a monocular camera looking downwards, integrated into the aerial vehicle, in order to provide visual information of the ground. With such information, the proposed visual-based SLAM system will be using visual information, attitude, and position measurements in order to accurately estimate the full state of the vehicle as well as the position of the landmarks observed by the camera. Compared with another kind of visual configurations (e.g., stereo vision), the use of monocular vision has some advantages in terms of weight, space, power consumption, and scalability. For example, in stereo rigs, the fixed base-line between cameras can limit the operation range. On the other hand, the use of monocular vision introduces some technical challenges. First, depth information cannot be retrieved in a single frame, and hence, robust techniques to recover features depth are required. In this work, a novel method is developed following the research initiated in [16]. The proposed approach is based on a stochastic technique of triangulation to estimate features depth. In this novel research, a new difficulty appears: the metric scale of the world cannot be retrieved if monocular vision is used as the unique sensory input to the system. For example, in the experiments presented in [17], the first ten measurements are aligned with the ground-truth in order to obtain the scale of the environment. In [18], the monocular scale factor is retrieved from a feature pattern with known dimensions. On the other hand, in many real scenarios GPS signal is available, at least for some periods. For this reason, in this work it is assumed that the GPS signal is known during a short period (for some seconds) at the beginning of the trajectory. Those GPS readings will be integrated into the system in order to recover the metric scale of the world. This period of time is what authors consider the initialization period. After this period, the system can rely only on visual information to estimate the position of the aerial vehicle. The integration of GPS readings with visual information is not new in the literature, see [19]. In this sense, one of the contributions of this work is to demonstrate that the integration of very noisy GPS measurements into the system for an initial short period is enough to recover the metric scale of the world. Furthermore, the experiments demonstrate that for flight trajectories performed near the origin of the navigation reference frame, it is better to avoid the integration of such GPS measurements after the initialization period. This paper is organized as follows: Section 2 states the problem description and assumptions. Section 3 describes the proposed method in detail. Section 4 shows the experimental results, and finally in Section 5, the conclusions of this work are presented.

2. System Specification

2.1. Assumptions

The platform that the authors consider in this work is a quadrotor freely moving in any direction in , as shown in Figure 1. The quadrotor is equipped with a monocular camera, an attitude and heading reference system (AHRS) and a position sensor (GPS). It is important to remark that the proposed visual-based SLAM approach can be applied to another kind of platforms.

Figure 1

Coordinate systems: the local tangent frame is used as the navigation reference frame . AHRS: Attitude and Heading Reference System.

The proposed system is mainly intended for local autonomous vehicle navigation. Hence, the local tangent frame is used as the navigation reference frame. The initial position of the vehicle defines the origin of the navigation coordinates frame. The navigation system follows the NED (North, East, Down) convention. In this work, the magnitudes expressed in the navigation, vehicle (robot), and camera frame are denoted respectively by the superscripts , , and . All the coordinate systems are right-handed defined. In this research, the sensors that have been taken into account are described and modelled in the following subsections.

2.2. Monocular Camera

As a vision system, a standard monocular camera has been considered. In this case, a central-projection camera model is assumed. The image plane is located in front of the camera’s origin where a non-inverted image is formed. The camera frame is right-handed with the z-axis pointing to the field of view. The projection of a 3D point located at to the image plane is defined by: where u and v are the coordinates of the image point p expressed in pixel units, and: being the same 3D point , but expressed in the camera frame by . In this case, it is assumed that the intrinsic parameters of the camera are already known: (i) focal length f; (ii) principal point ; and (iii) radial lens distortion . Let be the rotation matrix that transforms the navigation frame to the camera frame . Let be a known value, and is computed from the current robot quaternion . Let be the position of the camera’s optical center position expressed in the navigation frame. Inversely, a directional vector can be computed from the image point coordinates u and v. The vector points from the camera optical center position to the 3D point location. can be expressed in the navigation frame by , where is the camera-to-navigation rotation matrix. Note that for the mapping case, defined in Equation (3), depth information is lost. The distortion caused by the camera lens is considered through the model described in [20]. Using this model (and its inverse form), undistorted pixel coordinates can be obtained from the distorted pixel , and conversely.

2.3. Attitude and Heading Reference System

An attitude and heading reference system (AHRS) is a combination of instruments capable of maintaining an accurate estimation of the vehicle attitude while it is manoeuvring. Recent manufacturing of solid-state or MEMS gyroscopes, accelerometers, magnetometers, and powerful microcontrollers as well, have made possible the development of small, low-cost, and reliable AHRS devices (e.g., [1,21,22]). For these reasons, in this work a loosely-coupled approach is considered. In this case, the information of orientation provided by the AHRS is explicitly fused into the system. Hence, the availability of high-rated (typically 50–100 Hz) attitude measurements provided by a decoupled AHRS device are assumed. Attitude measurements are modelled by: where , being , , and Euler angles denoting respectively the roll, pitch, and yaw of the vehicle. Let be a Gaussian white noise with power spectral density (PSD) .

2.4. GPS

The Global Positioning System (GPS) is a satellite-based navigation system that provides 3D position information for objects on or near the Earth’s surface. The GPS system and global navigation satellite systems have been described in detail in numerous studies (e.g., [2,23]). Several sources of error affect the accuracy of GPS position measurements. The cumulative effect of each of these error sources is called the user-equivalent range error (UERE). In [2], these errors are characterized as a combination of slowly varying biases and random noise. In [24] it is stated that the total UERE is approximately 4.0 m (σ), from which 0.4 m (σ) correspond to random noise. In Figure 2, a comparison between the trajectory obtained with GPS and the actual one, flying in a small area, is shown.

Figure 2

Example of GPS position measurements obtained for a flight performed by the aerial vehicle. Top view (left plot) and lateral view (right plot) are shown for clarity. Flight trajectory has been computed using the perspective on 4-point (P4P) method described in Section 4. Error drift in GPS readings is noticeable. NED: North, East, Down.

In this work, it is assumed that position measurements can be obtained from the GPS unit, at least at the beginning of the trajectory, and they are modelled by: where is a Gaussian white noise with PSD , and is the position of the vehicle. Commonly, position measurements are obtained from GPS devices in geodetic coordinates (latitude, longitude, and height). Therefore, in Equation (5) it is assumed that GPS position measurements have been previously transformed to their corresponding local tangent frame coordinates. It is also assumed that the offset between the GPS antenna and the vehicle frame has been taken into account in the previous transformation.

2.5. Sensor Fusion Approach

The estimator proposed in this work is designed in order to estimate the full state of the vehicle, which will contain the position and orientation of the vehicle and their first derivatives, as well as the location of the landmarks observed by the camera. Attitude estimation can be well-handled by the available systems in the vehicle, as has been mentioned in the above subsections. Typically, the output of the AHRS is directly used as a feedback to the control system for stabilizing the flying vehicle. On the other hand, the proposed method requires the camera–vehicle to know its orientation in order to estimate its position, as will be discussed later in the paper. In order to account for the uncertainties associated with the estimation provided by the AHRS, the orientation is included into the state vector (see Section 3.1) and is explicitly fused into the system (see Section 3.4). Regarding the problem of position estimation, it cannot be solved for applications that require performing precise manoeuvres, even when GPS signal is available, as it can be inferred from the example presented in Section 2.4. Therefore, some additional sensory information (e.g., monocular vision) should be integrated into the system in order to improve its accuracy. On the other hand, one of the most challenging aspects of working with monocular sensors has to do with the impossibility of directly recovering the metric scale of the world. If no additional information is used, and a single camera is used as the sole source of data to the system, the map and trajectory can only be recovered without metric information [25]. Monocular vision and GPS are not suitable to be used separately for navigation purposes in some scenarios. For this reason, the noisy data obtained from the GPS is added during the initialization period in order to incorporate metric information into the system. Hence, after an initial period of convergence, where the system is considered to be in the initialization mode, the system can operate relying only on visual information to estimate the vehicle position.

3. Method Description

3.1. Problem Description

The main goal of the proposed method is to estimate the following system state x: where represents the state of the unmanned aerial vehicle, and represents the location of the i- feature point in the environment. At the same time, is composed of: where represents the orientation of the vehicle respect to the world (navigation) frame by a unit quaternion. Let be the angular velocity of the robot expressed in the same frame of reference. Let represent the position of the vehicle (robot) expressed in the navigation frame. Let denote the linear velocity of the robot expressed in the navigation frame. The location of a feature is parametrized in its euclidean form: In the remainder of the paper, the superscript will be dropped from to avoid confusion. The architecture of the system is defined by the a classical loop of prediction-update steps in the Extended Kalman Filter (EKF) in its direct configuration. In this case, the vehicle state as well as the feature estimates are propagated by the filter, see Figure 3.

Figure 3

Block diagram showing the architecture of the system. EKF-SLAM: Extended Kalman Filter Simultaneous Localization and Mapping.

3.2. Prediction

At the same frequency of the AHRS operation, the vehicle system state takes a step forward through the following unconstrained constant-acceleration (discrete) model: In the model represented by Equation (9), a closed form solution of is used to integrate the current velocity rotation over the quaternion . In this case and: At every step, it is assumed that there is an unknown linear and angular velocity with acceleration zero-mean and known-covariance Gaussian processes and , producing an impulse of linear and angular velocity: and . It is assumed that the map features remain static (rigid scene assumption) so . The state covariance matrix takes a step forward by: where Q and the Jacobians , are defined as: Let be the derivatives of the equations of the nonlinear prediction model (Equation (9)) with respect to the robot state . Let be the derivatives of the nonlinear prediction model with respect to the unknown linear and angular velocity. The Jacobian calculation is a complicated but tractable matter of differentiation, hence, no results are presented here. Uncertainties are incorporated into the system by means of the process noise covariance matrix , through parameters and .

3.3. Visual Aid

Depth information cannot be obtained in a single measurement when bearing sensors (e.g., a projective camera) are used. To infer the depth of a feature, the sensor must observe this feature repeatedly as the sensor freely moves through its environment, estimating the angle from the feature to the sensor center. The difference between those angle measurements is the parallax angle. Actually, parallax is the key that allows the estimation of features depth. In the case of indoor sequences, a displacement of centimeters could be enough to produce parallax; on the other hand, the more distant the feature, the more the sensor has to travel to produce parallax. In monocular-based systems, the treatment of the features in the stochastic map (initialization, measurement, etc.) is an important problem to address with direct implications in the robustness of the system. In this work, a novel method is proposed in order to incorporate new features into the system. In this approach, a single hypothesis is computed for the initial depth of features by means of a stochastic technique of triangulation. The method is based on previous authors’ work [16], and new contributions have been introduced in this research.

3.3.1. Detection of Candidate Points

The proposed method states that a minimum number of features is considered to be predicted appearing in the image, otherwise new features should be added to the map. In this latter case, new points are detected in the image through a random search. For this purpose, Shi-Tomasi corner detector [26] is applied, but other detectors could be also used. These points in the image, which are not yet added to the map, are called candidate points, see Figure 4. Only image areas free of both candidate points and mapped features are considered to detect new points with the saliency operator.

Figure 4

Candidate points are detected randomly in image regions without map features or candidate points. In this frame, the black rectangle indicates the current search region. Three new candidate points have been detected (green cross-marks). Candidate points being tracked are indicated by blue cross-marks. Visual features already mapped are indicated by circles. Red marks indicate unsuccessful matches.

At the k frame, when a visual feature is detected for the first time, the following entry is stored in a table: where is the location in the image of the candidate point. Let be a variable that models a 3D semi-line, defined on one side by the vertex , corresponding to the current optical center coordinates of the camera expressed in the navigation frame, and pointing to infinity on the other side, with azimuth and elevation and , respectively, and: where is computed as indicated in Section 2.2. is a covariance matrix which models the uncertainty of in the form of , where P is the system covariance matrix and is the Jacobian matrix formed by the partial derivatives of the function with respect to . Also, a pixel window, centered in is extracted and related to the corresponding candidate point.

3.3.2. Tracking of Candidate Points

To infer the depth of a feature, the sensor must observe this feature repeatedly until a minimum parallax is reached. For this reason, it is necessary to have a method to track the location in the image of candidate points whose initial depth must be computed. For feature points that have already been included into the system state, there is enough information (e.g., depth) to define probability regions where these points must lie based on the statistical information available in the system state (see [27]). On the other hand, for candidate points, there is not yet information about depth nor statistical correlations with other elements of the system. In this sense, one alternative is to use a general-purpose decoupled tracking method that works on the images and does not need assumptions about the system dynamics (e.g., [26]). Due to the lack of information about system dynamics, these kinds of methods usually define regions of search with symmetric geometry and fixed size. This factor can add some extra computational cost. In this work, a novel technique to track candidate points is proposed. The idea is to take advantage of the knowledge about the direction of the movement of the camera in order to define regions of search defined by very thin ellipses. The ellipses are aligned with the epipolar lines where the candidate points must lie. At every subsequent frame ,..., the location of candidate points is tracked. In this case, a candidate point is predicted to be appearing inside an elliptical region centered in the point , taken from , see Figure 5. In order to optimize the speed of the search, the major axis of the ellipse is aligned with the epipolar line defined by image points and .

Figure 5

The established search region to match candidate points is constrained to ellipses aligned with the epipolar line.

The epipole is computed by projecting , which is stored in , to the current image plane by Equations (1) and (2). Because there is not depth information of the candidate point, an hypothetical depth equal to one () is chosen in order to determine a virtual 3D point which lies in the semi-ray defined by . The epipole is then computed by projecting this virtual 3D point through Equations (1) and (2). In this case, will model a 3D point located at: where is a directional unitary vector defined by: . The orientation of the ellipse is determined by , where represents the y and x coordinates, respectively, of e, and . The size of the ellipse is determined by its major and minor axis, respectively a and b. The ellipse is represented in its matrix form by: The ellipse represents a probability region where the candidate point must lie in the current frame. The proposed tracking method is intended to be used during an initial short period of time. During this period, some information will be gathered in order to compute a depth hypothesis for each candidate point, prior to its initialization as a new map feature. For this reason, there is no extra effort to obtain more robust variations in scale or a rotations descriptor. In this case, direct patch cross-correlation is applied over all the image locations . If the score of a location , determined by the best cross-correlation between the candidate patch and the n patches defined by the region of search, is higher than a specific threshold, then this pixel location is considered as the current candidate point location. Thus, is updated with . Unfortunately, because there is not yet reliable information about the depth of candidate points, it is difficult to determine an optimal and adaptive size of the ellipse. In this case, a is left as a free parameter to be chosen empirically as a function of the particularities of the application (e.g., maximum velocity of the vehicle, video frame rate). For the application presented in this work, good results were found with a value of pixels. On the other hand, it is possible to investigate the effects obtained by the variation of the relation of () which determines the proportion of the ellipse. In Figure 6, it can be noted that the time required to track a candidate point increases considerably as the ellipse tends to be a circle (left plot). On the other hand, the number of candidate points being tracked is lower when the ellipse tends to be a circle (middle plot). This is because some candidate points are lost when the ellipse is too thin, and new candidate points must be detected. Even so, the total time required for the whole tracking process of candidate points is much lower when the parameter b is chosen in order to define a very thin ellipse (right plot). For the foregoing reason the value of parameter b is recommended to be ten times lower than a.

Figure 6

Results obtained by means of the variation of the relation between ellipse axes (). (left plot): average tracking time for a candidate point; (middle plot): average number of candidate points being tracked at each frame; (right plot): average total time per frame. These results were obtained using the same methodology described in Section 4.2.

3.3.3. Estimating Candidate Points Depth

Every time that a new image location is obtained for a candidate point , an hypothesis of depth is computed by: Let be the parallax. Let indicate the displacement of the camera from the first observation position to its current position, with: Let β be the angle defined by and . Let be the normalized directional vector computed taking , from , and where γ is the angle defined by and . Let be the directional vector pointing from the current camera optical center to the feature location, computed as indicated in Section 2.2 from the current measurement , see Figure 7.

Figure 7

An hypothesis for the depth of a candidate point is computed by triangulating between the first location when the point was detected and the current location of the vehicle.

At each step, there may be a considerable variation in depth computed by triangulation, specially for low parallax. In previous authors’ work [28], it is shown that estimates are greatly improved by filtering the hypotheses of depth with a simple low-pass filter. Moreover, in this work it is demonstrated that only a few degrees of parallax is enough to reduce the uncertainty in the depth estimation. When parallax is greater than a specific threshold () a new feature is added to the system state vector x: where The system state covariance matrix P is updated by: where is the covariance matrix which models the uncertainty of the new feature , and: In Equation (22), is taken from (Equation (13)). Let be a parameter modelling the uncertainty of process of depth estimation. Let be the Jacobian matrix formed by the partial derivatives of the function with respect to .

3.3.4. Visual Updates and Map Management

The process of tracking visual features is conducted by means of an active search technique [27]. In this case, and in a different way from the tracking method described in Section 3.3.2, the search region is defined by the innovation covariance matrix , where . Assuming that for the current frame, n visual measurements are available for features , then the filter is updated with the Kalman update equations as follows: where is the current measurement vector. Let be the current prediction measurement vector. The measurement prediction model has been defined in Section 2.2. Let K be the Kalman gain. Let S be the innovation covariance matrix. Let be the Jacobian formed by the partial derivatives of the measurement prediction model with respect to the state x. Let be the partial derivatives of the equations of the measurement prediction model with respect to the robot state . Let be the partial derivatives of with respect to feature . Note that has only a nonzero value at the location (indexes) of the observed feature . Let be the measurement noise covariance matrix. Let be the variance modelling the uncertainty in visual measurements. A SLAM framework that works reliably in a local way can easily be applied to large-scale problems using different methods, such as sub-mapping, graph-based global optimization [29], or global mapping [30]. Therefore, in this work, large-scale SLAM and loop-closing are not considered. However, these problems have been intensively studied in the past. Candidate points whose tracking process is failing are pruned from the system. Furthermore, visual features with high percentage of mismatching are removed from the system state and covariance matrix. The removal process is carried out using the approach described in [31].

3.4. Attitude and Position Updates

When an attitude measurement is available, the system state is updated. Since most low-cost AHRS devices provide their output in Euler angles format, the following measurement prediction model is used: During the initialization period, position measurements are incorporated into the system using the simple measurement model : The regular Kalman update equations (Equation (23)) are used to update attitude and position whenever is required, but using the corresponding Jacobian and measurement noise covariance matrix R. The metric scale of the world cannot be retrieved using only monocular vision, as mentioned previously, and thus additional information must be added to the system. For instance, the metric scale can be retrieved if the position of some landmarks are known a priori with low uncertainty [32]. In this work, it is assumed that the GPS signal is available for an initial period at least. This period is considered as an initialization period that must allow the convergence of depth for at least some features close to their actual values. These first features added to the map during the initialization period set a metric scale in estimations. Afterwards, the system can operate relying only on visual information to estimate the location of the vehicle. For the proposed method, the initialization period will end when at least n features show a certain degree of convergence. It has been theoretically demonstrated (e.g., [33]) that knowledge about the position of three landmarks can be enough to make the metric scale observable. However, in practice, there is always the possibility that the tracking process of some features fails at any time. For this reason, in this work the initialization period will be ending when features have converged. In experiments, good results have been found with . In [34], the convergence of features is tested using the Kullback distance. However, the complexity of the sampling method proposed to evaluate this distance is quite high. In the present work, good results have been found with the following criteria: where is the sub-matrix extracted from the covariance matrix P corresponding to the feature. In this case, if the greater eigenvalue of is smaller than one percent of the distance between the camera and the feature, then it is considered that the uncertainty in this feature has been minimized enough to take it as an initial reference of metrics. It is important to note that the origin of the local reference system of navigation is established at the end of the initialization period. The reason is because at the beginning of the movement the GPS errors can wrongly dominate the estimations. Since the proposed method is not deterministic, the duration of the initialization period varies even for the same input dataset (see Figure 8). For this reason, in order to simplify the experimental methodology, a fixed initialization period was used for computing the results of comparative studies presented in Section 4. In this manner, it was easier to align (in time) the estimated trajectories in order to perform a Monte Carlo validation. The fixed initial period was empirically determined to allow a high percentage of initial convergence. In a real scenario, the duration of the initialization period should be determined by an adaptive criteria, as authors have proposed in this section.

Figure 8

Histogram of the duration of the initialization period obtained after 20 runs of the proposed method. This particular case corresponds to the flight trajectory presented in Section 4.2.

4. Experimental Results

In this section, the results obtained using synthetic data from simulations are presented as well as the results obtained from experiments with real data. The experiments were performed in order to validate the performance of the proposed method. A MATLAB® implementation was used for this purpose.

4.1. Experiments with Simulations

In simulations, the model used to implement the vehicle dynamics was taken from [35]. To model the transient behaviour of the GPS error, the approach of [36] was followed. The monocular camera was simulated using the same parameter values of the camera used in the experiments with real data. The parameter values used to emulate the AHRS were taken from [1]. Figure 9 illustrates two cases of simulation: (a) The quadrotor was commanded to take off from the ground and then to follow a circular trajectory with constant altitude. The environment is composed by 3D points, uniformly distributed over the ground, which emulate visual landmarks; (b) The quadrotor was commanded to take off from the ground and then to follow a figure-eight-like trajectory with constant altitude. The environment is composed by 3D points, randomly distributed over the ground.

Figure 9

Comparison of the estimated trajectories obtained by filtering GPS data (left plots), and the estimated maps and trajectories obtained through visual-based navigation (right plots). Two different kind of trajectories and distributions of landmarks are simulated: (upper plots) a circular trajectory, (lower plots) a figure-eight-like trajectory. The GPS signal was used only during the initialization period. The actual trajectory is shown in black. The estimates are shown in blue.

In simulations, it is assumed that the camera can detect and track visual features, avoiding the data association problem. Furthermore, the problem of the influence of the estimates on the control system was not considered. In other words, an almost perfect control over the vehicle is assumed. Figure 10 shows the average mean absolute error (MAE) in position, obtained after 20 Monte Carlo runs of simulation. The MAE was computed for three scenarios: (i) using only GPS to estimate position; (ii) using GPS together with camera along all of the trajectory in order to estimate position and map; (iii) using GPS only during the initialization period, and then performing visual-based navigation and mapping.

Figure 10

Mean absolute error (MAE) in position computed from two simulations (a and b) out of 20 Monte Carlo runs: (upper plot) simulation (a) results; (lower plot) simulation (b) results.

Figure 9 and Figure 10 clearly show the benefits of incorporating visual information into the system. It is important to note that the trajectory obtained relying only on the GPS was computed by incorporating GPS readings into the filter, and do not denote raw measurements. In Figure 10 it is interesting to note that the computed MAE values for the trajectories obtained through visual-based navigation exhibit the classical SLAM behaviour when the quadrotor returns near to its initial position. In this case, the error is minimized close to zero. On the other hand, when the GPS is used all the time, the MAE remains more constant. In this case, it is seen that even when the vehicle is close to its trusted position, there is some influence of the GPS errors that affect the estimation. This behaviour suggests that for trajectories performed near to a local frame of reference, and even when the GPS signal is available, it is better to navigate having more confidence in visual information than in GPS data. On the other hand, in the case of trajectories moving far away from its initial frame of reference, the use of absolute referenced data obtained from the GPS imposes an upper bound on the ever growing error, contrary to what is expected with a pure vision-based SLAM approach. In these experiments, it is important to note that the most relevant source of error comes from the slow-time varying bias part of the GPS. In this case, some of the effects of this bias can be tackled by the model in Equation (5) by means of increasing the measurement noise covariance matrix. On the other hand, it was found that increasing this measurement matrix too much can affect the convergence of initial features depth. A future work could be, for instance, to develop an adaptive criteria to fuse GPS data, or also to extend the method in order to explicitly estimate the slow-varying bias of the GPS.

4.2. Experiments with Real Data

A custom-built quadrotor is used to perform experiments with real data. The vehicle is equipped with an Ardupilot unit as flight controller [37], a NEO-M8N GPS unit, a radio telemetry unit 3 DR 915 Mhz, a DX201 DPS camera with wide angle lens, and a 5.8 GHz video transmitter. In experiments, the quadrotor has been manually radio-controlled (see Figure 11).

Figure 11

A park was used as flight field. Data obtained from the sensors of a radio-controlled quadrotor has been used to test the proposed method. The eight year-old first author’s son was in charge of piloting the flying vehicle.

A custom-built C++ application running on a laptop has been used to capture data from the vehicle, which were received via MAVLINK protocol [38], as well as capturing the digitalized video signal transmitted from the vehicle. The data captured from the GPS, AHRS, and frames from the camera were synchronized and stored in a dataset. The frames with a resolution of 320 × 240 pixels, in gray scale, were captured at 26 . The flights of the quadrotor were conducted in a open area of a park surrounded by trees, see Figure 11. The surface of the field is mainly flat and composed by grass and dirt, but the experimental environment also included some small structures and plants. An average of 8–9 GPS satellites were visible at the same time. In experiments, in order to have an external reference of the flight trajectory to evaluate the performance of the proposed method, four marks were placed in the floor, forming a square of known dimensions (see Figure 4). Then, a perspective on 4-point (P4P) technique [39] was applied to each frame in order to compute the relative position of the camera with respect to this known reference. It is important to note that the trajectory obtained by the above technique should not be considered as a perfect reference of ground-truth. However, this approach was very helpful to have a fully independent reference of flight for evaluation purposes. Finally, the MATLAB implementation of the proposed method has been executed offline for all the dataset in order to estimate the flight trajectory and the map of the environment. An initial period of flight was considered for initialization purposes, as explained in Section 3.4. Figure 12 shows two different instances of a flight trajectory. For this test, the GPS readings were fused into the system only at the initialization period; after that, the position of the vehicle and the map of the environment were recovered using visual information. Since the beginning of the flight (left plots), it can be clearly appreciated how the GPS readings diverge from the actual trajectory. Several features have been included into the map just after a few seconds of flight (right plots).

Figure 12

Estimated trajectory and map corresponding to two different instants of time during periods of visual-based navigation: (upper plots) real images at 8.56 s and 15.22 s of flight; (middle plots) zenital view of maps and estimated trajectories at 8.56 s and 15.22 s of flight; (lower plots) sectional view of maps and estimated trajectories at 8.56 s and 15.22 s of flight. The estimated trajectory is indicated in blue. The P4P visual reference is indicated in yellow. GPS position measurements are indicated in green. Comparing visual features with the estimated map, it can be appreciated that the physical structure of the environment is partially recovered.

Figure 13 shows a 3D perspective of the estimated map and trajectory after 30 s of flight. In this test, a good concordance between the estimated trajectory and the P4P visual reference were obtained, especially if it is compared with the GPS trajectory.

Figure 13

3D plot of the estimated map and trajectory obtained in visual-based navigation mode. Considering the trajectory obtained by the P4P visual technique as a reference, it can be clearly appreciated that GPS is unreliable to estimate position when fine manoeuvres are performed.

In order to gain more insight about the performance of the proposed method, the same three experimental variants used in simulations were computed, but in this case with real data: (i) GPS; (ii) GPS + camera; (iii) camera (GPS only at the initialization). In this comparison, all the results were obtained averaging ten executions of each method. Is important to note that those averages are computed because the method is not deterministic since the search and detection of new candidate points is conducted in a random manner over the images (Section 3.3.1). The P4P visual reference was used as ground-truth. The number of visual features being tracked at each frame can affect the performance of monocular SLAM methods. For this reason, the methods were tested by setting two different values of minimum distance (M.D.) between the visual features being tracked. In this case, the bigger the value, the lesser the number of visual features that can be tracked. Figure 14 shows the progression over time for each case. A separate plot for each coordinate (north, east, and down) is presented. Table 1 gives a numerical summary of the results obtained in this experimental comparison with real data. These results confirm the results obtained through simulations. For trajectories estimated using only GPS data, the high average MAE in position makes this approach not suitable for its use as feedback to control fine manoeuvres. In this particular case, it is easy to see that the major source of error comes from the altitude computed by the GPS (see Figure 14, lower plots). Additional sensors (e.g., a barometer) can be used to mitigate this particular error. However, the error in the horizontal plane (north–east) can be still critical for certain applications. In this sense, the benefits obtained by including visual information into the system are evident.

Figure 14

Estimated average of position expressed in coordinates for a minimum distance of 15 pixels: north (left upper plot), east (left middle plot), and down (left lower plot), and for a minimum distance of 25 pixels: north (right upper plot), east (right middle plot), and down (right lower plot). A period of 5 s of initialization was considered where the GPS was available.

Table 1

Numerical results in real data experiments; (i) M.D. stands for minimum distance between features (in pixels units); (ii) N.O.F. stands for average number of features maintained into the system state; (iii) aMAE stands for average mean absolute error (in meters).

	M.D. (15p)	M.D. (15p)	M.D. (25p)	M.D. (25p)
Method	N.O.F.	aMAE (m)	N.O.F. (s)	aMAE (m)
GPS	-	1.70 ± 0.77σ	-	1.70 ± 0.77σ
Camera + GPS	56.4 ± 10.2σ	0.21 ± 0.11σ	30.9 ± 4.9σ	0.22 ± 0.10σ
Camera	57.9 ± 9.3σ	0.20 ± 0.09σ	30.9 ± 5.6σ	0.20 ± 0.08σ

As it could be expected, the number of map features increases considerably as the minimum distance between visual points is decremented. However, it is interesting to remark that, at least for these experiments, there was no important improvement in error reduction. Regarding the use of the GPS altogether with monocular vision, a slightly better concordance was obtained between the P4P reference and trajectory estimated avoiding the GPS data (after the initialization). These results still suggest that, at least for small environments, it could be better to rely more on visual information than on GPS data after the initialization period. The feasibility to implement monocular SLAM methods in real-time has been widely studied in the past. In particular, since the work of Davison in 2003 [32], the feasibility for EKF-based methods was shown for maps composed of up to 100 features using standard hardware. Later, in [29], it was shown that filter-based methods might be beneficial if limited processing power is available. Even real-time performance has been demonstrated for relatively high computation demanding techniques as the optimization-based method proposed in [40]. In the application proposed in this work, it can be seen (Table 1) that the number of features that are maintained into the system state (even for the low M.D.) are considerably below an upper bound that should allow a real-time performance, for instance by implementing the algorithm in C or C++ languages.

5. Conclusions

In this work, a vision-based navigation and mapping system with application to unmanned aerial vehicles has been presented. The visual information is obtained with a camera integrated in the flying vehicle pointing to the ground. The proposed scheme is closely related to monocular SLAM systems where a unique camera is used to concurrently estimate a map of visual features as well as the trajectory of the camera. As a difference from the purely monocular SLAM approaches, in this work a multi-sensor scheme is followed in order to take advantage of the set of sensors commonly available in UAVs in order to overcome some technical difficulties associated with monocular SLAM systems. When a monocular camera is used, depth information cannot be retrieved in a single frame. In this work, a novel method is developed with this purpose. The proposed approach is based on a stochastic technique of triangulation to estimate features depth. Another important challenge that arises with the use of monocular vision comes with the fact that the metric scale of the environment can be only retrieved with a known factor if no additional information is incorporated into the system. In this work, the GPS readings are used during an initial short period of time in order to set the metric scale of estimation. After this period, the system operates relying uniquely on visual information to estimate the location of the vehicle. Due to the highly noisy nature of the GPS measurements, it is unreliable to work only with filtered GPS data in order to obtain an accurate estimation of position to perform fine manoeuvres. In this case, visual information is incorporated into the system in order to refine such estimations. The experimental results obtained through simulations as well as with real data suggest the following and relevant conclusions: (i) the integration into the system of very noisy GPS measurements during an initial short period is enough to recover the metric scale of the world; (ii) for flight trajectories performed near to the origin of the navigation frame of reference it is better to avoid integration of GPS measurements after the initialization period.

2 in total

1. MonoSLAM: real-time single camera SLAM.

Authors: Andrew J Davison; Ian D Reid; Nicholas D Molton; Olivier Stasse
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2007-06 Impact factor: 6.226

2. Concurrent initialization for Bearing-Only SLAM.

Authors: Rodrigo Munguía; Antoni Grau
Journal: Sensors (Basel) Date: 2010-03-01 Impact factor: 3.576

2 in total

4 in total

1. Visual Information Fusion through Bayesian Inference for Adaptive Probability-Oriented Feature Matching.

Authors: David Valiente; Luis Payá; Luis M Jiménez; Jose M Sebastián; Óscar Reinoso
Journal: Sensors (Basel) Date: 2018-06-26 Impact factor: 3.576

2. Aircraft Pose Estimation Based on Geometry Structure Features and Line Correspondences.

Authors: Xichao Teng; Qifeng Yu; Jing Luo; Gang Wang; Xiaohu Zhang
Journal: Sensors (Basel) Date: 2019-05-09 Impact factor: 3.576

3. Visual-Based SLAM Configurations for Cooperative Multi-UAV Systems with a Lead Agent: An Observability-Based Approach.

Authors: Juan-Carlos Trujillo; Rodrigo Munguia; Edmundo Guerra; Antoni Grau
Journal: Sensors (Basel) Date: 2018-12-03 Impact factor: 3.576

4. Application of Crack Identification Techniques for an Aging Concrete Bridge Inspection Using an Unmanned Aerial Vehicle.

Authors: In-Ho Kim; Haemin Jeon; Seung-Chan Baek; Won-Hwa Hong; Hyung-Jo Jung
Journal: Sensors (Basel) Date: 2018-06-08 Impact factor: 3.576

4 in total