Abstract
Since the moving objects can degrade the stability of the visual-inertial simultaneous localization and mapping (VI-SLAM) systems significantly, this paper proposes a method based on deep learning and space constraints to distinguish the dynamic and static semantic objects. In detail, a pre-trained object detector predicts bounding boxes and class probabilities from selected keyframes. Then, a proposed random-sampling clustering algorithm, R-DBSCAN, filters the outliers lying within the prefiltered bounding boxes. After calculating the centroid of the remaining feature points, the semantic object’s attributes are judged by a proposed strategy: if the dynamic probability of the semantic object is larger than the static probability, the semantic object is dynamic; otherwise, it is static. Additionally, the drift errors of the VI-SLAM is constrained by the pedestrian dead reckoning (PDR) velocity. A series of experiments are conducted using the self-collected dataset to evaluate the accuracy of the proposed method in distinguishing the attributes of semantic objects for application scenes. The results demonstrate that the proposed method achieves the highest precision, recall, and F1 score as compared to other state-of-the-art methods. Additionally, the centroid of the dynamic semantic object changes in continuous keyframes reflecting its motion trajectory relative to the pose graph.