Real-time localization of endoscope is significant for navigation and automation of endoscopic diagnosis and minimally invasive surgery. However, traditional localization based on optical tracking or magnetic tracking is easily influenced by occlusion or electromagnetic medical instruments and is complicated for implementation. Meanwhile, transformation and correlation information in image pairs are still ignored in existing visual localization methods for endoscopy. In this work, a novel relative pose regression framework is proposed to perform relative pose estimation and absolute pose tracking for endoscope based on endoscopic videos. Firstly, scene features and transformation features are respectively extracted from endoscopic observations and the corresponding optical flow by the proposed feature encoder based on gated convolution, which can prevent gradient vanishing when training the encoder from scratch on endoscopic data. Furthermore, A novel correlation module based on cross attention is proposed to extract correlation features from two input images, which can capture more key features in endoscopic frames with more limited vision from local to global. Moreover, a novel pose decoder with upsampling and downsampling on the channel dimension is utilized to extract richer representation from the concatenated feature map for relative transformation vector prediction. The proposed method outperforms the state-of-the-art methods on the datasets from nasal endoscopy and colonoscopy, with less than 3\% localization error on average. The further experiments also demonstrate the efficiency of the proposed method.
BibTex Code Here