EndoLoc: Relative Pose Regression Framework with Transformation and Correlation Features for In-vivo Visual Localization of Endoscope

Liangjing Shao1,2,3,*, Benshuang Chen1,2, Shuting Zhao1,2, Fuming Yang4, Xinrong Chen1,2,#

1. College of Biomedical Engineering, Fudan University 2. Shanghai Key laboratory of Medical Image Computing and Computer Assisted Intervention 3. Department of Electronic Engineering, The Chinese University of Hong Kong 4. Department of Neurosurgery, Shanghai General Hospital, Shanghai JiaoTong University School of Medicine
Accepted by IEEE TCSVT

#Corresponding Authors

*This work was done when Liangjing was a master student with Fudan University

Demo Videos of Real-time Visual Localization in Nasal Endoscopy

Demo Videos of Real-time Visual Localization in Colonoscopy

Abstract

Real-time localization of endoscope is significant for navigation and automation of endoscopic diagnosis and minimally invasive surgery. However, traditional localization based on optical tracking or magnetic tracking is easily influenced by occlusion or electromagnetic medical instruments and is complicated for implementation. Meanwhile, transformation and correlation information in image pairs are still ignored in existing visual localization methods for endoscopy. In this work, a novel relative pose regression framework is proposed to perform relative pose estimation and absolute pose tracking for endoscope based on endoscopic videos. Firstly, scene features and transformation features are respectively extracted from endoscopic observations and the corresponding optical flow by the proposed feature encoder based on gated convolution, which can prevent gradient vanishing when training the encoder from scratch on endoscopic data. Furthermore, A novel correlation module based on cross attention is proposed to extract correlation features from two input images, which can capture more key features in endoscopic frames with more limited vision from local to global. Moreover, a novel pose decoder with upsampling and downsampling on the channel dimension is utilized to extract richer representation from the concatenated feature map for relative transformation vector prediction. The proposed method outperforms the state-of-the-art methods on the datasets from nasal endoscopy and colonoscopy, with less than 3\% localization error on average. The further experiments also demonstrate the efficiency of the proposed method.

BibTeX

BibTex Code Here