[go: up one dir, main page]

WO2025067655A1 - Method for obtaining camer pose by utilising relative pose contraints from adjacent and distant cameras covering the spatio-temporal space of a scene - Google Patents

Method for obtaining camer pose by utilising relative pose contraints from adjacent and distant cameras covering the spatio-temporal space of a scene Download PDF

Info

Publication number
WO2025067655A1
WO2025067655A1 PCT/EP2023/076871 EP2023076871W WO2025067655A1 WO 2025067655 A1 WO2025067655 A1 WO 2025067655A1 EP 2023076871 W EP2023076871 W EP 2023076871W WO 2025067655 A1 WO2025067655 A1 WO 2025067655A1
Authority
WO
WIPO (PCT)
Prior art keywords
pose
image
sequence
estimated
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/EP2023/076871
Other languages
French (fr)
Inventor
Mohammad ALTILLAWI
Ziyuan Liu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Cloud Computing Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Cloud Computing Technologies Co Ltd filed Critical Huawei Cloud Computing Technologies Co Ltd
Priority to PCT/EP2023/076871 priority Critical patent/WO2025067655A1/en
Publication of WO2025067655A1 publication Critical patent/WO2025067655A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/77Determining position or orientation of objects or cameras using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Definitions

  • This disclosure relates to an image processing apparatus and method for determining global camera relocalisation. This disclosure further relates to an apparatus and method for training a weighted processing network to determine global camera relocalisation.
  • the problem at hand is that it is difficult to perform global camera pose estimation from a single image.
  • the training data is composed of a set of images and the corresponding poses in a given reference frame.
  • the aforementioned methods work to solve the global localization problem through regression by a direct supervision from the pose labels. While some of these methods utilise sequences of images, they rely on successive and spatially nearby images and still estimated the pose through regression. In this direction, the method takes the image as input and maps it to 6 DoF pose (3 for position and 3 for orientation). However, the accuracy of these methods are in meters because they do not utilize geometric constraints in a matter that is aligned with the geometric problem of pose estimation. In summary, the localization errors of the abovedescribed methods are relatively high.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

An image processing apparatus for global camera relocalisation, the image processing apparatus comprising one or more processors configured to: receive an input image representing at least a portion of a scene; input the input image, to a trained weighted processing network trained using two sequences of input images and a determined relative pose for each input image, the trained weighted processing network comprising of one or more parameters; compute, using the trained weighted processing network and based on the input image, one or more of: estimated image coordinates in a global frame of the scene, estimated image coordinates in a camera frame of the scene, one or more weights associated with the coordinates within the scene; and estimate a pose for the input image based on one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene, the one or more weights associated with the coordinates within the scene and/or absolute image coordinates.

Description

METHOD FOR OBTAINING CAMER POSE BY UTILISING RELATIVE POSE CONTRAINTS FROM ADJACENT AND DISTANT CAMERAS COVERING THE SPATIOTEMPORAL SPACE OF A SCENE
FIELD OF THE INVENTION
This disclosure relates to an image processing apparatus and method for determining global camera relocalisation. This disclosure further relates to an apparatus and method for training a weighted processing network to determine global camera relocalisation.
BACKGROUND
Estimating the position and orientation of a camera from a single image has been a central research topic in computer vision as it plays a crucial role in many domains such as robotics, and augmented/virtual reality. With the advent of deep learning in computer vision, recent approaches have started to leverage neural networks for data driven camera pose estimation, referred from here on as pose estimation.
The problem at hand is that it is difficult to perform global camera pose estimation from a single image. For the defined problem, the training data is composed of a set of images and the corresponding poses in a given reference frame.
Previous works proposed to solve global localization problem by framing the solution as a regression problem whereby 2D images that are input are mapped to a 6 degree of freedom (DoF) pose. In this context, a Computational Neural Network (CNN) learns a function that encodes the input into a latent feature. One or more regression layers may then be used to map the latent image representation into the pose (position and orientation). Such methods utilize a direct supervision approach where each image is supervised by its corresponding pose.
Aiming to improve the pose accuracy, previous works have implemented different constraints on the learning process to better encode the latent vector and thus to obtain more accurate poses. One such approach as seen in PoseNet addressed the issue of loss imbalance between the orientation and translation losses by learning the weighting factors while training. Other works added constraints that are created from successive frames to improve localization. A further approach (AtLoc) pursued a method based on attention in an attempt to force the network to focus on robust objects and features and further utilized a sequence of images to learn about temporally consistent and informative features. Finally, a separate approach (GeoPoseNet) further utilized reprojection loss utilizing a 3D model of the scene.
The aforementioned methods work to solve the global localization problem through regression by a direct supervision from the pose labels. While some of these methods utilise sequences of images, they rely on successive and spatially nearby images and still estimated the pose through regression. In this direction, the method takes the image as input and maps it to 6 DoF pose (3 for position and 3 for orientation). However, the accuracy of these methods are in meters because they do not utilize geometric constraints in a matter that is aligned with the geometric problem of pose estimation. In summary, the localization errors of the abovedescribed methods are relatively high.
SUMMARY
An image processing apparatus for global camera relocalisation, the image processing apparatus comprising one or more processors configured to: receive an input image representing at least a portion of a scene; input the input image, to a trained weighted processing network trained using two sequences of input images and a determined relative pose for each input image, the trained weighted processing network comprising of one or more parameters; compute, using the trained weighted processing network and based on the input image, one or more of: estimated image coordinates in a global frame of the scene, estimated image coordinates in a camera frame of the scene, one or more weights associated with the coordinates within the scene; and estimate a pose for the input image based on one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene, the one or more weights associated with the coordinates within the scene and/or absolute image coordinates. This provides the advantage that global camera relocalisation can be performed based on a single input image and without the need for additional labels due to the added geometric constraints.
An image processing apparatus as described above, wherein the determined relative pose is based on comparing one or more first pose associated with a first image in a first sequence of image-pose pairings to a further pose associated to a further image in the first sequence and a pose associated to an image in a second sequence of image-pose pairings. This provides the advantage that spatially and temporally distanced images are used, providing more accurate pose relocalisation. An image processing apparatus as described above, wherein the one or more processors are configured, in estimating the pose of the input image, to employ a weighted alignment pose estimation by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene and the one or more weights associated with the coordinates within the scene. This provides the advantage that the coordinates of both the camera and the global frames can be blended appropriately.
An image processing apparatus as described above, wherein the one or more processors are configured, in estimating the pose of the input image, to employ a non-weighted alignment pose estimation by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the estimated image coordinates in a camera frame of the scene. This provides the advantage that the coordinates of both the camera and the global frames can be combined in a manner that requires less computation in an approximate manner.
An image processing apparatus as described above, wherein the one or more processors are further configured, in estimating the pose of the input image, to employ a perceptive point alignment by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the 2D image coordinates of the input image. This provides the advantage that only the coordinates of the global frame need be computed and then the pose can be estimated using the 2D pixel information (image coordinates) of the input image thus reducing processing time.
An image processing apparatus as described above, wherein the one or more processors are further configured to, in estimating the pose of the input image, use a random sample consensus technique. This provides a means to combine the global frame coordinates with the 2D image coordinates from the input image.
An image processing apparatus as described above, wherein the one or more processors are configured, in estimating the pose of the input image, employ: a weighted alignment pose estimation by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene and the one or more weights associated with the coordinates within the scene, and a perceptive point alignment by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the 2D image coordinates of the input image. This provides the advantage that multiple estimated poses can be produced from the same input image and can if desired be combined to produce a more accurate estimated pose.
An image processing apparatus as described above, wherein the trained weighted processing network is configured to be trained by: receiving two or more sequences comprising a first sequence of one or more image-pose pairing, the two or more sequences further comprising a second sequence of one or more image-pose pairing; determining, for one or more of the image-pose pairing, one or more relative pose based the one or more image pose pairing of the first sequence and the one or more image pose pairing of the second sequence; iteratively computing, using a weighted processing network, one or more estimated pose based on the determined one or more relative pose; determining, for each computed estimated pose, an error between the one or more estimated pose and the one or more relative pose; and updating the one or more parameters of the weighted processing network based on the determined error. This provides the advantage that the least available labels (constraints) are used for creating additional signals for training such that only weak-supervision is needed to train a weighted processing network for accurate pose estimation.
An image processing apparatus as described above, wherein the one or more processors are configured to train the parameters of the trained weighted processing network by: receiving a first sequence of image-pose pairings comprised of a first set of images and their associated first set of absolute poses; receiving a second sequence of image-pose pairings comprised of a second set of images and their associated second set of absolute poses; iteratively determining, for one or more image-pose pairing in the first sequence, one or more relative pose based on comparing one or more first pose associated with a first image in the first sequence to a further pose associated to a further image in the first sequence and a pose associated to an image in the second sequence; refine one or more parameters of a weighted processing network by iteratively: computing, using the weighted processing network, one or more estimated pose based on one or more of the image-pose pairings from the first sequence, one or more image-pose pairings from the second sequence and the determined one or more relative pose, determining, for each computed estimated pose, an error between the one or more estimated pose and the one or more relative pose; and updating the one or more parameters of the weighted processing network based on the determined error. This provides the advantage that the least available labels (constraints) are used for creating additional signals for training such that only weak-supervision is needed to train a weighted processing network for accurate pose estimation based on spatially and temporally distanced images. An image processing apparatus as described above, wherein the parameters of the weighted processing network are fine-tuned by iteratively: inputting to the image processing apparatus one or more image-location pairings; computing an estimated pose comprising location information; determining a fine-tuning error between the estimated pose and known imagelocation pairings; updating the one or more parameters of the weighted processing network based on minimisation of the fine-tuning error. This provides the advantage of allowing the weighted processing network to be modified in a bespoke manner based on limited data, in this case location data.
A method of image processing for global camera relocalisation, the method comprising: receiving an input image representing at least a portion of a scene; inputting the input image, to a trained weighted processing network trained using two sequences of input images and a determined relative pose for each input image, the trained weighted processing network comprising of one or more parameters; computing, using the trained weighted processing network and based on the input image, one or more of: estimated image coordinates in a global frame of the scene, estimated image coordinates in a camera frame of the scene, one or more weights associated with the coordinates within the scene; and estimating a pose for the input image based on one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene, the one or more weights associated with the coordinates within the scene and/or absolute image coordinates. This provides the advantage that global camera relocalisation can be performed based on a single input image and without the need for additional labels due to the added geometric constraints.
A method of image processing as described above, wherein determined relative pose is based on comparing one or more first pose associated with a first image in the first sequence to a further pose associated to a further image in the first sequence and a pose associated to an image in the second sequence. This provides the advantage that spatially and temporally distanced images are used, providing more accurate pose relocalisation.
A method of image processing as described above, wherein estimating the pose of the input image comprises: employing a weighted alignment pose estimation by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene and the one or more weights associated with the coordinates within the scene. This provides the advantage that the coordinates of both the camera and the global frames can be blended appropriately. A method of image processing as described above, wherein estimating the pose of the input image comprises: employing a non-weighted alignment pose estimation by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the estimated image coordinates in a camera frame of the scene. This provides the advantage that the coordinates of both the camera and the global frames can be combined in a manner that requires less computation in an approximate manner.
A method of image processing as described above, wherein estimating the pose of the input image comprises: employing a perceptive point alignment by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the 2D image coordinates of the input image. This provides the advantage that only the coordinates of the global frame need be computed and then the pose can be estimated using the 2D pixel information (image coordinates) of the input image thus reducing processing time.
A method of image processing as described above, wherein estimating the pose of the input image comprises: employing using a random sample consensus technique. This provides a means to combine the global frame coordinates with the 2D image coordinates from the input image.
A method of image processing as described above, wherein estimating the pose of the input image comprises employing: a weighted alignment pose estimation by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene and the one or more weights associated with the coordinates within the scene, and a perceptive point alignment by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the 2D image coordinates of the input image. This provides the advantage that multiple estimated poses can be produced from the same input image and can if desired be combined to produce a more accurate estimated pose.
A method of image processing as described above, wherein the trained weighted processing network is trained by: receiving two or more sequences comprising a first sequence of one or more image-pose pairing the two or more sequences further comprising a second sequence of one or more image-pose pairing; determining, for one or more of the image-pose pairing, one or more relative pose based the one or more image pose pairing of the first sequence and the one or more image pose pairing of the second sequence; iteratively computing, using a weighted processing network, one or more estimated pose based on the determined one or more relative pose; determining, for each computed estimated pose, an error between the one or more estimated pose and the one or more relative pose; and updating the one or more parameters of the weighted processing network based on the determined error. This provides the advantage that the least available labels (constraints) are used for creating additional signals for training such that only weak-supervision is needed to train a weighted processing network for accurate pose estimation.
A method of image processing as described above, wherein the trained weighted processing network is trained by: receiving the first sequence of image-pose pairings comprised of a first set of images and their associated first set of absolute poses; receiving the second sequence of image-pose pairings comprised of a second set of images and their associated second set of absolute poses; iteratively determining, for one or more image-pose pairing in the first sequence, the one or more relative pose based on comparing one or more first pose associated with a first image in the first sequence to a further pose associated to a further image in the first sequence and a pose associated to an image in the second sequence; refining one or more parameters of a weighted processing network by iteratively: computing, using the weighted processing network, the one or more estimated pose based on one or more of the image-pose pairings from the first sequence, one or more image-pose pairings from the second sequence and the determined one or more relative pose, determining, for each computed estimated pose, the error between the one or more estimated pose and the one or more relative pose; and updating the one or more parameters of the weighted processing network based on the determined error. This provides the advantage that the least available labels (constraints) are used for creating additional signals for training such that only weak- supervision is needed to train a weighted processing network for accurate pose estimation based on spatially and temporally distanced images.
A method of image processing as described above, the method further comprising fine-tuning the trained weighted processing network by iteratively: inputting to the image processing apparatus one or more image-location pairings; computing an estimated pose comprising location information; determining an fine-tuning error between the estimated pose and known image-location pairings; updating the one or more parameters of the weighted processing network based on minimisation of the fine-tuning error. This provides the advantage of allowing the weighted processing network to be modified in a bespoke manner based on limited data, in this case location data.
A training apparatus for training a weighted processing network to process images to determine camera localisation, the apparatus comprising one or more processors configured to: receive two or more sequences comprising a first sequence of image-pose pairings comprised of a first set of images and their associated first set of absolute poses, the two or more sequences further comprising a second sequence of image-pose pairings comprised of a second set of images and their associated second set of absolute poses; iteratively determine, for one or more image-pose pairing in the first sequence, one or more relative pose based on one or more first pose associated with a first image in the first sequence and a further pose associated to a further image in the first sequence and a pose associated to an image in the second sequence; refine one or more parameters of a weighted processing network by iteratively: computing, using the weighted processing network, one or more estimated pose based on one or more of the image-pose pairings from the first sequence, one or more imagepose pairings from the second sequence and the determined one or more relative pose, determining, for each computed estimated pose, an error between the one or more estimated pose and the one or more relative pose; and updating the one or more parameters of the weighted processing network based on the determined error. This provides the advantage that the least available labels (constraints) are used for creating additional signals for training such that only weak-supervision is needed to train a weighted processing network for accurate pose estimation based on spatially and temporally distanced images.
A training apparatus as described above, wherein the determined error is comprised of one or more of: a minimised error based on the difference between the two or more image-pose pairings in the first sequence, or a minimised error based on the difference between one or more image-pose pairings, and their associated absolute and estimated values, in the first sequence and one or more image-pose pairings in the second sequence. This provides the advantage that the errors between input images can be reduced leading to a more accurate pose estimation.
A training apparatus as described above, wherein the determined error is further determined by: computing one or more error between two or more image-pose pairing in the first sequence and/or second sequence based on comparing ground truth values for each image-pose pairing to estimated relative poses for each image pose-pairing; and summing and minimising the one or more computed errors. This provides the advantage that the error can be reduced based on a known ground truth value therefore reducing the inaccuracy of the parameters of the weighted processing network.
A training apparatus as described above, wherein the one or more image-pose pairings of the first sequence are spatially and temporally distanced from the one or more image-pose pairings of the second sequence. This provides the advantage that spatially and temporally distanced images are used, providing more accurate pose relocalisation. A training apparatus as described above, wherein the one or more processors are configured to determine the one or more relative pose of an image-pose pairing by comparing the pose of that image-pose pairing to the further pose of one or more further image-pose pairing in the first sequence and/or the second sequence. This provides the advantage of creating a geometric constraint based on the input image sequence data.
A training apparatus as described above, wherein the one or more processors are further configured to fine-tune the parameters of the weighted processing network by iteratively: inputting to the image processing apparatus one or more image-location pairings; computing an estimated pose comprising location information; determining a fine-tuning error between the estimated pose and known image-location pairings; updating the one or more parameters of the weighted processing network based on minimisation of the fine-tuning error. This provides the advantage of allowing the weighted processing network to be modified in a bespoke manner based on limited data, in this case location data.
A method of training a weighted processing network to process images to determine camera localisation, the method comprising: receiving two or more sequences comprising a first sequence of image-pose pairings comprised of a first set of images and their associated first set of absolute poses the two or more sequences further comprising a second sequence of image-pose pairings comprised of a second set of images and their associated second set of absolute poses; iteratively determining, for one or more image-pose pairing in the first sequence, one or more relative pose based on one or more first pose associated with a first image in the first sequence and a further pose associated to a further image in the first sequence and a pose associated to an image in the second sequence; refining one or more parameters of a weighted processing network by iteratively: computing, using the weighted processing network, one or more estimated pose based on one or more of the image-pose pairings from the first sequence, one or more image-pose pairings from the second sequence and the determined one or more relative pose, determining, for each computed estimated pose, an error between the one or more estimated pose and the one or more relative pose; and updating the one or more parameters of the weighted processing network based on the determined error. This provides the advantage that the least available labels (constraints) are used for creating additional signals for training such that only weak-supervision is needed to train a weighted processing network for accurate pose estimation based on spatially and temporally distanced images.
A method of training a weighted processing network as described above, wherein the determined error is comprised of one or more of: a minimised error based on the difference between the two or more image-pose pairings in the first sequence, or a minimised error based on the difference between one or more image-pose pairings, and their associated absolute and estimated values, in the first sequence and one or more image-pose pairings in the second sequence. This provides the advantage that the errors between input images can be reduced leading to a more accurate pose estimation.
A method of training a weighted processing network as described above, wherein the determined error is further determined by: computing one or more error between two or more image-pose pairing in the first sequence and/or second sequence based on comparing ground truth values for each image-pose pairing to estimated relative poses for each image posepairing; and summing and minimising the one or more computed errors. This provides the advantage that the error can be reduced based on a known ground truth value therefore reducing the inaccuracy of the parameters of the weighted processing network.
A method of training a weighted processing network as described above, wherein the one or more image-pose pairings of the first sequence are spatially and temporally distant from the one or more image-pose pairings of the second sequence. This provides the advantage that spatially and temporally distanced images are used, providing more accurate pose relocalisation.
A method of training a weighted processing network as described above, wherein the one or more processors are configured to determine the one or more relative pose of an image-pose pairing by comparing the pose of that image-pose pairing to the further pose of one or more further image-pose pairing in the first sequence and/or the second sequence. This provides the advantage of creating a geometric constraint based on the input image sequence data.
A method of training a weighted processing network as described above, wherein the method further comprises fine-tuning the parameters of the weighted processing network by iteratively: inputting to the image processing apparatus one or more image-location pairings; computing an estimated pose comprising location information; determining a fine-tuning error between the estimated pose and known image-location pairings; updating the one or more parameters of the weighted processing network based on minimisation of the fine-tuning error. This provides the advantage of allowing the weighted processing network to be modified in a bespoke manner based on limited data, in this case location data. BRIEF DESCRIPTION OF THE FIGURES
The embodiments of the present disclosure will now be described by way of example with reference to the accompanying drawings. In the drawings:
Figure 1 illustrates an example of a training apparatus of this disclosure;
Figure 2 illustrates a general example of the fine-tuning process of this disclosure;
Figure 3 illustrates an example of the image processing apparatus of this disclosure;
Figure 4 illustrates an example of pose estimation by weighted rigid alignment and perspective- n-point method in a RANSAC scheme;
Figure 5 illustrates an example of pose estimation by weighted rigid alignment;
Figure 6 illustrates an example of pose estimation using a non-weighted rigid alignment module; and
Figure 7 illustrates an example of pose estimation using only a perspective-n-point module in a RANSAC framework.
DETAILED DESCRIPTION OF THE INVENTION
In this disclosure, a new apparatus and method for global camera relocalisation from a single image based on deep learning is presented. The proposed utilizes the fewest available pose labels for the problem at hand to create additional signals for training. The presented apparatus runs in real time, stores only the weights of the network, and does not rely on a reference 3D model to localize. That is, the present disclosure provides a solution to the question: what is the rotation and position of the camera relative to a global frame of the scene from a given input image.
As an overview, the present disclosure employs a network of relative geometric constraints that are obtained from the available images and their corresponding poses for training. These geometric constraints are computed as relative poses between the available absolute poses of the training set of images. These constraints may be obtained from images that are close in space and time as well images that are far from each other. These constraints may be created randomly and applied simultaneously at each training iteration. These constraints may further be created from the available set of images and pose labels and may not require any extra effort to collect additional data.
These set of relative geometric constraints may be utilised to train a deep neural network to learn the geometry of the scene. In one case this disclosure describes that there is a network trained that is capable of taking a single input which may be a single input image and output one or more quantities, preferably three quantities. These three quantities are a set of weighting factors and two sets of 3D point clouds: one in the image coordinate frame and the other is in the global reference frame. A weighted rigid alignment module may take these three output quantities as inputs and utilise them to estimate a pose. During training, this pose is adjusted to match the ground-truth pose. This is conducted by supervising the deep neural network with the pose label and our created relative geometric constraints. These constraints are created from the existing labels without extra effort.
These constraints support the training by assisting the neural network to encode these geometric constraints in its weights. As a consequence, at inference, the deep neural network estimates a pose that is subject to these constraints. This setup improves the pose accuracy. Furthermore, it precludes the need to include explicit 3D coordinates (for example from LIDAR or depth measurement units). Experiments further show that these set of constraints can assist the network to improve both position and orientation localization when fine-tuned with position labels only.
At inference, the network may utilise its saved weights to estimate the pose from the single input image. The deep neural network estimates the set of weighting factors and two set of 3D coordinates, one in a global frame and the other in the camera frame. Consequently, the method disclosed herein may compute the pose by either using a weighted rigid alignment module, a non-weighted rigid alignment module, or a perspective-n-point module with a Random Sample Consensus (RANSAC) scheme.
The apparatus and method used for inference described herein utilise a single image to localize, represent the scene as a set of deep network parameters, save only the weights of the network (as opposed to requiring 3D SFM models), and runs in real-time.
The apparatus and methods of this disclosure will now be described in relation to the Figures. The training apparatus 100 and method of this disclosure will first be described.
The general approach described in this disclosure is that of a localisation pipeline 202 (which may comprise a weighted processing network 104 described herein) that takes one or more images as an input and outputs a position and orientation (pose) label after being computed by the pipeline. For training purposes such output positions and orientations may be compared to known pose information for an input image and the error may between the two may be minimised in order to adapt the parameters of the pipeline such that the localisation pipeline 202 outputs pose information with greater accuracy. In order to achieve this the method of this disclosure utilises novelty weak-supervision (e.g., not user supervision) for training the parameters of a localisation pipeline 202 formed as a trained weighted processing network 104. The weak-supervision focuses around the parameter of the relative geometric constraint e.g., relative pose labels between images.
The proposed method may therefore build a set of relative geometric constraints that are obtained from adjacent images as well as distant images. Distant images being images that are spatially and/or temporarily distanced from one another in comparison to the adjacent images in an image sequence. An example of a distant image is an image taken from a camera at another location in the scene for example from a different corner of a room, potentially at a different time of day. These constraints are used to assist the training of a deep neural network 104 to better learn the geometry of the scene for the benefit of global pose estimation from a given image. We define a relative geometric constraint as a relative pose between two poses 103.
One aspect of this disclosure is the provision of a training apparatus 100 for training a weighted processing network 104 to process images to determine camera localisation. The apparatus comprising one or more processors configured to perform the following functions.
The apparatus and thus the one or more processors are configured to receive two or more sequences comprised of sequences of image-pose pairings 201. As an example, two sequences of image-pose pairings 201 are described herein, however it should be understood that additional sequences of image-pose pairings 201 may be received and processed by the method described below. As such, the one or more processors may be configured to receive a first sequence 101 , 201 of image-pose pairings 201 comprised of a first set of images and their associated first set of absolute poses 103. The first sequence 101 , 201 may be comprised of one or more image-pose pairings 201 including but not limited to one or more images and an associated set of absolute poses 103 (position and orientation of the camera/ virtual camera within the scene) that produced the image. Each image-pose pairing 201 of the first sequence 101 , 201 may include a single image and single associated pose. The absolute poses 103 may refer to poses that are known for the input image.
The one or more processors may further be configured to receive a second sequence 101a, 201 of image-pose pairings comprised of a second set of images and their associated second set of absolute poses 103. The second sequence 101a, 201 of image-pose pairings 201 may the same format and composition as the first sequence 101 , 201 of image pose-pairings, for example, each pairing may be formed of an image and the absolute pose associated with that image.
Described another way, for a given scene with M number of collected images and M number of corresponding absolute poses 103 (relative to a global frame of the scene), the apparatus and method may be configured to receive/select a K set of image sequences, each with N images. They may also comprise associated image poses. For each input image i at training time, a first set of N images may be selected that are close in space and time to the input image. The selection process may be based on the timestamp of the collection of the images. The selection process may not be subject to any order. In the same manner, K-1 other sets (second or further sets) of N close images (that may be the same number as the first set) may also be randomly selected as well. These K sequences of images do not share similar images. Each of these K sets may span a random spot in the scene. Images within the same sequence may or may not share the same view. Images in the same sequence may not follow a chronological (temporal) order. As such, the two or more sequences of image-pose pairings 201 may be received or selected as above.
Once at least two sequences have been received the one or more processors of the apparatus of this disclosure may be configured to iteratively determine, for one or more image-pose pairing 201 in the first sequence 101 , 201 , one or more relative pose based on one or more first pose associated with a first image in the first sequence 101 , 201 and a further pose associated to a further image in the first sequence 101 , 201 and a pose associated to an image in the second sequence 101a, 201. In otherwords, the apparatus and method of this disclosure is configured to calculate a relative pose between two absolute poses 103 associated with two input images that are each part of an image pose pairing. This can be seen in Figure 1 wherein each absolute camera poses 101 from the first sequence 101 , 201 of image-pose pairings may be iteratively compared to the one or more other absolute camera pose from an image-pose pairing within the first sequence 101 , 201 and/or one or more other absolute camera pose from the one or more image-pose pairing from the second sequence 101a, 201 of image-pose pairings. In this way, each absolute image poses from the first sequence 101 , 201 may be compared to other absolute poses 103 in the same sequence which may be spatially and temporally close in position or may be compared to the other absolute poses 103 from the second sequence 101a, 201 which may be spatially and temporally distant from the absolute pose of that iteration. Iteratively, the one or more processors may therefore compute a relative pose between two poses in the same sequence with indices i and j as:
Figure imgf000017_0001
A pose T may be composed of a [3x3] rotation matrix R and a translation vector t [3x1], For the below definitions, we refer to an estimated quantity with a hat (A) on top of it (e.g.: t), while the same character without a hat is the ground-truth (true) value.
The error on the relative pose considering both the rotation matrix and the translation vector may be defined as follows:
Figure imgf000017_0002
Given the above, one definition of a translation error can be defined as:
Figure imgf000017_0003
Where || ... 1 |2 denotes l2 normalised and r is rotation. A rotation error can be defined as:
Figure imgf000017_0004
The one or more processors and method of this disclosure may also in iteratively determining, one or more relative poses described above, create relative constraints (poses) between the image and the next image in the sequence (image at the next index: i+1), so for a sequence of N images N-1 relative constraints (pose) are created. In addition, iteratively determining the one or more relative poses may also involve applying the same process for the remaining K sequences, e.g., the second or further sequences of image-pose pairings. In mathematical notation, we define the error signal between the relative ground-truth constraints and the estimated relative poses as:
Figure imgf000017_0005
Where the term means the translation vector of the relative GT pose between
Figure imgf000018_0001
consecutive cameras i and i +1 of sequence K. The same notation applies to the other terms in the equation. The terms with hats are estimated quantities as discussed above.
Furthermore, as part of the iterative process for determining the one or more relative poses discussed above, the one or more processors may be configured to determine for an image at index i in any sequence e.g., the first, second and/or further sequence, a relative pose is computed to the images at index i in the other sequences. This may be achieved as all of the sequences may have the same number of images as part of the image-pose pairings. In this way as shown in Figure 1 , images from the first sequence 101 , 201 in Figure 1 can be compared to images from the second sequence 101a, 201 in Figure 1 in order to compare absolute (ground-truth) poses and thus obtain one or more relative poses as described above. When the image-pose pairings are compared between the K sequences the errors across the sequences may be defined using the following expression:
Where the term denotes the translation vector of the relative GT pose between two distant
Figure imgf000018_0002
cameras with index i, one in sequence k, and the other in sequence k + 1. The same notation applies to the other terms in the equation.
To summarise the above, at each iteration, a random image I from the set of M images is chosen. Consequently, the sequences of images may be created/received by the one or more processors. Using this set of sequences, in the above example comprising a first and a second sequence 101a, 201 , we apply our method to create the set of relative poses from the available absolute (ground-truth) poses of these images.
Once the two or more sequences have been received and the one or more relative pose has been determined by the one or more processors of the training apparatus 100 of this disclosure, the one or more processors may then be configured to pass these to a weighted processing network 104. In doing so the one or more processors may be configured to refine one or more parameters of a weighted processing network 104. This may be achieved by the one or more processors by iteratively: computing, using the weighted processing network 104, one or more estimated pose based on one or more of the image-pose pairings from the first sequence 101 , 201 , one or more image-pose pairings from the second sequence 101a, 201 and the determined one or more relative pose.
In other words, the apparatus may be configured to use the created/received sequences of images and the determined relative poses by passing them to a weighted processing network 104 to estimate for each image (having an associated absolute pose) an estimated pose. From these estimated poses, the one or more processors may be configured to create a set of estimated relative poses. This may be achieved by the one or more processors computing the estimated relative poses in the same way as the relative poses discussed above but may take the estimated pose information as input in place of the absolute poses 103 for each image. This process may be performed iteratively for each of the one or more image (that are part of the image-pose pairings) in each of the two or more sequences.
Once the estimated relative poses have been computed by the weighted processing network 104, based on the above factors, implemented on the one or more processors of the apparatus of this disclosure, said one or more processors may be configured to determine, for each computed estimated pose, an error between the one or more estimated pose and the one or more relative pose; and update the one or more parameters of the weighted processing network 104 based on the determined error. The determining of an error between the one or more estimated pose and the determined relative pose (geometric constraints) for a randomly selected image from one of the two or more sequences may be repeated in an iterative manner such that the parameters of the weighted processing network 104 are updated iteratively and refined. The error between poses may in some cases be based on the following exemplary expression:
The determined error signal between the one or more estimated pose and the relative pose
Figure imgf000019_0001
may be computed by the one or more processors. The determined error may be comprised of one or more of: a minimised error based on the difference between the two or more imagepose pairings in the first sequence 101 , 201 , and/or a minimised error based on the difference between one or more image-pose pairings, and their associated absolute and estimated values, in the first sequence 101 , 201 and one or more image-pose pairings in the second sequence 101a, 201. Furthermore, a further error signal may be computed by the one or more processors as part of the iterative refinement of the parameters of the weighted processing network 104. This additional error signal may be the error signal between the estimated absolute pose of the input image I and its absolute ground-truth pose. This additional error may also be used in the updating of the parameters of the weighted processing network 104. These error signals are minimized in each iteration. In other words, the determined error may further be determined by computing one or more error between two or more image-pose pairing in the first sequence 101 , 201 and/or second sequence 101 a, 201 based on comparing ground truth values for each image-pose pairing to estimated relative poses for each image pose-pairing; and summing and minimising the one or more computed errors. By the end of the training (loss is minimized), the network 104 would have learned the geometric constraints and encoded these in its parameters.
Figure 2 illustrates a general example process that may be implemented by the one or more processors of the apparatus of this disclosure in which the one or more processors are configured to fine-tune the trained weighted processing network 104 that forms part of the localisation pipeline 302. In particular, the one or more processors may be further configured to fine-tune the parameters of the weighted processing network 104. This may be achieved by inputting to the training apparatus 100, in particular the trained weighted processing network 104, one or more image-location pairings. Each such image-location pairings 301 may be comprised of an image and an associated location information, for example GPS data recorded when the image was taken, and may be referred hereafter in relation to Figure 2 as an image 301. Such data may be utilised by the one or more processors in computing, using the weighted processing network 104, an estimated pose comprising location information. The location information may be included as part of the image-location pairings. A fine-tuning error between the estimated pose 307 and known image-location pairings, in particular the known location information, may then be determined by the one or more processors. At the end of each iteration, the one or more parameters of the weighted processing network 104 may be updated based on minimisation of the fine-tuning error that has been determined. These steps may be performed iteratively in a similar manner to the above-described training but may be performed at a lower learning rate for example, making smaller adjustments to the parameters of the weighted processing network 104 than are made during the training stage implemented by the one or more processors. The finetuning that is implemented may be performed over a small number of epochs for example 10-15.
In addition, the localisation pipeline that may comprise the weighted processing network 104 that is trained above may further comprise as outputs from the weighted processing network 104, estimated 3D coordinates in a global frame, estimated 3D coordinates in a camera frame and/or additional estimated weights associated with the 3D coordinate matrices of the coordinates in the global frame and the camera frame. These will be described below in relation to the inference device (image processing apparatus 400), however it should be understood that these outputs from the weighted processing network 104 may form part of the training process in that these outputs from the weighted processing network 104 may be used to produce a final predicted (estimated) pose 107, 207 that is compared to the absolute pose of an input image and thus used to train the weighted processing network to achieve a close relationship between these two quantities.
The above training method implemented on the apparatus and more particularly the one or more processors of the present disclosure provide the ability to train a weighted processing network 104 in a weakly supervised manner by using relative poses between images as geometric constraints and no further additional labels that would require user input. In this way the weighted processing network 104 of the present disclosure may utilize relative poses between adjacent and distant cameras to optimize a localization pipeline. These inputs for training the network 104 may be obtained from only the minimum labels available of a localization pipeline (e.g., absolute poses 103). There is thus no need for 3D ground truth data to train the network 104 as described above. The above-described steps may be implemented by the one or more processors of the training apparatus 100 of this disclosure or may be performed as a separate method.
There is also described herein an image processing apparatus 400 for global camera relocalisation. An example of such image processing apparatus 400 can be shown in Figure 3. As illustrated by Figure 3 this disclosure includes an image processing method and apparatus, wherein the image processing apparatus 400 is comprised of one or more processors that are configured to perform the following functions as will now be described in relation to Figures 3 to 7.
Firstly, the one or more processors may be configured to receive an input image 401 representing at least a portion of a scene. This image may be an image that has been received from a camera or, alternatively, from another source in electronic form. The one or more processors are then configured to input the input image 401 into a trained weighted processing network 402. The trained weighted processing network 402 may be a weighted processing networking 402 having the same properties discussed above in relation to the training apparatus 100 and may have been trained in the same manner. In particular the trained weighted processing apparatus may be trained using two sequences of input images and a determined relative pose for each input image as discussed above in relation to the training apparatus 100. The trained weighted processing network may also comprise of one or more parameters. The determined relative pose may be based on comparing one or more first pose associated with a first image in a first sequence 101 , 201 of input image-pose pairings as described in relation to the training apparatus 100 to a further pose associated to a further image in the first sequence 101 , 201 and a pose associated to an image in a second sequence 101a, 201 of input image-pose pairings as described in relation to the training apparatus 100.
Although the image processing apparatus 400 described herein may also be configured to perform the function of the training apparatus 100 described above this is not always the case and indeed is not necessary as in some cases the image processing apparatus 400 may be configured to utilise an already trained weighted processing network trained using the process described above. In other words, such training may be performed by a separate apparatus to the image processing apparatus 400 such as the training apparatus 100 described above.
Returning to Figure 3, the one or more processor may be configured to compute, using the trained weighted processing network and based on the input image, one or more of: estimated image coordinates in a global frame of the scene, estimated image coordinates in a camera frame of the scene, and/or one or more weights associated with the coordinates within the scene. This may be thought of as an inference step of the trained weighted processing network implemented on the image processing apparatus 400 and the trained weighted processing network may be thought of as a deep neural network that is configured to estimate the three outputs, e.g., the estimated 3D coordinates in a global frame, the estimated 3D coordinates in a camera frame and/or the additional estimated weights associated with the 3D coordinate matrices of the coordinates in the global frame and the camera frame.
Once these outputs are produced by the trained weighted processing network implemented on the image processing apparatus 400 by the one or more processors, they may be used to estimate a pose for the input image, for which there was previously no known pose information. The one or more processors are configured to estimate the pose for the input image based on one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene, the one or more weights associated with the coordinates within the scene and/or absolute image coordinates. There are a number of ways in which the image processing apparatus 400 of this disclosure may be configured to estimate a pose for the input image using these variables and these will now be described in relation to Figures 4 to 7. Figure 4 demonstrates all potential methods for estimating a pose and it is possible to estimate a pose of the input image using multiple of these methods and compare the estimated poses in order to form a composite estimated pose.
Briefly the series of processes in which the pose of the input image may be estimated may be as follows. One technique may be pose estimation by rigid alignment. In this case a nonlearned and parameter-free rigid alignment module may be used to estimate the pose in two ways. Firstly, by weighted rigid alignment, as shown in Figure 4 wherein a module (implemented on the one or more processors) takes the 3D coordinates in both the camera frame and the global frame in addition to their weighting factors (the weights) and estimates a pose. Secondly, using non-weighted rigid alignment as shown in Figure 6 in which a module (implemented on the one or more processors) takes only the 3D coordinates in both frames and estimates a pose.
An alternative approach that may be implemented by the one or more processors of the image processing apparatus 400 may be pose estimation by Perspective-n-point (PnP) as shown in Figure 7 in which a module (implemented on the one or more processors) takes only the 3D coordinates in the global frame and the 2D pixels (readily available from the input image) and estimates a pose through a perspective-n-point module in a random sample consensus (RANSAC) scheme.
The data quantities present in the below pose estimation may be of the following characteristics. The input image may be either a 1 (Grayscale) or 3 (RGB) channelled input. The input image may have the form: 1 xHxW (for grayscale) or 3xHxW (for RGB) where H and W define the height and width of the image respectively. The 3D coordinates in the global frame may be a 3 channelled output of size: 3xHxW and the 3D coordinates in the camera frame may be a 3 channelled output of size: 3xHxW. The Weights may be a 1 channelled output of size: 1 xHxW and the pose may be a 6 degree of freedom quantity (6 Degrees of Freedom DoF). It can be represented in a vector of size 6x1 (3 for position and 3 for orientation) or a vector of size 7x1 (3 for position and 4 for orientation).
In Figure 4 the one or more processors are configured to implement the trained weighted processing network (deep neural network) 402 to obtain the three quantities discussed above. The weighted rigid alignment module 403 takes the 3D coordinates in both frames (camera and global) in addition to their weighting factors and estimates a pose 407. In other words, the one or more processors may be configured, in estimating the pose 407 of the input image, to employ weighted alignment pose alignment 403 by estimating a pose 407 of the input image 401 using the one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene and the one or more weights associated with the coordinates within the scene.
Similarly and in some cases in combination, the perspective-n-point module 404, 704 in a RANSAC scheme may take, as shown in Figures 4 and 7, the 3D coordinates in the global frame and the 2D pixels 405, 705 (readily available from the input image) and estimates a pose 407, 707. In other words, the one or more processors are further configured, in estimating the pose 407, 707 of the input image 401 , 701 , to employ a perceptive point alignment by estimating a pose 407, 707 of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the 2D image coordinates 405, 705 of the input image 401 , 701. In addition, in some cases as shown in Figure 5, the one or more processors are further configured to, in estimating the pose 407, 707 of the input image 401 , 701 , use a random sample consensus technique 404, 704.
In Figure 4 the weighted alignment pose estimation 403 may be combined with the perceptive point alignment technique 404 such that the one or more processors may be configured to generate poses 407 using each technique and combine them to form a final pose. Figure 7 employs a solely perspective-n-point module 704 in a RANSAC scheme approach to determining a pose 707 based on the variable of the 3D Global coordinates output from the trained weighted processing network 702 on computing the input image 701. In this case, the weights and the 3D coordinates in the camera frame are not used. The 3D coordinates in the global frame and the corresponding 2D pixels that are available from the input image are used to estimate the pose 707.
In Figure 5, the one or more processors may be configured to estimate the pose 507 only by the weighted rigid alignment module 503. In other words, the one or more processors may be configured, in estimating the pose 507 of the input image 501 , to employ a weighted alignment pose estimation 503 by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene and the one or more weights associated with the coordinates within the scene. In this way the weights provide a measure of how the 3D coordinates in the global and camera frames should be applied on and RGB basis. Therefore, each pixel will have an associated weighting towards one of the global and camera frames pixel information and this may determine how the pixel in the pose 507 image is generated.
In Figure 6 illustrates an example in which the one or more processors are configured, in estimating the pose of the input image, to employ a non-weighted alignment pose estimation 610 by estimating a pose of the input image 601 using the one or more of the estimated image coordinates in a global frame of the scene and the estimated image coordinates in a camera frame of the scene. This example does not consider the weights that may be output from the trained weighted processing network 602 and takes the 3D coordinates from the camera and global frames and combines them to form the pose 607.
It should be understood that the input image, trained weighted processing network, 3D global and camera coordinates and pose seen in Figures 4 to 7 may be the same and the differing reference numerals used for these different figures should not be understood to mean that these common aspects of the figures are different. Reference numerals ending in the same number in figures 4 to 7 are therefore intended to reference the same stage of the process implemented by the one or more processors of the image processing apparatus 400 of the present disclosure. The difference between these figures lies in the technique used to produce each respective pose.
The training apparatus 100, corresponding method, image processing apparatus 400 and corresponding method described herein provide a number of advantages over what is known and use pose labels to create another set of supervision signals (geometric constraints) without extra effort by the user. These labels are the relative geometric constraints from adjacent and distant camera frames/images. These constraints described herein can be used to train the network to learn to estimate the 3D coordinates in a global and a camera frame as well as a set of weighting factors and thus Pose can be estimated using geometric information rather than just regression. This results in poses of increased accuracy.
In addition, the present disclosure takes advantage of the easy availability of data such that poses are easily obtained from cheap sensors (GPS, IMU, Wi-Fi signal) without the need to create 3D SFM models or capture the 3D scene using depth sensors or laser scanners for example LIDAR. This results in cheaper costs and more applications for the apparatus since only a single image is needed to localize the camera pose.
Furthermore, the present disclosure provides a means for fine-tuning the pre-trained network with position labels only. In some scenarios, there can only be available position information. These can still be utilized to train a network to learn position and orientation localization, which can save costs.
Finally, this disclosure provides techniques for pose estimation by rigid alignment and perspective-n-points algorithms. Therefore, using the same set of output data, the method of this disclosure provides the capability to estimate a pose using two algorithms in three different ways. This brings flexibility and increased accuracy to pose estimation. It is also possible to represent the 3D scene as weights/parameters of a weighted processing network (deep neural network), which provides a more compact scene representation than that currently known and thus reduces the amount of memory used in turn reducing memory requirements compared to 3D SFM models of a scene.
In this disclosure, only poses (positions and orientations) are needed to train the model and although the method described above may refer to receiving image-pose pairings, this should be understood as including images and associated poses or only poses once they are used for the training of the weighted processing network. The poses may can be obtained from GPS (position) and IMU (orientation). For indoor scenes, positions may also be also obtained relative to Wi-Fi routers.
The present apparatus and method of this disclosure have a number of use cases some of which are as follows;
• Mobile Phone/Navigation/Virtual reality,
• Crowd sourcing of data from mobile phones (camera, GPS, IMU) for a certain place/historical area/city (or from a collection of public images and the metadata),
• Training a model on the cloud from these collected data/fine-tune further on new incoming data (can be position labels only as well),
• Local ize/navigate from a given single image (human phone),
• Navigate to a certain place (ex: restaurant, store, museum) in a city,
• Navigate inside a mall to a certain shop, and
• From a give image of a certain place, listing relevant information about the scene (Augmented Reality).
A further use case may be in robotics/navigation wherein the image processing method may be used as a global coarse localizer. An example of this would be that if a mobile robot is lost (diverged from its path), use the proposed method to quickly relocalise from a single image. Alternatively, it may be used in continual learning: when a robot is exploring new areas, a data collection can be initiated, then sent to cloud to train/fine-tune a localization pipeline.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description, it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1 . An image processing apparatus for global camera relocalisation, the image processing apparatus comprising one or more processors configured to: receive an input image representing at least a portion of a scene; input the input image, to a trained weighted processing network trained using two sequences of input images and a determined relative pose for each input image, the trained weighted processing network comprising of one or more parameters; compute, using the trained weighted processing network and based on the input image, one or more of: estimated image coordinates in a global frame of the scene, estimated image coordinates in a camera frame of the scene, one or more weights associated with the coordinates within the scene; and estimate a pose for the input image based on one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene, the one or more weights associated with the coordinates within the scene and/or absolute image coordinates.
2. An image processing apparatus according to claim 1 , wherein the determined relative pose is based on comparing one or more first pose associated with a first image in a first sequence of image-pose pairings to a further pose associated to a further image in the first sequence and a pose associated to an image in a second sequence of image-pose pairings.
3. An image processing apparatus according to claim 1 or 2, wherein the one or more processors are configured, in estimating the pose of the input image, to employ a weighted alignment pose estimation by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene and the one or more weights associated with the coordinates within the scene.
4. An image processing apparatus according to any one of claims 1 to 3, wherein the one or more processors are configured, in estimating the pose of the input image, to employ a nonweighted alignment pose estimation by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the estimated image coordinates in a camera frame of the scene.
5. An image processing apparatus according to any one of claims 1 to 4, wherein the one or more processors are further configured, in estimating the pose of the input image, employ a perceptive point alignment by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the 2D image coordinates of the input image.
6. An image processing apparatus according to claim 5, wherein the one or more processors are further configured to, in estimating the pose of the input image, use a random sample consensus technique.
7. An image processing apparatus according to any one of claims 1 to 6, wherein the one or more processors are configured, in estimating the pose of the input image, employ: a weighted alignment pose estimation by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene and the one or more weights associated with the coordinates within the scene, and a perceptive point alignment by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the 2D image coordinates of the input image.
8. An image processing apparatus according to any one of claims 1 to 7, wherein the trained weighted processing network is configured to be trained by: receiving a first sequence of one or more image-pose pairing, and a second sequence of one or more image-pose pairing; determining, for one or more of the image-pose pairing, one or more relative pose based the one or more image pose pairing of the first sequence and the one or more image pose pairing of the second sequence; iteratively computing, using a weighted processing network, one or more estimated pose based on the determined one or more relative pose; determining, for each computed estimated pose, an error between the one or more estimated pose and the one or more relative pose; and updating the one or more parameters of the weighted processing network based on the determined error.
9. An image processing apparatus according to claim 8, wherein the one or more processors are configured to train the parameters of the trained weighted processing network by: receiving two or more sequences comprising a first sequence of image-pose pairings comprised of a first set of images and their associated first set of absolute poses, the two or more sequences further comprising a second sequence of image-pose pairings comprised of a second set of images and their associated second set of absolute poses; iteratively determining, for one or more image-pose pairing in the first sequence, one or more relative pose based on comparing one or more first pose associated with a first image in the first sequence to a further pose associated to a further image in the first sequence and a pose associated to an image in the second sequence; refine one or more parameters of a weighted processing network by iteratively: computing, using the weighted processing network, one or more estimated pose based on one or more of the image-pose pairings from the first sequence, one or more imagepose pairings from the second sequence and the determined one or more relative pose, determining, for each computed estimated pose, an error between the one or more estimated pose and the one or more relative pose; and updating the one or more parameters of the weighted processing network based on the determined error.
10. An image processing apparatus according to any one of claims 1 to 9, wherein the parameters of the weighted processing network are fine-tuned by iteratively: inputting to the image processing apparatus one or more image-location pairings; computing an estimated pose comprising location information; determining a fine-tuning error between the estimated pose and known image-location pairings; updating the one or more parameters of the weighted processing network based on minimisation of the fine-tuning error.
11. A method of image processing for global camera relocalisation, the method comprising: receiving an input image representing at least a portion of a scene; inputting the input image, to a trained weighted processing network trained using two sequences of input images and a determined relative pose for each input image, the trained weighted processing network comprising of one or more parameters; computing, using the trained weighted processing network and based on the input image, one or more of: estimated image coordinates in a global frame of the scene, estimated image coordinates in a camera frame of the scene, one or more weights associated with the coordinates within the scene; and estimating a pose for the input image based on one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene, the one or more weights associated with the coordinates within the scene and/or absolute image coordinates.
12. A method of processing data according to claim 11 , wherein determined relative pose is based on comparing one or more first pose associated with a first image in the first sequence to a further pose associated to a further image in the first sequence and a pose associated to an image in the second sequence.
13. A method of processing data according to claim 11 or 12, wherein estimating the pose of the input image comprises: employing a weighted alignment pose estimation by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene and the one or more weights associated with the coordinates within the scene.
14. A method of processing data according to any one of claims 11 to 13, wherein estimating the pose of the input image comprises: employing a non- weighted alignment pose estimation by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the estimated image coordinates in a camera frame of the scene.
15. A method of processing data according to any one of claims 11 to 14, wherein estimating the pose of the input image comprises: employing a perceptive point alignment by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the 2D image coordinates of the input image.
16. A method of processing data according to claim 15, wherein estimating the pose of the input image comprises: employing using a random sample consensus technique.
17. A method of processing data according to any one of claims 11 to 16, wherein estimating the pose of the input image comprises employing: a weighted alignment pose estimation by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene, the estimated image coordinates in a camera frame of the scene and the one or more weights associated with the coordinates within the scene, and a perceptive point alignment by estimating a pose of the input image using the one or more of the estimated image coordinates in a global frame of the scene and the 2D image coordinates of the input image.
18. A method of processing data according to any one of claims 11 to 17, wherein the trained weighted processing network is trained by: receiving two or more sequences comprising a first sequence of one or more imagepose pairing the two or more sequences further comprising a second sequence of one or more image-pose pairing; determining, for one or more of the image-pose pairing, one or more relative pose based the one or more image pose pairing of the first sequence and the one or more image pose pairing of the second sequence; iteratively computing, using a weighted processing network, one or more estimated pose based on the determined one or more relative pose; determining, for each computed estimated pose, an error between the one or more estimated pose and the one or more relative pose; and updating the one or more parameters of the weighted processing network based on the determined error.
19. A method of processing data according to claim 18, wherein the trained weighted processing network is trained by: receiving the first sequence of image-pose pairings comprised of a first set of images and their associated first set of absolute poses; receiving the second sequence of image-pose pairings comprised of a second set of images and their associated second set of absolute poses; iteratively determining, for one or more image-pose pairing in the first sequence, the one or more relative pose based on comparing one or more first pose associated with a first image in the first sequence to a further pose associated to a further image in the first sequence and a pose associated to an image in the second sequence; refining one or more parameters of a weighted processing network by iteratively: computing, using the weighted processing network, the one or more estimated pose based on one or more of the image-pose pairings from the first sequence, one or more image-pose pairings from the second sequence and the determined one or more relative pose, determining, for each computed estimated pose, the error between the one or more estimated pose and the one or more relative pose; and updating the one or more parameters of the weighted processing network based on the determined error.
20. A method of processing data according to any one of claims 11 to 19, the method further comprising fine-tuning the trained weighted processing network by iteratively: inputting to the image processing apparatus one or more image-location pairings; computing an estimated pose comprising location information; determining a fine-tuning error between the estimated pose and known image-location pairings; updating the one or more parameters of the weighted processing network based on minimisation of the fine-tuning error.
21. A training apparatus for training a weighted processing network to process images to determine camera localisation, the apparatus comprising one or more processors configured to: receive two or more sequences comprising a first sequence of image-pose pairings comprised of a first set of images and their associated first set of absolute poses, the two or more sequences further comprising a second sequence of image-pose pairings comprised of a second set of images and their associated second set of absolute poses; iteratively determine, for one or more image-pose pairing in the first sequence, one or more relative pose based on one or more first pose associated with a first image in the first sequence and a further pose associated to a further image in the first sequence and a pose associated to an image in the second sequence; refine one or more parameters of a weighted processing network by iteratively: computing, using the weighted processing network, one or more estimated pose based on one or more of the image-pose pairings from the first sequence, one or more imagepose pairings from the second sequence and the determined one or more relative pose, determining, for each computed estimated pose, an error between the one or more estimated pose and the one or more relative pose; and updating the one or more parameters of the weighted processing network based on the determined error.
22. The training apparatus according to claim 21 , wherein the determined error is comprised of one or more of: a minimised error based on the difference between the two or more image-pose pairings in the first sequence, or a minimised error based on the difference between one or more image-pose pairings, and their associated absolute and estimated values, in the first sequence and one or more image-pose pairings in the second sequence.
23. The training apparatus according to claim 22, wherein the determined error is further determined by: computing one or more error between two or more image-pose pairing in the first sequence and/or second sequence based on comparing ground truth values for each image-pose pairing to estimated relative poses for each image pose-pairing; and summing and minimising the one or more computed errors.
24. The training apparatus according to any one of claims 21 to 23, wherein the one or more image-pose pairings of the first sequence are spatially and temporally distanced from the one or more image-pose pairings of the second sequence.
25. The training apparatus according to any one of claims 21 to 24, wherein the one or more processors are configured to determine the one or more relative pose of an image-pose pairing by comparing the pose of that image-pose pairing to the further pose of one or more further image-pose pairing in the first sequence and/or the second sequence.
26. The training apparatus according to any one of claims 21 to 25, wherein the one or more processors are further configured to fine-tune the parameters of the weighted processing network by iteratively: inputting to the image processing apparatus one or more image-location pairings; computing an estimated pose comprising location information; determining a fine-tuning error between the estimated pose and known image-location pairings; updating the one or more parameters of the weighted processing network based on minimisation of the fine-tuning error.
27. A method of training a weighted processing network to process images to determine camera localisation, the method comprising: receiving two or more sequences comprising a first sequence of image-pose pairings comprised of a first set of images and their associated first set of absolute poses, the two or more sequences further comprising a second sequence of image-pose pairings comprised of a second set of images and their associated second set of absolute poses; iteratively determining, for one or more image-pose pairing in the first sequence, one or more relative pose based on one or more first pose associated with a first image in the first sequence and a further pose associated to a further image in the first sequence and a pose associated to an image in the second sequence; refining one or more parameters of a weighted processing network by iteratively: computing, using the weighted processing network, one or more estimated pose based on one or more of the image-pose pairings from the first sequence, one or more imagepose pairings from the second sequence and the determined one or more relative pose, determining, for each computed estimated pose, an error between the one or more estimated pose and the one or more relative pose; and updating the one or more parameters of the weighted processing network based on the determined error.
28. The method according to claim 27, wherein the determined error is comprised of one or more of: a minimised error based on the difference between the two or more image-pose pairings in the first sequence, or a minimised error based on the difference between one or more image-pose pairings, and their associated absolute and estimated values, in the first sequence and one or more image-pose pairings in the second sequence.
29. The method according to claim 28, wherein the determined error is further determined by: computing one or more error between two or more image-pose pairing in the first sequence and/or second sequence based on comparing ground truth values for each image-pose pairing to estimated relative poses for each image pose-pairing; and summing and minimising the one or more computed errors.
30. The method according to any one of claims 27 to 29, wherein the one or more imagepose pairings of the first sequence are spatially and temporally distant from the one or more image-pose pairings of the second sequence.
31. The method according to any one of claims 27 to 30, wherein the one or more processors are configured to determine the one or more relative pose of an image-pose pairing by comparing the pose of that image-pose pairing to the further pose of one or more further image-pose pairing in the first sequence and/or the second sequence.
32. The method according to any one of claims 27 to 31 , wherein the method further comprises fine-tuning the parameters of the weighted processing network by iteratively: inputting to the image processing apparatus one or more image-location pairings; computing an estimated pose comprising location information; determining a fine-tuning error between the estimated pose and known image-location pairings; updating the one or more parameters of the weighted processing network based on minimisation of the fine-tuning error.
PCT/EP2023/076871 2023-09-28 2023-09-28 Method for obtaining camer pose by utilising relative pose contraints from adjacent and distant cameras covering the spatio-temporal space of a scene Pending WO2025067655A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2023/076871 WO2025067655A1 (en) 2023-09-28 2023-09-28 Method for obtaining camer pose by utilising relative pose contraints from adjacent and distant cameras covering the spatio-temporal space of a scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2023/076871 WO2025067655A1 (en) 2023-09-28 2023-09-28 Method for obtaining camer pose by utilising relative pose contraints from adjacent and distant cameras covering the spatio-temporal space of a scene

Publications (1)

Publication Number Publication Date
WO2025067655A1 true WO2025067655A1 (en) 2025-04-03

Family

ID=88241164

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2023/076871 Pending WO2025067655A1 (en) 2023-09-28 2023-09-28 Method for obtaining camer pose by utilising relative pose contraints from adjacent and distant cameras covering the spatio-temporal space of a scene

Country Status (1)

Country Link
WO (1) WO2025067655A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220172386A1 (en) * 2020-11-27 2022-06-02 Samsung Electronics Co., Ltd. Method and device for simultaneous localization and mapping (slam)

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220172386A1 (en) * 2020-11-27 2022-06-02 Samsung Electronics Co., Ltd. Method and device for simultaneous localization and mapping (slam)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BRACHMANN ERIC ET AL: "Visual Camera Re-Localization From RGB and RGB-D Images Using DSAC", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE COMPUTER SOCIETY, USA, vol. 44, no. 9, 2 April 2021 (2021-04-02), pages 5847 - 5865, XP011916194, ISSN: 0162-8828, [retrieved on 20210402], DOI: 10.1109/TPAMI.2021.3070754 *

Similar Documents

Publication Publication Date Title
CN108961327B (en) Monocular depth estimation method and device, equipment and storage medium thereof
JP7276607B2 (en) Methods and systems for predicting crowd dynamics
US9978180B2 (en) Frame projection for augmented reality environments
CN119339008A (en) Dynamic modeling method of scene space 3D model based on multimodal data
WO2023165093A1 (en) Training method for visual inertial odometer model, posture estimation method and apparatuses, electronic device, computer-readable storage medium, and program product
US11398048B2 (en) Estimating camera pose
CN113711276A (en) Scale-aware monocular positioning and mapping
CN107665508B (en) Method and system for realizing augmented reality
CN117152228B (en) Self-supervision image depth estimation method based on channel self-attention mechanism
CN111105439B (en) A Simultaneous Localization and Mapping Method Using a Residual Attention Mechanism Network
CN117237431A (en) Training method, device, electronic equipment and storage medium for depth estimation model
CN117581278A (en) Real-time handheld marker-free human motion recording and virtual image rendering on mobile platforms
CN115880334B (en) Video object tracking method with automatic machine learning map fusion
KR20230049969A (en) Method and apparatus for global localization
KR20220094092A (en) Intelligent object tracing system utilizing 3d map reconstruction for virtual assistance
CN114677422A (en) Depth information generation method, image blurring method and video blurring method
KR20080029080A (en) Magnetic Position Estimation System and Method of Mobile Robot Using Monocular Zoom Camera
US20200167650A1 (en) Hinted neural network
CN117456124B (en) Dense SLAM method based on back-to-back binocular fisheye camera
EP4295320A1 (en) Personalized local image features using bilevel optimization
WO2023211435A1 (en) Depth estimation for slam systems using monocular cameras
CN118430021B (en) Three-dimensional scanning posture sensing system and method applied to intelligent wheelchair
WO2025067655A1 (en) Method for obtaining camer pose by utilising relative pose contraints from adjacent and distant cameras covering the spatio-temporal space of a scene
Wang et al. Perceptual enhancement for unsupervised monocular visual odometry
US20230281862A1 (en) Sampling based self-supervised depth and pose estimation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23783325

Country of ref document: EP

Kind code of ref document: A1