GB2631387A

GB2631387A - A computer-implemented method for training a pose prediction model for synthesis and display of perspective video

Info

Publication number: GB2631387A
Application number: GB2309612.6A
Authority: GB
Inventors: Colin Hoy Michael; Shan Mo
Original assignee: Continental Automotive Technologies GmbH
Current assignee: Aumovio Germany GmbH
Priority date: 2023-06-26
Filing date: 2023-06-26
Publication date: 2025-01-08
Also published as: GB202309612D0; WO2025002671A1; CN121420332A

Abstract

A computer-implemented method for training an untrained pose prediction model using a training dataset of optimal poses collected from a plurality of drivers in order to obtain a pose prediction model 100 for the synthesis and display of a perspective video for a driver. The method includes initialising a driver representation 108, receiving contextual information 124 and a current pose 116, determining features 156 and predicting an optimal pose and adjusting the pose prediction model based on a loss function that enforces consistency between the predicted optimal pose and the optimal poses of the training dataset. The above steps are repeated until the method is terminated, wherein the current pose is the predicted optimal pose of a preceding iteration, and wherein determining features and predicting an optimal pose is further based on features determined in one or more preceding iterations. Preferably, the pose prediction model comprises a convolutional neural network. A pose prediction model, a method for synthesizing and displaying a perspective video, a computing system and a computer program is also provided.

Description

A COMPUTER-IMPLEMENTED METHOD FOR TRAINING A POSE PREDICTION MODEL FOR SYNTHESIS AND DISPLAY OF PERSPECTIVE VIDEO

TECHNICAL FIELD

[0001] The invention relates to pose prediction for display of video to a driver. Specifically, the invention relates to a computer-implemented method for training an untrained pose prediction model in order to obtain a trained pose prediction model to predict an optimal pose for the synthesis and display of a perspective video for a driver.

BACKGROUND

[0002] An emerging technology in driver assistance systems is the ability to render virtual perspective videos from third person poses (or viewpoints) based on surround view camera data (also known as "pose synthesis"). Users usually interact with the visualized virtual perspective video from a third person pose through a touchscreen interface. For example, a dragging gesture would cause the pose to change according to the start and end coordinates of such dragging gestures, therefore causing a change in the visualised virtual perspective video.

[0003] The preferred or desired pose from which such virtual perspective video is generated from may differ depending on the driving manoeuvre performed. For example, a driver may most likely desire a close-up view showing the distance between the wheels and the curb when parallel parking. The preferred pose for each driver is subjective and may vary from driver to driver. For example, drivers could prefer any of the following views with equal validity: an overhead view; a lateral closeup of the nearest point of the car to the curb; or a longitudinal close-up of the nearest point of the car to the curb.

[0004] In order to adjust displayed pose to achieve their preferred or desired pose, the driver would have to shift their attention away from the road and remove their hands from the steering wheel to interact with the touchscreen interface, therefore leading to potentially reduced road safety. There is therefore a need for a system to predict an optimal pose for the synthesis and display to the driver a perspective video from the predicted optimal pose to reduce or minimize driver interaction with the touchscreen interface, therefore potentially increasing driver safety and road safety. There is much work in the field of recommender systems, but existing recommender systems may not be directly applicable to continuous prediction problems that include a direct feedback component from a human operator.

SUMMARY

[0005] It is the object of the invention to train a pose prediction model that is able to predict an optimal pose and subsequently synthesise and display a perspective video from the predicted optimal pose to a driver to reduce or minimize driver interaction with an interface displaying a perspective video from a third person pose to the driver.

[0006] The object is achieved by the subject matter of the independent claims. Preferred embodiments are subject-matter of the dependent claims.

[0007] It shall be noted that all embodiments of the present invention concerning a method or a series of performed steps might be carried out with the order of the steps as described, nevertheless this has not to be the only and essential order of the steps of the method. The herein presented methods or series of performed steps can be carried out with another order of the disclosed steps without departing from the respective method embodiment, unless explicitly mentioned to the contrary hereinafter.

[0008] To solve the above technical problems, the present invention provides a computer-implemented method for training an untrained pose prediction model using a training dataset of optimal poses collected from a plurality of drivers in order to obtain a pose prediction model for the synthesis and display of a perspective video for a driver, the method comprising: a) initialising a driver representation; b) receiving contextual information about a scene of a vehicle and a current pose; c) with the pose prediction model, determining features and predicting an optimal pose based on the contextual information, the current pose, and the driver representation; d) adjusting the pose prediction model based on a loss function that enforces consistency between the predicted optimal pose and the optimal poses of the training dataset, and e) repeating steps b) to d) until the method is terminated, wherein in step b) the current pose is the predicted optimal pose of a preceding iteration, and wherein step c) of determining features and predicting an optimal pose is further based on features determined in one or more preceding iterations.

[0009] The computer-implemented method of the present invention is advantageous over known methods as the resulting pose prediction model is able to continuously predict an optimal pose in real time based on contextual information specific to the scene, the current pose, as well as the driver representation based on each driver, such that the pose predicted may be optimal for the specific driving situation, driving context, and driver. The optimal pose predicted by the pose prediction model may then be used to synthesise and display a potentially better, safer, more suitable and/or more informative perspective video from the optimal pose to a driver, therefore potentially reducing or minimizing driver interaction with the displayed perspective video and increasing driver safety.

[0010] The present invention may the following advantages over heuristic methods for selecting a pose, such as by classifying the situation, then using the previously selected pose from the last time the same situation was present. Firstly, the present invention may be advantageous as the present invention may more easily be scaled by scaling the training data as compared to scaling the complexity of a heuristic method (e.g., a nearest neighbour lookup scheme on deployed driver experiences). Furthermore, the method is able to train a model to predict an optimised pose without primarily relying on manual viewpoint selections from the driver at deployment time which requires the collection of sufficient or larger amounts of experience from a driver over a wide range of scenarios. As the present invention learns to classify drivers based on previously collected training data, far less experience may be needed to be collected when adding a new driver. The present method may thus avoid a long learning phase which may annoy the driver and cause the driver to stop using the viewpoint selection system, which may lead to the driver reverting to using manual viewpoint selection.

[0011] A preferred computer-implemented method of the present invention is a computer-implemented method as described above, wherein the training dataset is generated by having a plurality of drivers performing one or more driving manoeuvres on a vehicle simulator, wherein the vehicle simulator displays to the driver a perspective video from a virtual camera pose, and wherein the plurality of drivers are preferably from different geographical regions and drive different vehicle models, and more preferably have safe driving records.

[0012] The above-described aspect of the present invention has the advantage that using a training dataset containing data collected from a wide variety of drivers ensures that the resulting pose prediction model is able to predict optimal poses in a variety of situations and for a variety of drivers. Ensuring that the drivers have safe driving records is also advantageous as the information and optimal poses collected from such plurality of drivers with safe driving records may potentially be more objectively informative to safely complete each driving manoeuvre, therefore resulting in the prediction of a potentially better optimal pose for synthesis of a potentially better, safer, more suitable and/or more informative perspective video for the driver [0013] A preferred computer-implemented method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the training dataset comprises an optimal pose for each driving manoeuvre, wherein the optimal pose is based on the displayed perspective video or, if the perspective video is adjusted by the driver, an implied optimal pose determined based on the perspective video adjusted by the driver.

[0014] The above-described aspect of the present invention has the advantage that the optimal pose is specific to each driving manoeuvre, such that the resulting pose prediction model is able to predict an optimal pose for the synthesis and display of perspective video for each driving manoeuvre. The optimal pose could be the pose of the perspective video displayed to the driver while performing the driving manoeuvre (i.e., the driver is satisfied with the pose displayed and does not need to adjust the pose while driving), or an implied optimal pose based on adjustments made by the driver to the displayed perspective video while performing the driving manoeuvre.

[0015] A preferred computer-implemented method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the optimal pose is refined by having each driver perform offline adjustments to the pose after the driving manoeuvre is completed.

[0016] The above-described aspect of the present invention has the advantage that the optimal poses adjusted based on offline adjustments may be more precise and may more accurately reflect the optimal pose preferred by the driver as compared to poses adjusted during driving as the driver would only be able to make limited adjustments to the video while driving. The optimal pose prediction model trained on training data with such more precise or accurate optimal poses may thus be able to better predict a more optimal pose for synthesis of a potentially better, safer, more suitable and/or more informative perspective video for each driving manoeuvre.

[0017] A preferred computer-implemented method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the contextual information about a scene of the vehicle comprises one or more of an intermediate environment representation, a sliding window of a history of vehicle parameters, accelerometer data, location information of the vehicle, and information derived from an in-cabin camera.

[0018] The above-described aspect of the present invention has the advantage that provision of information regarding vehicle conditions, the environment surrounding the vehicle, and the temporal history of the vehicle may be able to give a more complete picture of the current scene in real time such that the pose prediction model is able to predict a more optimal pose for synthesis of a potentially better, safer, more suitable and/or more informative perspective video for the driver.

[0019] A preferred computer-implemented method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the contextual information about a scene of the vehicle comprises a bird-eye-view (BEV) grid map comprising information on one or more terrain classes comprising: a road, a sidewalk, an obstacle, an ego-vehicle, and other vehicles.

[0020] The above-described aspect of the present invention has the advantage that provision of information about the environment surrounding the vehicle and any other traffic participants on the road may be able to give a more complete picture of the current context or scene in real time such that the pose prediction model is able to predict a more optimal pose for the synthesis and display of a potentially better, safer, more suitable and/or more informative perspective video for the driver.

[0021] A preferred computer-implemented method of the present invention is a computer-implemented method as described above or as described above as preferred, further comprising step f) of refining the pose prediction model by carrying out the following steps with a driver on a vehicle or a vehicle simulator: i) initialising a driver representation based on the driver; ii) receiving contextual information about the scene of the vehicle and a current pose displayed to the driver; iii) with the pose prediction model, determining features and predicting an optimal pose based on the contextual information, the current pose, and the driver representation; iv) synthesising and displaying a perspective video based on the predicted optimal pose; v) performing steps ii) to iv) until one or more interactions are received from the driver adjusting the displayed video or until the method is terminated, wherein in step ii) the current pose is the predicted optimal pose of a preceding iteration, and wherein step iii) of determining features and predicting an optimal pose is further based on features determined in one or more preceding iterations.

[0022] The above-described aspect of the present invention has the advantage that the pose prediction model may be validated by deploying the model on a simulator or vehicle with a human driver or operator, therefore ensuring that the model works and may be deployed on the roads to increase road and/or driver safety.

[0023] A preferred computer-implemented method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein step f) further comprises: vi) receiving one or more interactions from the driver adjusting the displayed perspective video; vii) determining an implied optimal pose based on the adjusted perspective video; viii) updating the pose prediction model and driver representation based on a loss function that enforces consistency between the predicted optimised pose and the implied optimal pose, and an embedding loss; and ix) repeating steps ii) to v) or steps ii) to viii) until the method is terminated, wherein in step ii) the current pose is the predicted optimal pose of a preceding iteration or, if one or more interactions are received from the driver, the current pose is the implied optimal pose of the preceding iteration.

[0024] The above-described aspect of the present invention has the advantage that the pose prediction model is additionally trained or refined based on feedback from drivers in the form of input and/or interaction adjusting the displayed perspective video such that the resulting pose prediction model is able to better predict a more optimal pose for the synthesis and display of a potentially better, safer, more suitable and/or more informative perspective video for a driver. The optimal pose predicted by the pose prediction model may also be personalised for each driver based on updates to the driver representation and/or model from feedback from drivers in the form of input and/or interaction adjusting the displayed perspective video, therefore potentially reducing the number of interactions received from the driver adjusting the displayed perspective video in subsequent iterations and/or deployment.

[0025] A preferred computer-implemented method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein smoothing is used during one or more iterations of steps ii) to v) [0026] The above-described aspect of the present invention has the advantage that the transition of the displayed perspective video from the current pose to the predicted optimal pose may be smoothened and less abrupt, and therefore may increase the ease of viewing by the driver and improve the viewing experience of the driver, thereby potentially increasing the likelihood of the driver using the present invention and potentially increasing driver and road safety.

[0027] A preferred computer-implemented method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the pose prediction model comprises: a convolutional neural network (CNN) configured to receive as input the contextual information; a first multi-layer perceptron (MLP) configured to receive as input the current pose; a concatenation module configured to receive as input the driver representation, an output from the convolutional neural network, and an output from the first multi-layer perceptron (MLP); a second multi-layer perceptron (MLP) configured to receive as input an output from the concatenation module; a sequence network suitable for sequential data configured to receive as input an output from the second multi-layer perceptron (MLP) and configured to output features which are fed into the sequence network in subsequent iterations, wherein the sequence network is preferably a long short-term memory (LSTM) network; a third multi-layer perceptron (MIT) configured to receive as input an output from the sequence network, and configured to output an updated driver representation which is used in subsequent iterations; a fourth multi-layer perceptron (MLP) configured to receive as input the output from the sequence network, and configured to output the predicted optimal pose; and optionally, a fifth multi-layer perceptron (MLP) configured to receive as input one or more interactions from the driver, and the concatenation module is further configured to receive as input an output from the fifth multi-layer perceptron (MLP).

[0028] The above-described aspect of the present invention has the advantage that a convolutional neural network (CNN) has high accuracy in image recognition and can automatically filter images and extract or detect features in images (or higher dimension sensor data) without any human supervision. The incorporation of muti-layer perceptrons (MLPs) is advantageous as MLPs allow the application to complex non-linear problems, work well with large input data and may provide quick predictions after training. The incorporation of a sequence network suitable for sequential data, such as LSTM, is advantageous as sequence networks and/or LSTM is/are able to learn long-term dependencies and capture complex patterns in sequential data. The various networks thus work together to predict a more optimal pose for the synthesis and display of a potentially better, safer, more suitable and/or more informative perspective video for a driver.

[0029] The above-described advantageous aspects of a computer-implemented method of the invention also hold for all aspects of a below-described pose prediction model of the invention. All below-described advantageous aspects of a pose prediction model of the invention also hold for all aspects of an above-described computer-implemented method of the invention.

[0030] The invention also relates to a pose prediction model obtainable by performing a computer-implemented method of the present invention.

[0031] The pose prediction model of the present invention is advantageous over known models as the pose prediction model of the present invention is potentially able to predict an optimal pose for the synthesis and display of a perspective video for a driver based on the driver representation, the context surrounding the vehicle that the driver is operating, and the current pose, thereby potentially reducing the number of interactions the driver has with the pose display to increase driver and/or road safety when the driver is operating the vehicle.

In some embodiments, the driver representation may be customised or personalised to the driver which may further reduce the need for interactions from the driver adjusting the displayed perspective video to further increase driver and/or road safety.

[0032] The above-described advantageous aspects of a computer-implemented method and/or a pose prediction model of the invention also hold for all aspects of a below-described method of the invention. All below-described advantageous aspects of a method of the invention also hold for all aspects of an above-described computer-implemented method and/or a pose prediction model of the invention.

[0033] The invention also relates to a method for synthesizing and displaying a perspective video from an optimal pose on a display to a driver of a vehicle, the method comprising: 1) detecting a driver; 2) initialising a driver representation based on the driver; 3) receiving contextual information about a scene of the vehicle and a current pose displayed to the driver; 4) with the pose prediction model of the present invention, determining features and predicting an optimal pose based on the contextual information, current pose, and the driver representation; 5) synthesizing and displaying a perspective video on the display based on the predicted optimal pose; 6) performing steps 3) to 5) until one or more interactions are received from the driver adjusting the displayed perspective video or until the method is terminated, wherein in step 3) the current pose is the predicted optimal pose of a preceding iteration, and wherein step 4) of determining features and predicting an optimal pose is further based on features determined in one or more preceding iterations.

[0034] The method of the present invention is advantageous over known methods as the method is able to synthesise and display a potentially better, safer, more suitable and/or more informative perspective video from a predicted optimized pose for the driver based on the driver representation, the context surrounding the vehicle that the driver is operating, and the current pose, thereby potentially reducing the need for interactions from the driver adjusting the displayed perspective video to further increase driver and/or road safety.

[0035] A preferred method of the present invention is a method as described above, further comprising: 7) receiving one or more interactions from the driver adjusting the displayed perspective video; 8) determining an implied optimal pose based on the adjusted perspective video; 9) updating the driver representation based on the implied optimal pose to generate an updated driver representation; and 10) performing steps 3) to 6) or 3) to 9) until the method is terminated, wherein the driver representation used is the updated driver representation.

[0036] The above-described aspect of the present invention has the advantage that the driver representation may be updated based on interactions or touch gestures received from the driver adjusting the displayed pose, therefore the driver representation may be customised or personalised to each specific driver which may further reduce the need for interactions from the driver adjusting the displayed perspective video to further increase driver and/or road safety.

[0037] The above-described advantageous aspects of a computer-implemented method, pose prediction model and/or method of the invention also hold for all aspects of a below-described computing system. All below-described advantageous aspects of a computing system of the invention also hold for all aspects of an above-described computer-implemented method, pose prediction model and/or method of the invention.

[0038] The invention also relates to a computing system synthesising and displaying a perspective video from a predicted optimal pose for a driver, the computing system comprising one or more displays, one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for carrying out a method according to the present invention.

[0039] The above-described advantageous aspects of a computer-implemented method, pose prediction model, method, and/or computing system of the invention also hold for all aspects of a below-described computer program, machine-readable storage medium, or data carrier signal of the invention. All below-described advantageous aspects of a computer program, machine-readable storage medium. or data carrier signal of the invention also hold for all aspects of an above-described computer-implemented method, pose prediction model, method, and/or computing system of the invention.

[0040] The invention also relates to a computer program, a machine-readable storage medium, or a data carrier signal that comprises instructions, that upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of a computer-implemented method of the invention. The machine-readable medium may include any medium and/or mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device). The machine-readable medium may be any medium, such as for example, read-only memory (ROM); random access memory (RAM); a universal serial bus (USB) stick; a compact disc (CD); a digital video disc (DVD), a data storage device; a hard disk; electrical, acoustical, optical, or other forms of propagated signals (e.g., digital signals, data carrier signal, carrier waves), or any other medium on which a program element as described above can be transmitted and/or stored.

[0041] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term "sensor" includes any sensor that detects or responds to some type of input and may be used to determine the condition and/or environment of an vehicle and/or human. Examples of sensors include accelerometers, tactile sensors, rotary encoders, pressure sensors, imaging sensors, depth sensors, light sensors, sound sensors, temperature sensors, or navigating sensors.

[0042] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term "pose" may be interchangeably referred to as "viewpoint" [0043] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term "sensor data" means the output or data of any sensor.

BRIEF DESCRIPTION OF THE DRAWINGS

[0044] These and other features, aspects, and advantages will become better understood with regard to the following description, appended claims, and accompanying drawings where: [0045] Fig. 1 is a schematic illustration of the inputs and outputs of a pose prediction model, in accordance with embodiments of the present disclosure; [0046] Fig. 2 is a schematic illustration of an example of the architecture of a pose prediction model, in accordance with embodiments of the present disclosure; [0047] Fig. 3 is a schematic illustration of a method for generating trained pose prediction model, in accordance with embodiments of the present disclosure; [0048] Fig. 4 is a schematic illustration of a method for refining an initial pose prediction model, in accordance with embodiments of the present disclosure; and [0049] Fig. 5 is a schematic illustration of a method of running a pose prediction model online, in accordance with embodiments of the present disclosure.

[0050] In the drawings, like parts are denoted by like reference numerals.

[0051] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether or not such computer or processor is explicitly shown.

DETAILED DESCRIPTION

[0052] In the summary above, in this description, in the claims below, and in the accompanying drawings, reference is made to particular features (including method steps) of the invention. It is to be understood that the disclosure of the invention in this specification includes all possible combinations of such particular features. For example, where a particular feature is disclosed in the context of a particular aspect or embodiment of the invention, or a particular claim, that feature can also be used, to the extent possible, in combination with and/or in the context of other particular aspects and embodiments of the invention, and in the inventions generally.

[0053] In the present document, the word "exemplary" is used herein to mean "serving as an example, instance, or illustration." Any embodiment or implementation of the present subject matter described herein as "exemplary" is not necessarily be construed as preferred or advantageous over other embodiments.

[0054] While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternative falling within the scope of the disclosure.

[0055] Fig. l is a schematic illustration of the inputs and outputs of a pose prediction model 100, in accordance with embodiments of the present disclosure. In some embodiments, the pose prediction model 100 may receive a driver representation 108, a current pose 116 and contextual information 124 as inputs to predict an optimal pose 140 for the synthesis and display of a perspective video for a driver, or operator of a vehicle. In some embodiments, the driver or operator may be sitting in and operating the vehicle. In some embodiments, the driver or operator may be seated separately from the vehicle and controlling the vehicle through teleoperation. In some embodiments, the driver representation 108 may be a representation of a driver comprising one or more features of or representative of a driver or operator. In some embodiments, the driver representation 108 may be a driver embedding 108 with size of any integer. For example, the driver embedding 108 may have a size of 32. As an example, the driver embedding 108 may be expressed in pseudocode as follows: // Default 32 const DRIVER_EMBEDDING_SIZE: Int [0056] In some embodiments, the current pose 116 may comprise information on the orientation and/or position of the vehicle. In some embodiments, the current pose 116 may comprise information on the orientation and/or position of the vehicle at one or more time points, and/or a specific time point. In some embodiments, the current pose 116 may comprise information relating to one or more imaging sensors on the vehicle, including a zoom or magnification. In some embodiments, the current pose 116 may be randomly initialised. In some embodiments, the current pose 116 may be the optimal predicted pose 140 predicted by the pose prediction model 100 from a preceding iteration. In some embodiments, the current pose 116 may be an implied optimal pose based on adjustments by the driver. As an example, the current pose 116 may be expressed in pseudocode as follows: // Number of parameters required to describe the pose 7/ 7/ If we always assume the camera is pointed at a fixed point // on the vehicle, we need: 7/ // * 3 parameters for the rotation // * 1 parameter for the zoom. /7

// which is 4 in total const POSE CHANNELS: Int

type PoseSpecification = Array[POSE_CHANNELS]

[0057] In some embodiments, contextual information 124 may comprise contextual information about a scene of a vehicle. In some embodiments, the scene of a vehicle may be the physical surroundings of the vehicle, whether in real life or in a simulator. The contextual information 124 may be obtained from one or more sensors mounted on, around, and/or within the vehicle. In some embodiments, the contextual information 124 may be obtained from a vehicle simulator. An example of a vehicle simulator is CARLA available at https://carla.org/. In some embodiments, the default settings of the vehicle simulator may be used. In some embodiments, the settings and/or parameters of the vehicle simulator may be adjusted. In some embodiments, the contextual information 124 may comprise one or more of an intermediate environment representation, a current pose, a sliding window of a history of vehicle parameters, accelerometer data, location information of the vehicle, and information derived from an in-cabin camera. Intermediate environment representation may include an "environment embedding" outputted by an auxiliary neural network (i.e., a neural network that directly consumes the RGB camera data to produce the embedding, a list of "affordances" (i.e., specific measurements like "distance to obstacle in the forward direction"), and/or a list of regions described by a particular shape, for examples a list of ellipsoid regions around the vehicle. In some embodiments, sliding window of the history of vehicle parameters may include parameters such as speed, steering angle, accelerator/brake pedal position, current gear (in particular whether it's a forward/reverse gear). In some embodiments, accelerometer data may include information such as translational acceleration, rotational velocity, and compass direction. In some embodiments, location about the location of the vehicle may include information on the road and/or city type and may be derived from GPS and a map. In some embodiments, information derived from an in-cabin camera may include information such as gaze direction detection and sentiment detection.

[0058] In some embodiments, the contextual information 124 may comprise a bird-eye-view (BEV) grid map comprising information on one or more terrain classes comprising: a road, a sidewalk, an obstacle, an ego-vehicle, and other vehicles. In some embodiments, contextual information 124 may be obtained from a surround view camera system of a vehicle by applying one or more machine learning models to the camera data. For example, semantic segmentation may be applied to distinguish the classes "road", "sidewalk", "car" and "other obstacle". For example, BEVFusion, a model available at https://github.com/mit-hanlab/bevfusion may be employed. In some embodiments, the contextual information 124 may be aligned with the current pose 116 of the vehicle. As an example, the contextual information 124 may be expressed in pseudocode as follows: /1 Length of the context grid, default 30 const L: Int // width of the context grid, default 30 const w: Int // channels in the context grid, default 5 // // This might be a one hot encoding of these terrain classes: // // * Road // * sidewalk // obstacle // * Ego-vehicle // * other vehicle const C: Int /7 A grid aligned with the current vehicle's pose type ContextGrid = Array[L, W, C] [0059] In some embodiments, the pose prediction model 100 may optionally further receive as input one or more interactions 132 from the driver. The interactions 132 may be one or more gestures or touch gestures received from the driver adjusting a perspective video currently displayed on a display. The perspective video may be a video from an imaging sensor, or a synthesised perspective video from a virtual third person or camera pose. A perspective video is a video from a third person (or outsider) pose or perspective. In some embodiments, the interactions 132 may be one or more inputs received from a driver adjusting the video displayed through any input device, including a touch screen, a keyboard, or a mouse.

[0060] In some embodiments, the pose prediction model 100 may generate as output a predicted optimal pose 140. In some embodiments, the predicted optimal pose 140 may be determined by the pose prediction model 100 based on the driver representation 108, the current pose 116, and the contextual information 124. In some embodiments, the predicted optimal pose 140 may be determined by the pose prediction model 100 based on the one or more interactions 132 received from the driver. In some embodiments, the pose prediction model 100 may further generate as output an updated driver representation 108 based on the received driver representation 108 and one or more interactions 132 received from the driver, and optionally, based on the contextual information 124 and current pose 116, wherein the updated driver representation 108 is used as input in subsequent iterations or steps.

[0061] In some embodiments, the pose prediction model 100 may generate as output one or more features 156, which may be used as input to the pose prediction model 100 for each subsequent iteration or step. The features may comprise one or more pieces of information based on, representative of, or related to the inputs into the pose prediction model 100. In some embodiments, the one or more features 156 may be recurrent features. In some embodiments, the one or more features 156 may also be used as input to the pose prediction model 100 to determine the predicted optimal pose 140 and/or the updated driver representation 148.

[0062] Fig. 2 is a schematic illustration of an example of the architecture of a pose prediction model, in accordance with embodiments of the present disclosure. It is emphasised that the architecture illustrated in Fig. 2 is only an example of the architecture of the pose prediction model and other suitable architecture could be employed.

[0063] In some embodiments, pose prediction model 100 may comprise a convolutional neural network (CNN) 208 configured to receive as input the contextual information 124. In some embodiments, CNN 208 may be a 2D convolutional network. An example for a 2D convolutional network is a residual network. An example of suitable network is ResNet which is disclosed in He et al., "Deep Residual Learning for Image Recognition", available at https://doi.org/10.48550/arXiv.1512.03385.

[0064] In some embodiments, pose prediction model 100 may comprise a first multi-layer perceptron (MLP) 216 (also termed a "current pose mlp"). The first MLP 216 may include feed forward layers and may have 2 to 5 hidden layers with a feature width of 32 to 256 in intermediate layers. The first MLP 216 may receive the current pose 116 as input and output an embedding thereof [0065] In some embodiments, the driver representation 108 (in the form of an embedding) and outputs of the CNN 208 and first MLP 216 are concatenated by a concatenation module 224. The concatenation module 224 outputs concatenated data which is then fed into a second MLP 232 (also termed a "lstm input mlp"). The second MLP 232 may include feed forward layers and may have 2 to 5 hidden layers with a feature width of 32 to 256 intermediate layers.

[0066] In some embodiments, pose prediction model 100 may comprise a sequence network 240 suitable for sequential data. Examples of sequence networks include recurrent neural networks (RNN) such as long short-term memory (LSTM) networks, gated recurrent units (GRU), 1D CNN, and graph neural networks (GNN) (which include transformers as a special case where the graph is fully connected in a particular context window). In some embodiments, the sequence network 240 may comprise RNNs, 1D CNN and GNN arbitrarily at different abstraction levels (e.g., 1D CNN for raw data, RNN for mid-level representation, and transformers for the high-level representation). Preferably, sequence network 240 is a long short-term memory (LSTM) network. The invention is discussed in relation to a LSTM network, although it is emphasised that any of the above-listed types of sequence networks may be used. LSTM is an extension of recurrent neural network (RNN), and in particular has a long-term memory cell which is used to store important information in each unit of LSTM. An example of LSTM may be found in "Understanding LSTM -a tutorial into Long Short-Term Memory Recurrent Neural Networks" by Staudemeyer and Morris, wherein an example of the architecture of LSTM may be found at least in Section 8, and an example of the training of LSTM may be found at least in Section 9. It is contemplated that aside from LSTM, other time-series based neural networks may be used, such as a recurrent neural network (RNN) or gated recurrent unit (GRU). The sequence network 240 outputs features 156 which may then be fed into the sequence network 240 as input in subsequent iterations. For example, a sequence network 240 with LSTM architecture may output recurrent features as feature 156, the recurrent features fed into the LSTM as input in subsequent features.

[0067] In some embodiments, pose prediction model 100 may comprise a third MLP 248. The third MLP 248 may include feed forward layers and may have 2 to 5 hidden layers with a feature width of 32 to 256 intermediate lavers. In some embodiments, the third MLP 248 may receive an output from the LSTM 240 and output an updated driver representation 108 (or embedding), the output updated driver representation used for subsequent iterations. In some embodiments, the pose prediction model may directly output a new value of the driver representation/embedding (with a potential penalisation for changing the driver representation/embedding too quickly). In some embodiments, additive or multiplicative updates may be carried out to update the driver representation/embedding. In some embodiments, the pose prediction model may output an additive delta that may be added to the current value of the driver representation/embedding. In some embodiments, the driver representation/embedding may be updated using techniques such as model agnostic meta learning (MAML) in which the machine learning model may be trained to apply gradient based updates to the driver embedding.

[0068] In some embodiments, the pose prediction model 100 may comprise a fourth MLP 256 (also termed a "pose selection mlp"). The fourth MLP 256 may include feed forward layers and may have 2 to 5 hidden layers with a feature width of 32 to 256 intermediate layers. In some embodiments, the fourth MLP 256 may receive an output from the LSTM 240 and output the predicted optimal pose 140.

[0069] In some embodiments, the pose prediction model 100 may comprise a fifth MLP 264. The fifth MLP 264 may include feed forward layers and may have 2 to 5 hidden layers with a feature width of 32 to 256 intermediate layers. In some embodiments, the fifth MLP 264 may receive one or more interactions received from the driver as input and output an embedding thereof The output of the fifth MLP 264 may also be fed into the concatenation model 224 and concatenated with the driver representation 108, the output from CNN 208 and the output from the first MLP 216.

[0070] As an example, the architecture or structure of the pose prediction model 100 (also termed a "pose selection network") may be expressed in pseudocode as follows: struct PoseselectionNetwork: default_driver_embedding: Array[DRIVER_EMBEDDING_SIZE] initial_lstm_state: (Array[64], Array[64]) context cnn: Conv2D( in_channels=C, out_channels=32, kernel_size=[3, 3], padding="SAME" ) , Rew(), conv2D( in_channels=32, out_channels=32, kernel_size=[3, 3], padding="SAME" ) , ReLu(), Flatten() , MLP( in_channels=(L w * 32), out_channel s=32, pose_sel ecti on_ml p: [ MLP( in_channels=POSE_CHANNELS, out_channels=32 ), ReLU() current_pose_mlp: [ MLP( in_channels=POSE_CHANNELS, out_channels=32 ), ReLU() lstm_input_mlp: MLP( in_channels=(32 + 32 + 32 + DRIVER_EMBEDDING_SIZE), out_channels=64 lstm: LSTM(channels=64) output_pose: MLP( in_channels=64, out_channels=POSE_CHANNELS output_driver_embedding: MLP( in_channels=64, out_channels=POSE_CHANNELS [0071] As an example, a forward propagation of the pose prediction model 100 may be expressed in pseudocode as follows: func pose_selection_network_forward( network: PoseSelectionNetwork, driver_embedding: Array[DRIVER_EMBEDDING_SIZE], lstm_state: (Array[64], Array[64]), context: Array[L, W, C],

current_pose: PoseSpecification,

selected_pose: Option[PoseSpecification]

40) ->

PoseSpecification,

Array[DRIVER_EMBEDDING_SIZE], (Array[64], Array[64]) ): xl = network.context_cnn(context) if selected_pose is not None: x2 = network.pose_selection_mlp(selected_pose) else: x2 = [0.0; 32] x3 = network.current_pose_mlp(current_pose) x = lstm_input_mlp(concat(xl, x2, x3, driver_embedding)) lstm_state = network.lstm(x, lstm_state) return ( network.output_pose(1stm_state[0]), network.output_driver_embedding(lstm_state[0]), lstm_state [0072] Fig. 3 is a schematic illustration of a method for generating trained pose prediction model, in accordance with embodiments of the present disclosure. In some embodiments, a plurality of drivers may be recruited to perform one or more driving manoeuvres on one or more vehicle simulators 308, wherein the vehicle simulators 308 display to each driver a video generated from a virtual camera pose. Preferably, the plurality of drivers may be from different geographical regions, drive different vehicle models. In general, the wider the variety of drivers, the more robust the pose prediction model may be. Preferably, the plurality of drivers have safe driving records as the information and optimal poses collected from such plurality of drivers with safe driving records may potentially be more objectively informative to safely complete each driving manoeuvre. As an example, 50 to 200 drivers may be recruited from 5 to 10 different geographical regions, the drivers driving 10 to 20 different vehicle models.

[0073] In some embodiments, in step 316, perspective videos may be collected of the drivers over a plurality of episodes. An episode corresponds to the duration for which a third person pose display system (or perspective video display system) is engaged each time it is used.

An example of an episode is the performance of a single driving manoeuvre (e.g., performing a parallel parking manoeuvre) on the vehicle simulator. The perspective videos collected may comprise information on the pose or viewpoint displayed to the drivers at each time point, as well as information on any interactions that the driver may have adjusting the pose or viewpoint of the perspective video displayed to them.

[0074] In some embodiments, in step 324, the optimal pose may be annotated for the frames of the perspective videos collected in step 316. In some embodiments, the optimal pose may be the pose of the perspective video displayed to the driver at each time point. In some embodiments, the optimal pose may be an implied optimal pose based on the interactions from the driver adjusting the pose or viewpoint of the perspective video displayed to them. In some embodiments, the optimal pose may be a refined pose obtained by having each driver perform offline adjustments to the pose after the driving manoeuvre is completed.

Once the frames have been annotated with the optimal pose, the annotated frames may be accumulated to generate a training dataset of optimal poses collected from a plurality of drivers. As an example, the training dataset may comprise the following frame and episode types: struct FramewithLabel: context: contextGrid driver_id: UUID

label] ed_pose: Posespecification

time: Timestamp struct Framewithlnteraction: context: contextGrid driver_id: UUID

impressed_pose: Posespecification

selected_pose: option[Posespecification]

time: Timestamp struct Episode[FrameType]: frames: List[FrameType] driver_id: UUID [0075] For example, the training dataset may comprise between 10,000 to 100,000 frames with labels, 100,000 to 1,000,000 frames with interactions, over 5,000 to 50,000 episodes.

[0076] In some embodiments, in step 332, supervised training of an untrained pose prediction model may be carried out using the training dataset generated in steps 316 and 324. Supervised training may be carried out with the following steps: i) Initializing a driver representation. The driver representation may be a driver embedding. In some embodiments, the driver representation may be a default driver representation. In some embodiments, the driver representation may be personalized or customized based on a specific driver. In some embodiments, the driver representation may be an updated driver representation generated from a preceding iteration.

ii) Receiving contextual information about a scene of a vehicle and a current pose. The contextual information may be received from a vehicle simulator on which the supervised training is carried out. In a first iteration, the current pose may be randomly initialized. In subsequent iterations, the current pose may be the optimal pose predicted in the preceding iteration.

iii) With the training pose prediction model, determining features and predicting an optimal pose based on the contextual information, the current pose, and the driver representation. The features determined may then be fed back into the pose prediction model for subsequent iterations. This allows the historical information to be accounted for when determining the optimal pose.

iv) Adjusting the pose prediction model based on a loss function that enforces consistency between the predicted optimal pose and the optimal poses of the training dataset. An example of a loss function is the L2 loss or squared error loss, which is the squared difference between a prediction and the actual value, calculated for each example in a dataset.

v) Repeating the above steps until a predetermined condition is met (e.g. number of iterations or epochs completed).

[0077] In some embodiments, before training, all the weights and biases may be initialized using Xavier or He initialisation as described in https://pytorch.org/docs/stablernninithtml# torch.nntnitxavier uniform_. In some embodiments, supervised training may be carried out using the well-known ADAM optimizer with default parameters (learning rate = 0.001, beta 1 = 0.9, beta 2 = 0.999, eps = le-0.8), as described in https://pytorch.org/docs/stable/ gen erated/torch opti m. Adam. html#torch opti m. Adam. In some embodiments, supervised training may be carried out over 100 epochs, wherein one epoch comprises a single pass through the entire training data. As an example, supervised training may be expressed in pseudocode as follows: func execute_episode_supervised( network: PoseselectionNetwork, driver_embedding: Option[Array[DRIVER_EMBEDDING_SIZE]], episode: Episode[FramewithLabel] ) -> scalar: lstm_state = network.initial_lstm_state if driver_embedding is None: driver_embedding = network.default_driver_embedding loss = 0.0 for frame in episode.frames: // For the first stage of training, current_pose should be // ignored..

current_pose = "random pose" output_pose, output_driver_embedding, lstm_state = ( pose_selection_network_forward( network, driver_embedding, lstm_state, frame.context, current_pose, selected_pose=None // Since selected_pose is None, the network should // have a preference to not change the driver embedding loss += 12_norm(driver_embedding output_driver_embedding) loss += 12 norm(frame.labelled_pose -output_pose) return loss [0078] In some embodiments, after supervised training is completed in step 332, a first pose prediction model 340 may be generated. This first pose prediction model 340 may be deployed to predict an optimal pose based on the contextual information, the current pose, and the driver representation/embedding. In some embodiments, the first pose prediction model 340 generated may be refined or improved upon though feedback training.

[0079] Fig. 4 is a schematic illustration of a method 400 for refining an initial pose prediction model, in accordance with embodiments of the present disclosure. In some embodiments, method 400 may be used to refine first pose prediction model 340 generated in method 300. In some embodiments, method 400 may be carried out with a driver on a vehicle or a vehicle simulator 308.

[0080] In some embodiments, method 400 may comprise step 416 wherein the model is run online.

[0081] Fig. 5 is a schematic illustration of a method 500 of running a pose prediction model online, in accordance with embodiments of the present disclosure. In some embodiments, the pose prediction model may be run online on a vehicle simulator 308. An example of a vehicle simulator that may be used is CARLA, which is available at https://carla.org/.

[0082] According to some embodiments, method 500 may comprise step 504 in which an image frame is initiated. In some embodiments, frame initiation may comprise a timer running at a fixed frequency such that the model is re-executed based on the current data each time the timer triggers.

[0083] According to some embodiments, method 500 may comprise step 508 of receiving contextual information and current pose based on the frame initiated in step 504. The contextual information may be continuously obtained from the vehicle simulator. At each time point, the pose prediction model may be executed in step 516, based on the contextual information and current pose received in step 508, as well as driver representation 108 initialized. In some embodiments, the driver representations 108 may be a default driver representation if this is the first time the model is run for a specific driver. In some embodiments, the driver representation 108 may be a driver representation personalized or customized for the specific driver. In some embodiments, the driver representation 108 may be an updated driver representation 108 generated in a preceding iteration.

[0084] In some embodiments, method 500 may comprise step 524, wherein pose smoothing is carried out. In some embodiments, pose smoothening may combine together a user or driver interaction with network prediction. For example, the well-known Butterwort filter may be used for pose smoothening.

[0085] In some embodiments, method 500 may comprise step 532 wherein one or more perspective images are rendered based on the output pose generated (with or without smoothening). The perspective images may be rendered based on any known methods used for rendering third person viewpoints to generate a perspective video. For example, reprojection-based solutions may be used, wherein there is an assumption that there is an estimate of the distance to the surroundings. Given a pose of an actual camera, a pose of the desired viewpoint, a depth representation (e.g., a 3D mesh) of the surroundings and intrinsic calibration of the actual camera, reprojection involves projecting the camera red-green-blue (KGB) pixel data onto a mesh, then projecting the augmented mesh onto the virtual camera.

In some embodiments, simple heuristics and/or monocular depth estimation may be used to determine the 3D mesh. In another example, neural network-based viewpoint rendering may be employed. Examples of methods for rendering third person viewpoints may be found in UK Patent Application No. GB 2212212.1 entitled "Method and Device for Generating An Outsider Perspective Image and Method of Training a Neural Network" filed 23 August 2022 and European patent application No. EP 22188656 entitled "System and Apparatus Suitable for Use With a Hypernetwork in Association With Neural Radiance Fields (Nerf) Related Processing, and a Processing Method in Association Thereto" filed 4 August 2022, the disclosures of which are incorporated herein by reference in their entirety. In some embodiments, the images rendered in step 532 may then be used in the frame initiation in step 504 for the next iteration.

[0086] In some embodiments, method 500 may comprise step 536 in which one or more interactions are received from the driver. In some embodiments, once the perspective image frame (or video) is initiated in step 504 and displayed to the driver, the driver may provide one or more interactions to adjust the video displayed to the driver such that the video displayed is from their preferred optimal pose. In some embodiments, if one or more interactions are received from the driver in step 536, the received one or more interactions may be fed as input into the pose prediction model when it is executed in step 516. In some embodiments, if one or more interactions are received from the driver in step 536, the received one or more interactions may be fed as input when pose smoothing is carried out in step 524. For example, if the driver specifies a pose (through provision of interactions), such pose would be treated as the correct pose. Otherwise, a linear combination of the current post and a predicted pose output from the pose prediction model may be use. For example, if the pose prediction model predicts a first pose A and the driver selects a second pose B (through providing interactions), the current pose displayed may be second pose B and may slowly drift back to first pose A over time.

[0087] Referring to Fig. 4, once the model is run online in step 416, feedback training may be carried out in step 424 to generate a subsequent or more refined pose prediction model 432. Pose prediction model 432 may be deployed to predict an optimal pose based on the contextual information, the current pose, the driver representation/embedding, and optionally, one or more interactions received from the driver.

[0088] For example, method 400 of refining, an initial pose prediction model may comprise the following steps: i) Initializing a driver representation. The driver representation may be a driver embedding. In some embodiments, the driver representation may be a default driver representation. In some embodiments, the driver representation may be personalized or customized based on a specific driver. In some embodiments, the driver representation may be an updated driver representation generated in a preceding iteration.

ii) Receiving contextual information about a scene of a vehicle and a current pose. The contextual information may be received from a vehicle simulator or a vehicle on which the feedback training is carried out. For a vehicle simulator, in a first iteration, the current pose may be randomly initialized. In subsequent iterations, the current pose may be the optimal pose (with or without pose smoothing) predicted in the preceding iteration.

iii) With the training pose prediction model, determining features and predicting an optimal pose based on the contextual information, the current pose, and the driver representation. The features determined may then be fed back into the pose prediction model for subsequent iterations.

iv) Synthesising and displaying a perspective video based on the predicted optimal pose (with or without pose smoothing). Any method of pose or viewpoint synthesis may be employed.

v) If no interactions are received, repeat the above steps until the method is terminated (e.g., number of iterations or epochs completed).

vi) If one or more interactions are received from the driver adjusting the synthesised and displayed perspective video, determining an implied optimal pose based on the adjusted perspective video and updating the pose prediction model and/or driver representation based on a loss function that enforces consistency between the predicted optimised pose and the implied optimal pose (also known as feedback loss), and an embedding loss. The implied optimal pose refers to the camera pose displayed on the synthesised and displayed perspective video after the receipt of one or more interactions received from the driver adjusting the synthesised and displayed perspective video. In some embodiments, prediction of an optimal pose based on the pose prediction model may be paused and/or ignored for the duration of the driver providing one or more interactions adjusting the synthesised and displayed perspective video. In some embodiments, when the driver is adjusting the synthesised and displayed perspective video, the adjustments of the driver may determine the camera pose used to render the perspective image/video on the display. In some embodiments, when the driver has stopped adjusting the synthesised and displayed perspective video, the adjusted displayed camera pose of the perspective video may be used as the input into the pose prediction model as the current camera pose for subsequent iterations.

vii)Repeat the above steps until the method is terminated (e.g., number of iterations or epochs completed).

[0089] In some embodiments, feedback training may be carried out using the well-known Adam optimizer with default parameters (learning rate = 0.001, beta t = 0.9, beta 2 = 0.999, eps = le-0.8), as described in https://pytorch.org/docs/stable/generated/ torch.optim.Adam.html#torch.optim.Adam In some embodiments, feedback training may be carried out over 100 epochs, wherein one epoch comprises a single pass through the entire training data As an example, feedback training may be expressed in pseudocode as follows: func execute_episode_with_feedback( network: PoseSelectionNetwork, driver_embedding: Option[Array[DRIVER_EMBEDDING_SIZE]], episode: Episode[FramewithFeedback] ) -> (scalar, Array[DRIVER_EMBEDDING_SIZE]): istm_state = network. initi al_istm_state if driver_embedding is None: driver_embedding = network.default_driver_embedding loss = 0.0 for frame in episode.frames: output_pose, output_driver_embedding, lstm_state = ( pose_selection_network_forward( network, driver_embedding, lstm_state, frame.context, current_pose=frame.impressed_pose, frame.selected_pose if frame.selected_pose is None: loss += 12_norm(driver_embedding -output driver_embedding) else: driver_embedding = output_driver_embedding loss += 12_norm(frame.selected_pose -output_pose) return (loss, driver_embedding) [0090] As an example, the overall process of training the pose prediction model may be expressed in pseudocode as follows: func train_network[FrameType]( data: List[Episode[FrameType]], network: PoseselectionNetwork ) : // n_train_iter may be set to 10A6 for i_train_iter = 0..n_train_iter: episode_l = data[random_int(0, length(data))] episode_2 = ( "choose another episode with the same driver_id as episode_1" if FrameType == FrameWithLabel: (loss_1, driver_embedding) = execute_episode_supervised( network, driver_embedding=None, episode_l (loss_2, _) = execute_episode_supervised( network, driver_embedding, episode_2 loss = loss _l + loss_2 else: // FramewithFeedback (loss 1, driver_embedding) = execute_episode_with_feedback( network, driver_embedding=None, episode_l (loss_2, _) = execute_episode_with_feedback( network, driver_embedding, episode_2 loss = loss 1 + loss 2 // ADAM learning rate may be set to 10A-4, // other ADAM paramters may be set to default values adam_optim(network, loss) [0091] As an example, the overall process may be expressed in pseudocode as follows: func overall_process(): first_dataset: List[Episode[FramewithLabel]] = "collect labelled data from driver cohort" // use standard xavier intialization network = PoseselectionNetworkO train_network[FramewithLabel](first_dataset, network) interaction_dataset: List[Episode[Framewithinteraction]] = [] for i_stage = 0..n_stage: if i_stage == 0: // In the first iteration, the network is not yet trained // to make use of feedback data. //

// we use smoothing to make the experience less frustrating for the // driver.

use_smoothing = True else: use_smoothing = False for driver_id in driver ids: (driver_embedding, data) = deploy_network( network, driver_id, dHver_embedding=None, use_smoothing interaction_dataset.append(data) data) = deploy_network( network, driver_id, driver_embedding, use_smoothing interaction_dataset.append(data) train_network[Framewithinteraction]( interaction dataset, network [0092] In some embodiments, first pose prediction model 340 and/or subsequent pose prediction model 432 may be deployed within a vehicle. The deployment of the pose prediction model may be carried out with the subsequent steps: i) Detecting a driver. In some embodiments, an in-cabin camera may be used to detect a driver within a vehicle. In some embodiments, face recognition and/or determination may be carried out to determine an identity of the driver. For example, the model available at https://github.com/ageitgey/face_recognition may be used. In some embodiment, sonar or low power radar, or even lidar/time of flight (TOF) camera/stereovision data may be used to detect and recognize or identify a driver within a vehicle. In some embodiments, the driver for teleoperati on may be detected. In some embodiments, the driver may provide and/or input an identification reference.

ii) Initialising a driver representation based on the driver. In some embodiments, if the driver is a new driver that is driving the vehicle for a first time, the driver representation may be a default driver representation. In some embodiments, if the driver is an existing driver (i.e., a driver who has previously driven the vehicle and/or is associated with a previously generated driver representation), the driver representation may be retrieved from a database based on the identity of the driver.

hi) Receiving contextual information about a scene of the vehicle and a current pose displayed to the driver. The contextual information may be obtained from surround view camera data.

iv) With first pose prediction model 340 or subsequent pose prediction model 432, determining features and predicting an optimal pose based on the contextual information, current pose, and the driver representation; v) Synthesizing and displaying a perspective video on the display based on the predicted optimal pose; vi) Performing steps iii) to v) until one or more interactions are received from the driver adjusting the displayed perspective video or until the method is terminated, wherein in step iii) the current pose is the predicted optimal pose of a preceding iteration, and wherein step iv) of determining features and predicting an optimal pose is further based on features determined in one or more preceding iterations vii) receiving one or more interactions from the driver adjusting the displayed perspective video; viii) determining an implied optimal pose based on the adjusted perspective video; ix) updating the driver representation based on the implied optimal pose to generate an updated driver representation; and x) performing steps iii) to vi) or iii) to ix) until the method is terminated, wherein the driver representation used is the updated driver representation.

[0093] As an example, the deployment of the pose prediction model 100 may be expressed in pseudocode as follows: func deploy_network( network: PoseSelectionNetwork, driver_id: uuip, driver_embedding: Option[Array[pRivER_EmBE00ING_siZE]l, use_smoothing: boot ) -> Episode[Framewithinteraction]: lstm_state = network.initial_lstm_state if driver embedding is None: driver_embedding = network.default_driver_embedding frames: List[FrameType] = [] current_pose = "default pose" while "driver has not completed task": "set the simulator to use current_pose" context = "get context from simulator" selected_pose = ( "get pose requested from user, if a gesture was provided, else None" output_pose, output_driver_embedding, lstm_state = ( pose_selection_network_forward( network, driver_embedding, lstm_state, context, current_pose, selected_pose if selected_pose is not None: driver_embedding = output_driver_embedding current_pose = selected_pose else if use_smoothing: current_pose = 0.9 * current_pose + 0.1 * output_pose else: current_pose = output_pose frame = Framewithinteraction( context, driver_id, impressed_pose=current_pose, selected_pose, time="get current time" frames.append(frame) episode = Episode[Framewithinteraction]( frames, driver_id return episode [0094] Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the embodiments of the present invention are intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.

Claims

CLAIMS1. A computer-implemented method for training an untrained pose prediction model using a training dataset of optimal poses collected from a plurality of drivers in order to obtain a pose prediction model for the synthesis and display of a perspective video for a driver, the method comprising: a) initialising a driver representation (108); b) receiving contextual information (124) about a scene of a vehicle and a current pose (124); c) with the pose prediction model (100), determining features (156) and predicting an optimal pose (140) based on the contextual information, the current pose, and the driver representation; d) adjusting the pose prediction model based on a loss function that enforces consistency between the predicted optimal pose and the optimal poses of the training dataset; and e) repeating steps b) to d) until the method is terminated, wherein in step b) the current pose is the predicted optimal pose of a preceding iteration, and wherein step c) of determining features and predicting an optimal pose is further based on features determined in one or more preceding iterations.
2. The computer-implemented method of claim 1, wherein the training dataset is generated by having a plurality of drivers performing one or more driving manoeuvres on a vehicle simulator, wherein the vehicle simulator displays to the driver a perspective video from a virtual camera pose, and wherein the plurality of drivers are preferably from different geographical regions and drive different vehicle models, and more preferably have safe driving records.
3. The computer-implemented method of claim 2, wherein the training dataset comprises an optimal pose for each driving manoeuvre, wherein the optimal pose is based on the displayed perspective video or, if the perspective video is adjusted by the driver, an implied optimal pose determined based on the perspective video adjusted by the driver.
4. The computer-implemented method of claim 2 or 3, wherein the optimal pose is refined by having each driver perform offline adjustments to the pose after the driving manoeuvre is completed.
5. The computer-implemented method of any of the preceding claims, wherein the contextual information about a scene of the vehicle comprises one or more of: an intermediate environment representation, a sliding window of a history of vehicle parameters, accelerometer data, location information of the vehicle, and information derived from an in-cabin camera.
6. The computer-implemented method of any of the preceding claims, wherein the contextual information about a scene of the vehicle comprises a bird-eye-view (BEV) grid map comprising information on one or more terrain classes comprising: a road, a sidewalk, an obstacle, an ego-vehicle, and other vehicles.
7. The computer-implemented method of any of the preceding claims, further comprising step 0 of refining the pose prediction model by carrying out the following steps with a driver on a vehicle or a vehicle simulator: i) initialising a driver representation based on the driver; ii) receiving contextual information about the scene of the vehicle and a current pose displayed to the driver; iii) with the pose prediction model, determining features and predicting an optimal pose based on the contextual information, the current pose, and the driver representation; iv) synthesising and displaying a perspective video based on the predicted optimal pose; v) performing steps ii) to iv) until one or more interactions are received from the driver adjusting the displayed video or until the method is terminated, wherein in step H) the current pose is the predicted optimal pose of a preceding iteration, and wherein step Hi) of determining features and predicting an optimal pose is further based on features determined in one or more preceding iterations.
8. The computer-implemented method of claim 7, wherein step 0 further comprises: vi) receiving one or more interactions from the driver adjusting the displayed perspective video; vii) determining an implied optimal pose based on the adjusted perspective video; viii) updating the pose prediction model and driver representation based on a loss function that enforces consistency between the predicted optimised pose and the implied optimal pose, and an embedding loss; and ix) repeating steps ii) to v) or steps ii) to viii) until the method is terminated, wherein in step H) the current pose is the predicted optimal pose of a preceding iteration or, if one or more interactions are received from the driver, the current pose is the implied optimal pose of the preceding iteration.
9. The computer-implemented method of any of claims 7 or 8, wherein smoothing is used during one or more iterations of steps ii) to v).
10. The computer-implemented method of any of the preceding claims, wherein the pose prediction model comprises: a convolutional neural network (CNN) (208) configured to receive as input the contextual information; a first multi-layer perceptron (MLP) (216) configured to receive as input the current pose a concatenation module (224) configured to receive as input the driver representation, an output from the convolutional neural network, and an output from the first multi-layer perceptron (MLP); a second multi-layer perceptron (MLP) (232) configured to receive as input an output from the concatenation module; a sequence network (240) suitable for sequential data configured to receive as input an output from the second multi-layer perceptron (MLP) and configured to output features which are fed into the sequence network in subsequent iterations, wherein the sequence network is preferably a long short-term memory (LSTM) network; a third multi-layer perceptron (MLP) (248) configured to receive as input an output from the sequence network, and configured to output an updated driver representation which is used in subsequent iterations; a fourth multi-layer perceptron (MLP) (256) configured to receive as input the output from the sequence network, and configured to output the predicted optimal pose; and optionally, a fifth multi-layer perceptron (MLP) (264) configured to receive as input one or more interactions from the driver, and the concatenation module is further configured to receive as input an output from the fifth multi-layer perceptron (MLP).
11. A pose prediction model obtainable by performing a method according to any of the preceding claims.
12. A method for synthesizing and displaying a perspective video from an optimal pose on a display to a driver of a vehicle, the method comprising: 1) detecting a driver; 2) initialising a driver representation based on the driver; 3) receiving contextual information about a scene of the vehicle and a current pose displayed to the driver; 4) with the pose prediction model of claim 11, determining features and predicting an optimal pose based on the contextual information, current pose, and the driver representation; 5) synthesizing and displaying a perspective video on the display based on the predicted optimal pose; 6) performing steps 3) to 5) until one or more interactions are received from the driver adjusting the displayed perspective video or until the method is terminated. wherein in step 3) the current pose is the predicted optimal pose of a preceding iteration, and wherein step 4) of determining features and predicting an optimal pose is further based on features determined in one or more preceding iterations.
13. The method of claim 12, further comprising: 7) receiving one or more interactions from the driver adjusting the displayed perspective video; 8) determining an implied optimal pose based on the adjusted perspective video; 9) updating the driver representation based on the implied optimal pose to generate an updated driver representation; and 10) performing steps 3) to 6) or 3) to 9) until the method is terminated, wherein the driver representation used is the updated driver representation.
14. A computing system for synthesising and displaying a perspective video from a predicted optimal pose for a driver, the computing system comprising one or more displays, one or more processors and memory storing one or more programs for execution by the one or more processors, the one or more programs including instructions for carrying out a method according to claim 12 or 13.
15. A computer program, a machine-readable storage medium, or a data carrier signal that comprises instructions, that upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of any of claims 1 to 10 or 12 to 13.