WO2025242305A1

WO2025242305A1 - Creating an interactive building interior model

Info

Publication number: WO2025242305A1
Application number: PCT/EP2024/064143
Authority: WO
Inventors: Michael Gadermayr; Christoph Hofer; Luca Debiasi
Original assignee: Synthetic Dimension GmbH
Current assignee: Synthetic Dimension GmbH
Priority date: 2024-05-22
Filing date: 2024-05-22
Publication date: 2025-11-27
Anticipated expiration: 2026-11-22

Abstract

A computer-implemented method of creating a building interior model and providing user interaction with the building interior model, comprising: receiving a plurality of 2D frames captured with capturing device of a smartphone or tablet computer, each frame representing at least part of a building interior; creating a 3D model of the building interior using a first subset of the plurality of 2D frames; determining a spatial position of the capturing device in the 3D model of the building interior for a second subset of the plurality of 2D frames; representing the position of the capturing device for a third subset of the 2D frames in the 3D model of the building interior; receiving a user input indicating one of the positions of the third subset of the 2D frames; and displaying the 2D frame corresponding to the indicated position to the user.

Description

CREATING AN INTERACTIVE BUILDING INTERIOR MODEL

TECHNICAL FIELD

[0001] The present disclosure generally relates to the field of computer-implemented modeling of an interior room, for example for generating or composing a virtual reality representation or scene of the interior room. In particular, the present disclosure relates to a computer-implemented method of creating a building interior model and providing user interaction with the building interior model. Moreover, the present disclosure relates to one or more devices, apparatuses, systems, computer programs and/or corresponding computer-readable media for carrying out the aforementioned method.

BACKGROUND OF THE INVENTION

[0002] Computer-implemented or computer-driven technologies for modeling environments have been increasingly developed and successfully applied in many different fields of technology and industry. Typical examples are virtual reality (VR) or augmented reality (AR) technologies, where for example sensor data of one or more sensors, such as image sensors capturing an environment of a user, can be used to generate, supplement or enrich virtual scenes simulating or modeling the environment based on one or more computer-generated models. Other examples include so-called mixed reality (MR) or extended reality (XR) technologies, where elements of real environments and virtual environments are typically merged to generate scenes or corresponding models representing new environments.

[0003] Nowadays, such techniques and technologies or certain aspects thereof are applied in many different fields of technology already at routine basis. Non-limiting examples and exemplary use cases comprise medical applications, such as augmented reality operating theatres or assistive technology in the medical field, educational or training applications, such as flight simulators or the like, gaming industry, virtual meeting or conference rooms, and autonomous driving.

[0004] Depending on the actual use case or application, different hard- and/or software components may be involved, such as one or more sensors capturing sensor data, one or more computing devices for processing the sensor data or other data, one or more devices for displaying a virtual scene, and/or one or more devices allowing a user to interact with the generated virtual scene. While some applications may require dedicated hardware, such as VR or AR glasses or goggles, for displaying or visualizing a virtual scene and allowing a user to interact with the scene, other applications can be run on regular computing devices or even portable user devices, such as smartphones, tablets or notebooks.

[0005] Further, for actually computing or generating models representing one or more scenes of a particular environment, non-learning based approaches of computing two- or three-dimensional models of the scene or certain aspects thereof have been proposed and used. Increasingly, however, such nonlearning based approaches are supplemented and/or replaced by artificial intelligence (Al) based and/or learning-based methods, such as machine learning and deep learning methods, where an artificial intelligence algorithm, engine, module or circuitry is trained with corresponding training data to model one or more particular aspects of an environment. Therein, a quality, robustness and user experience may depend on the data available and used to generate or compute the models of the scene or environment, such as for example on the sensor data, on the training data used to train the Al-based module, and on other software or hardware components potentially involved in computing the models, visualizing them and allowing a user to interact with them. Also, certain environments may be particularly challenging in terms of generating realistic and high-quality virtual scenes or corresponding models at a high level of user-experience. An example of such environments are interior rooms, such as rooms in a house or building. Modeling an interior room may be particularly challenging in terms of accurately reconstructing the room, and optionally one or more objects arranged therein, based on generating or computing a corresponding model of the room, in terms of visualizing the reconstructed room based on or using the model, and in terms of enabling a user to interact with such virtual representation, for example to alter or modify the virtual representation of a room displayed on a user device.

[0006] 3D models of the interior of buildings are needed for many use-cases, such as light planning, wifi/5G/6G planning, pre-demolition planning, kitchen planning, and computer aided facility management. In the case of existing buildings, 3D plans (such as building information models) are often not available. The digitization of buildings is typically time-consuming and expensive, requiring dedicated hardware such as laser scanners as well as post-processing based on experts.

[0007] Until high-quality models of interior rooms may be autonomously obtained, creating 3D building interior models rely on user input and user interaction. Creating 3D models of a building interior is laborious and time consuming, even with semi-automated approaches, even more so when the 3D models are to be very specific. As later leaning-based approaches rely on the quality of the dataset, it is desirable to create the best 3D models possible, despite the high and tiresome workload for human modelers.

SUMMARY OF THE INVENTION

[0008] It may, therefore, be desirable to provide for an improved computer-implemented method, corresponding devices, systems and techniques for creating a building interior model. Specifically, it may be desirable to provide a method facilitating the creation of high-quality models by providing easy, accessible and intuitive human interaction with the model.

[0009] This is achieved by the subject matter of the independent claims, wherein further embodiments are incorporated in the dependent claims and the following description.

[0010] In an aspect of the present disclosure, there is provided a computer-implemented method of creating a building interior model and providing user interaction with the building interior model. The method comprises: a) receiving a plurality of 2D frames captured with a capturing device of a smartphone or tablet computer, each frame representing at least part of a building interior; b) creating a 3D model of the building interior using a first subset of the plurality of 2D frames; c) determining a spatial position of the capturing device in the 3D model of the building interior for a second subset of the plurality of 2D frames; d) representing the position of the capturing device for a third subset of the 2D frames in the 3D model of the building interior; e) receiving a user input indicating one of the positions of the third subset of the 2D frames; and f) displaying the 2D frame corresponding to the indicated position to the user. [0011] The disclosed method provides for time- and cost efficiently scanning, reconstructing, segmenting, visualizing, interacting with, and exporting the interior of buildings based on mobile devices. First, a user may capture one or more interior scenes by moving a user device, such as a smartphone or tablet, in a building to acquire sensor data being representative for the building's interior. As soon as the user starts scanning, a reconstruction method may start running on the mobile device receiving the sensor data and generating a parametric 3D geometry. The 3D geometry may be quickly provided to the user as feedback by showing preliminary reconstructions of the scene. After finishing the scan, the user may then efficiently interact with the visualized 3D building model and, for example, enrich the model by defining object models which are associated with objects in the room. Based on performing segmentation of objects of interest in 2D frames, the 3D building model may be enriched. Segmentation may be performed based on user prompts, for example in the form of mouse clicks.

[0012] The capturing device captures or provides sensor data or image data for further processing. The plurality of 2D frames may refer to the totality of all the data or 2D frames captured by the capturing device when the building interior is scanned by a user. The capturing device may be, for example, a camera able to capture 2D image data, either in the form of photographs or still images, or video. According to an embodiment, the sensor data provided by the capturing device includes one or more of RGB sensor data (RedGreenBlue, RGB) and multispectral sensor data. Alternatively or additionally, the capturing device includes at least one RGB sensor and/or at least one multispectral image sensor. Generally, any number of image sensors of one or more types of image sensors can be used alone or in combination to generate the data of the capturing device. This can include monochrome image sensors, color image sensor, RGB image sensors, multispectral image sensors, infrared image sensors, ultraviolet image sensor or any other type of image sensor. Accordingly, the term image sensor or capturing device may be used herein to broadly refer to any type of sensor configured for acquiring electromagnetic radiation at one or more spectral lengths of the respective radiation.

[0013] According to an embodiment, the sensor data of the capturing device includes depth sensor data. Alternatively or additionally, the at least one capturing device includes one or more depth sensors, such as for example a LIDAR sensor and/or stereo camera. Using depth sensor data may particularly allow to precisely determine geometrical parameters of the interior room and/or one or more objects arranged or located therein, such as dimensions of the interior room or one or more boundaries thereof. Hence, using depth sensor data can further increase a quality and precision of the process of reconstructing and/or modeling the building interior.

[0014] According to an embodiment, the sensor data of the capturing device includes further sensor data of one or more of a gyroscope, an accelerometer, a radiation sensor, a thermal sensor, a laser sensor, an acoustic sensor, a pressure sensor, a nearfield sensor, and a capacitive sensor. Generally, further sensor data of any one or more further sensors may be combined and/or merged with sensor data of the capturing device to generate the sensor data of the capturing device.

[0015] The data captured or provided by the capturing device is then used to create a 3D model of the building interior. Any suitable known method for creating such models may be used. For example, any of the methods described in any one of WO 2023/174556 A1 , WO 2023/174555 A1 , WO 2023/174559 A1 , WO 2023/174562 A1 , or WO 2023/174561 A1 may be used. In short, a point cloud may be extracted from the 2D frames, merged together to create a global point cloud which may then be segmented and annotated to identify objects in the building interior, which may then be represented by parametric models in the 3D model.

[0016] The 3D model is created from a first subset of the plurality of 2D frames. The first subset may comprise all 2D frames of the plurality of 2D frames or the first subset may comprise less 2D frames than the plurality of 2D frames. For example, the first subset of 2D frames may be filtered out from the plurality of 2D frames according to their suitability for creating the 3D model. Possible criteria for selecting the first subset of 2D frames from the plurality of 2D frames may be viewing angle, visibility of certain objects in the building interior or parts of the building interior, sharpness, distance of the capturing device from certain objects in the building interior or parts of the building interior, coverage of the building interior, for example by overlap, and overall image quality. For example, during scanning of the building interior, 2D frames may be taken each predefined number of milliseconds apart or after the capturing device was moved a certain amount or rotated a certain degree. The first subset may be similarly chosen, for example by using a higher number of milliseconds or a greater movement or rotation between 2D frames of the first subset than in the plurality of 2D frames.

[0017] A spatial position of the capturing device in the 3D model of the building interior is determined for a second subset of the plurality of 2D frames. The spatial position of the capturing device refers to the position the capturing device was in when capturing the respective 2D frame, transferred into the 3D model. In other words, the position the capturing device would need to be in to capture a 2D frame of the 3D model representing the same part of the building interior as the 2D frame is determined. The spatial position therefore refers to a point or a volume in the coordinate system of the 3D model. The spatial position of the capturing device may be determined from the 2D frames. For example, the position may be determined relative to the reconstruction of the building interior in the 3D model based on the perspective of the 2D frames and their registration I overlap with each other. Alternatively or additionally, determining the position may also comprise further sensor data, for example from a gyroscope, a compass, and/or a GNSS receiver, for example a GPS receiver.

[0018] The second subset of 2D frames may comprise all 2D frames of the plurality of 2D frames or the second subset may comprise less 2D frames than the plurality of 2D frames. The second subset may comprise all of the 2D frames of the first subset or the second subset may comprise less 2D frames than the first subset. There may also be 2D frames in the second subset which are not part of the first subset. Possible criteria for selecting the second subset of 2D frames from the plurality of 2D frames or the first subset may again be viewing angle, visibility of certain objects in the building interior or parts of the building interior, sharpness, distance of the capturing device from certain objects in the building interior or parts of the building interior, coverage of the building interior, for example by overlap, and overall image quality. However, while the first subset is selected in view of the merit of each of the selected 2D frames for building the 3D model, the second subset may be selected in view of presenting the user with the best overview of the building interior. Therefore, the second subset contains 2D frames providing a representative view of objects and/or parts of the building interior, particularly for a human beholder.

[0019] From the 2D frames of the second subset, a third subset is chosen. The third subset may comprise all 2D frames of the second subset or the third subset may comprise less 2D frames than the second subset. For example, representative 2D frames may be chosen from the second subset and may form the third subset. The previously determined spatial position or position of the capturing device is then represented in the 3D model of the building interior for the third subset of the 2D frames. The representation may comprise a small 3D or 2D object or icon placed at the position of the capturing device in the 3D model. In this way, a user immediately recognizes where the capturing device captured a 2D frame of the building interior and therefore immediately knows whether or not the respective 2D frame may be relevant, for example for an inspection of the 3D model or the underlying building interior or for a modification of the 3D model. The number of 2D frames may be reduced in the third subset in comparison to the second subset, for example to avoid cluttering the 3D model with the representations of the positions of the capturing device in the 3D model.

[0020] The user may then indicate one of the positions of the capturing device represented in the 3D model by user input. For example, the user may click on one of the representations of the positions of the capturing device in the 3D model, for example by using a mouse or tapping a touchscreen. The 2D frame corresponding to the position selected and indicated by the user is then displayed, for example together with or next to the 3D model. The displayed 2D frame may be displayed in full resolution or scaled to a resolution so as to be easily fully viewable but not obstructing other parts of the screen, for example the 3D model. The displayed 2D frame may be changed whenever the user so desires by selecting or indicating another representation of the position of the third subset of the 2D frames, for example by clicking on another representation. Such a user input again leads to displaying the selected 2D frame, for example by replacing the previously displayed 2D frame.

[0021] In this way, the user can easily jump through different 2D frames representing different views, for example of objects in the building interior or different parts of the building interior. Simultaneously, the 3D model may also be displayed together with the 2D frame corresponding to the indicated position. Therefore, the user has an immediate and intuitive overview of the representation of the actual building interior in the 3D model. This facilitates the interaction with the 3D model, which encourages and enables the user to edit or modify the 3D model more closely or in a more detailed way, ultimately leading to a high-quality 3D model which might have otherwise taken a lot more work and a lot more time to achieve. [0022] The capturing device used for capturing the sensor data may capture the sensor data either as still images or still photographs and/or as a video. In other words, the plurality of 2D frames may be captured as still images or still photographs and/or as a video. In the case that the 2D frames are captured as a video, the video may be interpreted as a succession of single 2D frames, which may easily be extracted from the video. Therefore, a video may also be representative of the plurality of 2D frames and/or may serve as raw data from which the plurality of 2D frames may be extracted.

[0023] As explained, a building interior may be scanned by a user using the capturing device and capturing the plurality of 2D frames. The frames being part of the second and third subset may be chosen from this plurality of 2D frames later, after the scan is complete. For example, these frames may be automatically selected during the step of creating the 3D model based on heuristics of the 3D model, for example heuristics based on the selection criteria as explained above. However, it may also be provided that the user may, already during the scan, mark or indicate special 2D frames of which the user wants to have a representation of the position of the capturing device in the 3D model independently from any other considerations. For example, the user may want to capture a special 2D frame representing a particularly useful overview of at least a part of the building interior or an object in the building interior. For example, a special 2D frame may represent a photograph of a border of a room, for example a wall, and/or of objects on that wall, such as windows, doors, etc., taken at an angle which may be particularly representative of the captured scene. To make sure that this special 2D frame is part of the second and third subset of 2D frames, the user may indicate the special 2D frame by a user input. For example, the user may select a button on the smart phone or tablet computer indicating that the next photograph that is about to be taken or that the last photograph that has been taken by the user is a special 2D frame. In summary, therefore, the method may comprise in step a), receiving a user input indicating at least one special 2D frame, and/or, in step b), determining at least one special 2D frame based on heuristics of the 3D model, in step c), determining a spatial position of the capturing device in the 3D model of the building interior for the at least one special 2D frame, and in step d), representing the position of the capturing device of the at least one special 2D frame in the 3D model of the building interior, wherein the representation of the at least one special 2D frame may also be used in steps e) and f). In this way, the special 2D frame is always represented in the 3D model in terms of the position of the capturing device when capturing the special 2D frame. Of course, as explained above for the representations of the positions of the capturing device, also these representations may be indicated by the user through user input for displaying the special 2D frame. The special 2D frames may or may not be used when creating the 3D model. They may be part of the pool of the plurality of 2D frames from which the 2D frames for creation of the 3D model is chosen, but independently from whether or not they are chosen for the creation of the 3D model, the position of the capturing device when capturing the special 2D frames is always represented in the 3D model for further use by the user.

[0024] In principle, the at least one special 2D frame may be captured as a still image, for example a still photograph, or as a video, similarly to the plurality of 2D frames. However, capturing the at least one special 2D frame as a still image, for example a still photograph, may be preferred, as this typically enables a higher resolution and/or sharpness for the special 2D frame. This, in turn, increases the benefit of having the special 2D frames for the user. To further increase image quality, data driven image enhancement methods, for example convolutional neural networks or vision transformers, may be used, particularly to increase quality of moving frames.

[0025] To further increase the intuitive overview provided by the present disclosure, its representation of the position of the capturing device in the 3D model, in other words, each representation in step d), may comprise both the spatial position and the orientation of the capturing device when capturing the corresponding frame. The orientation of the capturing device describes the direction in which the capturing device was pointing when capturing the respective 2D frame. In other words, the orientation of the capturing device describes the viewing direction of the capturing device, which is then also represented in the respective 2D frame. For example, the representation of the position of the capturing device in the 3D model may comprise a 3D or 2D object or icon which also indicates the orientation of the capturing device. For example, the representation may comprise a stylized camera indicating both the position and orientation of the capturing device for each 2D frame. Another possibility comprises an arrow which may be added to the representation of the position to indicate the orientation of the capturing device. In this way, the user knows which perspective to expect from each position of the capturing device represented the 3D model and can therefore select the 2D frames to be displayed very intuitively and without having to search for the desired perspective or view by clicking on several different positions until the correct one has been found.

[0026] The method may also comprise representing the trajectory of the capturing device through the building interior between the positions of the capturing device in the 3D model, i.e. the positions represented in step d) in the 3D model. For example, the representations of the positions of the capturing device when capturing the respective 2D frames may be connected to each other through lines. Specifically, the lines may follow and/or indicate the timeline in which the 2D frames of which the position of the capturing device is represented in the 3D model were captured during scanning of the building interior. The representations of the position of the capturing device and the 3D model may therefore be chained to one another in a curve or line representing or approximated the movement of the capturing device through the building interior during the scan. For example, the first represented position in the 3D model may be connected to the subsequent representation. Each following representation may be connected to both the previous and the subsequent representation up until the last representation in the 3D model, which is only connected to the previous representation. All mentions of previous and subsequent representations may denote the immediately previous or subsequent representation, i.e. neighbouring representations. Such a representation of the trajectory may increase the overview of the user over the 3D model and the underlying 2D frames even more.

[0027] At times, the user may look for 2D frames showing a specific object in the building interior or a specific view of a part or an area of the building interior. The user may select one of the representations of a position of the capturing device in the 3D model to display the respective 2D frame, as explained above. However, it may be that another 2D frame may show the object or area in question even better than the one corresponding to the representation of the position of the capturing device in the 3D model. Or, in another case, simply having different but similar views of the same object or area may also be helpful. Therefore, the method may comprise determining 2D frames showing a similar view as the displayed 2D frame in step f) from any one of the plurality of 2D frames, the first subset, the second subset or the third subset, and displaying at least a part of the determined 2D frames as preview images. Preferably, the method may comprise, when the user selects one of the displayed preview images, displaying the corresponding 2D frame to the user. In other words, additional 2D frames for display as preview images may be selected from all available 2D frames. They may be selected automatically through a similarity comparison or similarity metric. They may also be selected by object recognition. For example, 2D frames showing a similar perspective of a certain area or part of the building interior may be selected. Also, 2D frames showing an object identified in the displayed 2D frame may also be selected. 2D frames showing a similar view as the displayed 2D frame may therefore mean that the 2D frame shows the same object or the same area of the building interior as the displayed 2D frame, albeit from a different distance, angle and/or perspective. These selected 2D frames or at least one of them, may then be displayed as preview image, for example a thumbnail, which may be indicated by the user to a user input, for example by clicking on them. When the user indicates one of the preview images of the selected 2D frames, the 2D frame corresponding to the preview image is then fully displayed as in step f) of the method according to the present disclosure.

[0028] One of the advantages of having a 3D model of a building interior is that it is very intuitive for a user to orient themselves in the model. This may also be used by the user to select specific views. For example, the method may comprise displaying the 3D model to the user in a first-person perspective, receiving a user input indicating a desired view in the 3D model and displaying the desired view. In other words, the user may move through the 3D model in the first-person perspective. The method may further comprise determining 2D frames showing a view of the building interior corresponding to the displayed view of the 3D model, and displaying at least a part of the determined 2D frames as preview images. Preferably, the method may comprise, when the user selects one of the displayed preview images, displaying the corresponding 2D frame to the user. This may mean that while the user moves through the 3D model in the first-person perspective, 2D frames corresponding to or showing a similar view as the displayed first-person perspective view of the 3D model are determined. In other words, 2D frames may be selected which show a view of the building interior which corresponds to the view of the 3D model displayed to the user in the first-person perspective. The determined frames or at least one or some of the determined frames, preferably the ones with the highest degree of similarity, are shown to the user as preview images and may be selected by the user by a user input, for example a mouse click. When the user selects one of the 2D frames through its preview image, that 2D frame is then displayed to the user as in step f) of the method according to the present disclosure.

[0029] Of course, is not necessary for the user to look at the 3D model in the first-person view to be able to select an object represented in the 3D model or a part of the building interior represented in the 3D model to find similar views or perspectives in the plurality of 2D frames. Therefore, the method may comprise receiving a user input, for example a mouse click, indicating a point of interest and/or a region of interest in the 3D model, determining at least one 2D frame showing a part of the building interior corresponding to the point of interest and/or the region of interest in the 3D model from the plurality of 2D frames, and displaying the at least one determined 2D frame to the user. If more than one 2D frame is determined in this way, the one with the highest image quality or with the best view of the desired object and/or part of the building interior according to the user input of the user may be fully displayed as in step f) of the method according to the present disclosure. Further determined 2D frames may be displayed as preview images which, again, may also be selected by the user for full display as explained herein.

[0030] The best view of the desired object and/or part of the building interior as mentioned above may be determined, for example, based on the relative position of the capturing device to the object or part of the building corresponding to the point of interest and/or the region of interest of the 3D model. Also, the image quality, specifically the sharpness of the image of the 2D frame may be considered. It may therefore be provided that determining the at least one 2D frame is based on at least one or a combination of the following parameters: distance of the position of the capturing device to the part of the building interior corresponding to the point of interest and/or the region of interest of the 3D model, sharpness of the 2D frame, and viewing angle of the 2D frame in relation to the part of the building interior corresponding to the point of interest and/or the region of interest of the 3D model.

[0031] As mentioned briefly before, creating the 3D model, i.e. step b) of the method, may comprise annotating and/or segmenting the 2D frames. Any suitable automatic, half-automatic or manual process may be applicable. Any of these processes may be prone to errors. Additionally, even without an error, a user may want to change the annotation and/or the segmentation to thereby also change the 3D model. The method may therefore comprise receiving a user input indicating a modification of an annotation and/or a segmentation of the displayed 2D frame, wherein the user input preferably is in the form of one or more mouse clicks or wherein the user input is in the form of one or more example images or one or more patch prompts, and modifying the 3D model accordingly. For example, the user may modify the segmentation of a 2D frame or part of the point cloud underlying the 3D model by defining positive and negative points on the 2D frame. For example, positive points may be part of the foreground, whereas negative points may be part of the background and therefore not be part of an objects to be segmented. In this way, the user may segment the 2D frame. An important part is then that the new segmentation or the modification of the segmentation and/or annotation is used to update the 3D model. After the user input for the modification is received, the 3D model is modified to consider the new annotation and/or segmentation. The resulting 3D model is then similar to the one that would have been created originally in step b), had the original segmentation and/or annotation been the one provided by the user.

[0032] Prompting based on individual clicks may need to be performed by the user individually for each object. To improve efficiency, prompting may also be enabled based on example images or patch prompts, as mentioned above and further explained below. For a given object type, by annotating one or more object instances in one or more frames, this system may generate an object type definition which may then be used to generate prompts for the segmentation method on yet unseen frames. In short, this may enable an automatic segmentation of object instances corresponding to the given object type on all available 2D frames based on one or more previously identified samples. In the context of machine learning, this may be referred to as a special type of one-shot learning. A patch prompt may therefore contain an image and preferably also an annotation, for example a mask or bounding box.

[0033] A method allowing prompting based on one or more visual examples may be obtained by combining a method based on point prompts (e.g. universal promptable segmentation methods such as segment anything model) with a method that generates the point prompts automatically based on image examples. As an example, using a feature extractor (e.g. a convolutional neural network or a vision transformer) input data (prompt images & target frames) may be mapped into a feature space. In the feature space, similarities between the prompt and the frames to be segmented may be computed for a predefined grid in the frames. Areas showing a high similarity with areas inside the object in the prompt images may be candidates for positive point prompts while objects in the prompt images with a large distance may be candidates for negative points. Thresholds may be used to identify positive and negative points. Another example may be an end-to-end neural network taking the frames including masks as well as the target image as input and generating prompts for the interactive segmentation approach as output. This may be achieved by first generating a context embedding from the image patches using a set neural- network (e.g. Set Transformer, PointNet) and second leveraging this generated context in combination with the image to be segmented to generate a sequence of segmentation prompt-tokens by applying a contextualized image to sequence model.

[0034] Another aspect of the present disclosure relates to a computing system, comprising a smartphone or tablet computer and at least one computing device, wherein the computing system is configured to perform the steps of the method according to the present disclosure. All of the features, functions and advantages of the method according to the present disclosure are also applicable to the computing system and vice versa. The computing device comprised in the computing system may, for example, be a personal computer. The at least one computing device may comprise a keyboard and/or a mouse configured to receive a user input. It may also comprise a display device, for example a display screen for displaying information to a user, for example the 3D model and the 2D frames as explained herein.

[0035] Another aspect of the present disclosure relates to a computer program, which, when executed by one or more processors of a computing system and/or computing device, instructs the computing system and/or computing device, for example one or more of a handheld device and/or a personal computer, to perform the method according to the present disclosure. All of the features, functions and advantages of the method and/or the computing system according to the present disclosure are also applicable to the computer program and vice versa. [0036] Another aspect of the present disclosure relates to a computer-readable medium having stored thereon the computer program as disclosed herein. All of the features, functions and advantages of the method and/or the computing system and/or the computer program according to the present disclosure are also applicable to the computer-readable medium and vice versa.

BRIEF DESCRIPTION OF THE DRAWINGS

[0037] Figure 1 shows interactive segmentation using a surface mode optimal for planar objects;

[0038] Figure 2 shows interactive segmentation using a box mode optimal for 3D objects;

[0039] Figure 3 shows a trajectory represented in the 3D model corresponding to the walk through the room over time during scanning the room with a mobile device;

[0040] Figure 4 shows a 3D building interior model and corresponding 2D frames during segmentation; and

[0041] Figure 5 shows a flowchart of the method.

[0042] The drawings are schematic only and not to scale.

DETAILED DESCRIPTION OF THE INVENTION

[0043] In Figure 5, a flowchart of the method 20 according to the present disclosure is shown. The method 20 may start in step 21 by receiving a plurality of 2D frames captured with a capturing device of a smartphone or tablet computer, each frame representing at least part of a building interior. Subsequently, in step 22, a 3D model of the building interior may be created using a first subset of the plurality of 2D frames. Step 23 may comprise determining a spatial position of the capturing device in the 3D model of the building interior for a second subset of the plurality of 2D frames. The position of the capturing device may be represented for a third subset of the 2D frames in the 3D model of the building interior in step 24. A user input received in step 25 and indicating one of the positions or representations of the third subset of the 2D frames leads to step 26, comprising displaying the 2D frame corresponding to the indicated position or representation to the user. These and further steps are more closely described herein.

[0044] First, a user may scan the interior of a building, such as a room, a floor, a stairwell, a lobby, a hall, or even the interior of a whole building, with a mobile device by capturing the interior from different poses while moving the device. Since no stationary hardware is needed (compared to known solutions based on laser scanners), a room may be scanned within a few seconds and scanning does not need any expertise. In the following, the term “room” is used to refer to any of the mentioned entities. The obtained data may consist of images and potentially also corresponding depth data where the last may be captured with a physical depth sensor or may be derived from the image data. Regarding the time dimension, image data (frames) may be either captured regularly or based on certain heuristics (e.g. if the viewing angle and/or position change surpassed a certain predefined threshold).

[0045] Based on the multitude of individual frames, in combination with a 3D registration, a 3D point cloud may be constructed. This 3D point cloud may then be used as a starting point for generating a parametric 3D interior geometry. The parametric 3D geometry may contain the interior geometry of all individual rooms and their relative position. Thereby digital twins of complete floors or even interiors of buildings may be represented in one single model, as in the case of building information models. The parametric model may be obtained by approximating the point cloud based on basic elements, which are geometrical 2D or 3D structures, such as planes, cuboids, cylinders, discs or spheres. Basic objects, such as windows and doors may also be integrated into the geometrical model. This geometry may be augmented with additional arbitrary objects of interest. Based on segmentation of 2D frames, either a user may segment and classify objects manually, semi-automatedly or even a fully automated segmentation may be performed. While the segmentation is performed on one or more 2D frames, the obtained segmentation masks may be translated into the 3D space and may be integrated into the 3D interior geometry. Additionally, a replacement of detected and segmented objects with corresponding 3D object models may be possible to enrich the semantic representation, interoperability and geometric accuracy of the model. For example, a detected power plug may be replaced by the manufacturer’s 3D model if available or by a standard model. Analogously, a lamp may be replaced by a standard lamp model or an automatically retrieved model similar to the real lamp. The augmented 3D geometry may be referred to as semantic 3D building model. Finally, this model may be exported in various formats, to seamlessly integrate into a user’s downstream workflow.

[0046] The detection or segmentation of objects may be needed for various tasks. For example, for light planning, existing lamps and power connections may need to be identified. For radio planning, existing power connections and network devices may need to be detected. For facility management, fire extinguishers, signs, lamps, and different appliances may be needed. Since the visual appearance of objects may show a high variability and sufficient training data is scarce, fully-automated segmentation approaches based on deep neural networks often do not yield satisfactory results. Thus the present disclosure offers a semi-automated approach, bridging this gap until sufficiently large training data may be available.

[0047] Since the method of the present disclosure may be agnostic to the used image sensors or capture devices, the captured data is denoted as frames or 2D frames, which may be RGB frames, but which may also be multispectral or monochrome frames.

[0048] The implementation of the described system may be performed as hybrid cloud-edge architecture as follows: The scan may be performed with a hand-held mobile device (smartphone or tablet). The point cloud and at least a preliminary version of the 3D interior geometries may be computed and iteratively updated on the mobile device (edge device) during the scan. The reconstruction may be visualized in real time (with a short latency) on the device. After the scan, data may be transmitted to a cloud. The final 3D model may be accessed via a web browser while the data may be physically located in the cloud. Computationally intensive computing steps may be performed in a cloud infrastructure. Such steps may include but are not limited to segmentation, retrieval, color or material estimation. Based on the requirements of a particular workflow, those augmentations may be invoked on demand. A micro service architecture may be useful to efficiently configure different pipelines to meet different workflows. The interaction (various types of visualization, measurement, manual segmentation, semi-automated segmentation) may be performed via a web browser. Generated semantic 3D building models may be exported in various formats to be integrated in various workflows.

[0049] By the term user device, the present disclosure refers either to a smart phone or a tablet. The user device necessarily has at least one capturing device, which may be or comprise at least one of the following, a visual sensor, i.e. a camera with an RGB sensor, a monochrome sensor or a multispectral sensor, a depth sensor (e.g. a LiDAR sensor), and a gyroscope. [0050] Holding the user device in one or two hands, a user may capture data (with the device’s sensors, i.e. the capturing device) which represents an interior room. Since one goal may be the reconstruction of the rooms’ geometry, it is advantageous to capture at least parts of the main building blocks of the room. For that purpose, particularly edges and corners of the rooms may be of high relevance. During the scan process it may be possible to acquire special marked 2D frames which can later be easily identified and accessed in the visualized semantic 3D building model and used for further visualization and interaction, including the segmentation of objects of interest. Since the geometry may be reconstructed in real time on the device, the user may obtain guiding feedback (with a low latency) on the user device to react in the case of problems (e.g. by rescanning certain areas) and to actively optimize the final output. During the scan, the user may mark special frames which can later be easily identified based on representations such as points in the visualization of the semantic 3D model. As another possibility to access individual frames later, the trajectory of the scan may be recorded to be visualized in the reconstructed 3D building model. Each frame (marked as point in the trajectory) may then be selected by the user. After the reconstruction, the user can interact with the 3D building model. Individual frames can be selected based on different strategies.

[0051] This is exemplarily shown in Figure 3. The top of Figure 3 shows a 3D model 1 of an interior room comprising a door 3, chairs 4, a window 5, a sideboard 6, a table 7, and walls 10. Also shown are representations 8 of positions of the capturing device used to scan the room and a trajectory 9 of the capturing device connecting them. Each representation 8 or point contains a direction (3D vector) and corresponds to a captured 2D frame 2. The 2D frames 2 can be accessed I displayed (lower portion of Figure 3) by clicking on the representations 8 or points. The lower portion of Figure 3 exemplarily shows one of the 2D frames 2 accessible through the representations 8 in the 3D model 1 .

[0052] For one, in the 3D model or 3D building model, icons representing the frames which were marked during the scan are displayed. The user may click on one of these icons and the corresponding frame may then be visualized, i.e. displayed, either fully or in a preview mode as preview images. Optionally, the trajectory of the scan process as well as the captured frames may be visualized in the 3D building model. The user may click on one of the visualized icons and the corresponding frame is visualized as mentioned before. Also optionally, based on interacting with the 3D model in the first-person view, a frame may be selected which shows similar content (from a similar perspective) as the current perspective. The selected frame may then be visualized as mentioned before. Another option may be to compute candidate frames (and visualize them in preview mode) based on heuristics in combination with knowledge of the intended use-case (e.g. for light planning, frames pointing on the ceiling may be relevant). Another option may be to compute and visualize (in preview mode) frames showing a certain room element (e.g. a wall, a door, a window) after the user clicks on the element or its representation in the model.

[0053] When a frame has been selected by a user, one or more further frames are automatically suggested (in preview mode, i.e. the frames are shown in a small size) showing similar (based on a similarity metric in a feature space), neighbouring (similar perspective, i.e. similar extrinsic camera parameters) content.

[0054] Selected and visualized 2D frames may be used to augment the 3D building model with object models. Based on an interactive segmentation method (such as, for example, the segment anything model), functionality may be provided to generate segmentations of objects, based on user prompts. As the user defines positive and negative points for a single object in a frame, the interactive segmentation method may perform a binary segmentation into foreground (object) and background. Objects which are shown on more than one frame may be segmented sequentially by processing several frames. There may be two different modes:

[0055] The planar or surface mode may allow to capture planar objects. All segmented points may be mapped from the 2D frames into the 3D point cloud and may be projected to a plane in 3D space. This is exemplarily shown in Figure 1 . The top of Figure 1 shows a 3D model 1 comprising walls 10, a door 3, a table 7, chairs 4, and a sideboard 6. In the 3D model 1 , the user has made a user input indicating a view of the ceiling 11 . Therefore, as in the lower part of Figure 1 , a 2D frame 2 showing the ceiling 11 is displayed. To the right of the displayed 2D frame 2, several smaller preview images of 2D frames 2 showing a similar view to the displayed 2D frame 2 of the ceiling 11 are displayed to the user. The user may choose any of these preview images to have the respective 2D frame 2 fully displayed like the one on the left. The user may select flat objects on the ceiling 11 , like lights or fixtures, for example by drawing a polygon on the 2D frame, which is then used for segmentation. The 2D frame 2 is segmented based on one or more mouse clicks on the object or the background. The planar segmentation is directly shown in the 3D model 1 .

[0056] The box mode may allow to capture non-planar objects. All segmented points may be mapped from the 2D frames into the 3D point cloud and based on all positive (foreground) points, a 3D point cloud (which may be a subject of the complete point cloud of the scan) and a corresponding 3D bounding box may be created. This may be repeated for all objects to be segmented in the scan data, both planar and non-planar objects. This is exemplarily shown in Figure 2. The top of Figure 2 shows a 3D model 1 comprising walls 10, a door 3, chairs 4 and a 3D bounding box 13. On the lower side of Figure 2, one or more 2D frames 2 are segmented based on one or more mouse clicks on the object or the background. The object in question in Figure 2 is a column 14. The generated 3D bounding box 13 covering all segmented points is directly shown in the 3D model 1 . To the lower right of Figure 2, preview images 12 of more 2D frames 2 with similar views as the one displayed to the left are shown and may be selected for full display.

[0057] Based on each thereby captured object, a type (category) may be selected to create an object model. A predefined or user defined type may be selected. Particularly predefined categories may enable interoperability within groups of users. Properties may be manually assigned (e.g., based on drop down menus, free text, radio boxes, check boxes) to the object models. The object type specifies predefined and user-defined properties. The type of objects may even define a hierarchy using inheritance. For example, an object of type office chair may have the same parameters as a (general) chair with the additional property “number of wheels”. Categories may be defined individually for each user or for a whole organization (for better interoperability). Default values for properties (which may be overwritten by the user) may be specified based on heuristics or can be estimated based on the underlying data (e.g., the size may be estimated based on the size of the segmentation).

[0058] Based on the available device’s sensors, frames may be captured in certain intervals. The period between capturing two frames may be fixed (i.e. in fixed time intervals) or may vary based on the movement of the device. To avoid capturing an unnecessarily large number of frames (in the case that the user does not move the device or moves the device very slowly) it may be beneficial to make use of information derived from the sensor data. [0059] Room geometry may consist of walls, openings, doors, and windows. Based on the 3D point cloud data, spatial quantization and a projection, a rasterized representation, showing the point cloud from a bird’s eye view, containing two spatial dimensions and a third dimension containing point characteristics (e.g. colors) may be computed. Since the third dimension does not refer to a spatial dimension, this representation is referred to as 2D representation. In addition, a rasterized 3D representation containing the three spatial dimensions may be computed based on spatial quantization. The resolution of these representations may be adjusted to fit the hardware’s performance (e.g. memory). Based on these two representations (and e.g. a concatenation) and a neural network (e.g. a fully convolutional neural network) a map representing borders and a map representing edges may be generated. Based on these data, line-sampler modules generate line proposals. Finally, a line-verification network may be applied to detect parametric representations.

[0060] For detecting doors and windows, first projections on the individual wall elements may be generated. The projections may contain colour information and other characteristics (e.g. point distances). Based on these projections, a 2D network may be used to generate segmentation output. Due to the rectangular size of doors and windows, a bounding box object detector can be sufficient here. To also fit irregular shapes, additional segmentation may be performed (e.g. making use of a method such as Mask-RCNN).

[0061] State-of-the-art supervised learning-based segmentation approaches, such as Mask-RCNN, U- Net and Detectron2 may require training data sets, which contain corresponding label data (segmentation masks providing a class label per pixel or polygons defining the objects of interest) for the objects or structures which are intended to be segmented. The methods may also require that a sufficient number of objects of interest are prevalent in the image data. While for many generic objects (e.g. chairs, tables, ...) such data sets exist (e.g. COCO), for domain specific objects (e.g. wifi access points, fire extinguishers) appropriate data sets may often be unavailable. Generating manual annotations, e.g. based on polygons or based on pixel-masks may be time-consuming and expensive. Annotation effort may become severely higher when operation on 3D data such as point clouds is performed. The main reason for this drastic increase of annotation effort may be the requirement of the operator to be trained on a 3D user interface. Herein, a strategy is proposed which reduces the complexity of segmenting 3D objects which can be used for manually segmenting 3D objects, semi-automatically segmenting 3D objects, and automatically segmenting 3D objects. The manual and semi-automated segmentation of 3D objects may be used to generate training data for fully-automated learning-based approaches.

[0062] During point cloud generation, a multitude of frames may be aggregated into a 3D point cloud based on a registration approach. After this generation, each frame may correspond to a partial 3D point cloud, which may be a part of the complete point cloud. Segmenting an object in 3D may be equivalent to segmenting the object in each frame it is present, translating the labels to the corresponding partial point cloud and merging the corresponding partial segmentations to a complete point cloud using the 3D information attached to the segmented pixels. This may allow a bidirectional connection between 2D frames and the 3D model. A pixel annotated in a 2D frame may be translated into the complete 3D point cloud. And inversely, an annotated point in the point cloud may be translated into one or more frames. For segmenting 3D objects, not necessarily each frame showing the object may need to be segmented. A segmentation of a subset of these frames may be sufficient either in the case of simple shapes (e.g. coplanar shapes) or in the case that a very high segmentation accuracy may not be needed. [0063] Interactive segmentation models, like the so-called “Segment Anything Model” may facilitate semi-automated segmentation of 2D images based on the manual selection of a small number of positive (objects) and negative points (background). The output of such models may then be parametrized by estimating its polygonal boundary. This polygonal outline may give an operator a straightforward way to interact with the given segmentation proposal by dragging its vertices.

[0064] Segmented or detected and classified objects may be replaced with 3D object models. An object model may refer to an object type, may contain a geometry (which may be of different types, such as a parametric description, a mesh, a polygon, or a bounding box), and properties (which may be manually assigned or automatically based on the image data). Object categories may be, e.g. power plugs, windows, doors, ... The object type may define the required and the optional properties. The type may also define how the geometry is modelled.

[0065] For example, a manual, semi-automated or automated method may segment or detect and classify an object in a 3D point cloud and may thereby provide a certain category as the object’s category. The geometric data (e.g. point cloud) representing the object may be used as geometry, however, it may also be replaced by another representation (e.g. mesh, bounding box) or model in a data set which corresponds to the classified category. Parametric representation may even allow the estimation of certain parameters (e.g. size, aspect ratio). For simple objects, one object model per category may be sufficient (e.g. power plugs). For more complex objects, it may be beneficial to replace an object in the point cloud with a 3D model (of the same category) showing a high degree of similarity. This may be achieved by means of image retrieval based on 2D pixel data, 3D voxel data or 3D point cloud data. Based on a distance measure, a feature representation (e.g. obtained by means of pretrained convolutional neural network (CNN) or self-supervised learning in combination with CNNs), a data set containing one or more models per category and 2D image or point cloud representations of the models, the best fitting model (most similar model) may be searched for an individual object in the point cloud. Due to the given geometry of these models, inaccurate segmentations may be compensated. For example, in the case of standardized objects, such as power-plugs, the retrieved model may be supposed to be more accurate than the acquired point cloud data. In the case that object dimensions vary (e.g. windows, doors), the predefined size may be obtained from the user. Since objects within one building may often be of the same type, it may be efficient if the user measures the dimension of single objects and assigns the dimension to several objects. For example, the window height and the door height may often be similar. Parametric models may be assigned with these priorly given parameters. The models which may be used to replace the underlying data may contain visual characteristics similar to the real objects. However, this may not be necessary since a high visual quality is often not necessary for workflows. Therefore, the geometry may even be skipped as long as the position of the object in the semantic 3D building model is known. Visualization of objects may also be performed based on simple bounding boxes or point markers. However, often additional parameters and properties may be needed to specify the objects.

[0066] Detected objects may be replaced with object models. These models may contain properties which may be derived automatically from the image data (or from the 3D geometry). For example, size, aspect ratio, colour, and texture of objects may be determined from the image data. Other properties may not be determinable from the image data but may be added manually by means of user interaction. Several properties may be added based on a visual mask and different input modalities (e.g. radio box, check box, text box, drop down field). For example, manufacturer, type, year of construction, and charge number may be manually added. Also, properties which are determined automatically may be overwritten manually in the application. Properties which are not predefined may be manually added. This process may be referred to as augmentation.

[0067] Since many objects may either show standard dimensions and/or regular shapes (circle, square, ball), the accuracy of segmentation output may be optimized by estimating the dimensions and/or shape with the possibility of feedback by the user. E.g. the user may be asked whether the object of interest is symmetric, or has a rectangular, square, circular, ... shape. Thereby also occluded areas during the scan may be circumvented. And the need for annotating more than a single frame may be diminished.

[0068] A trained neural network (e.g., convolutional neural network, transformer network) may be used to segment a multitude of 2D image frames. Herein, the term segmentation refers to “instance segmentation” which may be described as object detection and classification followed by pixelwise annotation. For each image frame, a corresponding partial point cloud may exist. This partial point cloud may be obtained based on depth sensors (e.g. LiDAR) and/or based on merging 2D frames from multiple views or multi-camera settings. The partial point clouds (each corresponding to an individual image frame) including the segmentation output may then be merged (registered). Thereby a global 3D point cloud may be generated with labelled points. The segmentation output may then be post-processed to increase robustness (despite single miss-classified points) by applying heuristics. E.g., filtering may be performed to eliminate noisy segmentations. Finally, the points may be projected from the 3D point cloud to the parametric 3D geometry, e.g. a point close to a wall segment may be mapped to the closest point (e.g. orthogonal projection) on the wall segment (l.e. distance to the wall plane is zero). Since the parametric base elements may often be coplanar, after the projection, 2D approaches may be used for further post-processing to optimize the quality of the final segmentations (e.g. to achieve simple parametric representations).

[0069] Segmentation may be performed in 2D and the resulting points may be translated into 3D. In the case that each point from 2D may simply be mapped to 3D, the result may be a 3D point cloud, potentially with a vast number of points. Simple parametric representations which may be efficiently visualized and modified are advantageous. A polygonal shape lying on a plane, e.g., may be represented based on a few points on the plane (one point for each of the polygon’s vertices). As soon as the plane is known, a point may consist of 2 degrees of freedom only. For the conversion from a 2D point cloud to a parametric representation, dedicated methods exist (e.g. polygon fitting approaches for 2D point clouds). [0070] To manually identify relevant objects of interest, a user may be presented with a selected 2D frame. The frame may be selected based on a set of automatedly or manually predefined frames. For manual selection, the user may trigger the acquisition of one or more marked frames during scanning a room. Thereby also the quality of the frame may be increased (since the frame may be captured without moving the device and potentially also the resolution can be increased). The selection may be performed while interacting with the model, e.g., in a first-person view. While the model is shown from a selected perspective, the frame or the frames with the closest distance or distances to the current view may be visualized. A comparison of a frame and a position in the first-person view may be performed, e.g., based on the extrinsic camera parameters, i.e. position and orientation or rotation of the capturing device (3+3 degrees of freedom). The individual values in this matrix may be compared, a difference of each parameter may be computed, and the differences may be summed up by weighted addition. Squared (or even fourth power) distances may be useful, to ensure that each parameter is in a certain range and that no outliers exist. The user may then select the frame which shows the objects of interest from the best perspective. Another option may be that the user selects a point by defining a point in the semantic 3D building model (e.g. based on first-person view or top-down view) and which represents the center of an intended (captured) frame (see Figure 4). Another possibility may be that the user defines a region in the 3D interior model to annotate. E.g., a whole wall or a part of a wall may be marked based on a polygon selection tool. Then, based on the selected region, one or more frames may be computed and visualized which show the selected area appropriately, i.e. in an appropriate distance and an appropriate angle (in best case orthogonal to the plane) in a high quality (non-reference image quality measures can be used to assess motion blur and noise). These aspects (distance, angle, quality) may be weighted and optimized (e.g. the minimum of a weighted sum score may be computed) for searching for the one or more best suited frames. Particularly if large areas are marked, several frames may be needed to capture the whole area. Finally, it may even be possible that the user did not capture frames showing the selected areas appropriately.

[0071] The user may then manually segment one or more objects of interest by drawing a polygon based on consecutively clicking on points in the frame, for example. After the annotation, for each annotated region, also a class label may be defined. A predefined class may be selected or a new class may be defined. For this purpose, a drop-down menu or radio buttons may be used. To aid the user, colours may be assigned to the classes, whereby the annotated regions may be highlighted with the respective colour. This procedure may be performed in one or more frames. The generated segmentations may finally be translated into the parametric 3D model and may be merged with the model. Depending on the characteristics of an object, it may be sufficient to segment a single frame. This may be particularly true for planar (or almost planar) objects. If an object is marked with a certain class label in a single frame, the object may be assumed to be an instance of this class, even though the user may not mark this object in each frame showing the same object to save manual effort. This efficient behaviour may be achieved since the annotated regions are transferred from the 2D frame (and the 2D- 3D mapping) into the global 3D point cloud. And a planar region in the 3D model may be automatically mapped to a class as soon as a single label is available.

[0072] Coplanar objects may typically be segmented easily based on a single frame. In the case of non- coplanar objects, the segmentation in a single frame, however, may lead to occluded parts of the object in the 3D model. To circumvent occluded parts, the annotation may need to be performed from different perspectives. Two options for selecting several appropriate frames are suggested herein. The first may be based on the first-person view, where the user may manually select additional appropriate views by moving inside the scene. Another option may be that the user selects a first frame, and based on this frame, the user may be presented with similar (neighbouring) views showing the object of interest from different perspectives.

[0073] First a frame may be selected as in the case of manual segmentation. The user may then click on one object of interest. As soon as the first click was performed, the semi-automated segmentation algorithm (any interactive segmentation method based on determining positive and potentially also negative seed points may be used, e.g. segment anything model) may propose a segmentation into foreground and background. To refine, the user may iteratively click on further points inside the one or more object-of-interest (defined as positive samples, e.g. left mouse button) and potentially also on points in the background (defined as negative samples, e.g. right mouse button). Thereby the segmentation may be refined and visualized until the user is confident with the output. Also, a class label may be determined as in the manual segmentation approach.

[0074] The 2D segmentations (per frame) may then be translated into the 3D coordinate system. Each point in the partial point cloud (which may be related to a segmented frame) may be assigned to a segmentation category, based on the position in the segmented 2D frame. Each partial point cloud (based on a single frame) may be registered (aligned) by means of point cloud matching algorithms with all other partial point clouds resulting in a single global point cloud. Based on the segmentation of one or more frames, the annotations may be transferred to the corresponding partial point clouds and also to the global point cloud containing all (aligned, registered) partial point clouds.

[0075] Based on the point cloud, a parametric representation of wall elements and other objects, for example, may be obtained by fitting base elements, such as planes, cylinders, and surfaces to the point cloud which may finally provide a simplified and adjustable representation. A segmentation of objects may then be mapped from a point cloud to a parametric representation by means of an orthogonal projection to the closest base element, achieving a translation of segmentations from the global 3D point cloud to the parametric 3D model.

[0076] In the case of a planar object, the segmentation in one frame may be sufficient. In the case of non-planar objects, it may be necessary to annotate more than a single frame. Based on assessing the properties (topology) of the point cloud based on heuristics (are the marked points on a single plane), an automated approach may assess whether additional segmented frames are needed. Also, the user may decide whether or not to annotate further frames.

[0077] In the case that more than a single frame may need to be annotated, the following procedures may be applied. Based on segmented points, unlabeled neighbors may be searched in the global point cloud. Segmentation labels of neighboring points (neighbors are determined based on a distance measure (e.g. L2 norm, L1 norm) and a chosen threshold) may then be transferred to the new points in different partial point clouds - and thereby also in different frames.

[0078] The thereby computed new labeled points in new frames may then partially be used as new positive and negative seed points. For the negative class, points outside the annotated region may be used. Particularly (positive and negative) points may be selected which are not too close to object borders (e.g. by means of morphological erosion/dilation of segmentation masks in the originally segmented frames followed by a random selection of points). The semi-automated 2D segmentation proposal algorithm may then be applied to the new frames with the translated data. This procedure may be repeated. The method allows fully automated processing of the additional frames, however, due to the real time performance, the user may also interact (set additional positive & negative points or use the polygon segmentation tool if needed) to ensure that the objects are correctly annotated.

[0079] To correct wrongly labelled points, the semi-automatedly generated segmentation masks may be adjusted (overruled) in each individual frame by means of manual annotation. After a correction, the semiautomated procedure may be rerun to translate the correction to the other frames in the 3D model.

[0080] This is exemplarily shown in Figure 4. The top shows a 3D model 1 of a building interior including objects in the building interior like walls 10, chairs 4 and a television screen 15, as well as a door 3 and a window 5. Also shown are representations 8 of several positions of the capturing device corresponding to respective 2D frames 2. The representations 8 may pertain to special 2D frames, for example frames showing plain views of the four borders of the room, which may have been marked by the user during the scan. On the lower side of Figure 4, on the left, a 2D frame 2 is displayed which shows radiators 16 below the window 5. As the radiators 16 are not yet represented in the 3D model 1 , the user may segment these by hand, for example by drawing a polygon 17 around radiator 16. The segmentation mask may also be visualized in the 3D interior model 1 , see the polygon below window 5 in the 3D model 1 shown in the top part of Figure 4.

[0081] Beyond parameterization based on point prompts, the semiautomated (interactive) segmentation approach may enable another category of parameterization based on image data. For example, a parameterization based on providing one of more exemplar image patches to a deep learning algorithm (on top of the image to be segmented) may be used. This may be referred to as patch-prompted segmentation. An image patch here may refer to a part of a full image (which could be a frame from a video) representing the bounding box with exactly the object of interest inside the bounding box. Optionally also a segmentation mask representing the outline of the object of interest may be provided I available. The image patches may be obtained from operative data (i.e. the captured frames) in combination with manual or semi-automated segmentation of one or more objects of interest. Another option for obtaining patches may be based on external data (e.g. product images, or synthetic images generated from 3D models). Based on the provided image patch or image patches, the interactive segmentation approach may be initialized and objects similar to the objects in the exemplary patches may be detected and segmented.

[0082] There are different technical solutions for patch-prompted segmentation. One approach may be based on performing local feature extraction for the patch (prompt) images and the image to be segmented resulting in feature maps. Based on a distance measure (e.g. Euclidean distance, cosine distance) and pairwise comparisons (between local features in the patch(es) and the image to be segmented), the similarity between regions in the image to be segmented and the object-of-interest may be estimated. Regions with a high estimated similarity (to one or more regions) may be used as candidates for positive prompts and regions with a low similarity may be used as candidates for negative prompts for initializing the interactive segmentation approach (e.g. segment anything model). The actual prompt points may need to be sampled from the regions. Also bounding boxes or mask prompts may be possible. Feature extraction may be performed based on previously trained convolutional neural networks or transformer networks in a supervised or self-supervised setting. Particularly the self-supervised setting may be advantageous since it allows to add the constraint that similar image regions may be mapped to features showing a small distance. Another patch-prompt segmentation approach may be an end-to-end neural network taking the image patches (optionally including masks) as well as the image to be segmented as input and generating prompts for the interactive segmentation approach as output. This may be achieved by first generating a context embedding from the image patches using a set neural- network (e.g. Set Transformer, PointNet) and second leveraging this generated context in combination with the image to be segmented to generate a sequence of segmentation prompt-tokens by applying a contextualized image to sequence model.

[0083] For fast matching of point clouds and fortranslating segmentations between frames, indexing may be used. Indexing may be useful to find nearest neighbours of points (originating from different frames) without the need to search through the whole point cloud for each individual point. Indexing may be performed, e.g., by spatially partitioning the global point cloud into a rectangular grid. Other options may comprise octrees, Kd-trees, R-trees.

[0084] Planar objects (e.g. planar lightings) may be segmented in one run on a single frame (without the need to perform additional segmentations on other frames corresponding to different perspectives). Based on statistics of the point cloud (e.g. PCA), it may be estimated whether a partial point-cloud is (roughly) planar or not. In the case of planar objects, there may be no need to perform segmentation from different perspectives. Instead, a more accurate segmentation may be achieved by using the image frame and directly map the contour (e.g. the contour can be parameterized by a polygon) into the global point cloud and into the parametric 3D geometry. Since the points may not necessarily exactly lie on a plane, an option may be to project the points to a plane which may approximate the object plane in an optimal way (e.g. based on least-squares optimization). Since objects may be associated with geometric basic elements, another option may be to map the points to the corresponding basic elements. Also, for objects which may not be completely co-planar (e.g. power plug) it may be beneficial to use this mode.

[0085] Non-coplanar objects of interest may not be able to be projected to coplanar elements without a major loss of information. An option to segment such objects in three dimension may be given by performing 2D segmentation in one or more frames, mapping the segmentations to the partial point clouds corresponding to the individual frames, aligning the one or more partial point clouds and extracting the segmented points. The annotated 3D points may then be translated to a simplified representation, such as a polygonal mesh or a parametric representation to facilitate handling and visualization. The latter can be effective in the case of simple shapes, e.g., such as cuboids.

[0086] Since objects may be segmented from different views (corresponding to different frames), single mis-segmentations may lead to artifacts if all segmentation masks from all views are mapped to the global point cloud. Methods of resolution, e.g., comprise point cloud filtering methods. In this way, artifacts or false positives may be removed.

[0087] While the 3D room geometry may be an essential part of the model described herein, other data may not necessarily be needed for each use case. While in some use cases, e.g., textures, colors or even material properties may be needed, which may correspond to significant memory usage, in other cases they may not even be considered. In the case that all data may be stored in a monolithic file, the resulting files covering all data (to cover each individual use case) may be huge. To allow a high degree of flexibility, the data may be stored in a hierarchical file format. The geometry may represent the essential part and thereby the backbone in this data structure. All other data may be stored in separate files, based on references. The used file format may allow grouping of objects. For example, a room may consist of several walls, a floor may consist of several rooms and a building may consist of several floors. This may even allow to compose several buildings in one or more hierarchies of superordinated structures, such as cities, regions, or countries. The hierarchical file format may allow dynamically loading partial data from a cloud instance to the smartphone/web application leading to clearly lower latencies compared to (potentially large) monolithic files.

[0088] Allowing for an integration into various workflows, an export may be possible to diverse 3D data formats, such as ReluxDesktop (.rdf), Industry Foundation Classes v4 (.ifc), SketchUp (.skp), Wavefront (.obj), GL Transmission Format (.gltf und .gib), Autodesk Drawing Interchange File (.dxf), Autodesk Filmbox (.fbx), Alembic (.abc), COLLADA (.dae), Radio Planning/ WiFi Network Planning (.zip), Stereo Lithography (.stl), Video (.mp4), and Universal Scene Description (.usd). [0089] The manual as well as the semi-automated method may both be utilized to generate annotated training data during operations without requiring any additional effort. Since the user may be interested in annotation, tools for efficient manual or semi-automated annotation are provided. In the long run, annotations in 2D may be used to train supervised machine learning models to perform segmentations. Methods for 2D segmentation may comprise convolutional neural network-based approaches such as U- Net, Mask RCNN, detectron2, and transformer-based approaches such as segformer. Since the annotations may be mapped into the 3D space, also arbitrary 3D point cloud segmentation approaches (point net, point-net++, point transformer, point cloud transformer, ...) and 3D volumetric methods (3D U- Net, Sparse 3D Unet, ...) may be applicable. It may be particularly useful to decide based on the characteristics of the objects of interest, whether to perform 2D or 3D segmentation. While 2D segmentation approaches may be effective, particularly in the case of small planar structures, 3D approaches may be helpful in the case of larger and non-planar objects.

Claims

1 . A computer-implemented method of creating a building interior model and providing user interaction with the building interior model, comprising: a) receiving a plurality of 2D frames captured with a capturing device of a smartphone or tablet computer, each frame representing at least part of a building interior; b) creating a 3D model of the building interior using a first subset of the plurality of 2D frames; c) determining a spatial position of the capturing device in the 3D model of the building interior for a second subset of the plurality of 2D frames; d) representing the position of the capturing device for a third subset of the 2D frames in the 3D model of the building interior; e) receiving a user input indicating one of the positions of the third subset of the 2D frames; and f) displaying the 2D frame corresponding to the indicated position to the user.

2. The method according to claim 1 , wherein the first subset comprises all 2D frames of the plurality of 2D frames or wherein the first subset comprises less 2D frames than the plurality of 2D frames.

3. The method according to any one of the previous claims, wherein the second subset comprises all 2D frames of the plurality of 2D frames or wherein the second subset comprises less 2D frames than the plurality of 2D frames.

4. The method according to any one of the previous claims, wherein the third subset comprises all 2D frames of the second subset or wherein the third subset comprises less 2D frames than the second subset.

5. The method according to any one of the previous claims, wherein the plurality of 2D frames is captured as still images and/or as a video.

6. The method according to any one of the previous claims, further comprising: in step a), receiving a user input indicating at least one special 2D frame, and/or, in step b), determining at least one special 2D frame based on heuristics of the 3D model, in step c), determining a spatial position of the capturing device in the 3D model of the building interior for the at least one special 2D frame, and in step d), representing the position of the capturing device of the at least one special 2D frame in the 3D model of the building interior, wherein the representation of the at least one special 2D frame may also be used in steps e) and f).

7. The method according to the previous claim, wherein the at least one special 2D frame is captured as a still image, for example a still photograph.

8. The method according to any one of the previous claims, wherein each representation in step d) comprises both the spatial position and the orientation of the capturing device when capturing the corresponding frame.

9. The method according to any one of the previous claims, further comprising representing the trajectory of the capturing device through the building interior between the positions represented in step d) in the 3D model.

10. The method according to any one of the previous claims, further comprising determining 2D frames showing a similar view as the displayed 2D frame in step f) from any one of the plurality of 2D frames, the first subset, the second subset or the third subset, and displaying at least a part of the determined 2D frames as preview images, preferably comprising, when the user selects one of the displayed preview images, displaying the corresponding 2D frame to the user.

11 . The method according to any one of the previous claims, further comprising: displaying the 3D model to the user in a first-person perspective, receiving a user input indicating a desired view in the 3D model and displaying the desired view, determining 2D frames showing a view of the building interior corresponding to the displayed view of the 3D model, and displaying at least a part of the determined 2D frames as preview images, preferably comprising, when the user selects one of the displayed preview images, displaying the corresponding 2D frame to the user.

12. The method according to any one of the previous claims, further comprising: receiving a user input indicating a point of interest and/or a region of interest in the 3D model, determining at least one 2D frame showing a part of the building interior corresponding to the point of interest and/or the region of interest in the 3D model from the plurality of 2D frames, and displaying the at least one determined 2D frame to the user.

13. The method according to the previous claim, wherein determining the at least one 2D frame is based on at least one or a combination of the following parameters:

- distance of the position of the capturing device to the part of the building interior corresponding to the point of interest and/or the region of interest of the 3D model,

- sharpness of the 2D frame, and

- viewing angle of the 2D frame in relation to the part of the building interior corresponding to the point of interest and/or the region of interest of the 3D model.

14. The method according to any one of the previous claims, wherein step b) comprises annotating and/or segmenting the 2D frames and wherein the method further comprises: receiving a user input indicating a modification of an annotation and/or a segmentation of the displayed 2D frame, wherein the user input preferably is in the form of one or more mouse clicks or wherein the user input is in the form of one or more example images or one or more patch prompts, and modifying the 3D model accordingly.

15. A computing system, comprising a smartphone or tablet computer and at least one computing device, wherein the computing system is configured to perform the steps of the method according to any one of the preceding claims.

16. The computing system according to the previous claim, wherein the at least one computing device comprises a keyboard and/or a mouse configured to receive a user input.

17. A computer program, which, when executed on a computing device or system, instructs the computing device or system to carry out the steps of the method according to any one of claims 1 to 14.

18. A computer-readable medium having stored thereon a computer program according to the previous claim.