CN119166004A

CN119166004A - Information interaction method, device, electronic device, storage medium and program product

Info

Publication number: CN119166004A
Application number: CN202411302934.0A
Authority: CN
Inventors: 何震; 郭飞
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2024-09-18
Filing date: 2024-09-18
Publication date: 2024-12-20

Abstract

The application discloses an information interaction method, an information interaction device, electronic equipment, a storage medium and a program product, and belongs to the technical field of augmented reality. The method comprises the steps of obtaining interaction behavior information generated in an interaction interface of the augmented reality equipment, wherein the interaction behavior information comprises fixation time length when at least one display object in a display window of the interaction interface is respectively fixed, obtaining space attribute information of the display window in the interaction interface, semantic information of the display window and use preference information of a user on the augmented reality equipment under the condition that the fixation time length represented by the interaction behavior information reaches preset fixation time length, determining the semantic information according to type information and state information of the display window, determining the space attribute information according to size information and position information of the display window, determining target interaction information according to the interaction behavior information, the space attribute information, the semantic information and the use preference information, and displaying the target interaction information on the interaction interface.

Description

Information interaction method, information interaction device, electronic equipment, storage medium and program product

Technical Field

The application belongs to the technical field of augmented reality, and particularly relates to an information interaction method, an information interaction device, electronic equipment, a storage medium and a program product.

Background

With the development of science and technology, users increasingly use Extended Reality (XR) devices for online interaction. The interactive interface of the existing XR equipment mainly adopts the design thought of mobile platforms such as mobile phones and the like, and tends to fold operation instructions in interface elements so as to achieve the overall simple design style. This design, while improving the visual experience, increases the operational complexity to some extent.

Currently, a user may operate an XR device via a handle, gestures, eye, etc. Because the operation instruction is usually folded in the interface element, the operation of the user is complicated, so that the interaction efficiency is reduced, and the hand and eye fatigue of the user is accelerated.

Disclosure of Invention

The embodiment of the application aims to provide an information interaction method, an information interaction device, electronic equipment, a storage medium and a program product, which can solve the problems that the operation of a user is complicated, the interaction efficiency is reduced, and the hand and eye fatigue of the user is accelerated.

In a first aspect, an embodiment of the present application provides an information interaction method, where the method includes:

The method comprises the steps of obtaining interaction behavior information generated in an interaction interface of the augmented reality equipment, wherein the interaction interface comprises at least one display window, the interaction behavior information is interaction process information of the at least one display window, and the interaction behavior information comprises gazing duration of each gazed at of at least one display object in the display window;

under the condition that the interaction behavior information characterizes that the gazing time length reaches a preset gazing time length, acquiring spatial attribute information of the display window on the interaction interface, semantic information of the display window and preference information of the user for the use of the augmented reality equipment, wherein the semantic information is determined according to type information and state information of the display window, and the spatial attribute information is determined according to size information and position information of the display window;

Determining target interaction information according to the interaction behavior information, the spatial attribute information, the semantic information and the use preference information, wherein the target interaction information is associated with the operation intention of the user;

And displaying the target interaction information on the interaction interface.

In a second aspect, an embodiment of the present application provides an information interaction apparatus, including:

The system comprises a first acquisition module, a second acquisition module and a display module, wherein the first acquisition module is used for acquiring interactive behavior information generated in an interactive interface of the augmented reality equipment, the interactive interface comprises at least one display window, the interactive behavior information is interactive process information of the at least one display window, and the interactive behavior information comprises gazing time length of each gazed at of at least one display object in the display window;

The second acquisition module is used for acquiring spatial attribute information of the display window on the interactive interface, semantic information of the display window and preference information of the user for the use of the augmented reality equipment under the condition that the interaction behavior information characterizes that the gazing duration reaches a preset gazing duration, wherein the semantic information is determined according to type information and state information of the display window, and the spatial attribute information is determined according to size information and position information of the display window;

a determining module, configured to determine target interaction information according to the interaction behavior information, the spatial attribute information, the semantic information, and the usage preference information, where the target interaction information is associated with an operation intention of the user;

And the display module is used for displaying the target interaction information on the interaction interface.

In a third aspect, an embodiment of the present application provides an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and where the processor is configured to execute a program or instructions to implement a method according to the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product stored in a storage medium, the program product being executable by at least one processor to implement the method according to the first aspect.

In the embodiment of the application, because the target interaction information is associated with the operation intention of the user, the interaction result can be directly displayed by acquiring the interaction behavior information generated in the interaction interface of the augmented reality device, and under the condition that the interaction behavior information characterizes that the gazing time of the user on a certain display object in the display window reaches the preset gazing time, the target interaction information is determined according to the interaction behavior information, the space attribute information of the display window in the interaction interface, the semantic information and the use preference information of the user on the augmented reality device, and the target interaction information is displayed on the interaction interface, namely, the following operation intention of the user is predicted in the process of interacting with the augmented reality device, and the target interaction information associated with the operation intention is displayed, or the user can interact with the augmented reality device based on the target interaction information, so that the interaction operation of the user is simplified, the interaction efficiency is improved, and the hand eye fatigue of the user is relieved.

Drawings

Fig. 1 is a schematic flow chart of an information interaction method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of determining activation weights according to an embodiment of the present application;

FIG. 3 is a schematic diagram of determining an association relationship between two modality information based on a cross-attention mechanism according to an embodiment of the present application;

FIG. 4 is a schematic diagram of another embodiment of determining an association between two modality information based on a cross-attention mechanism;

FIG. 5 is a schematic diagram of determining target interaction information using a multi-modal large model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an interactive interface provided by an embodiment of the present application;

Fig. 8 is a schematic structural diagram of an information interaction device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

Fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

For existing XR interaction designs, most manipulation modes can be summarized as two operation primitives, "navigation" and "selection". Common "navigation" operation implementations include ray navigation based on handle tracking, action navigation based on gesture tracking, action navigation based on eye movement tracking, etc., and common "select" operation implementations include button selection based on handle, action selection based on gesture recognition, etc.

Whether relying on navigational operations for handle tracking, gesture tracking, and eye movement tracking, or on selected operations for handle key presses and gesture recognition, its underlying logic relies on frequent movements of the "hand" and "eye". Based on this, operating the XR device by eye-hand interaction for a long time can cause significant user eye strain and arm strain.

In addition, as described in the background section, the interactive interface of the existing XR device mainly uses the design thought of mobile platforms such as mobile phones, and tends to fold the operation instructions in the interface elements, so as to achieve an overall simple design style. This design trend, while improving the visual experience, increases the operational complexity to some extent. For example, "copy text from reader to browser for searching", a series of actions such as a) selecting text by operating cursor by handle, b) selecting copy command after long key trigger menu, c) selecting browser address bar, d) selecting paste command after long key trigger menu again, and e) clicking search button are undergone in the middle. Even if the eye tracking capability is integrated, the operation of 'handle movement' is mainly simplified, and the whole link is still not optimized in the interaction efficiency, so that the eye and hand fatigue of a user can be accelerated.

Based on the above, in order to solve the problems in the prior art, embodiments of the present application provide an information interaction method, an apparatus, an electronic device, a readable storage medium, and a computer program product. The information interaction method can be applied to an extended reality scene, and can also be applied to the fields of digital intelligent popularization (such as advertisements, services and the like), natural intuitional interaction in the field of robots and the like, and is not limited herein.

The following describes in detail the information interaction provided by the embodiment of the present application through specific embodiments and application scenarios thereof with reference to the accompanying drawings.

Fig. 1 shows a flow chart of an information interaction method according to an embodiment of the present application. As shown in fig. 1, the information interaction method provided in the embodiment of the present application includes the following steps S110 to S140, and each step is explained in detail below.

S110, acquiring interaction behavior information of a user on an interaction interface of the augmented reality device, wherein the interaction interface comprises at least one display window, the interaction behavior information is interaction process information of the at least one display window, and the interaction behavior information comprises gazing duration of each gazed at of at least one display object in the display window.

Here, a plurality of applications may be installed in an Extended Reality (XR) device. A display window corresponding to the application may be displayed in the interactive interface. The display window may display a text label, a text box, an image, a control, and the like. Wherein the text box may be described using text near the user's point of view. The image may be described using a region of interest (Region of Interest, POI) near the user's point of view. The controls may correspond to, for example, "confirm," "cancel," "share," etc. functions.

From a time dimension, the interaction behavior information may include interactions between the user starting the XR device and obtaining the interaction behavior information. From the interaction type dimension, the interaction behavior information may include viewpoint information and target operation behavior. The viewpoint information may include a gaze sequence and a gaze duration of the user on the display object in the display window. The viewpoint information may belong to the "navigation operation" information mentioned above. The target operational behavior may include a user selected operation of the control. The target operational behavior may pertain to the "selected operation" mentioned above. From the information type dimension, the interactive behavior information may include a sequence of display objects, dwell time information, and action segments.

Based on this, in order to improve accuracy of the following prediction of the interaction behavior based on the interaction behavior information, in some embodiments, S110 may specifically include:

Receiving a first input of a user to a display object in at least one display window;

Responding to the first input, acquiring a display object sequence corresponding to a plurality of display objects, a gazing duration for gazing each display object and a target operation behavior for operating at least one display window;

Determining stay time information of a user on each display object according to the gazing time length corresponding to each display object in the first preset time length;

Splitting a display object sequence into a plurality of operation fragments according to the target operation behavior;

and determining interaction behavior information according to the display object sequence, the stay time information and the operation fragment.

Here, the first input includes, but is not limited to, gaze input by a user through an eyeball, touch input by a user to a display object through a touch device such as a finger or a stylus, a voice instruction input by the user, a specific gesture input by the user, other feasibility inputs, and the like. The first input may be specifically determined according to actual use requirements, which is not limited by the embodiment of the present application. The specific gesture in the embodiment of the application can be any one of a single-click gesture, a sliding gesture, a dragging gesture, a pressure recognition gesture, a long-press gesture, an area change gesture, a double-press gesture and a double-click gesture, and the click input in the embodiment of the application can be single-click input, double-click input or any-time click input, and the like, and can also be long-press input or short-press input. The first input may be, for example, a user's browsing input of a display object in at least one display window.

The sequence of display objects may characterize an interaction sequence generated in accordance with an input order when a user makes a first input to the plurality of display objects. In addition, the target operation behavior of the user for operating the display window may specifically be a selection operation of a control in the display window by the user. The selection operation may include, for example, a user clicking on a "share" control, clicking on an "edit" control, clicking on a "confirm" control, and so forth.

As an example, a sequence of display objects may be generated by monitoring the interactive behavior of a user. The sequence of display objects may be recorded, for example, as an M1-view-out text tab |m2-view-in image display area |m3-view hovering |m4-key trigger menu |m5-pressing the "share" button |m6-view-out image area |m7-view-in browser search box |m8-key input text |m9-key trigger search near the image (0.2, 0.4). Wherein M1-M4, M6-M8 may belong to operations that do not involve "move" and "opt" class interface layers of the background functionality. That is, M1-M4, M6-M8 may pertain to the "navigation operation" mentioned hereinabove. In addition, M5 may be an operation for "triggering the sharing function", and M9 may be an operation for "triggering the search function". M5 and M9 may belong to a "select operation" involving a background function. That is, the target operation behavior may include M5 and M9 in the above example.

As an example, if the display object sequence is longer, the display object sequence may be split into a plurality of segments in order to better predict the next interaction behavior of the user, i.e., to predict the operation intention of the user, from the display object sequence later. Specifically, since the "navigation operation" is a precondition of the "selection operation", which is the final purpose of the "navigation operation", the target operation behavior can be split as an operation separator to the display object sequence, and a plurality of operation fragments can be obtained. The operation fragment may be a minimum sequence of operations to move, navigate to "select operation". The length of the operation fragment may reflect the complexity of the operation. The longer the operation fragment, the higher the complexity of the operation fragment, and the higher the importance of the operation fragment to the subsequent prediction of the interaction behavior. Conversely, the shorter the operation fragment, the lower the complexity of the operation fragment, and the lower the importance of the operation fragment to the subsequent prediction of the interaction behavior. On the basis of the above example, the manipulation segments may include M1-M5 and M6-M9.

In addition, the first preset duration may be a preset time window, for example, 10 minutes. The stay time length information may represent a duty ratio of a gazing time length corresponding to each display object in a first preset time length, or a proportional relationship between gazing time lengths corresponding to a plurality of display objects respectively. The dwell time information may slide the histogram representation. The stay time information may reflect a user's attention to the display window.

As an example, by monitoring the interaction behavior of the user, the duration of the user's gaze on each display object may be obtained. And carrying out statistical analysis on a plurality of gazing time durations within a first preset time duration to obtain the residence time duration information of the user on each display object.

As a more specific example, the gaze point is sampled at a preset frequency (e.g., 10 Hz) for a first preset time period (e.g., 10 minutes), and the duty ratio of each display window focused by the gaze point in the first preset time period is counted, so as to obtain the sliding histogram.

In this way, the interactive behavior information is determined according to the longer display object sequence, the shorter operation fragments and the residence time information obtained by carrying out statistical analysis on each display object, so that the interactive behavior information can be more abundant, and the accuracy of the subsequent interactive behavior prediction based on the interactive behavior information is improved.

S120, under the condition that the interaction behavior information characterizes that the gazing time length reaches the preset gazing time length, acquiring space attribute information of a display window on an interaction interface, semantic information of the display window and use preference information of a user on the augmented reality equipment, wherein the semantic information is determined according to type information and state information of the display window, and the space attribute information is determined according to size information and position information of the display window.

Here, if the gazing time length of the user on a certain display object reaches the preset gazing time length, the next interaction behavior of the user can be predicted based on the interaction behavior information generated by the interaction between the user and the augmented reality device.

In order to improve accuracy of predicting the next interaction behavior of the user, before the interaction behavior is predicted, spatial attribute information of the display window on the interaction interface, semantic information of the display window and preference information of the user on the XR device can be obtained, and the interaction behavior is predicted according to the interaction behavior information, the spatial attribute information, the semantic information, the preference information and other multi-mode information. In consideration of time dimension, the spatial attribute information and the semantic information can belong to instant information, the interaction behavior information can belong to short period information, and the use preference information can belong to long period information.

The spatial attribute information of the display window on the interactive interface may include size information and position information of the display window on the interactive interface.

Based on this, in order to improve the prediction efficiency of predicting the interaction behavior, in some embodiments, the obtaining the spatial attribute information of the display window on the interaction interface may specifically include:

Acquiring first size information and anchor point information of a display window on an interactive interface;

mapping the first size information into a target coordinate system to obtain second size information;

projecting an anchor point corresponding to the anchor point information onto the unit sphere to obtain anchor point position information;

and determining the space attribute information according to the second size information and the anchor point pose information.

Here, the first size information may be actual size information of the display window at the interactive interface. The second size information may be size information obtained after mapping the first size information to the target coordinate system. The second size information and the first size information may be the same or different. The target coordinate system may be, for example, a normalized device coordinate system (Normalized Device Coordinates, NDC). The first size information of the different display windows at the interactive interface may be in different data dimensions. Therefore, in order to improve the standardization of the subsequent data processing procedure, the first size information may be mapped into the target coordinate system, that is, the first size information may be subjected to the standardization process to obtain the second size information. In addition, the larger the size of the display window, the higher the user's attention to the display window may be. The smaller the size of the display window, the lower the user's attention to the display window may be.

In addition, the anchor information may be position information of an anchor of the display window. The unit sphere may be a sphere with a radius of 1. Anchor point pose information may be expressed in terms of quaternions. The anchor point position information is obtained by uniformly projecting the anchor points to the unit sphere, so that the position information of the anchor points can be subjected to standardized processing, and the standardization of the subsequent data processing process is improved. If the anchor point pose information indicates that the anchor point is positioned at the front main view angle position of the unit sphere, the higher the importance degree of the display content in the display window corresponding to the anchor point can be. If the anchor point pose information indicates that the anchor point is located at the edge position of the unit sphere, the importance level of the display content in the display window corresponding to the anchor point can be lower.

In this way, by determining the spatial attribute information according to the second size information and the anchor point pose information after the normalization processing, the normalization of the spatial attribute information can be improved, and further the prediction efficiency of the subsequent prediction of the interaction behavior can be improved.

In addition, semantic information of the display window may be determined according to type information and state information of the display window. Wherein the type information may be determined according to the type of the display content in the display window. The type information may be expressed in terms of phrases. The type information may include dynamic types such as games, videos, etc., and static types such as pages, text, etc. In performing the prediction of the interaction behavior, the importance of the dynamic type may be higher than the importance of the static type. In addition, the status information may include whether the display window is in an active state or the display window is in an inactive state. That is, the display windows may include a first display window whose state information is in an active state and at least one second display window whose state information is in an inactive state.

Based on this, in order to improve accuracy of the following prediction of the interaction behavior, in some embodiments, the semantic information of the display window may specifically include:

Acquiring type information and state information of a display window, wherein the state information is used for indicating whether the display window is activated or not;

performing numerical processing on the type information under the condition that the type information is text information, and obtaining a numerical result corresponding to the type information;

Acquiring an activation weight corresponding to the state information;

And determining semantic information according to the numerical result and the activation weight.

Here, the digitizing of the type information may be to map text information to numeric information to obtain a digitized result. For example, "game" may be mapped to "10" and "page" may be mapped to "5". Wherein the value size may represent the importance of the type information. The larger the value may indicate the higher the importance of the type of information in subsequent prediction of interactive behavior.

In addition, if the display window is in an activated state, the activation weight of the display window may be 1, and if the display window is in a deactivated state, the activation weight of the display window may be 0. In practice, however, even if the display window is in an inactive state, its data still contributes to the prediction of interaction behavior.

Therefore, in order to improve accuracy of the following prediction of the interaction behavior, in some embodiments, the acquiring the activation weight corresponding to the state information may specifically include:

Acquiring a window distance between the first display window and each second display window to obtain at least one window distance;

and determining the corresponding activation weight of each display window according to at least one window distance.

Typically, the first display window is a user-selected display window. The second display window may be a display window that is not selected by the user. The display window not selected by the user may be a "foreground" window or a "background" window. Wherein the "foreground" window may be closer to the first display window. The "background" window may be farther from the first display window.

As an example, if the interactive interface includes a first display window and two second display windows a and B, and the window distance between the first display window and the second display window a is 3, and the window distance between the first display window and the second display window B is 5, the activation weight corresponding to the second display window a may be 3/8, and the activation weight corresponding to the second display window B may be 5/8. In addition, the activation weight corresponding to the first display window may be 1. Wherein, the higher the activation weight may indicate the higher the user's attention to the display window.

According to the embodiment of the application, the contribution degree of the second display window to the interactive behavior prediction can be reserved by determining the activation weight according to the window distance, so that the accuracy of the subsequent interactive behavior prediction can be improved.

Based on this, in order to improve accuracy of the window distance, further improve accuracy of the following prediction of the interaction behavior, in some embodiments, the obtaining the window distance between the first display window and each second display window may specifically include:

acquiring a first projection of an anchor point of a first display window on a unit sphere and a second projection of a second display window on the unit sphere;

The spherical distance between the first projection and each of the second projections is determined as the window distance.

As shown in fig. 2, taking the example that the interactive interface includes four display windows, 200 may represent the sphere of the unit sphere, 201 may represent the first projection, 202 may represent the second projection, and 203 may represent the window distance d _i. Wherein, the value of i can be from 0 to N, and N can represent the number of the second display windows. d ₀ may be 0, representing the spherical distance between the first projection and the first projection, and d ₁、d₂、d₃ may represent the spherical distances between the first projection and the three second projections, respectively.

By determining the window distance according to the spherical distance between the projections of the anchor points of the display window on the unit sphere, the accuracy of the window distance can be improved, and the accuracy of the subsequent interactive behavior prediction is further improved.

In addition, after determining the at least one window distance, the activation weight may be calculated as described in the above example, and the activation weight may also be calculated by the following formula:

Wherein,

In the above formula, w _i may be an activation weight, λ may be a settable super parameter, and may be reserved for a developer to perform a customized adjustment, so as to control the overall specific gravity. By determining the activation weight through the formula, the scientificity of the activation weight can be improved, and the accuracy of the subsequent interactive behavior prediction is further improved.

In conclusion, semantic information is determined according to the numerical result and the activation weight, and further interaction behavior prediction is performed based on the semantic information, so that the accuracy of interaction behavior prediction can be improved.

Further, the user's usage preference information for the XR device may include user's usage preference information for a plurality of applications in the XR device. The usage preference information may indicate that the user prefers to use a certain application in the XR device, and a period of time that the user prefers to use a certain application.

Based on this, in order to further improve accuracy of the interaction behavior prediction, in some embodiments, the obtaining the usage preference information of the user for the augmented reality device may specifically include:

Acquiring the use proportion information of a user to a plurality of applications in the augmented reality equipment and the use frequency information of the user to a target application in a plurality of time periods within a second preset time period, wherein the second preset time period comprises the time periods, and the target application is any one of the applications;

the usage preference information is determined based on the usage proportion information and the usage frequency information.

Here, the usage proportion information may indicate a proportion of the usage period of the application by the user to the second preset period. The usage frequency information may represent how frequently the user uses the target application within the target time period. The usage proportion information and the usage frequency information may be regarded as a priori knowledge under long period (i.e. second preset duration) statistics. The second preset time period may be, for example, two weeks to one month. Events such as starting, switching and stopping running of the application can be recorded through an operating system of the augmented reality device, and then the using time information of the application in a second preset duration is tracked. The usage proportion information of the plurality of applications can be determined by performing statistical analysis on the usage time information of the plurality of applications within the second preset time period. The larger the use proportion of the user to the application in the second preset time period, the larger the use requirement of the user to the application can be indicated. The smaller the usage proportion of the user to the application within the second preset time period, the smaller the usage requirement of the user to the application can be indicated.

In addition, the second preset time period may be divided by a time period in consideration of a certain regularity of daily activities of most users. The application using habit of the user can be counted by counting the using frequency of the user to an application in each time period. The higher the frequency of use of an application for a certain period of time, the more important the application is in that period of time.

Therefore, the use preference information is determined according to the use proportion information and the use frequency information, so that the use requirement and the use habit of the user on the XR equipment can be fully considered in the interactive behavior prediction, and the accuracy of the interactive behavior prediction is further improved.

S130, determining target interaction information according to the interaction behavior information, the spatial attribute information, the semantic information and the use preference information, wherein the target interaction information is associated with the operation intention of the user.

The target interaction information may characterize the user's next interaction behavior. The target interaction information may include shortcut interaction controls and/or interaction result information for performing a next operation. The shortcut interaction control and the interaction result information can be at least one. The shortcut interaction control may be, for example, a "share" control, a "screen capture" control, etc. The interaction result information may be, for example, explanatory information, price list information of the commodity, or the like.

Based on this, in order to improve accuracy of interaction behavior prediction, in some embodiments, determining the target interaction information according to the interaction behavior information, the spatial attribute information, the semantic information, and the usage preference information may specifically include:

The interactive behavior information, the spatial attribute information, the semantic information and the use preference information are input into a multi-modal large model, and the multi-modal large model is utilized to determine target interactive information.

The multimodal big model may be an artificial intelligence big model capable of handling multimodal information. The multimodal information may include, among other things, interaction behavior information, spatial attribute information, semantic information, and usage preference information. The multimodal big model may be, for example, a multimodal transducer model.

In some embodiments, the above-mentioned interactive behavior information, spatial attribute information, semantic information and usage preference information are input into a multi-modal large model, and the target interactive information is determined by using the multi-modal large model, which may specifically include:

Inputting interaction behavior information, spatial attribute information, semantic information and usage preference information into a multi-modal large model, and determining an association relationship between first modal information and second modal information for each piece of first modal information based on a cross-attention mechanism, wherein the first modal information is any one of the interaction behavior information, the spatial attribute information, the semantic information and the usage preference information, and the second modal information is any one of the interaction behavior information, the spatial attribute information, the semantic information and the usage preference information except the first modal information;

Based on a self-attention mechanism, determining an internal dependency relationship of the first modality information based on the association relationship for each first modality information;

And determining target interaction information according to the association relation and the internal dependency relation corresponding to each piece of first modality information.

For example, in the case where the first modality information is spatial attribute information and the second modality information is semantic information, a schematic diagram for determining an association relationship between the first modality information and the second modality information based on a cross-attention mechanism using a multi-modality large model may be shown in fig. 3.

For example, in the case that the first modality information is spatial attribute information and the second modality information is interactive behavior information, a schematic diagram for determining an association relationship between the first modality information and the second modality information based on a cross-attention mechanism using a multi-modality large model may be shown in fig. 4.

For example, if the spatial attribute information is denoted as C1, the semantic information is denoted as C2, the interaction behavior information is denoted as C3, and the usage preference information is denoted as C4, a schematic diagram for determining the target interaction information by using the multi-modal large model according to the embodiment of the present application may be shown in fig. 5.

According to the method and the device for predicting the interactive behavior, the accuracy of the interactive behavior prediction can be improved by determining the target interactive information through the multi-mode large model.

And S140, displaying the target interaction information on the interaction interface.

The target interaction information may be displayed in a floating window on the interaction interface. The target interaction information displayed in the floating window may be dynamically adjusted as the user interacts with the XR device.

As an example, when a user reads text, the explanation corresponding to the phrase concerned by the user can be displayed, when the user browses the e-commerce catalogue, the commodity price list can be displayed, when the viewpoint of the user is firstly transferred from the text display area to the video playing area, and then a certain animal in the video content is focused for a period of time, the screen capturing control and the text introduction of the animal can be simultaneously displayed.

If the target interaction information is the shortcut interaction control, the user can continue to interact by directly clicking the shortcut interaction control on the interaction interface without finding the interaction control for interaction through layer-by-layer operation, so that the user operation can be simplified. If the target interaction information is interaction result information, the user can directly obtain the interaction result, and the user operation is further simplified.

In order to better describe the whole solution, some specific examples are given based on the above embodiments.

First, the multimodal large model may be deployed in an XR device or in the cloud, without limitation. Generally, since the multi-mode large model is large in scale, in order to ensure normal operation of the XR device, the multi-mode large model can be deployed at the cloud, and the scheme in the embodiment of the application is executed by adopting a system architecture of 'end+cloud'. In the case of a multimodal large model deployed in the cloud, a schematic diagram of a system architecture in an embodiment of the present application may be shown in fig. 6.

In addition, as shown in FIG. 7, 700 may represent a user and 701 may represent an interactive interface of an XR device. 702 may represent a movement trajectory of the user's viewpoint (in this example, the viewpoint first moves from the text display area to the video play area, then gazes at a pet in the video content for a period of time, and finally moves to the browser search box). 703 may represent predictions of multimodal big models that guess that the first intent of the user may want to learn about the variety, habit, etc. of the pet, thus displaying some introductory content within the floating window. 704 may infer that the second intent of the user is a screenshot on behalf of the model, thus additionally displaying a "screenshot" control. 705 may stay within the search box on behalf of the user's current viewpoint, thus predicting the search keyword, i.e., displaying the recommended phrase. 706 and 607 may represent quick interaction controls that rank 2, 3, corresponding to a "share" control and a "search" control, respectively.

It should be noted that, when predicting in this example, the interaction behavior information may include first sub-interaction behavior information before the XR device moves from the current power-on to the point of view into the text display area, and second sub-interaction behavior information when the point of view moves from the text display area to the video playing area, then stares at the pet in the video content for a period of time, and finally moves to the browser search box.

Therefore, the interactive behavior information, the semantic information, the spatial attribute information, the use preference information and other information are perceived through the XR equipment, the interactive behavior information, the semantic information, the spatial attribute information and the use preference information are input into the multi-mode large model, the target interactive information is determined by utilizing the multi-mode large model, and the target interactive information is dynamically displayed on an interactive interface of the XR equipment, so that the method is beneficial to enriching information expression of the XR interface in combination with potential demands of users in real time, and improving interaction efficiency and relieving eye and hand fatigue of the users.

According to the information interaction method provided by the embodiment of the application, the execution subject can be an information interaction device. In the embodiment of the application, a method for executing information interaction by an information interaction device is taken as an example, and the information interaction device provided by the embodiment of the application is described.

Fig. 8 shows a schematic structural diagram of an information interaction device according to an embodiment of the present application. As shown in fig. 8, an information interaction device 800 provided in an embodiment of the present application may include:

The first obtaining module 801 is configured to obtain interaction behavior information generated in an interaction interface of the augmented reality device, where the interaction interface includes at least one display window, the interaction behavior information is interaction process information of the at least one display window, and the interaction behavior information includes gazing duration of each of at least one display object in the display window;

A second obtaining module 802, configured to obtain, when the gaze duration represented by the interaction behavior information reaches a preset gaze duration, spatial attribute information of the display window on the interaction interface, semantic information of the display window, and preference information of the user for use of the augmented reality device, where the semantic information is determined according to type information and state information of the display window, and the spatial attribute information is determined according to size information and position information of the display window;

A determining module 803, configured to determine target interaction information according to the interaction behavior information, the spatial attribute information, the semantic information, and the usage preference information, where the target interaction information is associated with an operation intention of the user;

and the display module 804 is configured to display the target interaction information on the interaction interface.

The information interaction device 800 is described in detail below, and is specifically as follows:

In some embodiments, the first obtaining module 801 may specifically include:

A receiving unit for receiving a first input of a user to a display object in at least one display window;

a first obtaining unit configured to obtain, in response to a first input, a display object sequence corresponding to a plurality of display objects, a gazing duration for gazing at each display object, and a target operation behavior for operating at least one display window;

The first determining unit is used for determining stay time information of a user on each display object according to the gazing time length corresponding to each display object in the first preset time length;

The splitting unit is used for splitting the display object sequence into a plurality of operation fragments according to the target operation behavior;

and the second determining unit is used for determining interaction behavior information according to the display object sequence, the stay time information and the operation fragment.

In some embodiments, the second obtaining module 802 may specifically further include:

the second acquisition unit is used for acquiring type information and state information of the display window, wherein the state information is used for indicating whether the display window is activated or not;

the processing unit is used for carrying out numerical processing on the type information under the condition that the type information is text information to obtain a numerical result corresponding to the type information;

a third acquisition unit, configured to acquire an activation weight corresponding to the state information;

and the third determining unit is used for determining semantic information according to the numerical result and the activation weight.

In some of these embodiments, the display windows include a first display window with state information in an active state and at least one second display window with state information in an inactive state. Based on this, the second acquisition module 802 may specifically further include:

a fourth obtaining unit, configured to obtain a window distance between the first display window and each second display window, to obtain at least one window distance;

And the fourth determining unit is used for determining the corresponding activation weight of each display window according to at least one window distance.

In some embodiments, the fourth obtaining unit may specifically include:

The acquisition subunit is used for acquiring a first projection of an anchor point of the first display window on the unit sphere and a second projection of the second display window on the unit sphere;

A first determination subunit for determining a spherical distance between the first projection and each of the second projections as a window distance.

a fifth obtaining unit, configured to obtain usage proportion information of a user on a plurality of applications in the augmented reality device in a second preset duration and usage frequency information of the user on a target application in a plurality of time periods, where the second preset duration includes a plurality of time periods, and the target application is any one of the plurality of applications;

and a fifth determining unit for determining the usage preference information based on the usage proportion information and the usage frequency information.

The information interaction device in the embodiment of the application can be electronic equipment or a component in the electronic equipment, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. The electronic device may be a Mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a Mobile internet appliance (Mobile INTERNET DEVICE, MID), an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a robot, a wearable device, an ultra-Mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), etc., and may also be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a Television (TV), a teller machine, a self-service machine, etc., which are not particularly limited in the embodiments of the present application.

The information interaction device in the embodiment of the application can be a device with an operating system. The operating system may be an Android operating system, an ios operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.

The information interaction device provided by the embodiment of the present application can implement each process implemented by the embodiments of the methods of fig. 1 to 7, and in order to avoid repetition, a detailed description is omitted here.

Optionally, as shown in fig. 9, the embodiment of the present application further provides an electronic device 900, which includes a processor 901 and a memory 902, where a program or an instruction capable of running on the processor 901 is stored in the memory 902, and the program or the instruction implements each step of the above-mentioned embodiment of the information interaction method when being executed by the processor 901, and the steps can achieve the same technical effect, so that repetition is avoided, and no further description is given here.

The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.

Fig. 10 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1000 includes, but is not limited to, a radio frequency unit 1001, a network module 1002, an audio output unit 1003, an input unit 1004, a sensor 1005, a display unit 1006, a user input unit 1007, an interface unit 1008, a memory 1009, and a processor 1010.

Those skilled in the art will appreciate that the electronic device 1000 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 1010 by a power management system to perform functions such as managing charge, discharge, and power consumption by the power management system. The electronic device structure shown in fig. 10 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.

The processor 1010 is configured to obtain interaction behavior information generated in an interaction interface of the augmented reality device, where the interaction interface includes at least one display window, the interaction behavior information is interaction process information of the at least one display window, the interaction behavior information includes gaze time length of at least one display object in the display window, when the gaze time length represented by the interaction behavior information reaches a preset gaze time length, obtain spatial attribute information of the display window on the interaction interface, semantic information of the display window, and preference information of a user for use of the augmented reality device, where the semantic information is determined according to type information and state information of the display window, and the spatial attribute information is determined according to size information and position information of the display window;

and a display unit 1006, configured to display the target interaction information on the interaction interface.

According to the embodiment of the application, the next operation intention of the user is predicted in the process of interaction between the user and the augmented reality equipment, and the target interaction information related to the operation intention is displayed, so that the interaction result can be directly displayed, or the user can interact with the augmented reality equipment based on the target interaction information, the interaction operation of the user is simplified, the interaction efficiency is improved, and the hand and eye fatigue of the user is relieved.

Optionally, a user input unit 1007 is configured to receive a first input from a user of a display object in at least one display window;

The processor 1010 is further configured to:

Determining stay time length information of a user on each display object according to the gazing time length corresponding to each display object in the first preset time length, splitting a display object sequence into a plurality of operation fragments according to target operation behaviors, and determining interaction behavior information according to the display object sequence, the stay time length information and the operation fragments.

Optionally, the processor 1010 is further configured to:

and determining semantic information according to the numeric result and the activation weight.

Optionally, the display windows include a first display window with state information in an active state and at least one second display window with state information in an inactive state. Based on this, the processor 1010 is further configured to:

Optionally, the processor 1010 is further configured to:

The embodiment of the application is beneficial to enriching the information expression of the XR interface by combining the potential requirements of the user in real time, improving the interaction efficiency and relieving the eye and hand fatigue of the user.

It should be appreciated that in embodiments of the present application, the input unit 1004 may include a graphics processor (Graphics Processing Unit, GPU) 10041 and a microphone 10042, where the graphics processor 10041 processes image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The display unit 1006 may include a display panel 10061, and the display panel 10061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1007 includes at least one of a touch panel 10071 and other input devices 10072. The touch panel 10071 is also referred to as a touch screen. The touch panel 10071 can include two portions, a touch detection device and a touch controller. Other input devices 10072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein.

The memory 1009 may be used to store software programs as well as various data. The memory 1009 may mainly include a first memory area storing programs or instructions and a second memory area storing data, wherein the first memory area may store an operating system, application programs or instructions (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, the memory 1009 may include volatile memory or nonvolatile memory, or the memory 1009 may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM), static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate Synchronous dynamic random access memory (Double DATA RATE SDRAM, DDRSDRAM), enhanced Synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCH LINK DRAM, SLDRAM), and Direct random access memory (DRRAM). Memory 1009 in embodiments of the application includes, but is not limited to, these and any other suitable types of memory.

The processor 1010 may include one or more processing units, and optionally the processor 1010 integrates an application processor that primarily processes operations involving an operating system, user interface, application program, etc., and a modem processor that primarily processes wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 1010.

The embodiment of the application also provides a readable storage medium, on which a program or an instruction is stored, which when executed by a processor, implements each process of the above-mentioned information interaction method embodiment, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here.

Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage medium such as computer readable memory ROM, random access memory RAM, magnetic or optical disk, etc.

The embodiment of the application further provides a chip, which comprises a processor and a communication interface, wherein the communication interface is coupled with the processor, and the processor is used for running programs or instructions to realize the processes of the information interaction method embodiment, and the same technical effects can be achieved, so that repetition is avoided, and the description is omitted here.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.

Embodiments of the present application provide a computer program product stored in a storage medium, where the program product is executed by at least one processor to implement the respective processes of the above-described information interaction method embodiments, and achieve the same technical effects, and for avoiding repetition, a detailed description is omitted herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. An information interaction method, characterized by comprising:

Acquire interactive behavior information generated in an interactive interface of an extended reality device, the interactive interface comprising at least one display window, the interactive behavior information being interactive process information of the at least one display window, the interactive behavior information comprising a gaze duration of each of at least one display object in the display window;

When the interaction behavior information indicates that the gaze duration reaches a preset gaze duration, obtaining spatial attribute information of the display window on the interaction interface, semantic information of the display window, and user preference information for the extended reality device, wherein the semantic information is determined according to type information and state information of the display window, and the spatial attribute information is determined according to size information and position information of the display window;

Determining target interaction information according to the interaction behavior information, the spatial attribute information, the semantic information and the usage preference information, wherein the target interaction information is associated with the operation intention of the user;

The target interaction information is displayed on the interaction interface.

2. The method according to claim 1, wherein obtaining the interactive behavior information generated in the interactive interface of the extended reality device comprises:

receiving a first input from the user on the display object in the at least one display window;

In response to the first input, acquiring a display object sequence corresponding to the plurality of display objects, a gaze duration for gazing at each of the display objects, and a target operation behavior for operating the at least one display window;

Determining the duration information of the user's stay on each of the displayed objects according to the gaze durations respectively corresponding to each of the displayed objects within the first preset duration;

According to the target operation behavior, splitting the display object sequence into multiple operation segments;

The interactive behavior information is determined according to the display object sequence, the stay duration information and the operation fragment.

3. The method according to claim 1, wherein obtaining the semantic information of the display window comprises:

Acquire type information and state information of the display window, wherein the state information is used to indicate whether the display window is activated;

In the case where the type information is text information, performing numerical processing on the type information to obtain a numerical result corresponding to the type information;

Obtaining an activation weight corresponding to the state information;

The semantic information is determined according to the quantified result and the activation weight.

4. The method according to claim 3, characterized in that the display window includes a first display window whose state information is in an activated state and at least one second display window whose state information is in an inactivated state, and the acquiring the activation weight corresponding to the state information comprises:

Acquire a window distance between the first display window and each of the second display windows to obtain at least one window distance;

The activation weight corresponding to each of the display windows is determined according to the at least one window distance.

5. The method according to claim 4, wherein obtaining the window distance between the first display window and each of the second display windows comprises:

Acquire a first projection of an anchor point of the first display window on a unit sphere and a second projection of the second display window on the unit sphere;

A spherical distance between the first projection and each of the second projections is determined as the window distance.

6. The method according to claim 1, wherein obtaining the user's usage preference information for the extended reality device comprises:

Acquire usage ratio information of multiple applications in the extended reality device by the user within a second preset time period and usage frequency information of a target application by the user within multiple time periods, wherein the second preset time period includes the multiple time periods, and the target application is any one of the multiple applications;

The usage preference information is determined according to the usage ratio information and the usage frequency information.

7. An information interaction device, comprising:

A first acquisition module is used to acquire interaction behavior information generated in an interaction interface of an extended reality device, wherein the interaction interface includes at least one display window, and the interaction behavior information is interaction process information of the at least one display window, wherein the interaction behavior information includes a gaze duration of each of at least one display object in the display window;

A second acquisition module is used to acquire, when the interaction behavior information indicates that the gaze duration reaches a preset gaze duration, spatial attribute information of the display window on the interaction interface, semantic information of the display window, and user preference information for the extended reality device, wherein the semantic information is determined according to type information and state information of the display window, and the spatial attribute information is determined according to size information and position information of the display window;

a determination module, configured to determine target interaction information according to the interaction behavior information, the spatial attribute information, the semantic information and the usage preference information, wherein the target interaction information is associated with the operation intention of the user;

A display module is used to display the target interaction information on the interaction interface.

8. The device according to claim 7, wherein the first acquisition module comprises:

A receiving unit, configured to receive a first input from the user on the display object in the at least one display window;

A first acquisition unit, configured to acquire, in response to the first input, a display object sequence corresponding to the plurality of display objects, a gaze duration for gazing at each of the display objects, and a target operation behavior for operating the at least one display window;

A first determining unit, configured to determine the duration information of the user's stay on each of the displayed objects according to the gaze durations respectively corresponding to each of the displayed objects within a first preset duration;

A splitting unit, used for splitting the display object sequence into multiple operation segments according to the target operation behavior;

The second determining unit is used to determine the interactive behavior information according to the display object sequence, the stay duration information and the operation fragment.

9. An electronic device, characterized in that it comprises a processor and a memory, wherein the memory stores programs or instructions that can be run on the processor, and when the program or instructions are executed by the processor, the steps of the information interaction method according to any one of claims 1 to 6 are implemented.

10. A readable storage medium, characterized in that the readable storage medium stores a program or instruction, and when the program or instruction is executed by a processor, the steps of the information interaction method according to any one of claims 1 to 6 are implemented.

11. A computer program product, characterized in that the computer program product is stored in a storage medium, and the computer program product is executed by at least one processor to implement the steps of the information interaction method according to any one of claims 1 to 6.