CN119096239A

CN119096239A - A visual search method and device

Info

Publication number: CN119096239A
Application number: CN202280095518.0A
Authority: CN
Inventors: 蒋昊; 蒋杰; 杨光
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2024-12-06
Also published as: WO2023225919A1

Abstract

A visual search is provided, which comprises the steps of obtaining an image to be searched (601), obtaining a first round of search results based on the characteristics of the image to be searched and the characteristics of a first-level object in a query recommendation library, wherein the first round of search results comprise a plurality of objects with the first level up to standard, the query recommendation library comprises N-level objects, each N-1-level object corresponds to a plurality of N-level objects, N is an integer larger than 1, the objects comprise text content and/or image content and/or video content and/or audio content (602), performing late interaction fusion on the characteristics of the image to be searched and the characteristics of a first target object to obtain a first accumulated search intention characteristic, the first target object is an object (603) selected by a user from the plurality of objects with the first level up to standard, and obtaining a second round of search results based on the first accumulated search intention characteristic, and the second round of search results comprise a plurality of objects with the second level up to standard corresponding to the first target object.

Description

Visual search method and device

Technical Field

The application relates to the technical field of searching, in particular to a visual searching method and device.

Background

Visual searching is one of the key technologies in the internet field, and typical applications are "search in graphs", "search in texts in graphs". The system is a subdivision of search engine, such as Microsoft's ' necessary ' search engine, and helps users to complete specific search task more conveniently through pictures. In the digital age with the current consumer attention range and time decreasing, the practical demands of users are effectively captured through visual search, and the consumer experience of the users is improved to be the development consensus of all large electronic commerce platforms. On the other hand, data Bridge surveys show that the market estimate for visual searches will increase from $60 to $300, and the rapidly growing market continues to drive the iterative development of relevant visual search technologies.

However, the visual search in the prior art has the problems of poor flexibility, only passive response to the user Querry, inability to help the user to identify or perfect the search intention which is not yet clear, low accuracy of the search result and poor user experience.

Disclosure of Invention

The embodiment of the application provides a visual search method and a visual search device, which help users efficiently, clearly and completely describe search intentions through multiple rounds of interaction, guide and perfect the search intentions of the users, actively mine potential interest points of the users and improve the effectiveness and flexibility of search.

The visual search method comprises the steps of obtaining an image to be searched, obtaining a first round of search results based on the characteristics of the image to be searched and the characteristics of a first-level object in a query recommendation library, wherein the first round of search results comprise a plurality of standard-reaching first-level objects, the query recommendation library comprises N-level objects, each N-1-level object corresponds to a plurality of N-level objects, N is an integer larger than 1, the objects comprise text content and/or image content and/or video content and/or audio content, performing late interaction fusion on the characteristics of the image to be searched and the characteristics of the first target object to obtain first accumulated search intention characteristics, the first target object is selected by a user from the plurality of standard-reaching first-level objects, and obtaining a second round of search results based on the first accumulated search intention characteristics, and the second round of search results comprise a plurality of standard-reaching second-level objects corresponding to the first target object.

In the possible implementation, through multiple rounds of interaction, the method helps the user to efficiently, clearly and completely describe the search intention, guides and perfects the search intention of the user, actively digs potential interest points of the user, and improves the effectiveness and flexibility of search.

In one possible implementation, the object of the first level having a similarity with the image to be searched greater than the preset threshold value is determined as the object of the first level reaching the standard.

In another possible implementation, the plurality of objects of the first level that meet the criteria in the first round of search results are ranked from high to low in similarity to the image to be searched.

In another possible implementation, the object of the second level having a similarity to the first accumulated search intention feature greater than the preset threshold is determined to be an object of the second level that meets the criterion.

In another possible implementation, the plurality of qualifying second-level objects in the second round of search results are ranked from high to low in similarity to the first cumulative search intention feature.

In another possible implementation, the Mth accumulated intention feature and the features of the L-th target object are subjected to late interaction fusion to obtain a final search intention, wherein the L-th target object is an object selected by a user from a plurality of objects reaching the L-th level, M is a positive integer greater than or equal to 1, L is a positive integer greater than M, and a final search result is obtained based on the final search intention, and comprises the object reaching the L+1 level corresponding to the L-th target object.

In one example, the final search intent is also associated with a first text feature, which is a feature of the query text entered by the user.

In another possible implementation, the final search results include card search results and/or expanded search results.

In another possible implementation, the query recommendation library includes information of multiple modes, the information of multiple modes is a tree structure, nodes of the tree structure represent the objects, and nodes of different levels of the tree structure represent the objects of different levels.

In a second aspect, the present application provides a visual search device comprising:

The acquisition module is used for acquiring the image to be searched;

the accumulated search intention determining module is used for obtaining a first round of search results based on the characteristics of the image to be searched and the characteristics of the object of the first level in the query recommendation library, wherein the first round of search results comprise a plurality of objects of the first level reaching standards;

The query recommendation library comprises N levels of objects, each N-1 level of object corresponds to a plurality of N levels of objects, N is an integer greater than 1, and the objects comprise text content and/or image content and/or video content and/or audio content;

Late interaction fusion is carried out on the features of the image to be searched and the features of a first target object to obtain a first accumulated searching intention feature, wherein the first target object is an object selected by a user from the plurality of objects reaching a first standard;

And the search result determining module is used for obtaining a second round of search results based on the first accumulated search intention characteristics, wherein the second round of search results comprise a plurality of objects reaching a second standard, corresponding to the first target object.

In one possible implementation, the object of the first level, which has the similarity with the image to be searched greater than the preset threshold, is determined as the object of the first level reaching the standard.

In another possible implementation, the objects of the plurality of standard-reaching first levels in the first round of search results are ranked according to the similarity with the image to be searched from high to low.

In another possible implementation, the object of the second level having a similarity to the first accumulated search intention feature greater than a preset threshold is determined to be an object of the second level that meets the criterion.

In another possible implementation, the plurality of objects of the second level that qualify in the second round of search results are ranked from high to low in similarity to the first accumulated search intention feature.

In another possible implementation, the search result determining module is further configured to perform late interaction fusion on the mth accumulated intention feature and the feature of the mth target object to obtain a final search intention, where the mth target object is an object selected by a user from the plurality of objects in the mth level up to standard, M is a positive integer greater than or equal to 1, and L is a positive integer greater than M;

Based on the final search intention, a final search result is obtained, wherein the final search result comprises an object reaching the L+1 level corresponding to the L-th target object.

In another possible implementation, the final search intent is also associated with a first text feature, which is a feature of the query text entered by the user.

In another possible implementation, the query recommendation library includes information of multiple modes, the information of multiple modes is a tree structure, nodes of the tree structure represent the objects, and nodes of different levels of the tree structure represent objects of different levels.

In a third aspect, the present application provides a computing device comprising a memory having executable code stored therein and a processor executing the executable code to implement the method of the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method according to the first aspect of the present application.

In a fifth aspect, the present application provides a computer program or computer program product comprising instructions which, when executed, implement the method of the first aspect of the application.

Drawings

FIG. 1 shows a schematic flow diagram of a visual search;

FIG. 2 is a block diagram of a visual search system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a construction process of a query recommendation library;

FIG. 4 is a schematic diagram of query recommendations for a query recommendation library during a search process;

FIG. 5a is a schematic diagram of a card search result;

FIG. 5b is a schematic diagram of an expanded search result;

FIG. 6 is a flowchart of a visual search method according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of a visual search device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computing device according to an embodiment of the present application.

Detailed Description

The technical scheme of the application is further described in detail through the drawings and the embodiments.

In order to better understand the technical solution provided by the embodiments of the present application, some terms thereof are briefly described below.

Query, content input by the user in the search box.

Semantic space, i.e., the world of linguistic meanings, each symbology is in a broad sense a language conveying meaning, the meaning they express constituting a particular semantic space.

Semantic features that the basic concept and meaning of content are represented by feature value vector.

Modality Each source or form of information may be referred to as a modality.

Cross-modal retrieval, in which the information retrieval requirement is often not only single-modal data of the same event, but also data of other modes are needed to enrich our knowledge of the same thing or event, and cross-modal retrieval is needed to realize retrieval among different-modal data.

And multi-source fusion, namely integrating various different data information, absorbing the characteristics of different data sources, and extracting unified better and richer information than single data from the integrated data.

Vector retrieval, in which K vectors that are close to the query vector are retrieved in a given vector dataset in some metric manner.

Atlas refers to a structure that represents something, object, entity and interconnection between something, object, entity.

Fig. 1 shows a schematic flow chart of a visual search. As shown in fig. 1, in order to accomplish the technical implementation of the visual search, the visual search includes the following steps:

1) Constructing a database offline, acquiring and filtering key information from each structured (such as a commodity library) or unstructured (such as a webpage) data source through a data mining technology, so as to realize the operation of constructing the database offline;

2) Performing semantic feature calculation on the actual query content of the user on line, performing semantic feature calculation on the text by using BERT, performing semantic feature calculation on the picture by using SWIN TRANSFORMER and the like, and projecting the result features into the same semantic space as the off-line database building corpus;

3) Based on the actual query content characteristics, carrying out matching retrieval and rough recall based on the characteristic similarity from an offline bottom library;

4) Further, fine ranking is performed by combining the rough recalled recommended results, and a final recommended result is returned.

However, the recommendation of the visual search method is more and more caught in the forepart of the elbow in view of the multimode information and the validity of the recommendation in the actual scene.

In order to solve the above problems, embodiments of the present application propose several visual search schemes.

The first scheme visual search scheme comprises constructing a content base, namely cleaning commodity images, webpage images and the like, constructing a picture base, extracting labels from structured (such as commodity base) or unstructured (such as webpage) data sources through a data mining technology, de-duplicating, and cleaning, and constructing a text label base;

Calculating content semantic features, namely respectively calculating content feature vectors of each label and each picture in a base through a text tower and a picture tower of the multi-mode semantic model;

Inquiring semantic feature calculation, namely calculating semantic feature vectors of a user Query picture through a picture tower of the multi-mode semantic model;

And (3) searching the content (searching the picture by the picture and searching the text by the picture), namely calculating the similarity between the query characteristics and the characteristics of the texts and pictures in the content base, returning the searched content with higher similarity to the user, and finally realizing the effect of visual search.

According to the first visual search scheme, based on multimode semantic matching, through calculating image-text modal characteristics, the general correlation between the query picture and the content of the base can be accurately judged to a certain extent, and then the visual search target is achieved. However, there are still disadvantages of poor flexibility of searching, passive response, and the like.

The search flexibility is poor-only a single round of picture queries is supported, so that the user cannot fully describe the complex search intent, e.g., (in the picture) how this is repaired. The input intention information is limited, and the system can only complete simple search requests such as similarity finding, object recognition and the like by calculating general semantic relativity.

The passive response is that the user Query can only be passively responded, the user cannot be helped to identify and perfect the search intention which is not clear, the search interest of the user cannot be actively stimulated, and the search flow and the search duration are restricted.

The second visual search scheme is to fill and expand according to the search content of the user and provide things which are considered to be related to the theme by the search engine, so that the user is helped to acquire recommendation information faster and better. For handling complex multi-modal searches. The visual search experience innovation is further characterized in that the visual search experience innovation is analyzed from the technical point of view, the user is allowed to additionally input text Query on the basis of basic visual search, and the text fusion search is initiated, so that the flexibility of the search is remarkably improved, and the search intention which cannot be completed by part of traditional search technologies can be supported.

The second visual search scheme has the following problems:

The passive response is that the user Query can only be passively responded, the user cannot be helped to identify the search intention which is not clear, and similarly, the scheme cannot actively excite the search interest of the user, and the search flow and the search duration are restricted.

The operation is complex, on one hand, the user needs to perform query input (uploading pictures and inputting texts) twice to complete complex intention search and violate the mind of the user who can obtain one shot in visual search, and on the other hand, when the user is dissatisfied with the search result, the user needs to input the query again each time.

The third visual search scheme includes:

(1) Calculating a feature vector representation of the user input content using the representation model;

(2) Performing similarity retrieval in an offline construction base according to the feature vector of the input content, and returning a sequenced result;

(3) Combining the user input content and the returned result, giving the user further refined search recommendation items;

(4) After the user interacts according to the refined search recommendation items, calculating the feature vector of the content again, taking the feature vector as an anchor point, and performing secondary retrieval in the return recommendation results in the previous round;

(5) And repeating the dialogue interaction flow until the user is guided to finish effective recommendation of the search intention.

According to the technical scheme, the search content (text and picture) is continuously input by a user, the input content is used for representing vector space, and the search range is continuously refined on the basis of graphic fusion, so that interactive complex query is expected to be realized. The scheme improves the capability of processing complex queries with a dialog progression to some extent.

The third visual search method adds multiple rounds of interaction capability based on multi-modal dialogue search on the basis of graphic fusion, improves the efficiency of complex inquiry, and still has the following defects:

(1) Passive response, namely only passively responding to the user Query, and failing to help the user to identify the search intention which is not clear yet;

(2) The operation is complex, namely, the user is required to continuously refine the query requirement, the input content of the user is seriously influenced, and the user experience is reduced;

(3) The technical difficulty is great, the related matched technology mainly has academic circles, has a large distance from the industrial floor, and is difficult to be used commercially.

In order to solve the problems of the above-mentioned scheme and the prior art, the embodiment of the application provides a visual search method, which is based on a multi-level query recommendation library, fully excavates structured and unstructured multi-source data, automatically constructs a multi-level tree-shaped query recommendation base, continuously updates query recommendation from two dimensions of breadth and depth by combining user behaviors (input pictures, clicking and the like), helps users efficiently, clearly and completely describe search intention, guides and perfects the search intention of the users, actively excavates potential interest points of the users, and improves the effectiveness and flexibility of the search.

Fig. 2 is a schematic diagram of a visual search system according to an embodiment of the present application. The system mainly comprises an offline module assembly and an online module assembly.

The offline module component comprises a query recommendation library construction module, a multi-mode content library, a multi-level query recommendation library and a content base construction module.

The online module assembly comprises a man-machine interaction module, a multi-mode information understanding module, a multi-element information fusion module and a semantic vector retrieval module.

The query recommendation library construction module in the offline module component is used for constructing stumps by utilizing structured data such as maps, multi-level labels and the like, and excavating high-frequency words such as webpages/logs and the like to expand the root node quantity. Performing relation node expansion from depth and breadth, de-duplicating nodes based on tools such as synonymous dictionary, language model and the like, and merging each sub-tree or leaf node downloaded by the duplicate nodes;

The multi-mode information understanding module in the online module component is used for recommending based on Query content characteristics and Query recommendation calculation similarity, integrating and perfecting user Query intention by combining online behaviors of a user and adjusting and recommending the user intention;

And the multi-element information fusion module in the online module is used for carrying out late interaction fusion modeling based on the accumulated intention characteristics and the next-layer node query recommendation text characteristics, searching each mode information of the content base, returning Top-1 as the query recommendation detail content of the node, further expanding search, carrying out expanded search based on the accumulated intention characteristics (and further fusing the additional input text characteristics of a user), and returning more content information.

The visual search system is a visual search system for multi-level conversational Query recommendation, and the main function of the system is to conduct interaction of user behavior accumulation based on Query pictures and interactive point-slide recommendation results input by a user, guide and perfect the search intention of the user, actively mine potential interest points of the user, improve the effectiveness and flexibility of search and return effective search results of the user. The main task of the offline module component is to extract structural information such as upper and lower levels of query recommendation from multi-source data and construct a multi-level (tree-shaped) query recommendation library. The main task of the online module component is to conduct guiding and perfecting of user searching intention through conversational cross-modal query recommendation and multi-element fusion content, and continuous interaction is carried out until searching is completed.

The implementation steps are as follows:

s1, the system is deployed on a specific hardware server, and the offline module component and the online module component can be deployed on the same hardware server or can be deployed respectively.

S2, extracting structural information such as upper and lower levels of query recommendation from the multi-source data in an off-line stage, constructing a multi-level (tree-shaped) query recommendation base, and completing off-line construction of the multi-level query recommendation base. The logical number piles are respectively calculated through the multi-level category labels and other structures of the data, the semantically similarity is expanded in depth (child nodes are expanded based on the containing relation with the synonymous judgment) and breadth (brother nodes are expanded based on the cross relation with the synonymous judgment and the co-occurrence relation in the same picture, and the brother nodes are added after the relation with the father nodes is checked). Meanwhile, the nodes are de-duplicated based on tools such as a synonym dictionary and a language model, all sub-trees or leaf nodes mounted on the repeated nodes are combined, and the implementation is shown in fig. 3.

And S3, in an online recommendation stage, user interaction is guided based on the Query picture and the multi-level Query recommendation library, and accumulated intention characteristics are continuously perfected until searching is completed (rollback is supported in the process). Firstly, selecting first-layer recommendation from all nodes according to Query picture characteristics and Query recommendation characteristic similarity aiming at first-layer nodes (namely branch nodes closest to a root node), returning child nodes of the user selection nodes as candidates of the next layer aiming at the subsequent layer, and then carrying out late interaction fusion modeling according to Query picture characteristics and Query recommendation text characteristics selected by the user to perfect accumulated intention characteristics. Finally, on-line pruning and reordering are performed on the next level of nodes based on the similarity of the current accumulated intent features and the child node query recommended text features (see fig. 4).

S4, performing multi-element fusion based on the accumulated intention characteristics, the next-layer query recommendation characteristics and the additional input text characteristics of the user to retrieve the modal information in the content base, and finally providing card searching (see FIG. 5 a) and/or expanded searching results (see FIG. 5 b).

Fig. 6 is a flowchart of a visual search method according to an embodiment of the present application. The visual search method can be implemented by the visual search system shown in fig. 2, and as shown in fig. 6, the visual search method provided by the embodiment of the application includes steps S601 to S604.

In step S601, an image to be searched is acquired.

The image to be searched may be received by the search terminal as user input and uploaded to the server. The searching terminal (such as a smart phone) can shoot through the camera device (such as a mobile phone camera) to obtain an image to be searched, and can also directly call the image from the local storage to serve as the image to be searched.

In step S602, a first round of search results is obtained based on the features of the image to be searched and the features of the first level object in the query recommendation library, where the first round of search results includes a plurality of objects of the first level that reach standards.

Extracting semantic features of the image to be searched, for example, extracting the semantic features of the image to be searched through a SWIN TRANSFORMER model, mapping the semantic features of the image to be searched into the same semantic space as a query recommendation library to obtain semantic feature vectors of the image to be searched, calculating similarity between the semantic feature vectors of the image to be searched and features of objects of a first level in the query recommendation library, determining the objects of the first level, the similarity of which is larger than a preset threshold (for example, 0.8), as standard-reaching objects, and taking a plurality of the objects of the first level, which are standard-reaching, as search results of a first round.

The query recommendation library comprises N levels of objects, each N-1 level of object corresponds to a plurality of N levels of objects, N is an integer larger than 1, the objects comprise text content and/or image content and/or video content and/or audio content, that is, the information in the query recommendation library is multi-mode information comprising text content information, image content information, audio content information and the like.

Optionally, the information of multiple modes in the query recommendation library is a tree structure, nodes of the tree structure represent objects, and nodes of different levels of the tree structure represent objects of different levels. The construction of the query recommendation library and the specific structure of the stub structure can participate in the description of the query recommendation library, and for brevity, the description is omitted here.

The structured and unstructured multi-source data are fully mined through the arrangement of the query recommendation library, the multi-level tree-shaped query recommendation library is automatically constructed, and potential paths for user information retrieval and information exploration are structurally combed, so that effective, low-cost and extensible data support is provided for the implementation of the visual search method provided by the embodiment of the application.

In one example, the plurality of objects of the first level that meet the criteria in the first round of search results are ranked in order of high-to-low similarity to the image to be searched. For example, as shown in fig. 4, by calculating the similarity between the image to be searched and each of the objects of the first level, the objects of the first level whose similarity meets the standard are ranked as bicycle accessories, screws..steel wires according to the similarity from high to low, that is, the search results of the first round, the search results are displayed to the user according to the similarity between the user and the pictures to be searched, wherein the higher the similarity is, the closer the user's initial search intention is, the more likely the user wants to search the content, and the more forward the ranking is, so that the user can find the content which the user wants to search faster.

In step S603, the features of the image to be searched and the features of the first target object are fused in a late interaction manner to obtain a first accumulated search intention feature, where the first target object is an object selected by the user from a plurality of objects of the first level reaching standards.

After the feature vector of the image to be searched and the feature vector of the first target object are obtained, the feature vector of the image to be searched and the feature vector of the first target object are subjected to weighted fusion, and the first accumulated search intention is obtained. The weights of the feature vectors of the image to be searched and the weighted weights of the feature vectors of the first target object may be determined in various ways, for example, by default of the system or by user setting.

The first target object is a first level object selected by the user, for example, by clicking on a first level object in the first round of search results in the screen as the first target object, such as the bicycle accessory in fig. 4.

The initial searching intention, which may not be perfect by the user, can be obtained from the features of the image to be searched, the features (may be text features) of the first target object selected by the user reflect the further searching intention of the user, and the more perfect accumulated searching intention is obtained by combining the features (and the initial searching intention) of the image to be searched and the features (and the further searching intention) of the first target object.

In step S604, a second round of search results is obtained based on the first accumulated search intention feature, the second round of search results including a plurality of objects of a second level that meet the criteria corresponding to the first target object.

And calculating the similarity of the first accumulated intention feature and object features of each second level corresponding to the first target object, determining that the similarity is larger than a preset threshold (for example, 0.8) is an object of the second level meeting the standard, and taking the object of the second level meeting the standard as a search result of a second round.

Optionally, the plurality of objects of the second level that reach the standard in the second round of search results are ranked from high to low according to a similarity to the first cumulative intention feature. For example, as shown in fig. 4, after calculating the similarity between the first cumulative intention feature and each object of the second level, the objects of the first level with the similarity reaching the standard are ranked into the transmission commodity, the transmission installation course and the transmission repair course according to the similarity from high to low, that is, the search results of the second round are ranked according to the similarity with the first cumulative intention, and the search results are displayed to the user, wherein the higher the similarity is, the closer the first cumulative intention with the user is, the more likely the content the user wants to search, and the earlier the ranking is, the more convenient the user can find the content the user wants to search.

If the second round of search results contain the content which the user wants to search, the user can open the content by double clicking, the content which the user wants to search is successfully obtained, the search is ended, if the user is still unsatisfied with the second round of search results (the content which accords with the search intention of the user does not exist), the interaction is continued, the next round of search is carried out, and the search intention of the user is continuously perfected until the content which accords with the search intention of the user is found.

The method comprises the steps of carrying out late interaction fusion on the M-th accumulated intention characteristic and the characteristic of an N-th target object to obtain a final search intention, wherein the N-th target object is an object selected by a user from a plurality of objects of the N-th level up to standard, M is a positive integer greater than or equal to 1, N is a positive integer greater than M, and based on the final search intention, a final search result is obtained, and comprises the object of the N+1-th level up to standard corresponding to the N-th target object.

Optionally, a back-off may be supported in each search round, for example, after receiving a back-off command from the user, returning back to the search page of the previous round, so that the user reselects the target object and reforms its search intention.

Optionally, the final search results include card search results (as shown in FIG. 5 a) and/or expanded search results (as shown in FIG. 5 b).

According to the visual search method provided by the embodiment of the application, the original user text description is replaced by continuous interaction of the 'click' visual information, so that on one hand, the complexity of user operation is reduced, and on the other hand, the user is guided to perfect the search intention by using the visual information containing more 'information'. Meanwhile, in the multi-mode information interaction and accumulation process, retrieval recommendation results are continuously adjusted and optimized, and retrieval effectiveness is improved.

In another implementation, the final search intent is also associated with a first text feature, which is a feature of the query text entered by the user. That is, in each search round, the manner of inputting the query text by the user is also supported to further express the search intention of the user, so that the search round is shortened, and the corresponding search content is searched faster or more accurately.

For example, in the search results of the nth search round, the user inputs the query text, extracts the features of the query text, performs late interaction fusion on the features of the accumulated search intention of the search round, the features of the target object and the features of the query text, and then performs search (calculates the similarity with each object) in the content base or the query recommendation base according to the fused features, so as to finally obtain the final search result, and finally, the object with the highest recommendation similarity (top 1) of the final search result.

The visual search method provided by the embodiment of the application guides user interaction based on the Query picture and the multi-level Query recommendation library, and continuously perfects accumulated intention characteristics until the search is completed (rollback is supported in the process). Firstly, selecting first-layer recommendation from all nodes based on Query picture characteristics and Query recommendation characteristic similarity, and then returning child nodes of the user selection nodes to serve as candidates of the next layer in the subsequent layers. And carrying out late interaction fusion modeling based on Query picture features and Query recommended text features selected by a user, and perfecting cumulative intention features. And on-line pruning and reordering are carried out on the next layer of nodes based on the similarity of the current accumulated intention characteristics and the child node query recommended text characteristics. And performing multi-element fusion based on the accumulated intention characteristics, the next-layer query recommendation characteristics and the additional text characteristics input by the user so as to retrieve the modal information in the content base. And carrying out late interaction fusion modeling based on the accumulated intention characteristics and the next-layer node query recommendation text characteristics, retrieving each mode information of the content base, and returning to Top-1 as query recommendation detail content of the node. And (4) performing expanded search based on the accumulated intention characteristics (and further integrating additional text characteristics input by the user), and returning more content information.

The visual search method provided by the embodiment of the application can be applied to the cloud application, and can also be applied to the terminal side service, such as mobile phone album gallery search recommendation, and further can be communicated with the multi-mode information of terminal side videos, pictures, short messages and the like, so that the joint interactive search recommendation of the terminal side modal information is realized.

Based on the same conception as the previous embodiment of the visual search method, a visual search device 700 is also provided in an embodiment of the present application, where the visual search device 700 includes units or modules for implementing the steps in the visual search method shown in fig. 1-6.

Fig. 7 is a schematic structural diagram of a visual search device according to an embodiment of the present application. The apparatus is applied to a computing device, as shown in fig. 7, and the visual search apparatus 700 includes at least:

An acquisition module 701, configured to acquire an image to be searched;

The accumulated search intention determining module 702 is configured to obtain a first round of search results based on the features of the image to be searched and the features of the object at the first level in the query recommendation library, where the first round of search results includes a plurality of objects at the first level that reach the standard;

The search result determining module 703 is configured to obtain a second round of search results based on the first accumulated search intention feature, where the second round of search results includes a plurality of objects reaching a second level corresponding to the first target object.

In another possible implementation, the search result determining module 704 is further configured to perform late interaction fusion on the mth accumulated intent feature and the feature of the mth target object to obtain a final search intent, where the mth target object is an object selected by a user from the plurality of objects in the mth level up to standard, where M is a positive integer greater than or equal to 1, and L is a positive integer greater than M;

The visual search device 700 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module in one visual search device 700 are respectively for implementing the corresponding flow of each method in fig. 1 to 6, and are not described herein for brevity.

Embodiments of the present application also provide a computing device comprising at least one processor for performing the methods described in fig. 1-6, a memory, and a communication interface.

As shown in fig. 8, the computing device 800 includes at least one processor 801, memory 802, and a communication interface 803. The processor 801, the memory 802, and the communication interface 803 are communicatively connected to each other by a wired (e.g., bus) system or by a wireless system. The communication interface 803 is configured to receive data sent by other devices, and the memory 802 stores computer instructions that are executed by the processor 801 to perform the visual search method of the foregoing method embodiment.

It should be appreciated that in embodiments of the present application, the processor 801 may be a central processing unit CPU, and the processor 801 may also be other general purpose processors, digital Signal Processors (DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), field programmable gate arrays (field programmable GATE ARRAY, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like.

The memory 802 may include read only memory and random access memory, and provides instructions and data to the processor 801. Memory 802 may also include nonvolatile random access memory.

The memory 802 may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM), an electrically erasable programmable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (double DATA DATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (direct rambus RAM, DR RAM).

It should be understood that the computing device 800 according to the embodiment of the present application may perform the method shown in fig. 1 to 6 in implementing the embodiment of the present application, and the detailed description of the implementation of the method is referred to above, which is not repeated herein for brevity.

Embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the above-mentioned visual search method to be implemented.

Embodiments of the present application provide a chip comprising at least one processor and an interface, the at least one processor determining program instructions or data through the interface, the at least one processor being configured to execute the program instructions to implement the visual search method mentioned above.

Embodiments of the present application provide a computer program or computer program product comprising instructions which, when executed, cause a computer to perform the above-mentioned visual search method.

Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different approaches for each particular application, but such implementation is not considered to be beyond the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the application, and is not meant to limit the scope of the application, but to limit the application to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the application are intended to be included within the scope of the application.

Claims

A visual search method is characterized in that,

Acquiring an image to be searched;

Obtaining a first round of search results based on the characteristics of the image to be searched and the characteristics of the first-level object in the query recommendation library, wherein the first round of search results comprise a plurality of objects reaching the first level;

The query recommendation library comprises N levels of objects, each N-1 level of object corresponds to a plurality of N levels of objects, N is an integer greater than 1, and the objects comprise text content and/or image content and/or video content and/or audio content;

Late interaction fusion is carried out on the features of the image to be searched and the features of a first target object to obtain a first accumulated searching intention feature, wherein the first target object is an object selected by a user from the plurality of objects reaching a first standard;

and obtaining a second round of search results based on the first accumulated search intention characteristics, wherein the second round of search results comprise a plurality of objects reaching a second standard, corresponding to the first target object.
The method according to claim 1, wherein the object of the first level having a similarity with the image to be searched greater than a preset threshold value is determined as the object of the first level that meets the criterion.
The method according to claim 1 or 2, wherein the plurality of objects of the first level that meet the standard in the first round of search results are ranked according to the similarity with the image to be searched from high to low.
A method according to any one of claims 1-3, wherein a second level of objects having a similarity to the first accumulated search intention feature greater than a preset threshold is determined to be up to standard second level objects.
The method of any of claims 1-4, wherein the plurality of qualifying second level objects in the second round search results are ordered from high to low in similarity to the first accumulated search intention feature.
The method of any one of claims 1-5, further comprising,

Late interaction fusion is carried out on the M-th accumulated intention feature and the features of the L-th target object to obtain a final search intention, wherein the L-th target object is an object selected by a user from the plurality of objects reaching the L-th level, M is a positive integer greater than or equal to 1, and L is a positive integer greater than M;

Based on the final search intention, a final search result is obtained, wherein the final search result comprises an object reaching the L+1 level corresponding to the L-th target object.
The method of claim 6, wherein the final search intent is further associated with a first text feature, the first text feature being a feature of the query text entered by the user.
The method of claim 6 or 7, wherein the final search results comprise card search results and/or expanded search results.
The method according to any one of claims 1-7, wherein the query recommendation library includes information of multiple modes, the information of multiple modes is a tree structure, nodes of the tree structure represent the objects, and nodes of different levels of the tree structure represent objects of different levels.
A visual search device, comprising:

The acquisition module is used for acquiring the image to be searched;

the accumulated search intention determining module is used for obtaining a first round of search results based on the characteristics of the image to be searched and the characteristics of the object of the first level in the query recommendation library, wherein the first round of search results comprise a plurality of objects of the first level reaching standards;

The query recommendation library comprises N levels of objects, each N-1 level of object corresponds to a plurality of N levels of objects, N is an integer greater than 1, and the objects comprise text content and/or image content and/or video content and/or audio content;

Late interaction fusion is carried out on the features of the image to be searched and the features of a first target object to obtain a first accumulated searching intention feature, wherein the first target object is an object selected by a user from the plurality of objects reaching a first standard;

And the search result determining module is used for obtaining a second round of search results based on the first accumulated search intention characteristics, wherein the second round of search results comprise a plurality of objects reaching a second standard, corresponding to the first target object.
The apparatus according to claim 10, wherein the object of the first level having a similarity with the image to be searched greater than a preset threshold value is determined as the object of the first level that meets the criterion.
The apparatus of claim 10 or 11, wherein the plurality of objects of the first level that meet the criteria in the first round of search results are ranked according to a high-to-low similarity to the image to be searched.
The apparatus of any of claims 10-12, wherein a second level of objects having a similarity to the first accumulated search intention feature greater than a preset threshold is determined to be qualified second level objects.
The apparatus of any of claims 10-13, wherein a plurality of objects of a second level of qualifying in the second round of search results are ordered from high to low in similarity to the first accumulated search intention feature.
The apparatus according to any one of claims 10 to 14, wherein the search result determining module is further configured to perform late interaction fusion on the mth cumulative intent feature and the feature of the mth target object to obtain a final search intent, where the mth target object is an object selected by a user from the plurality of objects at the L-th level that reach the standard, M is a positive integer greater than or equal to 1, and L is a positive integer greater than M;

Based on the final search intention, a final search result is obtained, wherein the final search result comprises an object reaching the L+1 level corresponding to the L-th target object.
The apparatus of claim 15, wherein the final search intent is further associated with a first text feature, the first text feature being a feature of the query text entered by the user.
The apparatus of claim 15 or 16, wherein the final search results comprise card search results and/or expanded search results.
The apparatus of any of claims 10-17, wherein the query recommendation library includes information of a plurality of modalities, the information of the plurality of modalities being a tree structure, nodes of the tree structure representing the objects, nodes of different levels of the tree structure representing objects of different levels.
A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, the processor executing the executable code to implement the method of any of claims 1-9.
A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-9.