CN121301436A

CN121301436A - Information processing methods, devices, equipment and media

Info

Publication number: CN121301436A
Application number: CN202511484666.3A
Authority: CN
Inventors: 邱童
Original assignee: Baidu China Co Ltd
Current assignee: Baidu China Co Ltd
Priority date: 2025-10-16
Filing date: 2025-10-16
Publication date: 2026-01-09

Abstract

This disclosure provides an information processing method, apparatus, device, and medium, relating to the field of data processing technology, particularly to retrieval systems and other related technologies, and capable of being used in application scenarios such as generative retrieval, intelligent document editing, intelligent assistants, virtual assistants, and intelligent e-commerce. The method includes: extracting fields from documents to be imported based on multiple preset attribute categories to obtain a first attribute value for a first attribute category; updating the index data corresponding to the first attribute category to record the mapping relationship between the first attribute value and the documents to be imported; determining the temperature attributes of multiple index data based on their respective update frequencies, the temperature attributes including hot data and cold data; and storing the index data with the temperature attribute of hot data in a first storage structure and storing the index data with the temperature attribute of cold data in a second storage structure, wherein the read speed of the first storage structure is greater than the read speed of the second storage structure.

Description

Information processing method, device, equipment and medium

Technical Field

The disclosure relates to the technical field of data processing, in particular to the technical field of a search system and the like, and can be used for generating application scenes such as search, intelligent editing of documents, intelligent assistants, virtual assistants, intelligent electronic commerce and the like, and particularly relates to an information processing method, an information processing device, electronic equipment, a computer readable storage medium and a computer program product.

Background

The search system processes, organizes and indexes unstructured or structured data, so that relevant information can be quickly positioned and returned according to a query request of a user. The existing retrieval system is continuously evolved and developed in the aspects of data processing efficiency, correlation of retrieval results and the like.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides an information processing method, an information processing apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided an information processing method for a search system including a plurality of index data corresponding to a plurality of preset attribute categories. The method comprises the steps of extracting fields of documents to be put in storage based on a plurality of attribute categories to obtain first attribute values of the first attribute categories, updating index data corresponding to the first attribute categories to record mapping relations between the first attribute values and the documents to be put in storage, determining temperature attributes of the index data based on respective updating frequencies of the index data, wherein the temperature attributes comprise hot data and cold data, storing the index data with the temperature attributes being the hot data in a first storage structure, storing the index data with the temperature attributes being the cold data in a second storage structure, and reading the first storage structure at a reading speed greater than that of the second storage structure.

According to another aspect of the present disclosure, there is provided an information processing apparatus for a search system including a plurality of index data corresponding to a plurality of preset attribute categories. The device comprises a field extraction unit, an updating unit, a determining unit and a data storage unit, wherein the field extraction unit is used for carrying out field extraction on a document to be put in storage based on a plurality of attribute categories to obtain a first attribute value of the first attribute category, the updating unit is used for updating index data corresponding to the first attribute category to record a mapping relation between the first attribute value and the document to be put in storage, the determining unit is used for determining temperature attributes of the plurality of index data based on respective updating frequencies of the plurality of index data, the temperature attributes comprise hot data and cold data, the data storage unit is used for storing the index data with the temperature attributes of the hot data into a first storage structure and storing the index data with the temperature attributes of the cold data into a second storage structure, and the reading speed of the first storage structure is higher than that of the second storage structure.

According to another aspect of the present disclosure, there is provided an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described method.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above-described method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described method.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 shows a flow chart of an information processing method according to an embodiment of the present disclosure;

FIG. 3 illustrates a flow chart for determining temperature properties of a plurality of index data based on respective update frequencies of the plurality of index data, according to an embodiment of the present disclosure;

FIG. 4 illustrates a flow chart for dynamically determining an update frequency threshold for the index data based on a plurality of sampling points, according to an embodiment of the present disclosure;

FIG. 5 shows a flow chart of an information processing method according to an embodiment of the present disclosure;

FIG. 6 shows a block diagram of an information processing apparatus according to an embodiment of the present disclosure, and

Fig. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another element. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

In the related art, the retrieval system has the problem of low resource utilization rate.

In order to solve the above problems, the present disclosure dynamically divides index data into hot data and cold data based on an update frequency of the index data, and stores the two data into storage structures having different read-write performances, respectively. By the method, high-frequency access hot data can be ensured to be stored in high-performance storage to obtain faster retrieval response, and massive cold data can be stored in storage with lower cost, so that the storage cost is reduced while the core query performance is ensured, and the overall resource utilization rate of a retrieval system is improved.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In an embodiment of the present disclosure, the server 120 may run one or more services or software applications that enable execution of the methods of the present disclosure.

In some embodiments, server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use client devices 101, 102, 103, 104, 105, and/or 106 for human-machine interaction. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, application iOS, UNIX-like operating systems, linux or Linux-like operating systems (e.g., GOOGLE Chrome OS), or include various mobile operating systems, such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, wi-Fi), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and/or 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and Virtual special server (VPS PRIVATE SERVER) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

According to one aspect of the present disclosure, an information processing method is provided. The retrieval system may include a plurality of index data corresponding to a plurality of preset attribute categories. As shown in FIG. 2, the method 200 includes a step S201 of extracting fields of a document to be put in storage based on a plurality of attribute categories to obtain a first attribute value of the first attribute category, a step S202 of updating index data corresponding to the first attribute category to record a mapping relationship between the first attribute value and the document to be put in storage, a step S203 of determining temperature attributes of the plurality of index data based on respective update frequencies of the plurality of index data, the temperature attributes including hot data and cold data, and a step S204 of storing the index data with the temperature attributes of hot data in a first storage structure and storing the index data with the temperature attributes of cold data in a second storage structure, wherein a reading speed of the first storage structure is greater than a reading speed of the second storage structure.

Therefore, by the mode, the high-frequency access hot data can be guaranteed to have faster retrieval response, and meanwhile, the storage cost of massive cold data is reduced, so that the overall resource utilization rate of a retrieval system is improved.

In some embodiments, the retrieval system of the present disclosure may be a search engine that provides general retrieval capabilities for a variety of services. The preset attribute categories may be understood as names of a plurality of structured fields describing the document, such as price of goods, release date of articles, or air temperature of places, etc. The same document to be binned may include attribute values corresponding to a plurality of different attribute categories. For example, a document describing a hotel room on a particular date may include both the attribute value "500" for the "price" attribute category, the attribute value "4.5 points" for the "score" attribute category, and the attribute value "20XX-01-01" for the "date" attribute category. The index data is an auxiliary data structure provided for accelerating retrieval, which records the correspondence between the attribute value and the document containing the attribute value. In one exemplary embodiment, index data corresponding to the attribute category "air temperature" may record document indexes (e.g., document IDs or other document identifications) corresponding to different air temperature values for quickly searching for documents corresponding to a particular temperature range.

According to some embodiments, the data structure of the index data may be a skip list, the plurality of storage nodes of the skip list each correspond to a tuple, each tuple includes an attribute value and at least one document index corresponding to the attribute value, and the plurality of storage nodes may be arranged in order of their corresponding attribute values.

A skip list is a probabilistic data structure that supports fast lookup, insertion and deletion with a query efficiency close to that of a balanced tree. Because the attribute category is often used for sorting or range screening the search results, the processing efficiency can be remarkably improved by adopting the ordered data structure of the jump table.

In one exemplary embodiment, the retrieval system may maintain a corresponding trip table for the attribute category "air temperature". When a plurality of documents have an attribute value of 30 degrees, a storage node with a value of 30 degrees is provided in the skip list, and a document index list of all documents with air temperature of 30 degrees is recorded in the node. Because all nodes are ordered according to temperature values, when a range query request such as 'query documents with temperature greater than 30 degrees' is processed, the system can quickly locate the node with the value of '30' by utilizing the ordered characteristic of the skip list and efficiently traverse all subsequent nodes, so that all documents meeting the conditions are quickly recalled.

Therefore, by adopting the jump table as the specific implementation of index data, not only is the storage requirement of the attribute values met, but also a high-performance basis is provided for the subsequent search operations such as sequencing, range query and the like.

In some embodiments, prior to step S201, a document to be binned may be acquired. The document to be warehoused is an original information unit, such as a news, information of a commodity, or a log record, which needs to be processed by the retrieval system and can be queried by the user. In the warehousing process, the system allocates a unique document index for each document to be warehoused, wherein the document index is a unique identifier of the document in the system and is used for establishing a mapping relation in index data.

In step S201, the content of the document to be put in storage may be parsed to extract the structured information contained therein. In an exemplary embodiment, assuming that a document to be put in storage is a weather-related article, the content is "30 degrees of the highest air temperature in the city a today", the system may extract a first attribute category of "air temperature" from the first attribute category, and the corresponding first attribute value is "30 degrees".

In step S202, the system uses the information extracted in step S201 to update index data. In one exemplary embodiment, the system finds index data corresponding to the attribute category "air temperature" and records the mapping relationship between the first attribute value "30 degrees" and the document index (e.g., "305") of the document to be entered.

In step S203, the system determines the access heat of the index data according to the history update condition. In one exemplary embodiment, the system monitors index data of the "air temperature" attribute category, which is updated with a high frequency of documents each day, thus determining its temperature attribute as hot data, while index data of another attribute category, such as "city altitude", which is relatively fixed in content, is updated with a very low frequency, thus determining its temperature attribute as cold data. Updating the index data may include creating, deleting, and/or other operations, which are not limited herein.

In some embodiments, the update frequency of the index data may be compared to a predetermined or dynamically determined frequency threshold, the temperature attribute of index data having an update frequency greater than the frequency threshold may be determined as hot data, and the temperature attribute of index data having an update frequency less than the frequency threshold may be determined as cold data.

In step S204, the index data whose temperature attribute is hot data is stored in the first storage structure, and the index data whose temperature attribute is cold data is stored in the second storage structure. In some embodiments, the first storage structure may be a memory, and the second storage structure may be a hard disk, such as a Solid State Disk (SSD). This approach greatly reduces the cost of storing massive, low frequency access cold data and reserves valuable memory resources for hot data that requires fast response.

According to some embodiments, a key database based on a Log-Structured MERGE TREE, LSM tree is deployed on the second storage structure for storing index data whose temperature attribute is cold data. The LSM tree-based key-value database is a storage engine optimized for write-operation intensive scenarios, and the sequential write mechanism can avoid costly random write operations, thereby efficiently handling sporadic updates that may occur in cold data.

In one exemplary embodiment, the LSM tree based key-value database may be LevelDB database.

According to some embodiments, as shown in fig. 3, the step S203 of determining the temperature attribute of the plurality of index data based on the respective update frequencies of the plurality of index data may include, for each index data of the plurality of index data, obtaining a plurality of sampling points by acquiring the update times of the index data counted at a preset sampling interval in a history counting period, step S302 of dynamically determining an update frequency threshold of the index data based on the plurality of sampling points, and step S303 of determining the temperature attribute of the index data based on the update frequency and the update frequency threshold of the index data in a current counting period.

Therefore, the mode abandons a fixed cold and hot judgment threshold value, and adopts a dynamic self-adaptive strategy instead. The strategy determines the update frequency threshold that meets its characteristics by analyzing the historical update behavior of each index data itself. The method can more reasonably and accurately distinguish the cold and hot attributes of different index data, and further improves the utilization efficiency of storage resources.

In step S301, the retrieval system may establish a time series of its update behavior for each index data. In one exemplary embodiment, the preset sampling interval may be one day and the historical statistics period may be one week. The system will continuously monitor the index data corresponding to the attribute category, such as "commodity price", and count the total number of updates (including additions, modifications, or deletions) that occur during each day. The total update times of each day is a sampling point. After a historical statistics period (one week) has ended, the system acquires a sequence of update times consisting of 7 sample points for subsequent analysis.

In step S302, the retrieval system may dynamically determine an update frequency threshold for the index data based on the sampling points. More specifically, the update frequency threshold F is not a globally fixed value, but a variable function related to the attribute category and time, which may be expressed as f=f (attribute category name, time t). This means that the retrieval system will independently and periodically calculate its respective update frequency threshold for each of the different attribute categories (e.g. "commodity price" and "user score"). Because the updating behaviors of different attribute types have respective periodicity rules, the threshold is determined by adopting the function related to the attribute types and time, so that the cold and hot judgment can be more targeted and accurate.

In one exemplary embodiment, the retrieval system may dynamically determine the update frequency threshold using a plurality of sampling points for the last three cycles (e.g., the last three weeks) of the historical statistics cycles. The determination may include calculating a maximum value, an average value, or other processing of the sampling points.

In step S303, the retrieval system uses the update frequency threshold obtained in the previous step to determine the data heat of the current period. The update frequency of the current statistical period may be, for example, an average value of a plurality of sampling points within the current statistical period. In one exemplary embodiment, the system compares an update frequency threshold calculated based on historical data of certain index data with the update frequency of the current statistical period. If the current update frequency is lower than the threshold, the system determines the temperature attribute of the index data as cold data, and otherwise, determines the index data as hot data.

According to some embodiments, the historical statistics period may include a previous statistics period and a last statistics period after the previous statistics period. As shown in FIG. 4, the step S302 of dynamically determining the update frequency threshold of the index data based on a plurality of sampling points may include the step S401 of fitting the sampling points in the latest statistical period to obtain a fitted curve, the step S402 of calculating a variation trend of the sampling points in the latest statistical period relative to the sampling points corresponding to the sampling time in the previous statistical period in response to determining that the shape difference of the fitted curve and the preset standard curve satisfies a first preset condition, and the step S403 of determining the update frequency threshold based on an average value of the sampling points in the latest statistical period in response to determining that the variation trend satisfies a second preset condition.

Thus, the update behavior of the index data can be analyzed more comprehensively by the multi-stage judgment method. This approach focuses not only on the updated data distribution pattern in the most recent statistical period, but also on its trend of change relative to the historical period. When the data behavior meets the two conditions of mode stability and trend stability, the threshold value is determined by adopting the data in the latest statistical period, so that the robustness of threshold value determination is increased, and inaccurate cold and hot judgment due to data fluctuation is avoided.

In step S401, in one exemplary embodiment, the retrieval system may apply a curve fitting algorithm to sample the 7-day update times of the "commodity price" index data over the last statistical period (e.g., the last week) to generate a fitted curve.

In step S402, the system may compare the resulting fitted curve with a preset standard curve (e.g., cosine curve) representing a general business rule. If the shape difference of the two curves is smaller, the first preset condition is considered to be satisfied, namely the recent updating mode is normal, and the calculation of the change trend is carried out.

According to some embodiments, the first preset condition indicates that the number of the same-sequence number point pairs with the square difference smaller than the first preset threshold value is not smaller than the second preset threshold value after the sampling points on the fitting curve and the preset standard curve are ordered according to the numerical values.

Thus, the method provides a quantifiable and deterministic calculation method for judging the shape difference of the two curves. The method is not used for directly comparing the values of the two curves at the same time point, but is used for comparing the overall value distribution condition of the two curves by sequencing all sampling points in respective periods, so that abstract shape differences can be converted into specific value comparison, and the objectivity and consistency of matching judgment are ensured.

In an exemplary embodiment, assuming that N sampling points are counted in the last counting period, N sampling points are also collected in the preset standard curve. The retrieval system firstly sorts the two groups of N sampling points from small to large according to the numerical values. Then, the square difference between the two sampling points corresponding to the ordered positions is calculated one by one (for example, the point with the smallest value in the fitted curve is compared with the point with the smallest value in the standard curve, and so on). The system counts the number of point pairs where the square error is less than a first preset threshold (e.g., n). If the number is greater than or equal to a second preset threshold (e.g., m), it is determined that the first preset condition is satisfied.

In some embodiments, the trend may be calculated by comparing the sample value on the day of the last week (e.g., monday) with the sample value on the same day of the earlier two prior statistical cycles (e.g., monday). If the difference is small, the second preset condition, i.e., the change trend is considered to be stable, is satisfied.

According to some embodiments, the second preset condition indicates that the difference in the value of the sampling point in the most recent statistical period with respect to the sampling point corresponding to the sampling time in the preceding statistical period is smaller than the preset ratio.

Therefore, the method provides a quantifiable calculation method for judging whether the change trend meets the condition or not, so that the retrieval system can identify the mutation of the update frequency.

In one exemplary embodiment, the preset ratio may be 50%. The system, when calculating the trend of change, compares the sampling point of the cycle in the most recent statistical cycle (e.g., 900 times) with the sampling point of the cycle in the previous statistical cycle (e.g., 800 times). The calculated difference in value was (900-800)/800=12.5%. Since 12.5% is less than 50%, it is determined that the second preset condition is satisfied.

In step S403, in an exemplary embodiment, if the curve shape difference and the sampling point variation trend both satisfy the preset condition, the retrieval system may determine the average value of 7 sampling points of the previous week as the update frequency threshold for the current statistical period.

According to some embodiments, step S302 further includes determining an update frequency threshold based on a maximum of the plurality of sampling points within the historical statistics period in response to determining that the shape difference of the fitted curve and the preset standard curve does not satisfy the first preset condition.

Thus, the above manner provides a reliable backoff strategy for dynamic threshold determination. When the recent update pattern of the index data exhibits abnormal fluctuations, the retrieval system does not take the recent average value any more, but rather adopts a more conservative historical maximum value as the threshold. The processing mode can effectively prevent the data from being abnormal (such as suddenly reduced in updating amount in a certain day) and wrongly judge one active index data as cold data, thereby ensuring the retrieval performance of hot data.

In an exemplary embodiment, if the system determines that the shape difference between the fitted curve of the last week of the index data of "commodity price" and the preset standard curve does not satisfy the first preset condition, the largest one of the 21 sampling points in the historical statistics period (three weeks in the past) may be selected as the update frequency threshold.

According to some embodiments, step S302 further includes determining an update frequency threshold based on an average of a plurality of sampling points within the historical statistics period in response to determining that the trend of variation does not satisfy the second preset condition.

Thus, this approach provides another fallback strategy for dynamic threshold determination. When the update pattern of the index data is normal but its value fluctuates drastically (i.e., the trend is unstable) in comparison with the history, the retrieval system no longer solely takes the average value of the latest period. By calculating the average of all sampling points over a longer historical period (e.g., the last three weeks), the recent single week abnormal fluctuations can be smoothed out, resulting in a more stable and representative frequency reference, thus making the determination of the updated frequency threshold more reliable.

In one exemplary embodiment, if the system determines that the update trend for one week over the "commodity price" index data does not satisfy the second preset condition, an average of all 21 sampling points over the last three weeks is calculated and taken as the update frequency threshold.

According to some embodiments, as shown in FIG. 5, the information processing method 500 may further include receiving a query request to the retrieval system and determining a second attribute category corresponding to the query request among the plurality of attribute categories, and reading the index data from the corresponding storage structure based on the temperature attribute of the index data corresponding to the second attribute category, at step S506. Step S501 to step S504 in fig. 5 may refer to the above description of step S201 to step S204, and are not described herein.

Therefore, through the mode, the retrieval system can determine the temperature attribute of the corresponding index data according to the second attribute category aimed at by the query request, and then select the corresponding data reading path. The method enables the retrieval system to pointedly optimize the query flow, thereby improving the overall retrieval efficiency.

In one exemplary embodiment, the user initiates a query request "query for items with a price greater than 100 yuan". In step S505, the retrieval system receives the request and determines therefrom that the second attribute category for which the query is directed is "price". In step S506, the retrieval system first determines the temperature attribute of the "price" index data. If it is hot data, it is read from a first storage structure (e.g. memory) and if it is cold data, it is read from a second storage structure (e.g. hard disk).

According to some embodiments, step S506, reading index data from the respective storage structure based on the temperature attribute of the index data corresponding to the second attribute category may include, in response to determining that the temperature attribute of the index data corresponding to the second attribute category is cold data, recording a time consumption of reading the data from the second storage structure, and comparing the time consumption with at least one preset time consumption threshold to selectively process the index data read from the second storage structure.

Therefore, by actively monitoring the time consumption of inquiry and comparing the time consumption with a preset threshold value, the retrieval system can make dynamic trade-off between the inquiry performance and the integrity of data recall, thereby effectively avoiding the influence of individual slow inquiry on the stability and response speed of the whole system.

In one exemplary embodiment, when the retrieval system needs to read index data of cold data, the "city altitude" from the second storage structure, the retrieval system starts a timer while initiating a read operation. After the data read is complete, the system records the time taken for the entire process, e.g., 35 milliseconds. The system then compares the actual time of 35 milliseconds to a predetermined time threshold to determine the subsequent processing of the returned results.

According to some embodiments, the at least one preset time consumption threshold may include a first time consumption threshold and a second time consumption threshold, and the first time consumption threshold may be less than the second time consumption threshold. Comparing the time consumption with at least one preset time consumption threshold to selectively process the index data read from the second storage structure may include truncating the read index data to preserve a preset proportion of the index data as a return result in response to determining that the time consumption is between the first time consumption threshold and the second time consumption threshold.

Thus, this approach provides a progressive service degradation strategy. When the performance of the cold data query is within an acceptable critical range, the system chooses to return a portion of the most relevant results, not all. The method and the device avoid long-time waiting or direct return failure, can provide valuable partial information for the user while ensuring faster response, and improve user experience.

In one exemplary embodiment, assume that the first time consumption threshold is 20 milliseconds and the second time consumption threshold is 50 milliseconds. If the recorded query takes 35 milliseconds, the system will perform a truncated operation since the value is between 20 and 50 milliseconds. For example, if the original recalled index data includes 1000 document indexes, the preset ratio is 50%, the system only retains 500 of them and returns.

According to some embodiments, comparing the time consumption to at least one preset time consumption threshold to selectively process index data read from the second storage structure may include returning an empty list as a return result in response to determining that the time consumption is greater than the second time consumption threshold.

Thus, this approach provides a fuse protection mechanism for the retrieval system. When the time consumption of the inquiry of the cold data exceeds a higher threshold value, the system actively gives up the inquiry. The method can effectively prevent extremely slow inquiry from occupying system resources for a long time caused by overlarge I/O pressure of the hard disk or other reasons, thereby guaranteeing the overall availability and stability of the whole retrieval service.

In one exemplary embodiment, if the second time-consuming threshold is 50 milliseconds and the actual time-consuming time of a cold data query reaches 60 milliseconds, the system determines that the query has timed out and returns an empty list directly to the upper-level application, terminating the subsequent process flow.

According to some embodiments, comparing the time consumption with at least one preset time consumption threshold to selectively process the index data read from the second storage structure includes, in response to determining that the time consumption is less than the first time consumption threshold, retaining all of the index data as a returned result.

Thus, the method ensures that the user can obtain the complete query result under the condition of good cold data query performance. This ensures that the hierarchical storage policy does not affect the comprehensiveness of the information recall when the retrieval system is loaded low or the data locality is good.

In one exemplary embodiment, if the first time consumption threshold is 20 milliseconds and the actual time consumption of a cold data query is 15 milliseconds, the retrieval system may use all of the index data read from the second storage structure in its entirety for subsequent retrieval procedures.

In some embodiments, after the index data is read from the first storage structure or the second storage structure to obtain the returned result of the document index, the retrieval system may further perform subsequent processing. The subsequent processing may include, for example, merging a plurality of document index lists recalled from different query conditions according to the logic of the query request, invoking a ranking policy to rank the merged document indexes, and obtaining corresponding document forward ranking information according to the ranked document indexes and assembling the information into a final result for provision to the user. The forward information may be the original attribute and content store keyed by a document index (e.g., document ID) for result reconstruction and presentation.

According to some embodiments, the information processing method further comprises the steps of storing a triplet comprising the first attribute category, the document index of the document to be put in storage and the first attribute value into a preset hash table structure before updating index data corresponding to the first attribute category, and inquiring and acquiring an attribute value corresponding to the document index from the hash table structure by using the document index recorded by the read index data after reading the index data of the second attribute category for subsequent processing of the read index data.

Thus, in the above manner, the retrieval system establishes a mechanism for storing and querying attribute values independent of the index data. The hash table structure is pre-stored in the database during the database construction, so that the corresponding original attribute value can be efficiently and reversely inquired according to the document index after the retrieval system obtains a batch of document indexes in the later period of the retrieval flow. The design of the separate storage avoids unnecessary traversal of the complex main index structure when only the attribute value is required to be acquired, optimizes the data acquisition path and improves the overall performance of the retrieval system.

In one exemplary embodiment, the hash table structure may be a two-dimensional hash table. In the database creation stage, when a document to be stored is processed, the index of which is ' 305 ', the attribute category is ' air temperature ', and the attribute value is ' 30 ℃, the system stores the information of < air temperature, 305, 30 ℃ into the hash table structure before updating index data (e.g. skip table) corresponding to the ' air temperature '. In the retrieval phase, it is assumed that the system recalls a collection of documents, including document index "305", via the index data. To obtain a specific air temperature value of the document for sorting or result presentation, the system can directly use "air temperature" and "305" as keys to perform quick lookup in the hash table structure, thereby directly obtaining the attribute value "30 degrees".

According to another aspect of the present disclosure, there is provided an information processing apparatus. The retrieval system comprises a plurality of index data corresponding to a plurality of preset attribute categories. As shown in fig. 6, the apparatus 600 includes a field extraction unit 610 configured to perform field extraction of a document to be binned based on a plurality of attribute categories to obtain a first attribute value of the first attribute category, an updating unit 620 configured to update index data corresponding to the first attribute category to record a mapping relationship between the first attribute value and the document to be binned, a determination unit 630 configured to determine temperature attributes of the plurality of index data, the temperature attributes including hot data and cold data, based on respective update frequencies of the plurality of index data, and a data storage unit 640 configured to store the index data whose temperature attributes are hot data in a first storage structure and store the index data whose temperature attributes are cold data in a second storage structure, the read speed of the first storage structure being greater than the read speed of the second storage structure.

It will be appreciated that the operation and effect of units 610 to 640 in apparatus 600 may be described above with reference to steps S201 to S204.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.

Referring to fig. 7, a block diagram of an electronic device 700 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the electronic device 700 are connected to the I/O interface 705, including an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the electronic device 700, the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 708 may include, but is not limited to, magnetic disks, optical disks. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through computer networks, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth devices, 802.11 devices, wi-Fi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods, procedures, and/or processes described above. For example, in some embodiments, the methods, procedures, and/or processes may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the methods, procedures, and/or processes described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the methods, procedures, and/or processes in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), the Internet, and a blockchain network.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PRIVATE SERVER" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. An information processing method for a retrieval system, the retrieval system comprising multiple index data corresponding to multiple preset attribute categories, the method comprising:

Based on the multiple attribute categories, fields are extracted from the documents to be imported to obtain the first attribute value of the first attribute category;

Update the index data corresponding to the first attribute category to record the mapping relationship between the first attribute value and the document to be added to the database;

Based on the update frequency of each of the multiple index data, the temperature attribute of the multiple index data is determined, wherein the temperature attribute includes hot data and cold data; and

Index data with a temperature attribute of hot data is stored in the first storage structure, and index data with a temperature attribute of cold data is stored in the second storage structure. The read speed of the first storage structure is greater than that of the second storage structure.

2. The method according to claim 1, wherein determining the temperature attribute of the plurality of index data based on their respective update frequencies comprises:

For each of the multiple index data, the number of times the index data is updated within the historical statistical period according to a preset sampling interval is obtained, thus obtaining multiple sampling points;

Based on the multiple sampling points, the update frequency threshold of the index data is dynamically determined; and

The temperature attribute of the index data is determined based on the update frequency of the index data in the current statistical period and the update frequency threshold.

3. The method according to claim 2, wherein the historical statistical period includes a previous statistical period and the most recent statistical period following the previous statistical period, and dynamically determining the update frequency threshold of the index data based on the plurality of sampling points includes:

The sampling points within the most recent statistical period are fitted to obtain a fitted curve;

In response to determining that the shape difference between the fitted curve and the preset standard curve satisfies a first preset condition, the trend of change of the sampling points in the most recent statistical period relative to the sampling points corresponding to the sampling time in the earlier statistical period is calculated; and

In response to determining that the change trend meets the second preset condition, the update frequency threshold is determined based on the average value of the sampling points in the most recent statistical period.

4. The method according to claim 3, wherein dynamically determining the update frequency threshold of the index data based on the plurality of sampling points includes:

In response to the determination that the shape difference between the fitted curve and the preset standard curve does not meet the first preset condition, the update frequency threshold is determined based on the maximum value of multiple sampling points within the historical statistical period.

5. The method according to claim 3, wherein the first preset condition indicates that after the sampling points on the fitted curve and the preset standard curve are sorted by numerical value, the number of pairs of points with the same index whose square difference is less than the first preset threshold is not less than the second preset threshold.

6. The method according to claim 3, wherein dynamically determining the update frequency threshold of the index data based on the plurality of sampling points includes:

In response to determining that the trend of change does not meet the second preset condition, the update frequency threshold is determined based on the average value of multiple sampling points within the historical statistical period.

7. The method according to claim 3, wherein the second preset condition indicates that the numerical difference between the sampling points in the most recent statistical period and the sampling points corresponding to the sampling time in the earlier statistical period is less than a preset proportion.

8. The method according to any one of claims 1-7, further comprising:

Receive a query request to the retrieval system, and determine a second attribute category corresponding to the query request from among the plurality of attribute categories; and

Based on the temperature attribute of the index data corresponding to the second attribute category, the index data is read from the corresponding storage structure.

9. The method according to claim 8, wherein reading index data from the corresponding storage structure based on the temperature attribute of the index data corresponding to the second attribute category includes:

In response to determining that the temperature attribute of the index data corresponding to the second attribute category is cold data, the time taken to read data from the second storage structure is recorded; and

The time consumed is compared with at least one preset time consumption threshold to selectively process the index data read from the second storage structure.

10. The method according to claim 9, wherein the at least one preset time consumption threshold includes a first time consumption threshold and a second time consumption threshold, the first time consumption threshold being less than the second time consumption threshold, and comparing the time consumption with at least one preset time consumption threshold to selectively process index data read from the second storage structure includes:

In response to determining that the time consumption is between the first time consumption threshold and the second time consumption threshold, the read index data is truncated to retain a preset proportion of index data as the return result.

11. The method of claim 10, wherein comparing the time consumption with at least one preset time consumption threshold to selectively process index data read from the second storage structure comprises:

In response to determining that the time consumed is greater than the second time consumption threshold, an empty list is returned as the result.

12. The method of claim 10, wherein comparing the time consumption with at least one preset time consumption threshold to selectively process index data read from the second storage structure comprises:

In response to determining that the time consumed is less than the first time consumption threshold, all index data is retained as the return result.

13. The method of claim 8, further comprising:

Before updating the index data corresponding to the first attribute category, a triple including the first attribute category, the document index of the document to be added to the database, and the first attribute value is stored in a preset hash table structure; and

After reading the index data of the second attribute category, the document index recorded in the read index data is used to query and obtain the attribute value corresponding to the document index from the hash table structure, so as to perform subsequent processing on the read index data.

14. The method according to any one of claims 1-7, wherein the data structure of the index data is a skip list, each of the multiple storage nodes of the skip list corresponds to a tuple, each tuple includes an attribute value and at least one document index corresponding to the attribute value, and the multiple storage nodes are arranged in order according to their corresponding attribute values.

15. The method according to claim 14, wherein a key-value database based on a log structure merge tree is deployed on the second storage structure for storing index data of the temperature attribute being cold data.

16. An information processing apparatus for a retrieval system, the retrieval system comprising multiple index data corresponding to multiple preset attribute categories, the apparatus comprising:

The field extraction unit is configured to extract fields from the document to be imported based on the multiple attribute categories, so as to obtain the first attribute value of the first attribute category;

The update unit is configured to update the index data corresponding to the first attribute category to record the mapping relationship between the first attribute value and the document to be added to the database.

The determining unit is configured to determine the temperature attributes of the plurality of index data based on the respective update frequencies of the index data, the temperature attributes including hot data and cold data; and

The data storage unit is configured to store index data with a temperature attribute of hot data into a first storage structure and index data with a temperature attribute of cold data into a second storage structure, wherein the read speed of the first storage structure is greater than the read speed of the second storage structure.

17. An electronic device, characterized in that the electronic device comprises:

At least one processor; and

A memory communicatively connected to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.

18. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to perform the method according to any one of claims 1-15.

19. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any one of claims 1-15.