Disclosure of Invention
The invention aims to provide a data storage method, an indexing method and a storage system supporting retrieval, which can store data and simultaneously realize later retrieval of the data, and effectively avoid the occurrence of invalid retrieval of the data.
To this end, in a first aspect, the present invention provides a data storage method supporting retrieval, the data storage method supporting retrieval including:
comparing the content of the single storage body with the stop word list to determine a plurality of characteristic words in the single storage body;
Determining the characterization capability parameters of each feature word for the storage main body;
Generating characteristic words of the storage main body and characterization capability parameters of the characteristic words into an index information table and storing the index information table and the storage main body together;
Under the condition that the words contained in the stop word list are excluded, words with the number of times of occurrence in the storage main body and the ranking of the number of times of occurrence in the storage main body are ranked and the number of times of occurrence in the storage main body is preset;
The characterization capability parameters are determined according to the continuous characteristic values and the occurrence times of the characteristic words, and the characterization capability parameters are respectively in positive correlation with the continuous characteristic values and the occurrence times;
The method comprises the steps of storing a text type storage main body, wherein a continuous characteristic value is a ratio of the maximum continuous field number of continuous appearance of characteristic words to the total field number contained in the storage main body in a plurality of continuous fields with the same interval;
for a storage body of video or audio type, the continuous feature value is the ratio of the maximum continuous period number in which feature words continuously appear to the total period number contained in the storage body within a continuous number of periods of the same interval.
In the preferred technical scheme of the data storage method supporting retrieval, in the step of determining the characteristic capacity parameter of each characteristic word for the storage main body, the product of the continuous characteristic value of the characteristic word and the occurrence frequency of the characteristic word is determined as the characteristic capacity parameter of the characteristic word for the storage main body.
As a preferred technical solution of the data storage method supporting retrieval, under the condition of excluding words contained in the deactivated vocabulary, for determining feature words in the storage subject, in response to the storage subject category being text, performing the following steps:
counting words with occurrence times more than one time in a storage main body;
comparing the counted occurrence times of the words, and determining the words with the preset number of the ranks before ranking as the characteristic words of the storage main body.
As a preferred technical solution of the data storage method supporting retrieval, in the case of excluding the words contained in the deactivated vocabulary, for the determination of the feature words in the storage subject, the following steps are performed for audio in response to the storage subject category:
Performing audio character recognition on the storage main body;
counting words with occurrence times more than one time in a storage main body;
comparing the counted occurrence times of the words, and determining the words with the preset number of the ranks before ranking as the characteristic words of the storage main body.
As a preferred technical solution of the data storage method supporting retrieval, under the condition that words contained in the deactivated vocabulary are excluded, for determining feature words in the storage subject, performing the following steps for the video in response to the storage subject category;
Respectively carrying out audio character recognition and image character recognition on the storage main body;
counting words with occurrence times more than one time in a storage main body;
comparing the counted occurrence times of the words, and determining the words with the preset number of the ranks before ranking as the characteristic words of the storage main body.
As a preferred technical solution of the data storage method supporting retrieval, after determining the characterization capability parameters of the feature words for the storage main body, the method further includes:
carrying out semantic analysis on the feature words of the storage main body, and determining the characteristic capability parameters of a plurality of feature words with the same semantic as the highest numerical value item in the characteristic capability parameters in the feature words with the same semantic.
In a second aspect, the present invention provides an indexing method for searching a database obtained by a data storage method supporting searching in the above solution, including:
determining a storage main body of which the feature words are matched with the retrieval content;
And reading a corresponding index information table, and determining the display sequence of the index information on the storage main body according to the characterization capability parameters of the feature words on the storage main body.
As a preferred technical scheme of the indexing method, determining the display sequence of the index information on the storage main body according to the characterization capability parameters of the feature words on the storage main body comprises the following steps:
determining feature words matched with the retrieval contents by each storage main body;
And taking the sequence from large to small of the sum of the characterization capability parameters of the feature words matched with the retrieval contents of each storage main body as the display sequence of the index information for each storage main body.
In a third aspect, the present invention provides a data storage system, for storing data by applying the above-mentioned data storage method supporting retrieval, including:
the data storage module is used for storing the main body;
The extraction module is connected with the data storage module and is used for extracting the feature words of the storage main body and counting the occurrence times and occurrence nodes of the feature words;
The computing module is connected with the extracting module and is used for computing continuous characteristic values and characteristic capacity parameters of the characteristic words;
And the index support module is respectively connected with the extraction module and the calculation module and is used for generating and storing an index data table comprising the characteristic words of the storage main body and the characterization capability parameters of the characteristic words of the storage main body.
As a preferable technical scheme of the data storage system, the computing module is provided with a semantic analysis unit, and the semantic analysis unit is used for determining feature words with the same semantic meaning and refreshing the characterization capability parameters of the feature words with the same semantic meaning.
The beneficial effects of the invention are as follows:
According to the data storage method supporting retrieval, through determining the characterization capability parameters, the correlation between the continuity of words in the storage main body and the characterization capability of the words on the storage main body is considered, and because in practical application, the situation that the feature words appear in a certain period exists in a concentrated mode, and because the feature words can only characterize the content of the corresponding period under the situation, the characterization of the occurrence times of the feature words is weaker, compared with the fact that the feature words appear more characterizations continuously in a plurality of adjacent periods, the characterization capability parameters determined through the occurrence times and the continuous feature values indicate that the feature words penetrate in the storage main body to a large extent, so that the correlation between the feature words and the storage main body can be better reflected.
Furthermore, the continuous characteristic values of the storage main bodies of different types are respectively determined by adopting the mode of the field-to-total field ratio and the time interval-to-total time interval ratio, so that the storage main bodies of audio, video and text types have good uniformity in response retrieval, the characterization capability parameters have comparability for the storage main bodies of different types, the index information displayed under the condition of retrieving various file types is accurate and reliable, and the effective index of the retrieved content in the database is further ensured.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Wherein the terms "first location" and "second location" are two distinct locations and wherein the first feature is "above," "over" and "over" the second feature includes the first feature being directly above and obliquely above the second feature, or simply indicates that the first feature is level above the second feature. The first feature being "under", "below" and "beneath" the second feature includes the first feature being directly under and obliquely below the second feature, or simply means that the first feature is less level than the second feature.
In the description of the present invention, unless explicitly stated or limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected, mechanically connected, electrically connected, directly connected, indirectly connected via an intervening medium, or in communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
For ease of understanding, the following terms in the present application are to be interpreted:
stop words, namely, in information retrieval, certain words or words are automatically filtered before or after data processing, and are called stop words, so that storage space is saved and search efficiency is improved.
Referring to fig. 1, the present embodiment provides a data storage method supporting retrieval, where the data storage method supporting retrieval includes:
in step S1, the content of the single storage body is compared with the stop vocabulary to determine a plurality of feature words in the single storage body, and it is understood that the storage body is data to be stored in the database.
Step S2, determining the characterization capability parameters of each feature word for the storage main body;
Step S3, generating the characteristic words of the storage main body and the characterization capability parameters of the characteristic words into an index information table and storing the index information table and the index information table together with the storage main body;
The feature words are words with preset number of ranking before ranking the number of occurrence times in the storage main body under the condition of excluding words contained in the stop word list, and in detail, the stop word list needs to be determined by combining with actual scenes such as language, field and the like, and the establishment of the stop word list is the prior art and is not repeated here.
The characterization capability parameter is determined according to the continuous characteristic value and the occurrence frequency of the characteristic word, and the characterization capability parameter is respectively in positive correlation with the continuous characteristic value and the occurrence frequency;
The method comprises the steps of storing a text type storage main body, wherein a continuous characteristic value is a ratio of the maximum continuous field number of continuous appearance of characteristic words to the total field number contained in the storage main body in a plurality of continuous fields with the same interval;
for a storage body of video or audio type, the continuous feature value is the ratio of the maximum continuous period number in which feature words continuously appear to the total period number contained in the storage body within a continuous number of periods of the same interval.
In the above embodiment, by determining the characterizability parameters, the correlation between the continuity of the words in the storage main body and the characterizability of the words for the storage main body is considered, and because in practical application, the situation that the feature words appear in a certain period exists, and because the feature words can only characterize the content of the corresponding period under the situation, the characterizability of the occurrence times of the feature words is weaker, the feature words continuously appear in adjacent periods and are more characterizability, and the characterizability parameters determined by the occurrence times and the continuous feature values indicate that the characterizability parameters penetrate into the storage main body to a greater extent, so that the correlation between the feature words and the storage main body can be better reflected.
Furthermore, the method and the device for determining the attribute capacity parameters of the storage main bodies of different types respectively determine the attribute capacity parameters of the storage main bodies of different types by adopting a field-to-total field ratio and a period-to-total period ratio, and thus the setting can enable the storage main bodies of audio, video and text types to have good uniformity in response to retrieval, enable the attribute capacity parameters to have comparability to the storage main bodies of different types, ensure accurate and reliable index information displayed under the condition of retrieving various file types, and further ensure effective index of retrieved contents in a database.
In detail, before the determination of the continuous characteristic value, the storage main body is divided into a plurality of fields or time periods with equal intervals, and in order to ensure that the fields or time periods with different lengths have comparability, the dividing number of the fields or time periods is required to be in direct proportion to the length of the fields or time periods, and the dividing number of the fields or time periods is more than 5 segments.
Specifically, in determining the characterizability parameter of each feature word for the storage subject, the product of the continuous feature value of the feature word and the number of occurrences of the feature word is determined as the characterizability parameter of the feature word for the storage subject.
For example, a certain memory body is divided into 100 fields, the length of each field is 200 words, the feature words "data set" appear in the 10 th field to the 19 th field, the feature words "data set" appear in the 15 th field to the 18 th field, the maximum number of continuous fields in which the feature words appear continuously is determined to be 10, the continuous feature value is 0.1, the number of occurrences of the feature words "data set" in the whole memory body is 98 times, and the characterization capability parameter of the feature words for the memory body is determined to be 9.8.
Specifically, with the words contained in the stop word list excluded, for the determination of the feature words in the storage subject, in response to the storage subject category being text, the following steps are performed:
step S101, counting words with occurrence times larger than one time in a storage main body;
step S102, comparing the counted occurrence times of the words, and determining the words with the preset number of the ranks before ranking as the characteristic words of the storage main body.
Specifically, with the words contained in the stop vocabulary excluded, for the determination of the feature words in the storage subject, the following steps are performed for audio in response to the storage subject category:
Step S111, performing audio text recognition on the storage main body;
step S112, counting words with occurrence times greater than one time in a storage main body;
And S113, comparing the counted occurrence times of the words, and determining the words with the preset number of ranks before ranking as the characteristic words of the storage main body.
Specifically, in the case of excluding words contained in the stop word list, for determination of feature words in the storage subject, the following steps are performed for the video in response to the storage subject category;
step S121, respectively performing audio character recognition and image character recognition on the storage main body;
Step S122, counting words with occurrence times greater than one time in a storage main body;
and step S123, comparing the counted occurrence times of the words, and determining the words with the preset number of the ranks before ranking as the characteristic words of the storage main body.
Of course, in implementation, for the preset number, the data amount contained in the storage main body needs to be adaptively set, the preset number cannot be set too small, so that the preset number of feature words can effectively represent the content of the storage main body, and meanwhile, the preset number cannot be set too large, fewer words irrelevant to the content of the storage main body are ensured to exist in the preset number of feature words, and the waste of storage space and calculation force is reduced.
In the above embodiment, considering that in practical application, the original text description of the audio, video or text type storage main body is not necessarily accurate or effectively used in the practice, the invalid search or the search with poor effect generated by the problem needs to be extracted and determined to be eliminated, and through identifying the characters in the storage process, the influence caused by the phenomenon is avoided, and the retrievability of the database obtained by storage is optimized.
Specifically, after determining the characterizability parameters of the feature words for the storage subject, the method further includes:
Carrying out semantic analysis on the feature words of the storage main body, and determining the characteristic capability parameters of a plurality of feature words with the same semantic as the highest numerical value item in the characteristic capability parameters in the feature words with the same semantic. Optionally, NLP (natural language processing) technology is selected for semantic analysis. Semantic analysis and redetermining the characterizability parameters further improves the retrieval effect.
Referring to fig. 2, the present embodiment further provides an indexing method for searching a database obtained by the data storage method supporting searching in the above solution, including:
Step S01, determining a storage main body of which the feature words are matched with the retrieval content;
step S02, a corresponding index information table is read, and the display sequence of the index information on the storage main body is determined according to the characteristic capacity parameters of the characteristic words on the storage main body.
Specifically, determining the display order of the index information for the storage main body according to the characterization capability parameters of the feature words for the storage main body comprises:
determining feature words matched with the retrieval contents by each storage main body;
And taking the sequence from large to small of the sum of the characterization capability parameters of the feature words matched with the retrieval contents of the storage main bodies as the display sequence of the index information for the storage main bodies.
Illustratively, the retrieved content is "internet data",
The memory main body A is characterized by the characteristic word Internet, and the characterization capability parameter is 0.18.
The main body B is stored with the characteristic word data, and the characterization capability parameter is 0.22.
The storage main body C is characterized by comprising the characteristic word of Internet, the characteristic capability parameter of 0.18, the characteristic word of data and the characteristic capability parameter of 0.22.
Storage body a, sum of characterization capability parameters=0.18.
Memory body B, sum of characterization capability parameters = 0.22.
Storage body C: sum of characterization capability parameters=0.18+0.22=0.40.
The index information is displayed as:
1.a storage body C (0.40);
2. A memory body B (0.22);
3. the storage body a (0.18).
Referring to fig. 3, the present embodiment further provides a data storage system, which uses the above-mentioned data storage method supporting retrieval to store data, including:
the data storage module is used for storing the main body;
the extraction module is connected with the data storage module and is used for extracting the feature words of the storage main body and counting the occurrence times and occurrence nodes of the feature words;
The computing module is connected with the extracting module and is used for computing continuous characteristic values and characteristic capacity parameters of the characteristic words;
The index support module is respectively connected with the extraction module and the calculation module and is used for generating and storing an index data table comprising the characteristic words of the storage main body and the characterization capability parameters of the characteristic words of the storage main body.
As a preferable technical scheme of the data storage system, the calculation module is provided with a semantic analysis unit, and the semantic analysis unit is used for determining feature words with the same semantic meaning and refreshing the characterization capability parameters of the feature words with the same semantic meaning, namely, the characterization capability parameters of a plurality of feature words with the same semantic meaning are all determined to be one item with the highest numerical value in the characterization capability parameters in the feature words with the same semantic meaning.
The modules involved in the embodiments of the present application may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, a processor may be described as including a data storage module, an extraction module, a calculation module, and an index support module. The names of these modules do not constitute a limitation on the module itself in some cases.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It is to be understood that the above examples of the present invention are provided for clarity of illustration only and are not limiting of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.