[go: up one dir, main page]

CN114818727B - Keyword extraction method and device - Google Patents

Keyword extraction method and device Download PDF

Info

Publication number
CN114818727B
CN114818727B CN202210412327.4A CN202210412327A CN114818727B CN 114818727 B CN114818727 B CN 114818727B CN 202210412327 A CN202210412327 A CN 202210412327A CN 114818727 B CN114818727 B CN 114818727B
Authority
CN
China
Prior art keywords
key sentence
target
key
sentence set
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210412327.4A
Other languages
Chinese (zh)
Other versions
CN114818727A (en
Inventor
王得贤
李长亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Beijing Kingsoft Digital Entertainment Co Ltd
Priority to CN202210412327.4A priority Critical patent/CN114818727B/en
Publication of CN114818727A publication Critical patent/CN114818727A/en
Application granted granted Critical
Publication of CN114818727B publication Critical patent/CN114818727B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a keyword extraction method and a keyword extraction device, wherein the keyword extraction method comprises the steps of obtaining a target document, extracting keywords and a first keyword set based on text content of the target document, extracting first semantic features of the keywords and second semantic features of text sentences in the target document, determining a second keyword set according to the first semantic features and the second semantic features, and determining a target keyword set according to the first keyword set and the second keyword set. The method can effectively improve the accuracy and efficiency of extracting the key sentences.

Description

Keyword extraction method and device
Technical Field
The application relates to the technical field of natural language processing, in particular to a keyword extraction method. The application also relates to a key sentence extracting device, a computing device and a computer readable storage medium.
Background
With the development of artificial intelligence in the field of computer technology, the field of natural language processing has also rapidly developed, and information retrieval according to text is an important branch of the field of natural language processing. Artificial intelligence (ARTIFICIAL INTELLIGENCE; AI) refers to the ability of an engineered (i.e., designed and manufactured) system to perceive an environment, as well as to acquire, process, apply, and represent knowledge. Development status of key technologies in the artificial intelligence field comprises key technologies such as machine learning, knowledge graph, natural language processing, computer vision, man-machine interaction, biological feature recognition, virtual reality/augmented reality and the like. Natural language processing (NLP, natural Language Processing) is an important research direction in the field of computer science, which is studying various theories and methods that enable efficient communication between humans and computers in natural language. The concrete expression forms of natural language processing include machine translation, text abstract, text classification, text collation, information extraction, speech synthesis, speech recognition and the like. Along with the development of natural language processing technology and the acceleration of life rhythm, effective information required to be transmitted to users becomes shorter and shorter, and at the moment, key sentence extraction technology in natural language processing can be adopted to extract key sentences so as to shorten the effective information.
For a document with a long text, as the document is provided with a large number of sentences, the number of sentences is increased along with the increase of the document, so that the searching of key sentences is difficult, and the determined key sentences are inaccurate. In order to ensure the accuracy of the key sentences, the key sentences in the articles are required to be searched manually at present. However, this method requires a lot of manpower and material resources and is extremely inefficient. Therefore, an effective solution is needed to solve the above-mentioned problems.
Disclosure of Invention
In view of this, the embodiments of the present application provide a method for extracting key sentences to solve the technical defects existing in the prior art. The embodiment of the application also provides a key sentence extracting device, a computing device and a computer readable storage medium.
According to a first aspect of an embodiment of the present application, there is provided a keyword extraction method, including:
Acquiring a target document, and extracting keywords and a first keyword sentence set based on the text content of the target document;
Extracting first semantic features of the keywords and second semantic features of each text sentence in the target document, and determining a second keyword sentence set according to the first semantic features and each second semantic feature;
and determining a target key sentence set according to the first key sentence set and the second key sentence set.
According to a second aspect of an embodiment of the present application, there is provided a keyword extraction apparatus, including:
The first acquisition module is configured to acquire a target document, and extract keywords and a first keyword sentence set based on the text content of the target document;
The first determining module is configured to extract first semantic features of the keywords and second semantic features of each text sentence in the target document, and determine a second keyword sentence set according to the first semantic features and each second semantic feature;
And the second determining module is configured to determine a target key sentence set according to the first key sentence set and the second key sentence set.
According to a third aspect of embodiments of the present application, there is provided a computing device comprising:
A memory and a processor;
The memory is used for storing computer executable instructions, and the processor executes the computer executable instructions to implement the steps of the keyword extraction method.
According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement steps of a keyword extraction method.
According to a fifth aspect of embodiments of the present application, there is provided a chip storing a computer program which, when executed by the chip, implements the steps of a keyword extraction method.
The keyword extraction method provided by the application comprises the steps of obtaining a target document, extracting keywords and a first keyword set based on the text content of the target document, extracting first semantic features of the keywords and second semantic features of each text sentence in the target document, determining a second keyword set according to the first semantic features and each second semantic feature, and determining a target keyword set according to the first keyword set and the second keyword set. The first key sentence set is determined through the text content of the target document, so that the key sentences in the first key sentence set are ensured to carry text level information, and the second key sentence set is determined through the first semantic features of the key words and the second semantic features of all the text sentences in the target document, so that the key sentences can be determined from the semantic level more accurately, namely, the key sentences in the first key sentence set are ensured to carry the semantic level information, and further, the target key sentence set is determined according to the first key sentence set and the second key sentence set, so that the key sentences in the target key sentences contain both the text level information and the semantic level information, namely, the accuracy of determining the key words is improved. In addition, based on the keyword extraction method provided by the application, automatic keyword extraction is realized, the keyword extraction is avoided while the accuracy of the keywords is ensured, the keyword extraction efficiency is improved, and the keyword extraction cost is reduced.
Drawings
FIG. 1 is a schematic diagram of a key sentence extracting method according to an embodiment of the present application;
FIG. 2 is a flowchart of a keyword extraction method according to an embodiment of the present application;
FIG. 3A is a schematic diagram of a similarity analysis model in a key sentence extraction method according to an embodiment of the present application;
FIG. 3B is a flowchart illustrating a process for determining text similarity in a keyword extraction method according to an embodiment of the present application;
FIG. 4 is a process flow diagram of a keyword extraction method for document recall according to one embodiment of the present application;
FIG. 5 is a schematic diagram of a key sentence extracting device according to an embodiment of the present application;
FIG. 6 is a block diagram of a computing device according to one embodiment of the application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than those herein described, and those skilled in the art will readily appreciate that the present application may be similarly embodied without departing from the spirit or essential characteristics thereof, and therefore the present application is not limited to the specific embodiments disclosed below.
The terminology used in the one or more embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the application. As used in one or more embodiments of the application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of the application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the application.
First, terms related to one or more embodiments of the present invention will be explained.
The TextRank algorithm is a relatively common keyword extraction algorithm in the field of natural language processing, can be used for extracting keywords, phrases and key sentences and automatically generating text abstracts, and is also a graph-based ranking algorithm.
The TF-IDF (term frequency-inverse document frequency) algorithm is a common weighting technique for information retrieval and data mining. TF is the Term Frequency (Term Frequency) and IDF is the inverse text Frequency index (Inverse Document Frequency).
The LDA (Latent DirichletAllocation) algorithm is a calculation method of Topic Models (Topic Models) and has no direct relation with word vectors.
The latent semantic analysis (LSA, latent SemanticAnalysis) algorithm is mainly used for topic extraction of texts, mining meaning behind the texts, data dimension reduction and the like.
In the application, a keyword extraction method is provided. The present application also relates to a keyword extraction apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments.
The execution body of the key sentence extraction method provided by the embodiment of the application can be a server or a terminal, and the embodiment of the application is not limited to this. And, the terminal may be any electronic product that can perform man-machine interaction with a user, such as a PC (Personal Computer ), a mobile phone, a palm computer PPC (Pocket PC), a tablet computer, and the like. The server may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center, which is not limited in the embodiment of the present application.
Referring to a schematic structure diagram of a keyword extraction method shown in fig. 1, a target document is acquired first, then, based on text content of the target document, keywords and a first keyword set of the target document are extracted, then, first semantic features of the keywords and second semantic features of each text sentence in the target document are determined, further, a second keyword set is determined according to the first semantic features and the second semantic features, and finally, the target keyword set of the target document is determined according to the first keyword set and the second keyword set.
The keyword extraction method provided by the application comprises the steps of obtaining a target document, extracting keywords and a first keyword set based on the text content of the target document, extracting first semantic features of the keywords and second semantic features of each text sentence in the target document, determining a second keyword set according to the first semantic features and each second semantic feature, and determining a target keyword set according to the first keyword set and the second keyword set. The first key sentence set is determined through the text content of the target document, so that the key sentences in the first key sentence set are ensured to carry text level information, and the second key sentence set is determined through the first semantic features of the key words and the second semantic features of all the text sentences in the target document, so that the key sentences can be determined from the semantic level more accurately, namely, the key sentences in the first key sentence set are ensured to carry the semantic level information, and further, the target key sentence set is determined according to the first key sentence set and the second key sentence set, so that the key sentences in the target key sentences contain both the text level information and the semantic level information, namely, the accuracy of determining the key sentences is improved. In addition, based on the key sentence extraction method provided by the application, automatic key sentence extraction is realized, the key sentence extraction cost is avoided while the accuracy of the key sentence is ensured, the key sentence extraction efficiency is improved, and the key sentence extraction cost is reduced.
Fig. 2 is a flowchart of a keyword extraction method according to an embodiment of the present application, which specifically includes the following steps:
step 202, acquiring a target document, and extracting keywords and a first keyword sentence set based on the text content of the target document.
The key sentence (set) is extracted, the process of extracting the key sentence is basically the same for texts in different fields or different categories, such as texts in medical field, texts in astronomy field, long texts and short texts, and the process of extracting the key sentence is described in detail below.
Specifically, text refers to a representation form of a written language, usually a sentence or a combination of multiple sentences with complete and systematic meaning, one text can be a sentence, a paragraph or a chapter, all belong to the text, a document refers to a file containing the text, a target document refers to a document with keywords to be extracted, text content is text content, keywords refer to words used for expressing the subject content of the document, key sentences refer to sentences used for expressing the subject content of the document, a key sentence set refers to a sum or a collection of one or more key sentences, and a first key sentence set refers to a key sentence set obtained from the text or text layer.
In practical applications, there are various ways to obtain the target document, for example, the operator may send an instruction for extracting a keyword from the execution body, or send an instruction for obtaining the target document, and correspondingly, after receiving the instruction, the execution body starts to obtain the target document, or the server may automatically obtain the target document every preset time period, for example, after the preset time period, the server with the keyword extraction function automatically obtains the target document, or after the preset time period, the terminal with the keyword extraction function automatically obtains the target document. The present specification does not limit the manner in which the target document is acquired.
In addition, the target Document may be a Document in any format, may be a Document in DOC (Document) format, may be a Document in txt format, may be a Document in image format, and may be a Document in PDF (Portable DocumentFormat) format, which is not limited in this specification.
After the target document is obtained, the text content of the target document can be extracted by selecting a corresponding text box extracting tool according to the format of the target document, and then extracting the text box from the target document through the text box extracting tool, wherein the text box contains the text composing the text content or the text composing the text content. Thus, the text box extracting tool corresponding to the format of the target document is selected to extract the text box, so that the accuracy and speed of extracting the text content can be improved.
For example, if the acquired target document is in a PDF format, a pdfominer tool corresponding to the PDF format is selected, and an extraction operation is performed on the target document, so that at least one text box containing text contents in the target document is extracted, and the text contents of the target document are obtained. For another example, if the obtained target document is in an image format, an optical character recognition tool (OCR, optical Character Recognition) corresponding to the image format is selected, and an extraction operation is performed on the target document, so that at least one text box containing text content in the target document is extracted, and the text content of the target document is obtained.
In one or more optional embodiments of the present disclosure, after obtaining the text content of the target document, the keyword and the at least one keyword may be directly extracted from the text content through a preset keyword sentence extraction tool, the obtained keyword is determined to be the keyword of the target document, and the obtained at least one keyword is determined to be the first keyword sentence set of the target document. Thus, the efficiency of determining the extracted keywords and the first keyword sentence sets can be improved.
And 204, extracting the first semantic features of the keywords and the second semantic features of each text sentence in the target document, and determining a second keyword sentence set according to the first semantic features and each second semantic feature.
On the basis of obtaining the keywords and the first keyword sentence sets based on the text content of the target document, further, determining a second keyword sentence set of the target document according to the first semantic features of the keywords and the second semantic features of each text sentence.
Specifically, the semantic features refer to features corresponding to language meanings of a plurality of word units, the first semantic features refer to semantic features of keywords, the second semantic features refer to semantic features of text sentences, and the second keyword sentence sets refer to keyword sentence sets acquired from a semantic level.
In one or more optional embodiments of the present disclosure, after the keyword is obtained, the first semantic feature of the keyword may be extracted through a preset semantic feature extraction tool, and the second semantic feature of each text sentence in the target document may be extracted respectively, so that accuracy in determining to extract the first semantic feature and the second semantic feature may be improved.
Further, the first semantic features of the keywords are compared with the second semantic features of the text sentences, and a second keyword sentence set of the target document is determined from the text sentences.
And 206, determining a target key sentence set according to the first key sentence set and the second key sentence set.
And further, determining a target key sentence set of the target document according to the first key sentence set and the second key sentence set on the basis of determining the first key sentence set and the second key sentence set.
Specifically, the target key sentence set refers to a key sentence set of the target document which is finally determined, that is, a key sentence set containing the target key sentence.
In one possible implementation manner of the embodiments of the present disclosure, after the first key sentence set and the second key sentence set are obtained, the first key sentence set and the second key sentence set may be combined to obtain a key sentence set of the target document, that is, key sentences contained in the first key sentence set and the second key sentence set are combined into one set, so as to obtain the target key sentence set of the target document. Therefore, the completeness of the target key sentence set can be ensured, namely the completeness and accuracy of the extracted key sentences are improved.
For example, the first set of key sentences includes key sentence 1, key sentence 2, key sentence 3 and key sentence 4, and the second set of key sentences includes key sentence 2, key sentence 4, key sentence 5 and key sentence 6, then the target set of key sentences includes key sentence 1, key sentence 2, key sentence 3, key sentence 4, key sentence 5 and key sentence 6.
In another possible implementation manner of the embodiment of the present disclosure, after the first key sentence set and the second key sentence set are obtained, the first key sentence set and the second key sentence set may be intersected to obtain a key sentence set of the target document, that is, key sentences included in both the first key sentence set and the second key sentence set form the target key sentence set of the target document. Therefore, the accuracy of the target key sentence set can be ensured, namely, the extracted key sentences are ensured to have key information of a text content layer and a semantic layer.
Along the above example, the first set of key sentences includes key sentence 1, key sentence 2, key sentence 3 and key sentence 4, and the second set of key sentences includes key sentence 2, key sentence 4, key sentence 5 and key sentence 6, then the target set of key sentences includes key sentence 2 and key sentence 4.
The keyword extraction method provided by the application comprises the steps of obtaining a target document, extracting keywords and a first keyword set based on the text content of the target document, extracting first semantic features of the keywords and second semantic features of each text sentence in the target document, determining a second keyword set according to the first semantic features and each second semantic feature, and determining a target keyword set according to the first keyword set and the second keyword set. The first key sentence set is determined through the text content of the target document, so that the key sentences in the first key sentence set are ensured to carry text level information, and the second key sentence set is determined through the first semantic features of the key words and the second semantic features of all the text sentences in the target document, so that the key sentences can be determined from the semantic level more accurately, namely, the key sentences in the first key sentence set are ensured to carry the semantic level information, and further, the target key sentence set is determined according to the first key sentence set and the second key sentence set, so that the key sentences in the target key sentences contain both the text level information and the semantic level information, namely, the accuracy of determining the key words is improved. In addition, based on the keyword extraction method provided by the application, automatic keyword extraction is realized, the keyword extraction is avoided while the accuracy of the keywords is ensured, the keyword extraction efficiency is improved, and the keyword extraction cost is reduced.
In one or more alternative embodiments of the present disclosure, keywords and a first keyword set of a target document may be directly extracted through a preset keyword sentence extraction tool, keywords and a third keyword set of the target document may also be directly extracted through a preset keyword sentence extraction tool, and then a fourth keyword set of the target document may be determined according to the keywords and the target document, where the third keyword set and the fourth keyword set are collectively referred to as the first keyword sentence set. That is, when the first keyword sentence set includes the third keyword sentence set and the fourth keyword sentence set, based on the text content of the target document, the keyword and the first keyword sentence set are extracted, and the specific implementation process may be as follows:
extracting keywords and a third keyword sentence set of the target document by using an extraction algorithm based on the text content according to the text content of the target document;
And identifying target text sentences containing the keywords in the target document according to the keywords, and constructing a fourth keyword sentence set of the target document based on the target text sentences.
Specifically, the third key sentence set refers to a key sentence set directly obtained from the text content based on an extraction algorithm of the text content, and the fourth key sentence set refers to a key sentence set composed of key sentences containing key words.
In practical application, after the text content of the target document is obtained, the keyword and at least one keyword which are directly extracted from the text content can be determined as the keyword of the target document through a preset keyword sentence extraction tool, and the obtained at least one keyword is determined as a third keyword sentence set of the target document. Then, for any text sentence in the target document, checking whether the text sentence contains a keyword, if so, determining the text sentence as a target text sentence, and if not, determining the text sentence as a non-target text sentence. Traversing each text sentence in the target document, and forming a fourth key sentence set by all obtained target text sentences. Therefore, the completeness of the first key sentence set can be improved, and the efficiency of determining the target key sentence set can be further improved.
In one or more alternative embodiments of the present disclosure, keywords of the target document may be extracted by a term frequency-inverse text frequency index (TF-IDF, term frequency-inverse document frequency) extraction algorithm. The keywords of the target document can also be extracted by a graph-based sorting algorithm, namely, according to the text content of the target document, the keywords of the target document are extracted by using the text content-based extraction algorithm, and the specific implementation process can be as follows:
word segmentation and word removal stopping processing are carried out on the text content of the target document, so that a plurality of candidate words are obtained;
According to a preset sliding window, constructing a word graph by taking each candidate word as a node and the co-occurrence relation among each candidate word as an edge;
According to the word graph, iteratively calculating a first initial weight corresponding to each candidate word until a first preset convergence condition is reached, and obtaining a first target weight corresponding to each candidate word;
Keywords of the target document are determined from the candidate words based on the first target weight.
The word segmentation method is characterized by comprising the steps of word segmentation, word candidate, word graph, a first initial weight and a first target weight, wherein the word segmentation is used for carrying out word segmentation and word removal on the text content to obtain all words, the co-occurrence relationship is a co-occurrence relationship, the word graph is a graph formed by words, namely the candidate words are used as nodes, the first initial weight is a weight of the candidate words determined based on the word graph, and the first target weight is a first initial weight tending to be stable or convergent.
In practical application, the word segmentation processing can be directly carried out on the text content of the target document to obtain a plurality of words, or the text content can be segmented according to the whole sentence to obtain a plurality of sentences, and then the word segmentation is carried out on each sentence to obtain a plurality of words.
Further, the obtained plurality of words are subjected to deactivated word processing, namely deactivated words in the plurality of words are removed, so that a plurality of candidate words are obtained, for example, part of speech tagging is carried out on each word to determine the part of speech of each word, and then words which are virtual words in the plurality of words, namely deactivated words, are deleted according to the part of speech of each word, so that a plurality of candidate words are obtained.
And constructing a word graph according to each candidate word, namely constructing edges between any two nodes by adopting a co-occurrence relationship, namely taking the co-occurrence relationship between each candidate word as an edge, wherein edges exist between the two nodes, and only when the corresponding candidate words co-occur in a preset sliding window with the length of K, wherein K represents the size of the preset sliding window, namely K candidate words co-occur at most, K is a positive integer, such as K=2, and constructing the word graph.
Then, according to the connection relation between nodes in the word graph and a preset weight calculation formula, as shown in formula 1, iteratively calculating a first initial weight corresponding to each candidate word. And determining the stabilized first initial weight as a first target weight corresponding to the candidate word after the first initial weight corresponding to the candidate word is stabilized until a first preset convergence condition is reached, wherein the first initial weight tends to be stabilized.
In formula 1, V i represents the ith candidate word, V j represents the jth candidate word, S (V i) represents the first initial weight of the ith candidate word, S (V j) represents the first initial weight of the jth candidate word, d represents a damping coefficient such as 0.85, in (V i) represents the set of candidate words pointing to the ith candidate word, out (V j) represents the set of candidate words pointing to the jth candidate word, and Out (V j) represents the number of Out (V j).
On the basis of determining each first target weight, candidate words with the first target weights larger than a first weight threshold value can be used for determining the keywords of the target document, the candidate words can be arranged according to the sequence from the large first target weight to the small first target weight, and N candidate words in the top ranking can be used for determining the keywords of the target document, wherein N is a preset positive integer. Thus, the keyword extraction efficiency and accuracy can be improved.
For example, text content can be segmented according to a whole sentence through a TextRank algorithm to obtain a plurality of sentences, then each sentence is segmented to obtain a plurality of words, then each word is labeled with a part of speech to determine the part of speech of each word, and then the stop word is deleted according to the part of speech of each word to obtain a plurality of candidate words. And then, constructing a word graph by taking each candidate word as a node and the co-occurrence relation among the candidate words as an edge based on the preset sliding window size of 2. And then calculating a first initial weight of each candidate word through a formula 1 until a first preset convergence condition is reached, and obtaining a first target weight corresponding to each candidate word. And finally, determining 5 candidate words with the largest first target weight as keywords of the target document.
In one or more optional embodiments of the present disclosure, the key sentence of the target document may be obtained according to the semantic relativity between the sentence and the title of the target document, and the key sentence of the target document may be extracted by a graph-based ranking algorithm to obtain a third key sentence set, that is, according to the text content of the target document, the third key sentence set may be extracted by using a text content-based extraction algorithm, where the specific implementation process may be as follows:
Sentence dividing processing is carried out on the text content of the target document, so that a plurality of candidate sentences are obtained;
Constructing a sentence graph by taking each candidate sentence as a node and the sentence similarity among the candidate sentences as an edge;
According to the sentence pattern, iteratively calculating a second initial weight corresponding to each candidate sentence until a second preset convergence condition is reached, and obtaining a second target weight corresponding to each candidate word;
a third set of key sentences of the target document is determined from the candidate sentences based on the second target weights.
The sentence dividing method comprises the steps of dividing sentences in the text content, candidate sentences, sentence similarity, sentence graph, second initial weight and second target weight, wherein the sentence dividing process is used for dividing sentences in the text content, the candidate sentences are used for obtaining all sentences after sentence dividing processing is carried out on the text content, the sentence similarity is used for achieving the similarity of sentence semantics, the sentence graph is used for forming the graph by taking sentences, namely the candidate sentences as nodes, the second initial weight is used for determining the weight of the candidate sentences based on the sentence graph, and the second target weight is used for achieving stability or convergence.
In practical application, the text content is firstly segmented according to the whole sentence, namely, the sentence segmentation is performed, so that a plurality of candidate sentences are obtained. And then constructing sentence patterns according to each candidate sentence, namely constructing edges between the nodes by taking each candidate sentence as a node and sentence similarity between the candidate sentences to obtain the sentence patterns. And then, according to the connection relation among the nodes in the sentence graph and a preset weight calculation formula, as shown in a formula 2, iteratively calculating a second initial weight corresponding to each candidate sentence. And determining the stabilized second initial weight as a second target weight corresponding to the candidate sentence when the second initial weight corresponding to the candidate sentence reaches a second preset convergence condition, if the second initial weight tends to be stable. Further, candidate sentences with the second target weight greater than the second weight threshold value can be used for determining key sentences of the target document, the candidate sentences can be arranged according to the sequence from the large target weight to the small target weight, and the M candidate sentences with the top ranking can be used for determining key words of the target document, wherein M is a preset positive integer. All obtained key sentences form a third key sentence set of the target document.
Thus, the efficiency and accuracy of extracting the third key sentence set can be improved.
In formula 2, V i represents an ith candidate sentence, V j represents a jth candidate sentence, WS (V i) represents a second initial weight of the ith candidate sentence, WS (V j) represents a second initial weight of the jth candidate sentence iterated last, d represents a damping coefficient such as 0.85, in (V i) represents a set of candidate sentences pointing to the ith candidate sentence, out (V j) represents a set of candidate sentences pointed to by the jth candidate sentence, W ji represents a sentence similarity of the ith candidate sentence and the jth candidate sentence, and W jk represents a sentence similarity of the kth candidate sentence and the jth candidate sentence.
For example, text content may be first divided according to a whole sentence by a TextRank algorithm to obtain a plurality of candidate sentences. And then, constructing a sentence graph by taking each candidate sentence as a node and taking the sentence similarity among each candidate sentence as an edge. And then calculating a second initial weight of each candidate sentence through a formula 2 until a second preset convergence condition is reached, so as to obtain a second target weight corresponding to each candidate sentence. And finally, forming a third key sentence set of the target document by the 3 candidate words with the largest second target weight.
In one or more optional embodiments of the present disclosure, the first set of key sentences includes a third set of key sentences and a fourth set of key sentences, and when determining the target set of key sentences according to the first set of key sentences and the second set of key sentences, the third set of key sentences and the fourth set of key sentences may be combined to obtain the target set of key sentences, that is, all key sentences contained in the second set of key sentences, the third set of key sentences and the fourth set of key sentences are combined into one set to obtain the target set of key sentences of the target document. And the intersection of the second key sentence set, the third key sentence set and the fourth key sentence set can be obtained to obtain the target key sentence set.
When the second key sentence set, the third key sentence set and the fourth key sentence set are intersected to obtain a target key sentence set, key sentences contained in the second key sentence set, the third key sentence set and the fourth key sentence set can be formed into the target key sentence set of the target document, and the target confidence degree of each initial key sentence can be obtained according to the initial confidence degree of each initial key sentence relative to the second key sentence set, the third key sentence set and the fourth key sentence set, and then the target key sentence set is determined according to the target confidence degree, namely the intersection is obtained for the second key sentence set, the third key sentence set and the fourth key sentence set, so that the target key sentence set is obtained, and the specific implementation process can be as follows:
determining a first initial confidence coefficient of an initial key sentence relative to a second key sentence set, a second initial confidence coefficient relative to a third key sentence set and a third initial confidence coefficient relative to a fourth key sentence set, wherein the initial key sentence refers to any key sentence in the second key sentence set, the third key sentence set and the fourth key sentence;
Determining target confidence degrees of the initial key sentences according to the first initial confidence degrees, the second initial confidence degrees and the third initial confidence degrees;
Determining target key sentences from the second key sentence set, the third key sentence set and the fourth key sentence set based on the target confidence level;
and constructing a target key sentence set based on the target key sentences.
Specifically, the confidence level is also called reliability, or confidence level and confidence coefficient, the first initial confidence level refers to the confidence level of a certain initial key sentence relative to a second key sentence set, the second initial confidence level refers to the confidence level of a certain initial key sentence relative to a third key sentence set, the third initial confidence level refers to the confidence level of a certain initial key sentence relative to a fourth key sentence set, the target confidence level refers to the integrated confidence level obtained by processing the initial confidence levels, the target key sentence refers to the minimum text unit forming the target key sentence set, and the initial key sentence refers to the minimum text unit forming the second key sentence set, the third key sentence set and the fourth key sentence set.
In practical applications, each initial key sentence has a corresponding initial confidence coefficient for the second key sentence set, the third key sentence set and the fourth key sentence set, when a certain key sentence set in the second key sentence set, the third key sentence set and the fourth key sentence set does not contain a certain initial key sentence, the initial confidence coefficient of the initial key sentence relative to the key sentence set is a first preset value, for example, the second key sentence set does not contain an initial key sentence a, the first initial confidence coefficient of the initial key sentence a is 0, and when a certain key sentence set in the second key sentence set, the third key sentence set and the fourth key sentence set contains a certain initial key sentence, the initial confidence coefficient of the initial key sentence relative to the key sentence set is a second preset value, for example, when the third key sentence set contains an initial key sentence a, the first initial confidence coefficient of the initial key sentence a is 1.
And the initial key sentences are respectively corresponding to the first initial confidence, the second initial confidence and the third initial confidence of the second key sentence set, the third key sentence set and the fourth key sentence set. And then inputting the first initial confidence coefficient, the second initial confidence coefficient and the third initial confidence coefficient into a preset calculation formula (shown as formula 3) for calculation to obtain the target confidence coefficient of the initial key sentence. And arranging the key sentences in the second key sentence set, the third key sentence set and the fourth key sentence according to the order of the target confidence degrees from large to small, and determining the key sentences with the top L ranks as target keywords of the target document, wherein L is a preset positive integer. All the obtained target key sentences form a target key sentence set of the target document. Thus, the accuracy and efficiency of the target key sentence can be improved.
Y=a1×x1+a2 x2+a3 x3 (formula 3)
In the formula 3, y is a target confidence coefficient, x1, x2 and x3 are respectively a first initial confidence coefficient, a second initial confidence coefficient and a third initial confidence coefficient, and a1, a2 and a3 are respectively weights corresponding to the first initial confidence coefficient, the second initial confidence coefficient and the third initial confidence coefficient.
In addition, when a certain keyword set in the second keyword set, the third keyword set and the fourth keyword set contains a certain initial keyword, the initial confidence degree of the initial keyword relative to the keyword set may be a weight corresponding to the initial keyword when determining the keyword set, for example, when obtaining the third keyword set by using a TextRank algorithm, the weight of each keyword in the third keyword set is obtained, and is used as the first confidence degree of the keyword, for example, when obtaining the keyword by using the TextRank algorithm, the weight of the keyword is obtained, and then based on the fourth keyword set constructed by the target text sentence containing the keyword, at this time, the weight of each target text sentence in the fourth keyword set, namely, the weight of the keyword is the sum of the weights of the keywords contained in the keyword, for example, the weight of the keyword in the second keyword set determined according to the semantic relevance of the first semantic features and the second semantic features is the semantic relevance corresponding to the keyword.
In one or more alternative embodiments of the present disclosure, the keyword extraction method may be used for document recall, that is, the target document may be a query document, a candidate document, or both a query document and a candidate document. In the case that the target document includes a query document and a plurality of candidate documents, after determining the target key sentence set from the first key sentence set and the second key sentence set, further comprising:
and determining the text similarity of the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the candidate documents.
Specifically, the query document refers to a document input by a user for retrieval, the candidate document refers to a document stored in a database, and the text similarity refers to the similarity of the text content of the query document and the candidate document.
In practical application, based on a key sentence extraction method, a target key sentence set of a query document and a target key sentence set of each candidate document are obtained, and text similarity between the target key sentence set of the query document and the target key sentence set of each candidate document is calculated, namely, the text similarity between the query document and each candidate document is determined. Thus, the accuracy and reliability of the obtained text similarity can be improved by calculating the text similarity between the documents based on the target key sentence sets.
The target keyword sentence sets of the query document and the plurality of candidate documents can be converted into feature vectors according to a preset vector conversion algorithm, and then the similarity between the feature vectors corresponding to the query document and the feature vectors corresponding to the candidate documents, namely the text similarity between the query document and the candidate documents, is calculated according to a preset similarity algorithm, such as Euclidean distance (Eucledian Distance) algorithm, manhattan distance (MANHATTAN DISTANCE) algorithm and Minkowski distance (Minkowski distance) algorithm.
In one or more alternative embodiments of the present disclosure, after determining the text similarity of the query document with each candidate document according to the target keyword sentence set of the query document and the target keyword sentence sets of the plurality of candidate documents, the similar documents of the query document may be recalled from the plurality of candidate documents according to each text similarity. For example, candidate documents with text similarity greater than the text similarity threshold may be determined to be similar to the query document, or the similar documents may be arranged in order of greater text similarity, where Q is a preset positive integer, and the candidate words ranked Q top are determined to be similar to the query document. And then feeds back similar documents. Thus, the similar documents are determined and recalled based on the text similarity, and the efficiency and accuracy of recalling the similar documents can be improved.
Note that, the document recall is based on the query document of the user, so that a similar document having a high similarity to the query document is returned. The document length is relatively long, so that the method has important significance for researching the similarity calculation of the long text. The current long text similarity calculation method mainly comprises the steps of directly calculating the similarity of text characters based on a character similarity determination method, namely adopting an edit distance, a hamming distance, a Jaccard distance and the like, wherein the method is coarse and simple, the similarity is calculated only at a character level, a semantic level is ignored, the traditional machine learning similarity determination method comprises the steps of manually constructing text feature vectors through a TF-IDF algorithm, an LSA algorithm, an LDA algorithm and the like, then obtaining text similarity through calculation of cosine similarity, euclidean distance and the like, manually constructing features, and not fully utilizing text context semantic information, and the text similarity method based on text interception depth learning comprises the steps of generally intercepting front parts or middle parts of documents as texts, and carrying out text similarity calculation through a long-short-term memory network (LSTM, long Short Term Memory) model, a convolutional neural network (CNN, convolutional Neural Networks), a BERT (BidirectionalEncoder Representations from Transformer) model and the like.
Compared with the similarity determination method based on traditional machine learning, the method provided by the specification does not need to manually construct features, determines the second keyword sentence set through the first semantic features of the keywords and the second semantic features of each text sentence in the target document, fully utilizes text context semantic information, and compared with the text similarity method based on text cut-off deep learning, cuts off the text keyword sentences without extracting the document keyword sentences, does not weaken text keyword information deletion, and further improves the accuracy of recall of similar documents. Complex feature extraction, such as TF-IDF, LDA, LSA calculation and the like, can be avoided, the vector feature is constructed conveniently and rapidly, and the accuracy of similar document recall is effectively improved.
In one or more optional embodiments of the present disclosure, the target keyword sentence sets of the query document and the target keyword sentence sets of the plurality of candidate documents may also be input to a similarity analysis model trained in advance, to obtain text similarity between the query document and each candidate document. That is, before determining the text similarity between the query document and each candidate document according to the target keyword sentence set of the query document and the target keyword sentence sets of the plurality of candidate documents, the method further includes:
Obtaining a pre-trained similarity analysis model, wherein the similarity analysis model is obtained by training based on a sample sentence set carrying a similarity label;
determining the text similarity of the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the candidate documents, including:
And inputting the target key sentence set of the query document and the target key sentence sets of the candidate documents into a similarity analysis model to obtain the text similarity of the query document and each candidate document.
Specifically, the similarity analysis model refers to a pre-trained neural network model, such as a neural network model and a probability neural network model, and further includes a BERT model, a transducer model, a sentence-BERT model and the like, the sample sentences refer to sentences used for training to obtain the similarity analysis model, the sample sentence pairs refer to a set containing two sample sentences, the sample sentence sets refer to a set containing a plurality of sample sentence pairs, and the similarity labels refer to real text similarity of the two sample sentences in the sample sentence pairs.
In practical application, a similarity analysis model obtained by training based on a sample sentence set carrying a similarity label can be obtained. And then, on the basis of acquiring the target key sentence sets of the query document and the target key sentence sets of the candidate documents, further, inputting the target key sentence sets of the query document and the target key sentence sets of the candidate documents into a similarity analysis model, and carrying out similarity calculation on the target key sentence sets of the query document and the target key sentence sets of the candidate documents by the similarity analysis model to output the text similarity of the target query document and each candidate document respectively. Through a pre-trained similarity analysis model, based on a target key sentence set of the query document and target key sentence sets of a plurality of candidate documents, the text similarity between the query document and each candidate document is calculated, and the speed and accuracy of determining the text similarity can be improved.
In one or more optional embodiments of the present disclosure, the similarity analysis model includes a feature extraction layer and a pooling layer, and the method includes inputting a target keyword sentence set of a query document and a target keyword sentence set of a plurality of candidate documents into the similarity analysis model to obtain text similarity between the query document and each candidate document, where the specific implementation process may be as follows:
for any candidate document, respectively inputting a target key sentence set of the query document and the target key sentence set of the candidate document into a feature extraction layer for feature extraction processing to obtain a query feature vector and a candidate feature vector;
Respectively inputting the query feature vector and the candidate feature vector into a pooling layer for pooling treatment to obtain a query embedded vector and a candidate embedded vector;
And determining the text similarity between the query document and the candidate document according to the query embedded vector and the candidate embedded vector.
Specifically, the feature extraction layer may be a neural network model, such as BERT (Bidirectional Encoder Representations from Transformer) model, the pooling layer is also called a downsampling layer, that is, pooling layer, which can compress the input entity feature vector and text feature vector, on one hand, reduce the complexity in the subsequent similarity calculation, on the other hand, keep some invariance of the entity feature vector and text feature vector, the query feature vector refers to the hidden layer representation obtained by inputting the target keyword sentence set of the query document into the feature extraction layer, the candidate feature vector refers to the hidden layer representation obtained by inputting the target keyword sentence set of the candidate document into the feature extraction layer, the hidden layer representation is the feature of the input entity target keyword sentence set, abstract to another dimension space to display the more abstract feature of the target keyword sentence set, and on the other hand, the hidden layer representation can be better divided linearly, the pooling treatment is the treatment of removing the impurity information and reserving the key information, the query feature vector refers to the representation obtained by pooling the query feature vector, and the candidate feature vector refers to the vector representation obtained by pooling the candidate feature vector.
In practical application, referring to fig. 3A, fig. 3A shows a schematic structural diagram of a similarity analysis model in a key sentence extraction method according to an embodiment of the present application, where the similarity analysis model includes a feature extraction layer and a pooling layer. On the basis of acquiring the target key sentence sets of the query document and the target key sentence sets of the candidate documents, the target key sentence sets of the query document and the target key sentence sets of any candidate document can be respectively input into the feature extraction layer, and the feature extraction layer respectively performs feature extraction processing on the target key sentence sets of the query document and the target key sentence sets of the candidate documents and then outputs query feature vectors and candidate feature vectors. And then, in order to reduce the data processing amount, the query feature vector and the candidate feature vector are respectively input into a pooling layer for pooling processing, and after the pooling is finished, the pooling layer outputs the query embedded vector and the candidate embedded vector. Further, the query embedded vector and the candidate embedded vector are compared, and the similarity of the query embedded vector and the candidate embedded vector, namely the text similarity between the query document and the candidate document, is calculated. In this way, the efficiency and accuracy of determining text similarity can be improved.
In one or more optional embodiments of the present disclosure, in order to improve the efficiency and precision of feature extraction by the feature extraction layer, two sub-feature extraction layers with the same structure, parameter type and parameter number may be set in the feature extraction layer, that is, the feature extraction layer includes a first sub-feature extraction layer and a second sub-feature extraction layer with the same structure, parameter type and parameter number, so that one sub-feature extraction layer may perform feature extraction on a target keyword set of a query document, and the other sub-feature extraction layer may perform feature extraction on a target keyword set of a candidate document. Namely, in the case that the feature extraction layer includes a first sub-feature extraction layer and a second sub-feature extraction layer with the same structure, parameter type and parameter number, the target key sentence set of the query document and the target key sentence set of the candidate document are respectively input into the feature extraction layer to perform feature extraction processing, so as to obtain a query feature vector and a candidate feature vector, and the specific implementation process may be as follows:
Inputting the target key sentence set of the query document into a first sub-feature extraction layer for feature extraction processing to obtain a query feature vector;
and inputting the target key sentence set of the candidate document into a second sub-feature extraction layer for feature extraction processing to obtain a candidate feature vector.
In practical application, referring to fig. 3A, fig. 3A shows a schematic structural diagram of a similarity analysis model in a key sentence extraction method according to an embodiment of the present application, where a feature extraction layer includes two sub-feature extraction layers, namely a first sub-feature extraction layer and a second sub-feature extraction layer. When the target key sentence set of the query document and the target key sentence set of the candidate document are subjected to feature extraction, the target key sentence set of the query document is required to be input into a first sub-feature extraction layer, the first sub-feature extraction layer outputs a query feature vector corresponding to the target key sentence set of the query document after the target key sentence set of the query document is subjected to feature extraction, the target key sentence set of the candidate document is input into a second sub-feature extraction layer, and the second sub-feature extraction layer outputs a candidate feature vector corresponding to the target key sentence set of the candidate document after the target key sentence set of the candidate document is subjected to feature extraction.
In order to improve the efficiency and precision of the pooling process of the pooling layer, and further determine the efficiency of text similarity by a similarity analysis model, two sub-pooling layers with the same structure, parameter type and parameter number can be arranged in the pooling layer, namely the pooling layer comprises a first sub-pooling layer and a second sub-pooling layer with the same structure, parameter type and parameter number, so that one sub-pooling layer carries out pooling process on the query feature vector, and the other sub-pooling layer carries out pooling process on the candidate feature vector. Namely, under the condition that the pooling layer comprises a first sub-pooling layer and a second sub-pooling layer which have the same structure, parameter type and parameter quantity, respectively inputting the query feature vector and the candidate feature vector into the pooling layer for pooling treatment to obtain a query embedded vector and a candidate embedded vector, wherein the specific implementation process can be as follows:
Inputting the query feature vector into a first sub-pooling layer for pooling treatment to obtain a query embedded vector;
And inputting the candidate feature vectors into a second sub-pooling layer for pooling treatment to obtain candidate embedded vectors.
In practical application, referring to fig. 3A, fig. 3A shows a schematic structural diagram of a similarity analysis model in a key sentence extraction method according to an embodiment of the present application, where a pooling layer includes two sub-pooling layers, namely a first sub-pooling layer and a second sub-pooling layer. When the query feature vector and the candidate feature vector are subjected to pooling processing, the query feature vector is input into a first sub-pooling layer, the first sub-pooling layer carries out pooling processing on the query feature vector and then outputs a query embedded vector corresponding to the query feature vector, the candidate feature vector is input into a second sub-pooling layer, and the second sub-pooling layer carries out pooling on the candidate feature vector and then outputs a candidate embedded vector corresponding to the query candidate feature vector.
Before the pre-trained similarity analysis model is obtained, the language characterization model needs to be trained to obtain the similarity analysis model. That is, before obtaining the pre-trained similarity analysis model, the method further comprises:
acquiring a preset language characterization model and a sample set, wherein the sample set comprises a plurality of sample sentence set pairs carrying similarity labels, and the sample sentence set pairs comprise a first sample sentence set and a second sample sentence set;
Extracting any sample sentence set pair from the sample set, and inputting a first sample sentence set and a second sample sentence set in the sample sentence set pair into a language characterization model to obtain the prediction similarity of the first sample sentence set and the second sample sentence set;
determining a loss value according to the predicted similarity and the similarity label carried by the sample sentence set pair;
And according to the loss value, adjusting model parameters of the language characterization model, continuously executing the step of extracting any sample sentence set pair from the sample set, and determining the trained language characterization model as a similarity analysis model under the condition that a first preset training stop condition is reached.
Specifically, the language characterization model refers to a pre-trained neural network model, such as RoBERTa model, which is pre-designated, the first sample sentence set and the second sample sentence set are two sample sentence sets contained in a sample sentence set pair, the predicted similarity refers to the similarity between the first sample sentence set and the second sample sentence set, which is determined by the language characterization model, and the first training stopping condition may be that the loss value is smaller than or equal to a preset threshold value, or that the number of iterative training reaches a preset iteration value, or that the loss value converges, i.e. the loss value is not reduced any more along with continuous training.
In practical applications, there are various ways to obtain the language characterization model and the sample set, for example, the operator may send a training instruction of the language characterization model to the execution body, or send an obtaining instruction of the language characterization model and the sample set, and correspondingly, the execution body starts to obtain the language characterization model and the sample set after receiving the instruction, or the server may automatically obtain the language characterization model and the sample set every preset time period, for example, after the preset time period, the server with the model training function automatically obtains the language characterization model and the sample set in the designated access area, or after the preset time period, the terminal with the model training function automatically obtains the language characterization model and the sample set stored locally. The present description does not set any limitation on the manner in which the language characterization model and the sample set are obtained.
After the language characterization model and the sample set are obtained, a sample sentence set pair is extracted from the sample set, then the first sample sentence set and the second sample sentence set contained in the sample sentence set pair are input into the language characterization model, the language characterization model determines that similarity calculation is carried out on the first sample sentence set and the second sample sentence set, and the prediction similarity of the first sample sentence set and the second sample sentence set is output. And then determining a loss value according to a preset first loss function according to the predicted similarity and the similarity label carried by the sample sentence set pair, adjusting model parameters of the language characterization model according to the loss value under the condition that a first preset training stop condition is not met, extracting one sample sentence set pair from the sample set again for next training, and determining the trained language characterization model as a similarity analysis model under the condition that the first preset training stop condition is met. Therefore, the language characterization model is trained through the plurality of sample sentence sets, the accuracy and the speed of determining the text similarity by the similarity analysis model can be improved, and the robustness of the similarity analysis model is improved.
In one or more optional embodiments of the present disclosure, after extracting the first semantic feature of the keyword and the second semantic feature of each text sentence in the target document, the second set of key sentences may be determined according to the semantic association degree of the first semantic feature and each second semantic feature. That is, according to the first semantic features and each second semantic feature, the second key sentence set is determined, and the specific implementation process may be as follows:
Determining the semantic association degree of the first semantic features and each second semantic feature;
and determining a second key sentence set from each text sentence according to the semantic association degree.
Specifically, the semantic association degree refers to the similarity between the first semantic feature and the second semantic feature.
In practical applications, the similarity between the first semantic feature and each second semantic feature, that is, the semantic association degree, may be calculated according to a preset similarity algorithm, such as a euclidean distance (Eucledian Distance) algorithm, a manhattan distance (MANHATTAN DISTANCE) algorithm, and a markov distance (Minkowski distance) algorithm. And arranging the text sentences according to the sequence of the semantic relevance from large to small, and adding the text sentences with P top ranking text sentences to refer to the second key sentence set to obtain the second key sentence set, wherein P is a preset positive integer. Therefore, the completeness of the second key sentence set can be improved, and the efficiency of determining the target key sentence set can be further improved.
In one or more optional embodiments of the present disclosure, the keyword and each text sentence in the target document may also be input to a pre-trained association analysis model, to obtain a semantic association between the first semantic feature and each second semantic feature. That is, before extracting the first semantic feature of the keyword and the second semantic feature of each text sentence in the target document, the method further includes:
obtaining a pre-trained relevance analysis model, wherein the relevance analysis model comprises a feature extraction sub-model and a relevance calculation sub-model;
extracting a first semantic feature of a keyword and a second semantic feature of each text sentence in a target document, wherein the extracting comprises the following steps:
inputting the keywords and each text sentence in the target document into a feature extraction sub-model to obtain a first semantic feature of the keywords and a second semantic feature of each text sentence;
determining the semantic association degree of the first semantic features and the second semantic features comprises the following steps:
Inputting the first semantic features and the second semantic features into a relevancy calculation sub-model to obtain the semantic relevancy of the first semantic features and the second semantic features.
Specifically, the relevance analysis model refers to a pre-trained neural network model, such as a neural network model and a probability neural network model, and further includes a BERT model, a Transformer model, a sentence-BERT model and the like, the feature extraction sub-model refers to a part of the relevance analysis model for extracting features of keywords or text sentences, and the relevance calculation sub-model refers to a part of the relevance analysis model for calculating semantic relevance.
In practical application, a relevance analysis model including a feature extraction sub-model and a relevance calculation sub-model may be obtained first. Then, on the basis of obtaining the keywords and each text sentence in the target document, further inputting the keywords and each text sentence into a feature extraction sub-model, extracting features of the keywords and each text sentence by the feature extraction sub-model to obtain first semantic features of the keywords and second semantic features of each text sentence, inputting the first semantic features and each second semantic feature into a relevance computation sub-model, and performing relevance computation on the first semantic features and each second semantic feature by the relevance computation sub-model to output the semantic relevance of the first semantic features and each second semantic feature, namely the semantic relevance of the keywords and each text sentence. Through a pre-trained association analysis model, the semantic association degree of the keywords and each text sentence is obtained based on the keywords and each text sentence in the target document, and the speed and accuracy of determining the semantic association degree can be improved.
Before the pre-trained association analysis model is obtained, the neural network model needs to be trained to obtain the association analysis model. That is, before the pre-trained association analysis model is obtained, the method further comprises:
Acquiring a preset neural network model and a training set, wherein the neural network model comprises a feature extraction sub-model and a relevance calculation sub-model, the training set comprises a plurality of sample pairs carrying relevance labels, and the sample pairs comprise sample words and sample sentences;
Any sample pair is extracted from the training set, and sample words and sample sentences in the sample pair are input into a feature extraction sub-model to obtain first predicted features of the sample words and second predicted features of the sample sentences;
inputting the first predicted feature and the second predicted feature into a relevance computation sub-model to obtain the predicted relevance of the first predicted feature and the second predicted feature;
determining a difference value according to the predicted association degree and the association degree label carried by the sample pair;
And according to the difference value, adjusting model parameters of the feature extraction sub-model and the association degree calculation sub-model, continuously executing the step of extracting any sample pair from the training set, and determining the trained neural network model as an association degree analysis model under the condition that a second preset training stop condition is reached.
Specifically, the neural network model refers to a mathematical model of a neuron, such as a BERT model, the sample sentence refers to a sentence used for training to obtain a relevance analysis model, the sample word refers to a word trained to obtain a relevance analysis model, the sample pair refers to a set comprising one sample sentence and one sample word, the training set refers to a set comprising a plurality of sample pairs, the relevance label refers to the real relevance of the sample word and the sample word in the sample pair, the first prediction feature refers to the semantic feature of the sample word determined by the feature extraction sub-model, the second prediction feature refers to the semantic feature of the sample word determined by the feature extraction sub-model, the prediction relevance refers to the relevance of the first prediction feature and the second prediction feature determined by the relevance calculation sub-model, and the second training stop condition may be that the loss value is smaller than or equal to a preset threshold, the iteration training number reaches a preset iteration value, or the loss value converges, that is, and the loss value is not reduced as training continues.
In practical applications, there are various ways to obtain the neural network model and the training set, for example, an operator may send a training instruction of the neural network model to an execution subject, or send an obtaining instruction of the neural network model and the training set, and correspondingly, the execution subject starts to obtain the neural network model and the training set after receiving the instruction, or a server may automatically obtain the neural network model and the training set every preset time period, for example, after the preset time period, a server with a model training function automatically obtains the neural network model and the training set in a designated access area, or after the preset time period, a terminal with a model training function automatically obtains the neural network model and the training set stored locally. The present description does not set any limitation on the manner in which the neural network model and training set are obtained.
After a neural network model and a sample set are obtained, a sample pair is extracted from the training set, sample words and sample sentences contained in the sample pair are input into a feature extraction sub-model, feature extraction is carried out on the sample words and the sample sentences through the feature extraction sub-model, first predicted features and second predicted features of the sample sentences are determined, then the first predicted features and the second predicted features are input into a relevance calculation sub-model, relevance calculation is carried out on the first predicted features and the second predicted features through the relevance calculation sub-model, and the prediction relevance of the first predicted features and the second predicted features is output. And then determining a difference value according to a preset second loss function according to the predicted association degree and the association degree label carried by the sample pair, adjusting model parameters of the neural network model according to the difference value under the condition that a second preset training stop condition is not met, extracting one sample pair from the training set again, and performing next training, and determining the trained neural network model as an association degree analysis model under the condition that the second preset training stop condition is met. Therefore, the neural network model is trained through a plurality of samples, the accuracy and the rate of determining the semantic relevance by the relevance analysis model can be improved, and the robustness of the relevance analysis model is improved.
Referring to fig. 3B, fig. 3B shows a process flow diagram for determining text similarity in a keyword extraction method according to an embodiment of the present application, and an example of a query document and a candidate document is described as follows:
S1, first obtaining a first key sentence set of keywords of a query document and candidate documents, wherein the first key sentence set comprises a third key sentence set and a fourth key sentence set.
S1-1, acquiring a third key sentence set of keywords of the query document and the candidate document, namely respectively extracting the keywords and the key sentences of the query document and the candidate document by using a textrank method to respectively generate the third key sentence set of the query document and the candidate document;
S1-2, obtaining a fourth key sentence set of the query document and the candidate document, namely correspondingly searching target text sentences containing the key words in the query document and the candidate document respectively according to the key words generated in the step S1-1, and determining the fourth key sentence set of the query document and the fourth key sentence set of the candidate document.
S2, inputting the keywords of the query document and each text sentence of the query document into a relevance analysis model, acquiring a first preset number of text sentences with the highest keyword semantic relevance of the query document as a second keyword sentence set of the query document, and similarly acquiring a second keyword sentence set of the candidate document.
S3, generating a target key sentence set, namely respectively solving intersection of the second key sentence set, the third key sentence set and the fourth key sentence set of the query document to obtain the target key sentence set of the query document, and similarly obtaining the target key sentence set of the candidate document.
And S4, determining the text similarity, namely inputting the target key sentence set of the query document and the target key sentence set of the candidate document into a pre-trained similarity analysis model to obtain the text similarity of the query document and the candidate document.
The keyword extraction method provided by the application comprises the steps of obtaining a target document, extracting keywords and a first keyword set based on the text content of the target document, extracting first semantic features of the keywords and second semantic features of each text sentence in the target document, determining a second keyword set according to the first semantic features and each second semantic feature, and determining a target keyword set according to the first keyword set and the second keyword set. The first key sentence set is determined through the text content of the target document, so that the key sentences in the first key sentence set are ensured to carry text level information, and the second key sentence set is determined through the first semantic features of the key words and the second semantic features of all the text sentences in the target document, so that the key sentences can be determined from the semantic level more accurately, namely, the key sentences in the first key sentence set are ensured to carry the semantic level information, and further, the target key sentence set is determined according to the first key sentence set and the second key sentence set, so that the key sentences in the target key sentences contain both the text level information and the semantic level information, namely, the accuracy of determining the key sentences is improved. In addition, based on the key sentence extraction method provided by the application, automatic key sentence extraction is realized, the key sentence extraction cost is avoided while the accuracy of the key sentence is ensured, the key sentence extraction efficiency is improved, and the key sentence extraction cost is reduced.
The following is a description of the method for extracting a key sentence, taking the document recall application as an example, with reference to fig. 4. Fig. 4 shows a process flow chart of a keyword extraction method applied to document recall according to an embodiment of the present application, which specifically includes the following steps:
Step 402, obtaining a query document and a plurality of candidate documents.
Step 404, extracting keywords and a third keyword sentence set of the document by utilizing an extraction algorithm based on the text content according to the text content of the document aiming at any document in the query document and the candidate documents.
Optionally, extracting the keywords of the document according to the text content of the document by using a text content-based extraction algorithm, including:
Word segmentation and word removal stopping processing are carried out on the text content of the document, so that a plurality of candidate words are obtained;
According to a preset sliding window, constructing a word graph by taking each candidate word as a node and the co-occurrence relation among each candidate word as an edge;
According to the word graph, iteratively calculating a first initial weight corresponding to each candidate word until a first preset convergence condition is reached, and obtaining a first target weight corresponding to each candidate word;
Keywords for the document are determined from the candidate words based on the first target weights.
Optionally, extracting the third key sentence set of the document according to the text content of the document by using a text content-based extraction algorithm, including:
Sentence dividing processing is carried out on the text content of the document to obtain a plurality of candidate sentences;
Constructing a sentence graph by taking each candidate sentence as a node and the sentence similarity among the candidate sentences as an edge;
According to the sentence pattern, iteratively calculating a second initial weight corresponding to each candidate sentence until a second preset convergence condition is reached, and obtaining a second target weight corresponding to each candidate word;
a third set of key sentences of the document is determined from the candidate sentences based on the second target weights.
And step 406, identifying a target text sentence containing the keywords in the document according to the keywords, and constructing a fourth keyword sentence set of the document based on the target text sentence.
Step 408, obtaining a pre-trained relevance analysis model, wherein the relevance analysis model comprises a feature extraction sub-model and a relevance calculation sub-model.
Optionally, before obtaining the pre-trained association analysis model, the method further includes:
Acquiring a preset neural network model and a training set, wherein the neural network model comprises a feature extraction sub-model and a relevance calculation sub-model, the training set comprises a plurality of sample pairs carrying relevance labels, and the sample pairs comprise sample words and sample sentences;
Any sample pair is extracted from the training set, and sample words and sample sentences in the sample pair are input into a feature extraction sub-model to obtain first predicted features of the sample words and second predicted features of the sample sentences;
inputting the first predicted feature and the second predicted feature into a relevance computation sub-model to obtain the predicted relevance of the first predicted feature and the second predicted feature;
determining a difference value according to the predicted association degree and the association degree label carried by the sample pair;
And according to the difference value, adjusting model parameters of the feature extraction sub-model and the association degree calculation sub-model, continuously executing the step of extracting any sample pair from the training set, and determining the trained neural network model as an association degree analysis model under the condition that a second preset training stop condition is reached.
Step 410, inputting the keywords and each text sentence in the document into the feature extraction sub-model to obtain a first semantic feature of the keywords and a second semantic feature of each text sentence.
Step 412, inputting the first semantic features and the second semantic features into the relevancy computation sub-model to obtain the semantic relevancy of the first semantic features and the second semantic features.
Step 414, determining a second key sentence set from the text sentences according to the semantic association degree.
Step 416, intersection sets are obtained for the second key sentence set, the third key sentence set and the fourth key sentence set, and the target key sentence set of the document is obtained.
Optionally, intersection is solved on the second key sentence set, the third key sentence set and the fourth key sentence set, so as to obtain a target key sentence set of the document, which comprises:
determining a first initial confidence coefficient of an initial key sentence relative to a second key sentence set, a second initial confidence coefficient relative to a third key sentence set and a third initial confidence coefficient relative to a fourth key sentence set, wherein the initial key sentence refers to any key sentence in the second key sentence set, the third key sentence set and the fourth key sentence;
Determining target confidence degrees of the initial key sentences according to the first initial confidence degrees, the second initial confidence degrees and the third initial confidence degrees;
Determining target key sentences from the second key sentence set, the third key sentence set and the fourth key sentence set based on the target confidence level;
And constructing a target key sentence set of the document based on the target key sentence.
Step 418, obtaining a pre-trained similarity analysis model, wherein the similarity analysis model is obtained by training based on a sample sentence set carrying similarity labels.
Optionally, before obtaining the pre-trained similarity analysis model, the method further comprises:
acquiring a preset language characterization model and a sample set, wherein the sample set comprises a plurality of sample sentence set pairs carrying similarity labels, and the sample sentence set pairs comprise a first sample sentence set and a second sample sentence set;
Extracting any sample sentence set pair from the sample set, and inputting a first sample sentence set and a second sample sentence set in the sample sentence set pair into a language characterization model to obtain the prediction similarity of the first sample sentence set and the second sample sentence set;
determining a loss value according to the predicted similarity and the similarity label carried by the sample sentence set pair;
And according to the loss value, adjusting model parameters of the language characterization model, continuously executing the step of extracting any sample sentence set pair from the sample set, and determining the trained language characterization model as a similarity analysis model under the condition that a first preset training stop condition is reached.
And 420, inputting the target key sentence sets of the query document and the target key sentence sets of the candidate documents into a similarity analysis model to obtain the text similarity of the query document and each candidate document.
Step 422 recalls similar documents of the query document from the plurality of candidate documents according to the respective text similarities.
The key sentence extraction method provided by the application determines the first key sentence set through the text content of the target document, ensures that key sentences in the first key sentence set carry text layer information, determines the second key sentence set through the first semantic features of the key words and the second semantic features of all text sentences in the target document, can determine key sentences from the semantic layer more accurately, namely ensures that the key sentences in the first key sentence set carry the text layer information, and further determines the target key sentence set according to the first key sentence set and the second key sentence set, so that the key sentences in the target key sentences contain both the text layer information and the text layer information, namely the accuracy of determining the key sentences is improved. In addition, based on the key sentence extraction method provided by the application, automatic key sentence extraction is realized, the key sentence extraction cost is avoided while the accuracy of the key sentence is ensured, the key sentence extraction efficiency is improved, and the key sentence extraction cost is reduced.
In addition, compared with a similarity determination method based on character texts, the keyword extraction method provided by the application has the advantages that the second keyword set is determined through the first semantic features of keywords and the second semantic features of each text sentence in the target document, so that semantic information is effectively utilized, the accuracy of determining the text similarity is greatly improved, compared with a similarity determination method based on traditional machine learning, the method does not need to manually construct features, and simultaneously, the second keyword set is determined through the first semantic features of the keywords and the second semantic features of each text sentence in the target document, text context semantic information is fully utilized, and compared with a text similarity method based on text cut-off deep learning, text information deletion is not weakened because document keywords are not extracted, and the accuracy of recall of similar documents is improved.
Corresponding to the method embodiment, the application also provides an embodiment of the keyword extraction device, and fig. 5 shows a schematic structural diagram of the keyword extraction device according to an embodiment of the application. As shown in fig. 5, the apparatus includes:
A first obtaining module 502 configured to obtain a target document, and extract a keyword and a first keyword sentence set based on text content of the target document;
A first determining module 504 configured to extract a first semantic feature of the keyword and a second semantic feature of each text sentence in the target document, and determine a second set of keywords according to the first semantic feature and each second semantic feature;
The second determining module 506 is configured to determine the target set of key sentences according to the first set of key sentences and the second set of key sentences.
Optionally, the first key sentence set includes a third key sentence set and a fourth key sentence set;
the first acquisition module 502 is further configured to:
extracting keywords and a third keyword sentence set of the target document by using an extraction algorithm based on the text content according to the text content of the target document;
And identifying target text sentences containing the keywords in the target document according to the keywords, and constructing a fourth keyword sentence set of the target document based on the target text sentences.
Optionally, the first acquisition module 502 is further configured to:
word segmentation and word removal stopping processing are carried out on the text content of the target document, so that a plurality of candidate words are obtained;
According to a preset sliding window, constructing a word graph by taking each candidate word as a node and the co-occurrence relation among each candidate word as an edge;
According to the word graph, iteratively calculating a first initial weight corresponding to each candidate word until a first preset convergence condition is reached, and obtaining a first target weight corresponding to each candidate word;
Keywords of the target document are determined from the candidate words based on the first target weight.
Optionally, the first acquisition module 502 is further configured to:
Sentence dividing processing is carried out on the text content of the target document, so that a plurality of candidate sentences are obtained;
Constructing a sentence graph by taking each candidate sentence as a node and the sentence similarity among the candidate sentences as an edge;
According to the sentence pattern, iteratively calculating a second initial weight corresponding to each candidate sentence until a second preset convergence condition is reached, and obtaining a second target weight corresponding to each candidate word;
a third set of key sentences of the target document is determined from the candidate sentences based on the second target weights.
Optionally, the second determining module 506 is further configured to:
and solving intersection sets of the second key sentence set, the third key sentence set and the fourth key sentence set to obtain a target key sentence set.
Optionally, the second determining module 506 is further configured to:
determining a first initial confidence coefficient of an initial key sentence relative to a second key sentence set, a second initial confidence coefficient relative to a third key sentence set and a third initial confidence coefficient relative to a fourth key sentence set, wherein the initial key sentence refers to any key sentence in the second key sentence set, the third key sentence set and the fourth key sentence;
Determining target confidence degrees of the initial key sentences according to the first initial confidence degrees, the second initial confidence degrees and the third initial confidence degrees;
Determining target key sentences from the second key sentence set, the third key sentence set and the fourth key sentence set based on the target confidence level;
and constructing a target key sentence set based on the target key sentences.
Optionally, the target document comprises a query document and a plurality of candidate documents;
optionally, the apparatus further comprises a third determining module configured to:
and determining the text similarity of the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the candidate documents.
Optionally, the apparatus further comprises a recall module configured to:
And recalling similar documents of the query document from the plurality of candidate documents according to the text similarity.
Optionally, the apparatus further comprises a second acquisition module configured to:
Obtaining a pre-trained similarity analysis model, wherein the similarity analysis model is obtained by training based on a sample sentence set carrying a similarity label;
the third determination module is further configured to:
And inputting the target key sentence set of the query document and the target key sentence sets of the candidate documents into a similarity analysis model to obtain the text similarity of the query document and each candidate document.
Optionally, the apparatus further comprises a first training module configured to:
acquiring a preset language characterization model and a sample set, wherein the sample set comprises a plurality of sample sentence set pairs carrying similarity labels, and the sample sentence set pairs comprise a first sample sentence set and a second sample sentence set;
Extracting any sample sentence set pair from the sample set, and inputting a first sample sentence set and a second sample sentence set in the sample sentence set pair into a language characterization model to obtain the prediction similarity of the first sample sentence set and the second sample sentence set;
determining a loss value according to the predicted similarity and the similarity label carried by the sample sentence set pair;
And according to the loss value, adjusting model parameters of the language characterization model, continuously executing the step of extracting any sample sentence set pair from the sample set, and determining the trained language characterization model as a similarity analysis model under the condition that a first preset training stop condition is reached.
Optionally, the first determining module 504 is further configured to:
Determining the semantic association degree of the first semantic features and each second semantic feature;
and determining a second key sentence set from each text sentence according to the semantic association degree.
Optionally, the apparatus further comprises a third acquisition module configured to:
obtaining a pre-trained relevance analysis model, wherein the relevance analysis model comprises a feature extraction sub-model and a relevance calculation sub-model;
the first determination module 504 is further configured to:
inputting the keywords and each text sentence in the target document into a feature extraction sub-model to obtain a first semantic feature of the keywords and a second semantic feature of each text sentence;
Inputting the first semantic features and the second semantic features into a relevancy calculation sub-model to obtain the semantic relevancy of the first semantic features and the second semantic features.
Optionally, the apparatus further comprises a second training module configured to:
Acquiring a preset neural network model and a training set, wherein the neural network model comprises a feature extraction sub-model and a relevance calculation sub-model, the training set comprises a plurality of sample pairs carrying relevance labels, and the sample pairs comprise sample words and sample sentences;
Any sample pair is extracted from the training set, and sample words and sample sentences in the sample pair are input into a feature extraction sub-model to obtain first predicted features of the sample words and second predicted features of the sample sentences;
inputting the first predicted feature and the second predicted feature into a relevance computation sub-model to obtain the predicted relevance of the first predicted feature and the second predicted feature;
determining a difference value according to the predicted association degree and the association degree label carried by the sample pair;
And according to the difference value, adjusting model parameters of the feature extraction sub-model and the association degree calculation sub-model, continuously executing the step of extracting any sample pair from the training set, and determining the trained neural network model as an association degree analysis model under the condition that a second preset training stop condition is reached.
The key sentence extracting device provided by the application determines the first key sentence set through the text content of the target document, ensures that key sentences in the first key sentence set carry text layer information, determines the second key sentence set through the first semantic features of the key words and the second semantic features of all text sentences in the target document, can determine key sentences from the semantic layer more accurately, namely ensures that the key sentences in the first key sentence set carry the text layer information, and further determines the target key sentence set according to the first key sentence set and the second key sentence set, so that the key sentences in the target key sentences contain both the text layer information and the text layer information, namely the accuracy of determining the key sentences is improved. In addition, based on the key sentence extraction method provided by the application, automatic key sentence extraction is realized, the key sentence extraction cost is avoided while the accuracy of the key sentence is ensured, the key sentence extraction efficiency is improved, and the key sentence extraction cost is reduced.
The above is a schematic scheme of a key sentence extracting device of the present embodiment. It should be noted that, the technical solution of the keyword extraction apparatus and the technical solution of the keyword extraction method belong to the same concept, and details of the technical solution of the keyword extraction apparatus, which are not described in detail, can be referred to the description of the technical solution of the keyword extraction method. Furthermore, the components in the apparatus embodiments should be understood as functional blocks that must be established to implement the steps of the program flow or the steps of the method, and the functional blocks are not actually functional partitions or separate limitations. The device claims defined by such a set of functional modules should be understood as a functional module architecture for implementing the solution primarily by means of the computer program described in the specification, and not as a physical device for implementing the solution primarily by means of hardware.
Fig. 6 illustrates a block diagram of a computing device 600 provided in accordance with an embodiment of the present application. The components of computing device 600 include, but are not limited to, memory 610 and processor 620. The processor 620 is coupled to the memory 610 via a bus 630 and a database 650 is used to hold data.
Computing device 600 also includes access device 640, access device 640 enabling computing device 600 to communicate via one or more networks 660. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 640 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the application, the above-described components of computing device 600, as well as other components not shown in FIG. 6, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 5 is for exemplary purposes only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 600 may also be a mobile or stationary server.
Wherein the processor 620 is configured to execute computer-executable instructions of the keyword extraction method.
The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the above-mentioned keyword extraction method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned keyword extraction method.
An embodiment of the present application also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, are used in a keyword extraction method.
The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the above-mentioned keyword extraction method belong to the same concept, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned keyword extraction method.
The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, randomAccess Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
An embodiment of the present application further provides a chip storing a computer program, which when executed by the chip, implements the steps of the keyword extraction method.
It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The preferred embodiments of the application disclosed above are intended only to assist in the explanation of the application. Alternative embodiments are not intended to be exhaustive or to limit the application to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and the full scope and equivalents thereof.

Claims (15)

1.一种关键句抽取方法,其特征在于,包括:1. A key sentence extraction method, characterized by comprising: 获取目标文档,基于所述目标文档的文字内容,提取关键词和第一关键句集,其中,所述第一关键句集是指从文字或文本层面获取的关键句集;Acquire a target document, and extract keywords and a first key sentence set based on the text content of the target document, wherein the first key sentence set refers to a key sentence set acquired from a text or text level; 提取所述关键词的第一语义特征和所述目标文档中各文本语句的第二语义特征,并根据所述第一语义特征和各所述第二语义特征,确定第二关键句集,其中,所述第二关键句集是指从语义层面获取的关键句集,所述根据所述第一语义特征和各所述第二语义特征,确定第二关键句集,包括:根据所述第一语义特征与各所述第二语义特征的语义关联度,确定第二关键句集;Extracting the first semantic feature of the keyword and the second semantic feature of each text sentence in the target document, and determining a second key sentence set according to the first semantic feature and each of the second semantic features, wherein the second key sentence set refers to a key sentence set obtained from a semantic level, and determining the second key sentence set according to the first semantic feature and each of the second semantic features includes: determining the second key sentence set according to the semantic association between the first semantic feature and each of the second semantic features; 根据所述第一关键句集和所述第二关键句集,确定目标关键句集,其中,所述根据所述第一关键句集和所述第二关键句集,确定目标关键句集,包括:对所述第一关键句集和第二关键句集求交集,确定目标关键句集。A target key sentence set is determined according to the first key sentence set and the second key sentence set, wherein the step of determining the target key sentence set according to the first key sentence set and the second key sentence set includes: finding the intersection of the first key sentence set and the second key sentence set to determine the target key sentence set. 2.根据权利要求1所述的方法,其特征在于,所述第一关键句集包括第三关键句集和第四关键句集;2. The method according to claim 1, characterized in that the first key sentence set includes a third key sentence set and a fourth key sentence set; 所述基于所述目标文档的文字内容,提取关键词和第一关键句集,包括:The step of extracting keywords and a first key sentence set based on the text content of the target document includes: 根据所述目标文档的文字内容,利用基于文字内容的提取算法,提取所述目标文档的关键词和第三关键句集;Extracting keywords and a third key sentence set of the target document according to the text content of the target document using a text content-based extraction algorithm; 根据所述关键词,识别所述目标文档中包含有所述关键词的目标文本语句,基于所述目标文本语句构建所述目标文档的第四关键句集。According to the keyword, a target text sentence in the target document containing the keyword is identified, and a fourth key sentence set of the target document is constructed based on the target text sentence. 3.根据权利要求2所述的方法,其特征在于,所述根据所述目标文档的文字内容,利用基于文字内容的提取算法,提取所述目标文档的关键词,包括:3. The method according to claim 2, characterized in that the step of extracting keywords of the target document based on the text content of the target document using a text content-based extraction algorithm comprises: 将所述目标文档的文字内容进行分词和去停用词处理,得到多个候选词;Segmenting the text content of the target document and removing stop words to obtain multiple candidate words; 根据预设滑动窗口,以各候选词为节点,并以各候选词之间的共现关系为边,构建词图;According to the preset sliding window, a word graph is constructed with each candidate word as a node and the co-occurrence relationship between the candidate words as an edge; 根据所述词图,迭代计算各候选词对应的第一初始权重,直至达到第一预设收敛条件,得到各候选词对应的第一目标权重;Iteratively calculating a first initial weight corresponding to each candidate word according to the word graph until a first preset convergence condition is reached, thereby obtaining a first target weight corresponding to each candidate word; 基于所述第一目标权重,从各候选词确定所述目标文档的关键词。Based on the first target weight, keywords of the target document are determined from the candidate words. 4.根据权利要求2所述的方法,其特征在于,所述根据所述目标文档的文字内容,利用基于文字内容的提取算法,提取第三关键句集,包括:4. The method according to claim 2, characterized in that the extracting the third key sentence set according to the text content of the target document using a text content-based extraction algorithm comprises: 将所述目标文档的文字内容进行分句处理,得到多个候选语句;Segmenting the text content of the target document to obtain multiple candidate sentences; 以各候选语句为节点,并以各候选语句之间的语句相似度为边,构建句图;A sentence graph is constructed with each candidate sentence as a node and the sentence similarities between candidate sentences as edges; 根据所述句图,迭代计算各候选语句对应的第二初始权重,直至达到第二预设收敛条件,得到各候选语句对应的第二目标权重;Iteratively calculating the second initial weight corresponding to each candidate sentence according to the sentence graph until a second preset convergence condition is reached to obtain a second target weight corresponding to each candidate sentence; 基于所述第二目标权重,从各候选语句确定所述目标文档的第三关键句集。Based on the second target weight, a third key sentence set of the target document is determined from each candidate sentence. 5.根据权利要求2所述的方法,其特征在于,所述根据所述第一关键句集和所述第二关键句集,确定目标关键句集,包括:5. The method according to claim 2, characterized in that the step of determining a target key sentence set based on the first key sentence set and the second key sentence set comprises: 对所述第二关键句集、所述第三关键句集和所述第四关键句集求交集,获得目标关键句集。The intersection of the second key sentence set, the third key sentence set and the fourth key sentence set is obtained to obtain a target key sentence set. 6.根据权利要求5所述的方法,其特征在于,所述对所述第二关键句集、所述第三关键句集和所述第四关键句集求交集,获得目标关键句集,包括:6. The method according to claim 5, characterized in that the step of finding the intersection of the second key sentence set, the third key sentence set and the fourth key sentence set to obtain the target key sentence set comprises: 确定初始关键句相对于第二关键句集的第一初始置信度、相对于第三关键句集的第二初始置信度和相对于第四关键句集的第三初始置信度,其中,所述初始关键句是指所述第二关键句集、所述第三关键句集和所述第四关键句中的任一关键句;Determining a first initial confidence of an initial key sentence relative to a second key sentence set, a second initial confidence relative to a third key sentence set, and a third initial confidence relative to a fourth key sentence set, wherein the initial key sentence refers to any key sentence in the second key sentence set, the third key sentence set, and the fourth key sentence; 根据所述第一初始置信度、所述第二初始置信度和所述第三初始置信度,确定所述初始关键句的目标置信度;Determining a target confidence of the initial key sentence according to the first initial confidence, the second initial confidence, and the third initial confidence; 基于所述目标置信度,从所述第二关键句集、所述第三关键句集和所述第四关键句集中确定目标关键句;Based on the target confidence, determining a target key sentence from the second key sentence set, the third key sentence set, and the fourth key sentence set; 基于所述目标关键句构建所述目标关键句集。The target key sentence set is constructed based on the target key sentence. 7.根据权利要求1-6任意一项所述的方法,其特征在于,所述目标文档包括查询文档和多个候选文档;7. The method according to any one of claims 1 to 6, characterized in that the target document includes a query document and a plurality of candidate documents; 在所述根据所述第一关键句集和所述第二关键句集,确定目标关键句集之后,还包括:After determining the target key sentence set according to the first key sentence set and the second key sentence set, the method further includes: 根据所述查询文档的目标关键句集和所述多个候选文档的目标关键句集,确定所述查询文档分别与各候选文档的文本相似度。The text similarity between the query document and each candidate document is determined according to the target key sentence set of the query document and the target key sentence sets of the plurality of candidate documents. 8.根据权利要求7所述的方法,其特征在于,在所述根据所述查询文档的目标关键句集和所述多个候选文档的目标关键句集,确定所述查询文档分别与各候选文档的文本相似度之后,还包括:8. The method according to claim 7, characterized in that after determining the text similarity between the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the plurality of candidate documents, it further comprises: 根据各所述文本相似度,从所述多个候选文档中召回所述查询文档的相似文档。According to each of the text similarities, similar documents to the query document are recalled from the multiple candidate documents. 9.根据权利要求7所述的方法,其特征在于,在所述根据所述查询文档的目标关键句集和所述多个候选文档的目标关键句集,确定所述查询文档分别与各候选文档的文本相似度之前,还包括:9. The method according to claim 7, characterized in that before determining the text similarity between the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the plurality of candidate documents, it further comprises: 获取预训练的相似度分析模型,其中,所述相似度分析模型基于携带有相似度标签的样本语句集对训练得到;Obtaining a pre-trained similarity analysis model, wherein the similarity analysis model is trained based on a set of sample sentences carrying similarity labels; 所述根据所述查询文档的目标关键句集和所述多个候选文档的目标关键句集,确定所述查询文档分别与各候选文档的文本相似度,包括:The determining, based on the target key sentence set of the query document and the target key sentence sets of the plurality of candidate documents, the text similarity between the query document and each candidate document includes: 将所述查询文档的目标关键句集和所述多个候选文档的目标关键句集输入至所述相似度分析模型,得到所述查询文档分别与各候选文档的文本相似度。The target key sentence set of the query document and the target key sentence sets of the multiple candidate documents are input into the similarity analysis model to obtain the text similarity between the query document and each candidate document. 10.根据权利要求9所述的方法,其特征在于,在所述获取预训练的相似度分析模型之前,还包括:10. The method according to claim 9, characterized in that before obtaining the pre-trained similarity analysis model, it also includes: 获取预设的语言表征模型和样本集,其中,所述样本集中包含多个携带相似度标签的样本语句集对,所述样本语句集对包括第一样本语句集和第二样本语句集;Obtaining a preset language representation model and a sample set, wherein the sample set includes a plurality of sample sentence set pairs carrying similarity labels, and the sample sentence set pairs include a first sample sentence set and a second sample sentence set; 从所述样本集中提取任一样本语句集对,将该样本语句集对中的第一样本语句集和第二样本语句集输入至所述语言表征模型,得到所述第一样本语句集和所述第二样本语句集的预测相似度;Extract any sample sentence set pair from the sample set, input a first sample sentence set and a second sample sentence set in the sample sentence set pair into the language representation model, and obtain a predicted similarity between the first sample sentence set and the second sample sentence set; 根据所述预测相似度和该样本语句集对携带的相似度标签,确定损失值;Determining a loss value according to the predicted similarity and a similarity label carried by the sample sentence set; 根据所述损失值,调整所述语言表征模型的模型参数,继续执行所述从所述样本集中提取任一样本语句集对的步骤,在达到第一预设训练停止条件的情况下,将训练好的所述语言表征模型确定为相似度分析模型。According to the loss value, the model parameters of the language representation model are adjusted, and the step of extracting any sample sentence set pair from the sample set is continued. When a first preset training stop condition is reached, the trained language representation model is determined as a similarity analysis model. 11.根据权利要求1所述的方法,其特征在于,在所述提取所述关键词的第一语义特征和所述目标文档中各文本语句的第二语义特征之前,还包括:11. The method according to claim 1, characterized in that before extracting the first semantic features of the keyword and the second semantic features of each text sentence in the target document, it also includes: 获取预训练的关联度分析模型,其中,所述关联度分析模型包括特征提取子模型和关联度计算子模型;Acquire a pre-trained correlation analysis model, wherein the correlation analysis model includes a feature extraction sub-model and a correlation calculation sub-model; 所述提取所述关键词的第一语义特征和所述目标文档中各文本语句的第二语义特征,包括:The extracting the first semantic feature of the keyword and the second semantic feature of each text sentence in the target document includes: 将所述关键词和所述目标文档中的各文本语句输入至所述特征提取子模型,得到所述关键词的第一语义特征和所述各文本语句的第二语义特征;Inputting the keyword and each text sentence in the target document into the feature extraction sub-model to obtain a first semantic feature of the keyword and a second semantic feature of each text sentence; 所述确定所述第一语义特征与各所述第二语义特征的语义关联度,包括:The determining of the semantic association between the first semantic feature and each of the second semantic features includes: 将所述第一语义特征和各第二语义特征输入至所述关联度计算子模型,得到所述第一语义特征与各所述第二语义特征的语义关联度。The first semantic feature and each second semantic feature are input into the association degree calculation sub-model to obtain the semantic association degree between the first semantic feature and each second semantic feature. 12.根据权利要求11所述的方法,其特征在于,所述获取预训练的关联度分析模型之前,还包括:12. The method according to claim 11, characterized in that before obtaining the pre-trained association analysis model, it also includes: 获取预设的神经网络模型和训练集,其中,所述神经网络模型包括特征提取子模型和关联度计算子模型,所述训练集中包含多个携带关联度标签的样本对,所述样本对包括样本词语和样本语句;Obtaining a preset neural network model and a training set, wherein the neural network model includes a feature extraction sub-model and a relevance calculation sub-model, and the training set includes a plurality of sample pairs carrying relevance labels, wherein the sample pairs include sample words and sample sentences; 从所述训练集中提取任一样本对,将该样本对中的样本词语和样本语句输入至所述特征提取子模型,得到所述样本词语的第一预测特征和所述样本语句的第二预测特征;Extract any sample pair from the training set, input the sample word and the sample sentence in the sample pair into the feature extraction sub-model, and obtain the first prediction feature of the sample word and the second prediction feature of the sample sentence; 将所述第一预测特征和所述第二预测特征输入至所述关联度计算子模型,得到所述第一预测特征和所述第二预测特征的预测关联度;Inputting the first prediction feature and the second prediction feature into the correlation calculation sub-model to obtain the prediction correlation between the first prediction feature and the second prediction feature; 根据所述预测关联度和该样本对携带的关联度标签,确定差异值;Determining a difference value according to the predicted correlation and the correlation label carried by the sample pair; 根据所述差异值,调整所述特征提取子模型和关联度计算子模型的模型参数,继续执行所述从所述训练集中提取任一样本对的步骤,在达到第二预设训练停止条件的情况下,将训练好的所述神经网络模型确定为关联度分析模型。According to the difference value, the model parameters of the feature extraction sub-model and the association calculation sub-model are adjusted, and the step of extracting any sample pair from the training set is continued. When a second preset training stop condition is reached, the trained neural network model is determined as the association analysis model. 13.一种关键句抽取装置,其特征在于,包括:13. A key sentence extraction device, characterized by comprising: 第一获取模块,被配置为获取目标文档,基于所述目标文档的文字内容,提取关键词和第一关键句集,其中,所述第一关键句集是指从文字或文本层面获取的关键句集;A first acquisition module is configured to acquire a target document, and extract keywords and a first key sentence set based on the text content of the target document, wherein the first key sentence set refers to a key sentence set acquired from a text or text level; 第一确定模块,被配置为提取所述关键词的第一语义特征和所述目标文档中各文本语句的第二语义特征,并根据所述第一语义特征和各所述第二语义特征,确定第二关键句集,其中,所述第二关键句集是指从语义层面获取的关键句集,所述第一确定模块进一步被配置为:根据所述第一语义特征与各所述第二语义特征的语义关联度,确定第二关键句集;The first determination module is configured to extract the first semantic feature of the keyword and the second semantic feature of each text sentence in the target document, and determine a second key sentence set according to the first semantic feature and each of the second semantic features, wherein the second key sentence set refers to a key sentence set obtained from a semantic level, and the first determination module is further configured to: determine the second key sentence set according to the semantic association between the first semantic feature and each of the second semantic features; 第二确定模块,被配置为根据所述第一关键句集和所述第二关键句集,确定目标关键句集,其中,所述第二确定模块进一步被配置为:对所述第一关键句集和第二关键句集求交集,确定目标关键句集。The second determination module is configured to determine a target key sentence set according to the first key sentence set and the second key sentence set, wherein the second determination module is further configured to: find the intersection of the first key sentence set and the second key sentence set to determine the target key sentence set. 14.一种计算设备,其特征在于,包括:14. A computing device, comprising: 存储器和处理器;Memory and processor; 所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令实现权利要求1至12任意一项所述关键句抽取方法的步骤。The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the steps of the key sentence extraction method according to any one of claims 1 to 12. 15.一种计算机可读存储介质,其存储有计算机指令,其特征在于,该指令被处理器执行时实现权利要求1至12任意一项所述关键句抽取方法的步骤。15. A computer-readable storage medium storing computer instructions, characterized in that when the instructions are executed by a processor, the steps of the key sentence extraction method described in any one of claims 1 to 12 are implemented.
CN202210412327.4A 2022-04-19 2022-04-19 Keyword extraction method and device Active CN114818727B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210412327.4A CN114818727B (en) 2022-04-19 2022-04-19 Keyword extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210412327.4A CN114818727B (en) 2022-04-19 2022-04-19 Keyword extraction method and device

Publications (2)

Publication Number Publication Date
CN114818727A CN114818727A (en) 2022-07-29
CN114818727B true CN114818727B (en) 2025-01-17

Family

ID=82506319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210412327.4A Active CN114818727B (en) 2022-04-19 2022-04-19 Keyword extraction method and device

Country Status (1)

Country Link
CN (1) CN114818727B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115455950B (en) * 2022-09-27 2023-06-16 中科雨辰科技有限公司 Acquiring text data processing system
CN116541522A (en) * 2023-04-20 2023-08-04 上海东普信息科技有限公司 Voice quality inspection method, device, equipment and storage medium
CN119169537B (en) * 2024-11-20 2025-03-07 浙江大华技术股份有限公司 Method and device for determining alarm message

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680505A (en) * 2020-04-21 2020-09-18 华东师范大学 A Markdown Feature Aware Unsupervised Keyword Extraction Method
CN112164391A (en) * 2020-10-16 2021-01-01 腾讯科技(深圳)有限公司 Statement processing method and device, electronic equipment and storage medium
CN113590768A (en) * 2020-04-30 2021-11-02 北京金山数字娱乐科技有限公司 Training method and device of text relevance model and question-answering method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779119B (en) * 2012-06-21 2015-08-26 盘古文化传播有限公司 A kind of method of extracting keywords and device
CN109582949B (en) * 2018-09-14 2022-11-22 创新先进技术有限公司 Event element extraction method and device, computing equipment and storage medium
CN111460131A (en) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 Method, device, device and computer-readable storage medium for extracting official document abstract
CN111460099B (en) * 2020-03-30 2023-04-07 招商局金融科技有限公司 Keyword extraction method, device and storage medium
CN112597776A (en) * 2021-03-08 2021-04-02 中译语通科技股份有限公司 Keyword extraction method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680505A (en) * 2020-04-21 2020-09-18 华东师范大学 A Markdown Feature Aware Unsupervised Keyword Extraction Method
CN113590768A (en) * 2020-04-30 2021-11-02 北京金山数字娱乐科技有限公司 Training method and device of text relevance model and question-answering method and device
CN112164391A (en) * 2020-10-16 2021-01-01 腾讯科技(深圳)有限公司 Statement processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114818727A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN110826328B (en) Keyword extraction method, device, storage medium and computer equipment
CN104376406B (en) A kind of enterprise innovation resource management and analysis method based on big data
Yu et al. Deep multimodal distance metric learning using click constraints for image ranking
CN111753167B (en) Search for processing methods, apparatus, computer equipment and media
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN114818727B (en) Keyword extraction method and device
CN107491547A (en) Searching method and device based on artificial intelligence
CN112507091A (en) Method, device, equipment and storage medium for retrieving information
CN114880447A (en) Information retrieval method, device, equipment and storage medium
CN114943236B (en) Keyword extraction method and device
KR102590388B1 (en) Apparatus and method for video content recommendation
CN112800226A (en) Method for obtaining text classification model, method, apparatus and device for text classification
CN114003706B (en) Keyword combination generation model training method and device
CN120234386A (en) A retrieval joint optimization method for retrieval enhancement generation system
Mishra et al. Attention free BIGBIRD transformer for long document text summarization
Hassan et al. Evaluating of efficacy semantic similarity methods for comparison of academic thesis and dissertation texts
Tian et al. Automatic image annotation with real-world community contributed data set
Hariprasath et al. A study on word embeddings in local LLM-based chatbot applications
Ihou et al. A smoothed latent generalized dirichlet allocation model in the collapsed space
CN112749251B (en) Text processing method, device, computer equipment and storage medium
Zamani Neural models for information retrieval without labeled data
Kapoor Classification & Clustering of Text Based on Doc2Vec & K-means Clustering based Similarity Measurements
CN114036946B (en) A system and method for text feature extraction and assisted retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant