CN114328822B

CN114328822B - A contract text intelligent analysis method based on deep data mining

Info

Publication number: CN114328822B
Application number: CN202111485260.9A
Authority: CN
Inventors: 焦洪林; 陆向东; 朱坚; 王雷
Original assignee: Fujia Newland Software Engineering Co ltd
Current assignee: Fujia Newland Software Engineering Co ltd
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2025-04-04
Anticipated expiration: 2041-12-07
Also published as: CN114328822A

Abstract

The invention provides an intelligent analysis method of contract texts based on deep data mining, which belongs to the technical field of text processing and comprises the steps of S10, obtaining contract texts to be analyzed and historical contract texts to form a contract text set, S20, preprocessing the contract text set, S30, respectively extracting keywords from the contract texts to be analyzed and the historical contract texts based on a multi-feature word weight formula to obtain first keywords and second keywords, S40, searching similar historical contract texts based on the first keywords and the second keywords, S50, searching confusion words of the contract texts to be analyzed based on language model confusion and an N-Gram language model, and matching correct words corresponding to the confusion words, and S60, displaying the first keywords, the similar historical contract texts, the confusion words and the correct words to complete intelligent analysis of the contract texts to be analyzed. The method has the advantage that the quality and the efficiency of contract text analysis are greatly improved.

Description

Contract text intelligent analysis method based on deep data mining

Technical Field

The invention relates to the technical field of text processing, in particular to an intelligent contract text analysis method based on deep data mining.

Background

In recent years, with the development of internet technology, corporate law enforcement officers are faced with a work demand for rapidly analyzing, managing, and writing a large number of contracts in the form of electronic documents in a short time. How to quickly and accurately acquire summary information from various contracts and manage and edit the contracts is a main problem to be solved at present. Contract text has the following characteristics relative to other documents:

1. The topic type is clear, the contract text is usually edited and managed by the departments, the affiliated institutions and the business conditions, and each document basically has the affiliated departments or business type classification.

2. The word use specialization is that the words used in the contract text generally use some special words in the corresponding range according to the departments and topics, rather than using various life and network words like the documents of novels, forums, microblogs and the like.

3. Content normalization, wherein the content of the contract text is generally a statement sentence and does not contain excessive modification and descriptive content, so that errors are mostly word errors and special word errors when writing the contract text, and grammar and semantic errors are rarely involved.

4. The contract text generally has no abstract, and sometimes takes a meeting or report name as a document title, and cannot provide enough document summary information.

Based on the characteristics of the contract text, no corresponding method in the prior art can accurately extract keywords from the contract text, match similar contracts and correct content, so that errors are easy to occur and the efficiency is low when the contract text is analyzed. Therefore, how to provide an intelligent analysis method for contract text based on deep data mining, so as to improve the quality and efficiency of the analysis of the contract text, becomes a technical problem to be solved urgently.

Disclosure of Invention

The invention aims to solve the technical problem of providing an intelligent analysis method for contract text based on deep data mining, which realizes the improvement of the quality and efficiency of the analysis of the contract text.

The invention discloses an intelligent analysis method for contract text based on depth data mining, which comprises the following steps:

step S10, acquiring a contract text to be analyzed and a large number of historical contract texts to form a contract text set;

s20, preprocessing the contract text set;

Step S30, based on a multi-feature word weight formula, extracting keywords from the preprocessed contract text to be analyzed and the preprocessed historical contract text respectively to obtain a plurality of first keywords and a plurality of second keywords;

Step S40, similar historical contract texts are searched based on the first keywords and the second keywords;

S50, searching confusion words of the contract text to be analyzed based on the language model confusion degree and the N-Gram language model, and matching correct words corresponding to the confusion words;

and step S60, displaying the first keyword of the contract text to be analyzed, similar historical contract text, confusion words and correct words, and completing intelligent analysis of the contract text to be analyzed.

Further, in the step S10, the obtaining a large amount of historical contract text specifically includes:

setting a time span, and acquiring historical contract texts of a large number of different departments in different areas based on the time span.

Further, the step S20 specifically includes:

S21, searching for repeated contract texts in the contract text set based on the contract titles, and merging repeated contract texts;

s22, eliminating noise data of each contract text in the contract text set;

Step S23, a sensitive word stock is created, and sensitive words of all contract texts in the contract text set are filtered based on the sensitive word stock.

Further, in the step S22, the noise data includes at least URL address, special symbol, expression, picture, and zero-width character.

Further, in the step S30, the multi-feature term weight formula specifically includes:

W_NEW-TF-IDF＝W_TF-IDF×W_word;

W_TF-IDF=TF(i)×IDF(i);

W_word＝αW_l+βW_c+γW_len;

wherein W _NEW-TF-IDF represents multi-feature word weight, W _TF-IDF represents weighted feature weight, W _word represents word weight comprising position weight, part-of-speech weight and word length weight, TF (i) represents word frequency of the ith word, IDF (i) represents inverse document frequency of the ith word, namely, the smaller the number of contracted texts comprising the ith word is, the larger the value is, N _i represents the number of times the ith word appears, N represents the total number of all keywords, N represents the total number of contracted texts, df (i) represents the number of documents in which the ith word appears, alpha, beta and gamma all represent weight coefficients, W _l represents position weight, W _c represents part-of-speech weight, W _len represents word length weight, i _len represents word length of the ith word, avg (len) represents average price word length.

Further, the step S40 specifically includes:

step S41, splicing the first keywords and the corresponding contract titles to obtain first key information, and splicing the second keywords and the corresponding contract titles to obtain second key information;

step S42, inputting the first key information and the second key information into a BERT model and an average pooling layer in sequence for feature extraction to obtain a first feature sentence vector and a second feature sentence vector;

And step S43, sequentially calculating cosine similarity of the first feature sentence vector and each second feature sentence vector, and matching similar historical contract texts based on the cosine similarity.

Further, in the step S43, the calculation formula of the cosine similarity is:

Where sim represents cosine similarity, x _i represents a first feature sentence vector, y _i represents a second feature sentence vector, and m represents the total number of feature sentence vectors.

Further, the step S50 specifically includes:

Step S51, setting a likelihood threshold value and creating an confusion word set, wherein the confusion word set comprises one-to-one correspondence between a plurality of confusion words and correct words;

Step S52, calculating likelihood estimation values of all the sentences in the contract text to be analyzed in sequence based on the confusion degree of the language model, judging whether the likelihood estimation values are lower than a likelihood threshold value, if so, indicating that suspected confusion words exist, and entering step S53;

S53, sorting sentences with suspected confusion words through an N-Gram language model, and selecting words with highest scores as confusion words based on sorting results;

And step S54, matching correct words corresponding to the confusion words by using the confusion word set.

The invention has the advantages that:

The method comprises the steps of constructing multi-feature word weights by combining the position weights, the part-of-speech weights and the word length weights of words on the basis of traditional weighted feature weights, extracting keywords based on the multi-feature word weights, fully considering the position, the part-of-speech and the word length characteristics of the words, greatly improving the accuracy of keyword extraction, searching similar historical contract texts through keyword matching, greatly improving the searching efficiency relative to full text searching, searching confusion words through language model confusion and N-Gram language models, and matching correct words corresponding to the confusion words based on the established confusion word sets, so that content error correction of contract texts to be analyzed is realized, and compared with traditional manual analysis, the quality and the efficiency of contract text analysis are greatly improved.

Drawings

The invention will be further described with reference to examples of embodiments with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method for intelligent analysis of contract text based on deep data mining of the present invention.

Detailed Description

The technical scheme of the embodiment of the application has the general idea that the multi-feature word weight is built by combining the position weight, the part-of-speech weight and the word length weight of the words to extract the keywords so as to improve the accuracy of extracting the keywords, similar historical contract texts are searched through keyword matching so as to improve the searching efficiency, confusion words are searched through a language model confusion degree and an N-Gram language model, and correct words corresponding to the confusion words are matched based on the built confusion word set so as to realize content error correction, thereby improving the quality and the efficiency of contract text analysis.

Referring to fig. 1, a preferred embodiment of the intelligent analysis method for contract text based on depth data mining according to the present invention includes the following steps:

step S20, preprocessing the contract text set, namely removing some invalid data to improve text processing efficiency;

Step S30, based on a multi-feature word weight formula, extracting keywords from the preprocessed contract text to be analyzed and the preprocessed historical contract text respectively to obtain a plurality of first keywords and a plurality of second keywords; because the length of the contract text is often longer, if browsing takes a long time throughout, the important information of the contract text can be conveniently and quickly obtained by the staff through extracting the keywords;

Step S40, similar historical contract texts are searched based on the first keywords and the second keywords, so that some reference information can be conveniently obtained from the similar historical contract texts;

Step S50, searching confusion words of the contract text to be analyzed based on a language model confusion degree (PPL) and an N-Gram language model, and matching correct words corresponding to the confusion words;

And step S60, displaying the first keyword of the contract text to be analyzed, the similar historical contract text, the confusion word and the correct word, completing intelligent analysis of the contract text to be analyzed, and automatically replacing the corresponding confusion word by using the correct word.

In the step S10, the obtaining a large number of historical contract texts specifically includes:

Setting a time span, and acquiring historical contract texts of a large number of different departments in different areas based on the time span so as to improve the richness of the sample.

The step S20 specifically includes:

s22, eliminating noise data of each contract text in the contract text set;

In the step S22, the noise data at least includes URL address, special symbol, expression, picture, and zero-width character.

In the step S30, the multi-feature term weight formula specifically includes:

W_NEW-TF-IDF＝W_TF-IDF×W_word;

W_TF-IDF=TF(i)×IDF(i);

W_word＝αW_l+βW_c+γW_len;

Wherein W _NEW-TF-IDF represents multi-feature word weight, W _TF-IDF represents weighted feature weight, W _word represents word weight comprising position weight, part-of-speech weight and word length weight, TF (i) represents word frequency of the ith word, IDF (i) represents inverse document frequency of the ith word, namely, the smaller the number of contracted texts comprising the ith word is, the larger the value is, which indicates that the ith word has good type distinguishing effect, N _i represents the number of times the ith word appears, N represents the total number of all keywords, N represents the total number of contracted texts, df (i) represents the number of documents in which the ith word appears, alpha, beta and gamma represent weight coefficients, the values are preferably 0.6, 0.3 and 0.1 respectively, W _l represents position weight, W _c represents part-of-speech weight, W _len represents word length weight, i _len represents word length of the ith word, avg (len) represents average price length.

The TF-IDF algorithm shows that the characteristic words with high enough occurrence frequency in the text and low enough occurrence frequency in other texts of the whole text set are keywords of the text, but the structure of the TF-IDF algorithm is too simple to effectively reflect the importance of words and the position distribution of the characteristic words, and the weight of the words cannot be effectively adjusted, so that the accuracy of the TF-IDF algorithm is not high, the TF-IDF algorithm does not reflect the importance of the positions, parts of speech and word lengths of the words, the information reflected by the contents of different structures is different for a contract, the weight of the contract title is distributed according to different structural characteristics, namely, the weight of the contract title is distributed according to different structural characteristics, and therefore, the invention combines the characteristics of sample data to improve the traditional TF-IDF algorithm, endows different coefficients to the characteristic words with different positions, parts of speech and word lengths in the contract, and multiplies the characteristic words by the TF-IDF values of the characteristic words to enhance the text expression effect.

Since the title of the text of the contract can generally summarize the main content of the contract, the probability that the words appearing in the title become keywords is higher, and the words appearing in the beginning or ending may reflect the hidden keywords or related keywords of the contract and should be properly paid attention, the position weight of the title of the contract is adjusted to be highest, the position weight of the beginning or ending is secondary, and the position weights of other positions are smallest.

The part of speech in the Chinese is divided into two types, namely real word and imaginary word, the real word comprises nouns, verbs, adjectives, pronouns, numerical words, measuring words and the like, the imaginary word comprises prepositions, conjunctions, exclamation, auxiliary words and the like, and the part of speech of the key word is usually mainly nouns or noun phrases, and then verbs, adverbs and other modifier words.

The too short keywords can not embody the containing information, the too long keywords and the more containing information are, the keyword can be segmented again, the word length of the segmented contract text is found after the word length is counted, the word length of the keywords is generally between [2 and 7], and the too long and too short word length is needed to be filtered.

The step S40 specifically includes:

In the step S43, the calculation formula of the cosine similarity is as follows:

The method comprises the steps of obtaining a similarity of cosine, wherein sim represents the similarity of cosine, x _i represents a first feature sentence vector, y _i represents a second feature sentence vector, m represents the total number of feature sentence vectors, and the larger the sim value is, the smaller the included angle between the two feature sentence vectors is, the higher the similarity is, and finally, the history contract text with the highest similarity is returned.

The similar contract text search can be used for matching similar historical contract texts for the currently written or managed contract texts and providing related references and references for related personnel, the semantic search of the similar contract texts is actually used for judging the semantic similarity between the original texts and the target texts, the traditional semantic matching is biased to vocabulary semantic, form matching and syntactic similarity, text features which are well defined in advance are required to be extracted, a similarity detection algorithm is written to obtain the similarity between the texts, and a neural network-based method is used for considering how to distinguish semantic differences between two texts and how to construct the relevance between the two texts when a model is constructed. Because the text of the contract has longer space, if the feature vector comparison is carried out based on the full text, the extracted feature vector cannot well represent the key information of the contract, and the finally retrieved similar result has larger difference from the actual result.

The step S50 specifically includes:

Step S51, setting a likelihood threshold value, and creating an confusion word set, wherein the confusion word set comprises one-to-one correspondence between a plurality of confusion words and correct words, and can be updated as required, so that the expansibility is strong;

Step S52, calculating likelihood estimation values of all the sentences in the contract text to be analyzed in sequence based on the confusion degree (PPL) of the language model, judging whether the likelihood estimation values are lower than a likelihood threshold value, if so, indicating that suspected confusion words exist, and entering into step S53;

The language model confusion is the multiplicative inverse of the language model allocation probability, and the formula is:

Wherein S represents the input text, N represents the sentence length, P (W _i) represents the probability of the ith word;

the correctness is judged by means of a statistical and probabilistic N-Gram language model based on score prediction of a text, an ordered word sequence containing N words is needed when the method is applied, a binary model Bi-Gram (N-2) is needed if the existence of a certain word depends on only one word in front of the word, a ternary model Tri-Gram (N-3) is needed if the existence of a certain word depends on two words in front of the word, and the method is similar. Assuming that a sentence s in the contracted text is composed of a series of words q ₁,q₂,…,q_n with specific sequences, according to the chain rule, the probability of occurrence of the sentence s is:

The N-Gram language model assumes that the sum of the probabilities of occurrence of any word is related to the N-1 words in front of it, namely:

when modeling is performed by using the ternary model, the i-th word is related to the first 2 words, namely:

In summary, the invention has the advantages that:

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that the specific embodiments described are illustrative only and not intended to limit the scope of the invention, and that equivalent modifications and variations of the invention in light of the spirit of the invention will be covered by the claims of the present invention.

Claims

1. A contract text intelligent analysis method based on depth data mining is characterized by comprising the following steps:

s20, preprocessing the contract text set;

2. The method for intelligent analysis of contract text based on deep data mining according to claim 1, wherein in the step S10, the step of obtaining a large amount of historical contract text is specifically as follows:

3. The method for intelligent analysis of contract text based on deep data mining according to claim 1, wherein said step S20 comprises the following steps:

s22, eliminating noise data of each contract text in the contract text set;

4. The method for intelligent analysis of contract text based on depth data mining according to claim 3, wherein in step S22, the noise data includes at least URL address, special symbol, expression, picture and zero-width character.

5. The method for intelligent analysis of contract text based on deep data mining according to claim 1, wherein in the step S30, the multi-feature word weight formula is specifically:

W_NEW-TF-IDF＝W_TF-IDF×W_word;

W_TF-IDF=TF(i)×IDF(i);

W_word＝αW_l+βW_c+γW_len;

6. The method for intelligent analysis of contract text based on deep data mining according to claim 1, wherein the step S40 specifically comprises the following steps:

7. The intelligent analysis method of contract text based on depth data mining according to claim 6, wherein in the step S43, the cosine similarity calculation formula is:

8. The method for intelligent analysis of contract text based on deep data mining according to claim 1, wherein said step S50 comprises the following steps: