CN109726270A

CN109726270A - A method for detecting the degree of repetition of articles based on article segmentation and Pearson test

Info

Publication number: CN109726270A
Application number: CN201811511826.9A
Authority: CN
Inventors: 徐炜华
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2019-05-07
Anticipated expiration: 2038-12-11
Also published as: CN109726270B

Abstract

The present invention is a kind of article repetition degree detecting method examined based on article segmentation and Pearson.The invention is characterized in that article text is all divided into multiple segments, it then counts the frequency of occurrence of each segment and is arranged according to certain sequence, again by database or the article in other sources carry out same treatment, the data obtained can be depicted as curve, therefore the method for Pearson's inspection can be taken to detect the correlation of two curves, to obtain the repetition degree of two articles.When related coefficient is 0.8-1.0, two article height of explanation are repeated, higher repetition when being 0.6-0.8, moderate repetition when being 0.4-0.6, low repetition when being 0.2-0.4, extremely low repetition or without repeating when being 0.0-0.2.This technology in conjunction with computer technology after can be applicable to paper repeatability detection (being commonly called as paper duplicate checking) in terms of, and improve the artificial difficulty for reducing repetitive rate (being commonly called as drop weight), be of great significance for strike paper act of plagiarism.

Description

A kind of article repetition degree detecting method examined based on article segmentation and Pearson

Technical field

The present invention relates to a kind of methods that paper repeats degree detecting.

Background technique

Pearson correlation coefficients (Pearson correlation coefficient) are also referred to as Pearson product-moment correlation coefficient (Pearson product-moment correlation coefficient), is a kind of linearly dependent coefficient, is defined as The quotient of covariance and standard deviation between two variables:

Above formula defines population correlation coefficient, and common lowercase Greek alpha ρ (rho) is used as and represents symbol.Estimate sample Pearson correlation coefficient (sample correlation coefficient) can be obtained in covariance and standard deviation, commonly uses English lower case r and represents:

R also can obtain the expression formula with above formula equivalence by the criterion score Estimation of Mean of sample point:

WhereinAnd σ_XIt is to X respectively_iCriterion score, sample mean and the sample standard deviation of sample.

Pearson correlation coefficients are the statistics for reflecting two linear variable displacement degrees of correlation, and absolute value shows more greatly Correlation is stronger.Illustrate that correlation is extremely strong when related coefficient is 0.8-1.0, correlation is stronger when being 0.6-0.8, is 0.4- Moderate correlation when 0.6, correlation is lower when being 0.2-0.4, and correlation is extremely low or without correlation when being 0.0-0.2.

Summary of the invention

The present invention creatively selects solve previous traditional paper not with semantic complete sentence for minimum duplicate checking unit The problem of weight can drop by adjusting the methods of word order in duplicate checking method easily, and by mathematical model logarithm mature in statistics According to being analyzed.The present invention is intended to provide a kind of simple, accurate, reliable paper duplicate checking method.

Specific embodiment

Firstly, randomly select segmentation site, paper to be measured is decomposed into equal length or segment not etc..It should infuse herein Meaning is that fragment length after decomposing is unsuitable too long, in order to avoid influencing the sensitivity of detection, is usually no more than 5 words, will full text it is equal Sensitivity highest when being decomposed into single character.Then the total degree occurred in paper to gained segment counts, according to one It is fixed sequentially to be arranged, obtain an array.Then, the reference paper in database is decomposed according to same site, The paper segment decomposited is counted according to the number of appearance, the piece for occurring in paper to be measured but not occurring in reference paper Section meter 0, does not occur in paper to be measured but the segment occurred in reference paper does not count, resulting data according to paper phase to be measured Same sequence arrangement, obtains two arrays.Finally, carrying out Pearson inspection to resulting two array, the phase relation of two arrays is obtained Number, the repetition degree of as two papers.

Detailed description are as follows states shown in example for this method:

Select the highest full text of sensitivity word for word isolation.Paper to be measured is decomposed, to gained Chinese character frequency of occurrence into Row counts, and is ranked up according to the initial of the Chinese character decomposited sequence to the data obtained, obtains an array.Again by data Reference paper in library decomposes, and counts to gained Chinese character frequency of occurrence, and right according to the data arrangement of paper to be measured sequence The data obtained is ranked up, the word meter 0 for occurring in paper to be measured but not occurring in reference paper, is not occurred but is joined in paper to be measured Word than occurring in paper does not count, array of getting back.Finally, carrying out Pearson inspection to resulting two array, obtain The repetition degree of two papers illustrates that a possibility that paper has plagiarism is very big when related coefficient is 0.8-1.0, is 0.6- A possibility that plagiarizing when 0.8 is larger, has certain plagiarism suspicion when being 0.4-0.6, when being 0.2-0.4 a possibility that plagiarizing compared with Low, a possibility that plagiarizing, is extremely low when being 0.0-0.2.It repeats above operation, until the reference paper in database is detected and finished.

Claims

1. a kind of article examined based on article segmentation and Pearson repeats degree detecting method, main feature are as follows: by specific Article to be measured is divided into several segments by number of words or specific identification site, then counts the number that each segment occurs altogether in article, Using article segment as horizontal axis, the total degree that segment occurs in article is the longitudinal axis, draws curve, this curve is referred to as the spy of this article Curve is levied, by the segmentation of the progresss same manner of other articles and counts each segment in article out according to same segmentation site Existing number, using article segment as horizontal axis, the total degree that segment occurs in article is the longitudinal axis, draws curve, this curve is referred to as The curve of article to be measured is carried out Pearson's inspection by the indicatrix of this article together with the curve of other articles, as according to According to the repetition degree for determining article.

2. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that for article to be measured being divided into multiple segments, the sensitivity of the effect length detection of segment, the shorter detection of fragment length Sensitivity is higher.

3. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, When being characterized in that article to be measured being divided into multiple segments, fragment length is simultaneously not fixed.

4. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that after article to be measured is divided into multiple segments, counts the number that each segment occurs in the text.

5. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that after counting the number that each segment occurs in the text, using article segment as horizontal axis, segment occurs total in article Number is the longitudinal axis, draws curve, this curve is referred to as the indicatrix of this article.

6. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that the indicatrix of article can be completed to draw by computer.

7. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is not absolutely required to present in graphical form for the indicatrix for being characterized in that by article made of computer drawing.

8. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that present in the form of mathematic(al) representation or array by the indicatrix of article made of computer drawing.

9. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that the indicatrix for several articles that will be drawn out carries out Pearson's inspection.

10. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that the result examined according to Pearson judges the repetition degree of article.