CN109726270A - A method for detecting the degree of repetition of articles based on article segmentation and Pearson test - Google Patents
A method for detecting the degree of repetition of articles based on article segmentation and Pearson test Download PDFInfo
- Publication number
- CN109726270A CN109726270A CN201811511826.9A CN201811511826A CN109726270A CN 109726270 A CN109726270 A CN 109726270A CN 201811511826 A CN201811511826 A CN 201811511826A CN 109726270 A CN109726270 A CN 109726270A
- Authority
- CN
- China
- Prior art keywords
- article
- pearson
- segmentation
- detecting method
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 230000011218 segmentation Effects 0.000 title claims abstract description 16
- 238000007689 inspection Methods 0.000 claims abstract description 5
- 238000001514 detection method Methods 0.000 claims abstract description 4
- 230000035945 sensitivity Effects 0.000 claims description 5
- 239000012634 fragment Substances 0.000 claims description 3
- 238000010425 computer drawing Methods 0.000 claims 2
- 238000005516 engineering process Methods 0.000 abstract 2
- 230000003252 repetitive effect Effects 0.000 abstract 1
- 238000003491 array Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention is a kind of article repetition degree detecting method examined based on article segmentation and Pearson.The invention is characterized in that article text is all divided into multiple segments, it then counts the frequency of occurrence of each segment and is arranged according to certain sequence, again by database or the article in other sources carry out same treatment, the data obtained can be depicted as curve, therefore the method for Pearson's inspection can be taken to detect the correlation of two curves, to obtain the repetition degree of two articles.When related coefficient is 0.8-1.0, two article height of explanation are repeated, higher repetition when being 0.6-0.8, moderate repetition when being 0.4-0.6, low repetition when being 0.2-0.4, extremely low repetition or without repeating when being 0.0-0.2.This technology in conjunction with computer technology after can be applicable to paper repeatability detection (being commonly called as paper duplicate checking) in terms of, and improve the artificial difficulty for reducing repetitive rate (being commonly called as drop weight), be of great significance for strike paper act of plagiarism.
Description
Technical field
The present invention relates to a kind of methods that paper repeats degree detecting.
Background technique
Pearson correlation coefficients (Pearson correlation coefficient) are also referred to as Pearson product-moment correlation coefficient
(Pearson product-moment correlation coefficient), is a kind of linearly dependent coefficient, is defined as
The quotient of covariance and standard deviation between two variables:
Above formula defines population correlation coefficient, and common lowercase Greek alpha ρ (rho) is used as and represents symbol.Estimate sample
Pearson correlation coefficient (sample correlation coefficient) can be obtained in covariance and standard deviation, commonly uses English lower case r and represents:
R also can obtain the expression formula with above formula equivalence by the criterion score Estimation of Mean of sample point:
WhereinAnd σXIt is to X respectivelyiCriterion score, sample mean and the sample standard deviation of sample.
Pearson correlation coefficients are the statistics for reflecting two linear variable displacement degrees of correlation, and absolute value shows more greatly
Correlation is stronger.Illustrate that correlation is extremely strong when related coefficient is 0.8-1.0, correlation is stronger when being 0.6-0.8, is 0.4-
Moderate correlation when 0.6, correlation is lower when being 0.2-0.4, and correlation is extremely low or without correlation when being 0.0-0.2.
Summary of the invention
The present invention creatively selects solve previous traditional paper not with semantic complete sentence for minimum duplicate checking unit
The problem of weight can drop by adjusting the methods of word order in duplicate checking method easily, and by mathematical model logarithm mature in statistics
According to being analyzed.The present invention is intended to provide a kind of simple, accurate, reliable paper duplicate checking method.
Specific embodiment
Firstly, randomly select segmentation site, paper to be measured is decomposed into equal length or segment not etc..It should infuse herein
Meaning is that fragment length after decomposing is unsuitable too long, in order to avoid influencing the sensitivity of detection, is usually no more than 5 words, will full text it is equal
Sensitivity highest when being decomposed into single character.Then the total degree occurred in paper to gained segment counts, according to one
It is fixed sequentially to be arranged, obtain an array.Then, the reference paper in database is decomposed according to same site,
The paper segment decomposited is counted according to the number of appearance, the piece for occurring in paper to be measured but not occurring in reference paper
Section meter 0, does not occur in paper to be measured but the segment occurred in reference paper does not count, resulting data according to paper phase to be measured
Same sequence arrangement, obtains two arrays.Finally, carrying out Pearson inspection to resulting two array, the phase relation of two arrays is obtained
Number, the repetition degree of as two papers.
Detailed description are as follows states shown in example for this method:
Select the highest full text of sensitivity word for word isolation.Paper to be measured is decomposed, to gained Chinese character frequency of occurrence into
Row counts, and is ranked up according to the initial of the Chinese character decomposited sequence to the data obtained, obtains an array.Again by data
Reference paper in library decomposes, and counts to gained Chinese character frequency of occurrence, and right according to the data arrangement of paper to be measured sequence
The data obtained is ranked up, the word meter 0 for occurring in paper to be measured but not occurring in reference paper, is not occurred but is joined in paper to be measured
Word than occurring in paper does not count, array of getting back.Finally, carrying out Pearson inspection to resulting two array, obtain
The repetition degree of two papers illustrates that a possibility that paper has plagiarism is very big when related coefficient is 0.8-1.0, is 0.6-
A possibility that plagiarizing when 0.8 is larger, has certain plagiarism suspicion when being 0.4-0.6, when being 0.2-0.4 a possibility that plagiarizing compared with
Low, a possibility that plagiarizing, is extremely low when being 0.0-0.2.It repeats above operation, until the reference paper in database is detected and finished.
Claims (10)
1. a kind of article examined based on article segmentation and Pearson repeats degree detecting method, main feature are as follows: by specific
Article to be measured is divided into several segments by number of words or specific identification site, then counts the number that each segment occurs altogether in article,
Using article segment as horizontal axis, the total degree that segment occurs in article is the longitudinal axis, draws curve, this curve is referred to as the spy of this article
Curve is levied, by the segmentation of the progresss same manner of other articles and counts each segment in article out according to same segmentation site
Existing number, using article segment as horizontal axis, the total degree that segment occurs in article is the longitudinal axis, draws curve, this curve is referred to as
The curve of article to be measured is carried out Pearson's inspection by the indicatrix of this article together with the curve of other articles, as according to
According to the repetition degree for determining article.
2. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1,
It is characterized in that for article to be measured being divided into multiple segments, the sensitivity of the effect length detection of segment, the shorter detection of fragment length
Sensitivity is higher.
3. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1,
When being characterized in that article to be measured being divided into multiple segments, fragment length is simultaneously not fixed.
4. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1,
It is characterized in that after article to be measured is divided into multiple segments, counts the number that each segment occurs in the text.
5. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1,
It is characterized in that after counting the number that each segment occurs in the text, using article segment as horizontal axis, segment occurs total in article
Number is the longitudinal axis, draws curve, this curve is referred to as the indicatrix of this article.
6. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1,
It is characterized in that the indicatrix of article can be completed to draw by computer.
7. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1,
It is not absolutely required to present in graphical form for the indicatrix for being characterized in that by article made of computer drawing.
8. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1,
It is characterized in that present in the form of mathematic(al) representation or array by the indicatrix of article made of computer drawing.
9. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1,
It is characterized in that the indicatrix for several articles that will be drawn out carries out Pearson's inspection.
10. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1,
It is characterized in that the result examined according to Pearson judges the repetition degree of article.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811511826.9A CN109726270B (en) | 2018-12-11 | 2018-12-11 | Article repetition degree detection method based on article segmentation and Pearson test |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811511826.9A CN109726270B (en) | 2018-12-11 | 2018-12-11 | Article repetition degree detection method based on article segmentation and Pearson test |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109726270A true CN109726270A (en) | 2019-05-07 |
CN109726270B CN109726270B (en) | 2022-11-25 |
Family
ID=66295608
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811511826.9A Expired - Fee Related CN109726270B (en) | 2018-12-11 | 2018-12-11 | Article repetition degree detection method based on article segmentation and Pearson test |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109726270B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114580836A (en) * | 2022-01-18 | 2022-06-03 | 广东电力通信科技有限公司 | A 5G power infrastructure co-construction and sharing support method and system for power transmission and distribution |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000046739A1 (en) * | 1999-02-08 | 2000-08-10 | Zelson Amy S | Fingerprint analysis method |
CN106227897A (en) * | 2016-08-31 | 2016-12-14 | 青海民族大学 | A kind of Tibetan language paper copy detection method based on Tibetan language sentence level and system |
-
2018
- 2018-12-11 CN CN201811511826.9A patent/CN109726270B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000046739A1 (en) * | 1999-02-08 | 2000-08-10 | Zelson Amy S | Fingerprint analysis method |
CN106227897A (en) * | 2016-08-31 | 2016-12-14 | 青海民族大学 | A kind of Tibetan language paper copy detection method based on Tibetan language sentence level and system |
Non-Patent Citations (1)
Title |
---|
金博等: "基于篇章结构相似度的复制检测算法", 《大连理工大学学报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114580836A (en) * | 2022-01-18 | 2022-06-03 | 广东电力通信科技有限公司 | A 5G power infrastructure co-construction and sharing support method and system for power transmission and distribution |
CN114580836B (en) * | 2022-01-18 | 2025-08-29 | 广东电力通信科技有限公司 | A 5G power infrastructure co-construction and sharing support method and system for power transmission and distribution |
Also Published As
Publication number | Publication date |
---|---|
CN109726270B (en) | 2022-11-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tong et al. | Mining frequent itemsets over uncertain databases | |
Jehad et al. | Fake news classification using random forest and decision tree (j48) | |
Gustafsson et al. | Comparison and validation of community structures in complex networks | |
Prokić et al. | Recognising groups among dialects | |
CN105975518A (en) | Information entropy-based expected cross entropy feature selection text classification system and method | |
CN106681985A (en) | Establishment system of multi-field dictionaries based on theme automatic matching | |
CN112579783B (en) | Short text clustering method based on Laplace atlas | |
Lambert et al. | Axor and Monit: two new polythetic‐divisive strategies for hierarchical classification | |
CN109726270A (en) | A method for detecting the degree of repetition of articles based on article segmentation and Pearson test | |
Subeno et al. | Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collapsed Gibbs Sampling Inference Process. | |
Zhiqiang et al. | Measuring semantic similarity between words using wikipedia | |
Kuncheva et al. | Pca feature extraction for change detection in multidimensional unlabelled streaming data | |
Lamprier et al. | On evaluation methodologies for text segmentation algorithms | |
KR101585644B1 (en) | Apparatus, method and computer program for document classification using term association analysis | |
CN112101468B (en) | A method for determining abnormal sequences in sequence combinations | |
CN114783446B (en) | Voice recognition method and system based on contrast predictive coding | |
CN103336806B (en) | A kind of key word sort method that the inherent of spacing and external pattern entropy difference occur based on word | |
Lee | Temporal correlation analysis of programming language popularity | |
Lang et al. | Graph-based seed set expansion for relation extraction using random walk hitting times | |
CN113792141A (en) | Feature selection method based on covariance measure factor | |
CN106611057B (en) | The text classification feature selection approach of importance weighting | |
Bacon | A maximum likelihood approach to correlational outlier identification | |
Zhong | Hot topic discovery in online community using topic labels and hot features | |
Pritsos et al. | The impact of noise in web genre identification | |
Almutairi et al. | Developing Arabic Sentiment Analysis for Saudi Arabia's Telecommunication Companies using Deep and Ensemble Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20221125 |
|
CF01 | Termination of patent right due to non-payment of annual fee |