[go: up one dir, main page]

CN109726270A - A method for detecting the degree of repetition of articles based on article segmentation and Pearson test - Google Patents

A method for detecting the degree of repetition of articles based on article segmentation and Pearson test Download PDF

Info

Publication number
CN109726270A
CN109726270A CN201811511826.9A CN201811511826A CN109726270A CN 109726270 A CN109726270 A CN 109726270A CN 201811511826 A CN201811511826 A CN 201811511826A CN 109726270 A CN109726270 A CN 109726270A
Authority
CN
China
Prior art keywords
article
pearson
segmentation
detecting method
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811511826.9A
Other languages
Chinese (zh)
Other versions
CN109726270B (en
Inventor
徐炜华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201811511826.9A priority Critical patent/CN109726270B/en
Publication of CN109726270A publication Critical patent/CN109726270A/en
Application granted granted Critical
Publication of CN109726270B publication Critical patent/CN109726270B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention is a kind of article repetition degree detecting method examined based on article segmentation and Pearson.The invention is characterized in that article text is all divided into multiple segments, it then counts the frequency of occurrence of each segment and is arranged according to certain sequence, again by database or the article in other sources carry out same treatment, the data obtained can be depicted as curve, therefore the method for Pearson's inspection can be taken to detect the correlation of two curves, to obtain the repetition degree of two articles.When related coefficient is 0.8-1.0, two article height of explanation are repeated, higher repetition when being 0.6-0.8, moderate repetition when being 0.4-0.6, low repetition when being 0.2-0.4, extremely low repetition or without repeating when being 0.0-0.2.This technology in conjunction with computer technology after can be applicable to paper repeatability detection (being commonly called as paper duplicate checking) in terms of, and improve the artificial difficulty for reducing repetitive rate (being commonly called as drop weight), be of great significance for strike paper act of plagiarism.

Description

A kind of article repetition degree detecting method examined based on article segmentation and Pearson
Technical field
The present invention relates to a kind of methods that paper repeats degree detecting.
Background technique
Pearson correlation coefficients (Pearson correlation coefficient) are also referred to as Pearson product-moment correlation coefficient (Pearson product-moment correlation coefficient), is a kind of linearly dependent coefficient, is defined as The quotient of covariance and standard deviation between two variables:
Above formula defines population correlation coefficient, and common lowercase Greek alpha ρ (rho) is used as and represents symbol.Estimate sample Pearson correlation coefficient (sample correlation coefficient) can be obtained in covariance and standard deviation, commonly uses English lower case r and represents:
R also can obtain the expression formula with above formula equivalence by the criterion score Estimation of Mean of sample point:
WhereinAnd σXIt is to X respectivelyiCriterion score, sample mean and the sample standard deviation of sample.
Pearson correlation coefficients are the statistics for reflecting two linear variable displacement degrees of correlation, and absolute value shows more greatly Correlation is stronger.Illustrate that correlation is extremely strong when related coefficient is 0.8-1.0, correlation is stronger when being 0.6-0.8, is 0.4- Moderate correlation when 0.6, correlation is lower when being 0.2-0.4, and correlation is extremely low or without correlation when being 0.0-0.2.
Summary of the invention
The present invention creatively selects solve previous traditional paper not with semantic complete sentence for minimum duplicate checking unit The problem of weight can drop by adjusting the methods of word order in duplicate checking method easily, and by mathematical model logarithm mature in statistics According to being analyzed.The present invention is intended to provide a kind of simple, accurate, reliable paper duplicate checking method.
Specific embodiment
Firstly, randomly select segmentation site, paper to be measured is decomposed into equal length or segment not etc..It should infuse herein Meaning is that fragment length after decomposing is unsuitable too long, in order to avoid influencing the sensitivity of detection, is usually no more than 5 words, will full text it is equal Sensitivity highest when being decomposed into single character.Then the total degree occurred in paper to gained segment counts, according to one It is fixed sequentially to be arranged, obtain an array.Then, the reference paper in database is decomposed according to same site, The paper segment decomposited is counted according to the number of appearance, the piece for occurring in paper to be measured but not occurring in reference paper Section meter 0, does not occur in paper to be measured but the segment occurred in reference paper does not count, resulting data according to paper phase to be measured Same sequence arrangement, obtains two arrays.Finally, carrying out Pearson inspection to resulting two array, the phase relation of two arrays is obtained Number, the repetition degree of as two papers.
Detailed description are as follows states shown in example for this method:
Select the highest full text of sensitivity word for word isolation.Paper to be measured is decomposed, to gained Chinese character frequency of occurrence into Row counts, and is ranked up according to the initial of the Chinese character decomposited sequence to the data obtained, obtains an array.Again by data Reference paper in library decomposes, and counts to gained Chinese character frequency of occurrence, and right according to the data arrangement of paper to be measured sequence The data obtained is ranked up, the word meter 0 for occurring in paper to be measured but not occurring in reference paper, is not occurred but is joined in paper to be measured Word than occurring in paper does not count, array of getting back.Finally, carrying out Pearson inspection to resulting two array, obtain The repetition degree of two papers illustrates that a possibility that paper has plagiarism is very big when related coefficient is 0.8-1.0, is 0.6- A possibility that plagiarizing when 0.8 is larger, has certain plagiarism suspicion when being 0.4-0.6, when being 0.2-0.4 a possibility that plagiarizing compared with Low, a possibility that plagiarizing, is extremely low when being 0.0-0.2.It repeats above operation, until the reference paper in database is detected and finished.

Claims (10)

1. a kind of article examined based on article segmentation and Pearson repeats degree detecting method, main feature are as follows: by specific Article to be measured is divided into several segments by number of words or specific identification site, then counts the number that each segment occurs altogether in article, Using article segment as horizontal axis, the total degree that segment occurs in article is the longitudinal axis, draws curve, this curve is referred to as the spy of this article Curve is levied, by the segmentation of the progresss same manner of other articles and counts each segment in article out according to same segmentation site Existing number, using article segment as horizontal axis, the total degree that segment occurs in article is the longitudinal axis, draws curve, this curve is referred to as The curve of article to be measured is carried out Pearson's inspection by the indicatrix of this article together with the curve of other articles, as according to According to the repetition degree for determining article.
2. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that for article to be measured being divided into multiple segments, the sensitivity of the effect length detection of segment, the shorter detection of fragment length Sensitivity is higher.
3. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, When being characterized in that article to be measured being divided into multiple segments, fragment length is simultaneously not fixed.
4. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that after article to be measured is divided into multiple segments, counts the number that each segment occurs in the text.
5. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that after counting the number that each segment occurs in the text, using article segment as horizontal axis, segment occurs total in article Number is the longitudinal axis, draws curve, this curve is referred to as the indicatrix of this article.
6. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that the indicatrix of article can be completed to draw by computer.
7. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is not absolutely required to present in graphical form for the indicatrix for being characterized in that by article made of computer drawing.
8. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that present in the form of mathematic(al) representation or array by the indicatrix of article made of computer drawing.
9. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that the indicatrix for several articles that will be drawn out carries out Pearson's inspection.
10. a kind of article examined based on article segmentation and Pearson repeats degree detecting method according to claim 1, It is characterized in that the result examined according to Pearson judges the repetition degree of article.
CN201811511826.9A 2018-12-11 2018-12-11 Article repetition degree detection method based on article segmentation and Pearson test Expired - Fee Related CN109726270B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811511826.9A CN109726270B (en) 2018-12-11 2018-12-11 Article repetition degree detection method based on article segmentation and Pearson test

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811511826.9A CN109726270B (en) 2018-12-11 2018-12-11 Article repetition degree detection method based on article segmentation and Pearson test

Publications (2)

Publication Number Publication Date
CN109726270A true CN109726270A (en) 2019-05-07
CN109726270B CN109726270B (en) 2022-11-25

Family

ID=66295608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811511826.9A Expired - Fee Related CN109726270B (en) 2018-12-11 2018-12-11 Article repetition degree detection method based on article segmentation and Pearson test

Country Status (1)

Country Link
CN (1) CN109726270B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580836A (en) * 2022-01-18 2022-06-03 广东电力通信科技有限公司 A 5G power infrastructure co-construction and sharing support method and system for power transmission and distribution

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000046739A1 (en) * 1999-02-08 2000-08-10 Zelson Amy S Fingerprint analysis method
CN106227897A (en) * 2016-08-31 2016-12-14 青海民族大学 A kind of Tibetan language paper copy detection method based on Tibetan language sentence level and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000046739A1 (en) * 1999-02-08 2000-08-10 Zelson Amy S Fingerprint analysis method
CN106227897A (en) * 2016-08-31 2016-12-14 青海民族大学 A kind of Tibetan language paper copy detection method based on Tibetan language sentence level and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
金博等: "基于篇章结构相似度的复制检测算法", 《大连理工大学学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114580836A (en) * 2022-01-18 2022-06-03 广东电力通信科技有限公司 A 5G power infrastructure co-construction and sharing support method and system for power transmission and distribution
CN114580836B (en) * 2022-01-18 2025-08-29 广东电力通信科技有限公司 A 5G power infrastructure co-construction and sharing support method and system for power transmission and distribution

Also Published As

Publication number Publication date
CN109726270B (en) 2022-11-25

Similar Documents

Publication Publication Date Title
Tong et al. Mining frequent itemsets over uncertain databases
Jehad et al. Fake news classification using random forest and decision tree (j48)
Gustafsson et al. Comparison and validation of community structures in complex networks
Prokić et al. Recognising groups among dialects
CN105975518A (en) Information entropy-based expected cross entropy feature selection text classification system and method
CN106681985A (en) Establishment system of multi-field dictionaries based on theme automatic matching
CN112579783B (en) Short text clustering method based on Laplace atlas
Lambert et al. Axor and Monit: two new polythetic‐divisive strategies for hierarchical classification
CN109726270A (en) A method for detecting the degree of repetition of articles based on article segmentation and Pearson test
Subeno et al. Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collapsed Gibbs Sampling Inference Process.
Zhiqiang et al. Measuring semantic similarity between words using wikipedia
Kuncheva et al. Pca feature extraction for change detection in multidimensional unlabelled streaming data
Lamprier et al. On evaluation methodologies for text segmentation algorithms
KR101585644B1 (en) Apparatus, method and computer program for document classification using term association analysis
CN112101468B (en) A method for determining abnormal sequences in sequence combinations
CN114783446B (en) Voice recognition method and system based on contrast predictive coding
CN103336806B (en) A kind of key word sort method that the inherent of spacing and external pattern entropy difference occur based on word
Lee Temporal correlation analysis of programming language popularity
Lang et al. Graph-based seed set expansion for relation extraction using random walk hitting times
CN113792141A (en) Feature selection method based on covariance measure factor
CN106611057B (en) The text classification feature selection approach of importance weighting
Bacon A maximum likelihood approach to correlational outlier identification
Zhong Hot topic discovery in online community using topic labels and hot features
Pritsos et al. The impact of noise in web genre identification
Almutairi et al. Developing Arabic Sentiment Analysis for Saudi Arabia's Telecommunication Companies using Deep and Ensemble Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20221125

CF01 Termination of patent right due to non-payment of annual fee