[go: up one dir, main page]

CN112417305A - Website sensitive word detection system and method - Google Patents

Website sensitive word detection system and method Download PDF

Info

Publication number
CN112417305A
CN112417305A CN202011454305.1A CN202011454305A CN112417305A CN 112417305 A CN112417305 A CN 112417305A CN 202011454305 A CN202011454305 A CN 202011454305A CN 112417305 A CN112417305 A CN 112417305A
Authority
CN
China
Prior art keywords
information
website
module
sensitive
title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011454305.1A
Other languages
Chinese (zh)
Inventor
王亚军
于航海
冯耀麟
尚国财
柳乐
赵雄浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lujin Technology Co ltd
Beijing Gctech Technology Co ltd
Original Assignee
Beijing Lujin Technology Co ltd
Beijing Gctech Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lujin Technology Co ltd, Beijing Gctech Technology Co ltd filed Critical Beijing Lujin Technology Co ltd
Priority to CN202011454305.1A priority Critical patent/CN112417305A/en
Publication of CN112417305A publication Critical patent/CN112417305A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a system and a method for detecting website sensitive words, which belong to the technical field of website information maintenance and comprise an input module, an image conversion module, a detection module, a path display module and an information display module; the input module receives an input domain name, an IP address and a title with sensitive words; the image conversion module receives the title with the sensitive words output by the input module, and converts the title into an image according to different fonts; the detection module detects the website according to the domain name, the IP address, the title and the picture; the path display module displays the received file path containing the title or the picture of the website with the sensitive information; the information display module displays the received domain name and IP address of the website with the sensitive information and the information with the title or the picture in the website.

Description

Website sensitive word detection system and method
Technical Field
The invention relates to the technical field of website information maintenance, in particular to a website sensitive word detection system and method.
Background
More and more 'ambiguous sensitive words' appear in the website at present, in most websites, the sensitive words generally refer to words with sensitive political tendency (or anti-political party tendency), violence tendency and unhealthy colors or uncivilized words, and some websites set some special sensitive words only suitable for the website according to the actual conditions of the websites. Many websites are blocked due to sensitive words, which causes economic loss. Or some hackers input some sensitive words by using the bullet box, and many tourists browsing websites see the sensitive words by operating the trigger bullet box, so that social public opinion is caused or the social order is influenced, and the legal responsibility is great.
In the prior art, reference may be made to the chinese patent application publication No. CN110750981A, which discloses a high-accuracy website sensitive word detection method based on machine learning, in which a document to be detected is first subjected to rule matching with a sensitive word database to obtain a document set containing sensitive words, training data is processed and learned to output a machine learning model, and then the document set is input into the model to obtain a website sensitive word detection result. The method combines a machine learning algorithm to train the model, firstly performs sensitive word rule matching on the crawled website page, then performs machine learning automatic analysis on the output website after rule matching, reduces the data volume predicted by the machine learning model, improves the detection speed and accuracy, and finally obtains the possibility that the page contains sensitive words through statistical calculation.
The above prior art solutions have the following drawbacks: although a method for detecting and intercepting website sensitive words exists at present, the way for maliciously sending the sensitive words is various at present, and all the sensitive words are difficult to remove by simply depending on the existing language recognition.
Disclosure of Invention
The invention aims to provide a website sensitive word detection method, which can detect sensitive words sent in formats such as pictures and the like, effectively increase the removal accuracy of the sensitive words and enlarge the detection range of the sensitive words.
The technical purpose of the invention is realized by the following technical scheme:
a website sensitive word detection method comprises the following steps:
firstly, inputting a domain name and an IP address to be detected, and filling a title with sensitive words to be detected;
secondly, converting the title into an image according to different fonts;
thirdly, detecting a title and a picture in a website corresponding to the domain name and the IP address;
fourthly, displaying the detected file path containing the title or the picture of the website;
and fifthly, displaying the domain name with the sensitive information, the IP address and the information with the title or the picture in the domain name.
By adopting the scheme, after the user inputs the sensitive words and the detection range to be detected, the method can automatically display the information with the sensitive words and the file path for the user, so that the user can conveniently process the information with the sensitive words, and when the detection is carried out, the information with the formats of pictures, dynamic pictures, videos and the like which are maliciously uploaded by others can be searched through the sensitive words with the picture format, so that the detection range of the sensitive words is effectively enlarged, and the elimination accuracy of the sensitive words is increased.
The invention is further configured to: further comprising:
adding blank characters among all characters in the sensitive words;
and thirdly, detecting the sensitive words after adding the blank characters in the website corresponding to the domain name and the IP address, wherein the blank characters are any characters during searching.
By adopting the scheme, the detection range of the sensitive words can be further expanded, and the situation that people divide the sensitive words by simple characters such as blank spaces and the like to avoid detection is avoided.
The invention is further configured to: further comprising:
and sixthly, generating data from the displayed information, establishing a document, and storing the data into the specified document.
By adopting the scheme, the user can check the information in the document at any time so as to ensure that the user can process the information with the sensitive words at convenient time when the user does not have time to process the information or is not in time to process the information.
The invention is further configured to: further comprising:
and seventhly, after the document is opened, selecting any domain name with sensitive information, IP address and information that the title of the domain name contains sensitive words, and automatically inquiring a corresponding file path according to the selected information.
By adopting the scheme, the user can directly know the file path when selecting the information with the sensitive words, and the user can conveniently process the information.
The invention is further configured to: further comprising:
and thirdly, setting the blurring degree of the detected picture before detecting the picture, wherein the higher the blurring degree is, the larger the detected picture range is.
By adopting the scheme, the user can set the detection ambiguity according to the actual condition, the retrieval range of the information such as pictures can be controlled according to the actual condition, the detection accuracy is improved, and the information with sensitive words can be detected while the error detection of too much normal information is avoided as far as possible.
The invention aims to provide a website sensitive word detection system which can detect sensitive words sent in formats such as pictures and the like, effectively increase the removal accuracy of the sensitive words and enlarge the detection range of the sensitive words.
The technical purpose of the invention is realized by the following technical scheme:
a website sensitive word detection system comprises an input module, an image conversion module, a detection module, a path display module and an information display module;
the input module receives an input domain name, an IP address and a title with sensitive words and outputs the domain name, the IP address and the title with the sensitive words;
the image conversion module receives the title with the sensitive words output by the input module, converts the title into an image according to different fonts and outputs the image;
the detection module receives a domain name, an IP address and a title output by the input module and a picture output by the image conversion module, locks a website according to the received domain name and the IP address, detects information which is the same as or similar to the received title and the picture on the locked website, transmits a file path containing the title or the picture of the website with the detected sensitive information to the path display module, and transmits the domain name and the IP address of the website with the detected sensitive information and the information with the title or the picture in the website to the information display module;
the path display module displays the received file path containing the title or the picture of the website with the sensitive information;
and the information display module displays the received domain name and IP address of the website with the sensitive information and the information with the title or picture in the website.
By adopting the scheme, after the user inputs the sensitive words to be detected and the detection range into the system, the system can automatically detect the sensitive words in the set range, so that the user can conveniently process the information with the sensitive words, and during detection, the system can search the information in the formats of pictures, dynamic pictures, videos and the like maliciously uploaded by others through the sensitive words in the picture format, thereby effectively enlarging the detection range of the sensitive words and increasing the removal accuracy of the sensitive words.
The invention is further configured to: further comprising: the word processing module receives the title with the sensitive words output by the input module, adds blank characters among all the characters in the sensitive words to form suspected sensitive words and transmits the suspected sensitive words to the detection module;
the detection module detects the same information as the suspected sensitive words on the locked website, and blank characters are any characters during detection.
By adopting the scheme, the word processing module can further expand the detection range of the sensitive words, and avoids the situation that people divide the sensitive words by simple characters such as blank spaces and the like to avoid detection.
The invention is further configured to: further comprising: the storage module receives input information, establishes a document and stores the received information into the document, the path display module transmits a file path to the storage module for storage, and the information display module transmits the information to the storage module for storage.
By adopting the scheme, a user can check the information in the document at any time through the storage module so as to ensure that the user can process the information with the sensitive words at a convenient time when the user does not have time to process the information or is not in time to process the information.
The invention is further configured to: further comprising: and the automatic searching module calls the document stored in the storage module according to the received instruction and selects the information stored in the document, and the automatic searching module automatically inquires and displays the corresponding file path according to the selected information.
By adopting the scheme, the automatic searching module automatically searches the corresponding file path and displays the file path to the user when the user selects the information with the sensitive words after opening the document, so that the user can conveniently process the information.
The invention is further configured to: the method is characterized in that: the detection module receives information input from the outside and adjusts the blurring degree of the detected picture according to the input information, and the higher the blurring degree is, the larger the detected picture range is.
By adopting the scheme, the user can set the detection ambiguity at the detection module according to the actual condition, the detection accuracy can be improved by controlling the retrieval range of the information such as the picture and the like according to the actual condition, and the information with the sensitive words can be detected without mistakenly detecting too much normal information.
In conclusion, the invention has the following beneficial effects:
1. when the method and the system detect the sensitive words, the sensitive words in the picture format can be used for searching the information in the picture, dynamic graph, video and other formats uploaded maliciously by others, so that the detection range of the sensitive words is effectively enlarged, and the removal accuracy of the sensitive words is increased;
2. the word processing module can further expand the detection range of the sensitive words, and avoids the situation that people divide the sensitive words by simple characters such as blank spaces and the like to avoid detection.
Drawings
Fig. 1 is an overall system block diagram of the second embodiment.
In the figure, 1, an input module; 2. an image conversion module; 3. a word processing module; 4. a detection module; 5. a path display module; 6. an information display module; 7. a storage module; 8. and an automatic searching module.
Detailed Description
The first embodiment is as follows: a website sensitive word detection method comprises the following specific steps:
step one, inputting a domain name and an IP address to be detected, and filling a title with sensitive words to be detected.
And step two, converting the title into the picture according to different fonts. Blank characters are added between each word in the sensitive word.
And step three, setting the fuzziness of the detected picture by the user, wherein the higher the fuzziness is, the larger the detected picture range is. After the setting is finished, detecting a title, a picture and a sensitive word after adding a blank character in a website corresponding to a domain name and an IP address, wherein the blank character is an arbitrary character during searching.
Step four, displaying the file path containing the title or the picture of the detected website,
And step five, displaying the domain name with the sensitive information, the IP address and the information with the title or the picture in the domain name.
And step six, generating data from the displayed information, establishing a document, and storing the data into the specified document. The user can check the information in the document at any time to ensure that the user can process the information with the sensitive words at a convenient time when the user does not have time to process the information or is not in time to process the information.
And step seven, after the document is opened, selecting any domain name with sensitive information, IP address and information that the title of the domain name contains sensitive words, and automatically inquiring a corresponding file path according to the selected information. The user can directly know the file path when selecting the information with the sensitive words, and the user can conveniently process the information.
After the user inputs the sensitive words and the detection range to be detected, the method can automatically display the information with the sensitive words and the file path for the user, and is convenient for the user to process the information with the sensitive words. During detection, information in the formats of pictures, dynamic pictures, videos and the like which are maliciously uploaded by others can be searched through the sensitive words in the picture format, and the sensitive words can be prevented from being separated by simple characters such as blank spaces and the like to avoid detection by adding blank characters. The detection range of the sensitive words is effectively enlarged, and the elimination accuracy of the sensitive words is improved.
Example two: a website sensitive word detection system is shown in figure 1 and comprises an input module 1, an image conversion module 2, a word processing module 3, a detection module 4, a path display module 5, an information display module 6, a storage module 7 and an automatic search module 8.
As shown in fig. 1, the input module 1 receives an input domain name, an IP address, and a title with a sensitive word and outputs the domain name, the IP address, and the title with the sensitive word. The image conversion module 2 receives the title with the sensitive words output by the input module 1, and the image conversion module 2 converts the title into pictures according to different fonts and outputs the pictures. The word processing module 3 receives the title with the sensitive words output by the input module 1, the word processing module 3 adds blank characters between each word in the sensitive words to form suspected sensitive words, and transmits the suspected sensitive words to the detection module 4.
As shown in fig. 1, the detection module 4 receives externally input information and adjusts the blur degree of the detected picture according to the input information, and the higher the blur degree is, the larger the detected picture range is. The user can set the detection ambiguity at detection module 4 according to actual conditions, can improve the precision of detection through controlling the retrieval scope to information such as picture according to actual conditions, can guarantee as far as possible not to miss out too much normal information again when can detecting the information that has sensitive word.
As shown in fig. 1, the detection module 4 receives the domain name, the IP address and the title output by the input module 1, the picture output by the image conversion module 2, and the suspected sensitive word output by the word processing module 3. The detection module 4 locks the website according to the received domain name and the IP address, and detects information that is the same as or similar to the received title, picture, and suspected sensitive word on the locked website, where the blank character is an arbitrary character during detection. The detection module 4 transmits the file path containing the title or the picture of the website with the detected sensitive information to the path display module 5. The detection module 4 transmits the domain name and the IP address of the website with the detected sensitive information and the information with the title or the picture in the website to the information display module 6.
As shown in fig. 1, the path display module 5 displays a file path containing a title or a picture of the received website with the sensitive information. The information display module 6 displays the received domain name and IP address of the website with the sensitive information and the information with the title or picture in the website. After the user inputs the sensitive words and the detection range to be detected into the system, the system can automatically display the information with the sensitive words and the file path for the user, and the user can conveniently process the information with the sensitive words. During detection, information in the formats of pictures, dynamic pictures, videos and the like which are maliciously uploaded by others can be searched through the sensitive words in the picture format, and the sensitive words can be prevented from being separated by simple characters such as blank spaces and the like to avoid detection by adding blank characters. The system can effectively enlarge the detection range of the sensitive words and increase the elimination accuracy of the sensitive words.
As shown in fig. 1, the storage module 7 receives input information and creates a document, stores the received information in the document, the path display module 5 transmits a file path to the storage module 7 for storage, and the information display module 6 transmits the information to the storage module 7 for storage. The automatic searching module 8 calls the document stored in the storage module 7 according to the received instruction and selects the information stored in the document, and the automatic searching module 8 automatically inquires and displays the corresponding file path according to the selected information. When the user selects the information with the sensitive words after opening the document, the automatic searching module 8 automatically searches the corresponding file path and displays the file path to the user, so that the user can conveniently process the information. The user can set the detection ambiguity at detection module 4 according to actual conditions, can improve the precision of detection through controlling the retrieval scope to information such as picture according to actual conditions, can guarantee as far as possible not to miss out too much normal information again when can detecting the information that has sensitive word.
The embodiments of the present invention are preferred embodiments of the present invention, and the scope of the present invention is not limited by these embodiments, so: all equivalent changes made according to the structure, shape and principle of the invention are covered by the protection scope of the invention.

Claims (10)

1. A website sensitive word detection method is characterized by comprising the following steps:
firstly, inputting a domain name and an IP address to be detected, and filling a title with sensitive words to be detected;
secondly, converting the title into an image according to different fonts;
thirdly, detecting a title and a picture in a website corresponding to the domain name and the IP address;
fourthly, displaying the detected file path containing the title or the picture of the website;
and fifthly, displaying the domain name with the sensitive information, the IP address and the information with the title or the picture in the domain name.
2. The system and method for detecting website sensitive words according to claim 1, further comprising:
adding blank characters among all characters in the sensitive words;
and thirdly, detecting the sensitive words after adding the blank characters in the website corresponding to the domain name and the IP address, wherein the blank characters are any characters during searching.
3. The system and method for detecting website sensitive words according to claim 1, further comprising:
and sixthly, generating data from the displayed information, establishing a document, and storing the data into the specified document.
4. The system and method for detecting website sensitive words according to claim 3, further comprising:
and seventhly, after the document is opened, selecting any domain name with sensitive information, IP address and information that the title of the domain name contains sensitive words, and automatically inquiring a corresponding file path according to the selected information.
5. The system and method for detecting website sensitive words according to claim 1, further comprising:
and thirdly, setting the blurring degree of the detected picture before detecting the picture, wherein the higher the blurring degree is, the larger the detected picture range is.
6. A website sensitive word detection system is characterized in that: the system comprises an input module (1), an image conversion module (2), a detection module (4), a path display module (5) and an information display module (6);
the input module (1) receives an input domain name, an IP address and a title with sensitive words and outputs the domain name, the IP address and the title with the sensitive words;
the image conversion module (2) receives the title with the sensitive words output by the input module (1), and the image conversion module (2) converts the title into an image according to different fonts and outputs the image;
the detection module (4) receives a domain name, an IP address and a title output by the input module (1) and a picture output by the image conversion module (2), the detection module (4) locks a website according to the received domain name and the IP address and detects information which is the same as or similar to the received title and picture on the locked website, the detection module (4) transmits a file path containing the title or the picture of the detected website with sensitive information to the path display module (5), and the detection module (4) transmits the domain name and the IP address of the detected website with sensitive information and the information with the title or the picture in the website to the information display module (6);
the path display module (5) displays the received file path containing the title or the picture of the website with the sensitive information;
and the information display module (6) displays the received domain name and IP address of the website with the sensitive information and the information with the title or picture in the website.
7. The website sensitive word detection system of claim 6, further comprising: the word processing module (3) receives the title with the sensitive words output by the input module (1), and the word processing module (3) adds blank characters among all the characters in the sensitive words to form suspected sensitive words and transmits the suspected sensitive words to the detection module (4);
the detection module (4) detects the same information as the suspected sensitive words on the locked website, and blank characters are any characters during detection.
8. The website sensitive word detection system of claim 6, further comprising: the storage module (7) receives input information and establishes a document, the received information is stored in the document, the path display module (5) transmits a file path to the storage module (7) for storage, and the information display module (6) transmits the information to the storage module (7) for storage.
9. The website sensitive word detection system of claim 8, further comprising: the automatic searching module (8) calls the documents stored in the storage module (7) according to the received instructions and selects the information stored in the documents, and the automatic searching module (8) automatically inquires and displays the corresponding file path according to the selected information.
10. The website sensitive word detection system of claim 6, wherein: the detection module (4) receives information input from the outside and adjusts the blurring degree of the detected picture according to the input information, and the higher the blurring degree is, the larger the detected picture range is.
CN202011454305.1A 2020-12-10 2020-12-10 Website sensitive word detection system and method Pending CN112417305A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011454305.1A CN112417305A (en) 2020-12-10 2020-12-10 Website sensitive word detection system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011454305.1A CN112417305A (en) 2020-12-10 2020-12-10 Website sensitive word detection system and method

Publications (1)

Publication Number Publication Date
CN112417305A true CN112417305A (en) 2021-02-26

Family

ID=74775907

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011454305.1A Pending CN112417305A (en) 2020-12-10 2020-12-10 Website sensitive word detection system and method

Country Status (1)

Country Link
CN (1) CN112417305A (en)

Similar Documents

Publication Publication Date Title
US6169998B1 (en) Method of and a system for generating multiple-degreed database for images
JP5226095B2 (en) Local item extraction
US12159452B2 (en) Automatically predicting text in images
US20090300003A1 (en) Apparatus and method for supporting keyword input
JPH07282088A (en) Matching device and matching method
KR100930249B1 (en) Apparatus and method for searching the Internet using information obtained from images
CN109272440B (en) Thumbnail generation method and system combining text and image content
WO2016057238A1 (en) Linking thumbnail of image to web page
JP2010509794A (en) Improved mobile communication terminal
CN111104028B (en) Method, device, equipment and storage medium for topic determination
US6535652B2 (en) Image retrieval apparatus and method, and computer-readable memory therefor
JP2011065255A (en) Data processing apparatus, data name generation method and computer program
CN112417305A (en) Website sensitive word detection system and method
JP7651962B2 (en) Information processing device, information processing system, information processing method, and program
JP4266240B1 (en) Item judgment system and item judgment program
KR19990016894A (en) How to search video database
CN109783735A (en) Method and device for acquiring content based on user corpus
US5361204A (en) Searching for key bit-mapped image patterns
CN115565193A (en) Questionnaire information input method and device, electronic equipment and storage medium
JP2007188410A (en) Electronic dictionary device, electronic dictionary search method, and electronic dictionary program
JPH08180064A (en) Document retrieval method and document filing device
KR100540735B1 (en) Subtitle Character-based Image Indexing Method
JP6425989B2 (en) Character recognition support program, character recognition support method, and character recognition support device
EP4379573A1 (en) Computer implemented method for an automated search of an article of a printed medium
JP3241854B2 (en) Automatic word spelling correction device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210226