[go: up one dir, main page]

CN1588879A - Internet content filtering system and method - Google Patents

Internet content filtering system and method Download PDF

Info

Publication number
CN1588879A
CN1588879A CN 200410053683 CN200410053683A CN1588879A CN 1588879 A CN1588879 A CN 1588879A CN 200410053683 CN200410053683 CN 200410053683 CN 200410053683 A CN200410053683 A CN 200410053683A CN 1588879 A CN1588879 A CN 1588879A
Authority
CN
China
Prior art keywords
url
cfa
classification
cams
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200410053683
Other languages
Chinese (zh)
Inventor
薛向阳
石静
郭小鹏
许源
赵泽宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN 200410053683 priority Critical patent/CN1588879A/en
Publication of CN1588879A publication Critical patent/CN1588879A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

本发明为一种互联网内容过滤系统及过滤方法。系统框架包括:内容过滤代理(CFA)、查询服务器(QS)、内容分析与管理服务器(CAMS)三部分。网络内容过滤系统的过滤流程为:当用户发出对某个URL进行访问的请求时,CFA根据用户设置的黑白名单,允许或禁止该访问请求。倘若该URL不在CFA的黑白名单中,CFA则向QS发出查询请求。QS将会在自己的URL库中查询该URL的分级信息并将结果返回给CFA。CFA据此做出反应。同时QS会定期从CAMS中下载更新的URL分级信息。本发明可以准确地识别网络中存在的不良信息,并主动地阻止互联网用户访问这些不良网站。在过滤不良信息的同时,系统将最大限度的确保网络速度的高效。The invention relates to an Internet content filtering system and filtering method. The system framework includes three parts: Content Filtering Agent (CFA), Query Server (QS), and Content Analysis and Management Server (CAMS). The filtering process of the network content filtering system is: when a user sends a request to visit a certain URL, CFA allows or prohibits the access request according to the black and white lists set by the user. If the URL is not in CFA's black and white list, CFA will send a query request to QS. QS will query the rating information of the URL in its own URL database and return the result to CFA. CFA reacts accordingly. At the same time, QS will regularly download updated URL rating information from CAMS. The invention can accurately identify bad information existing in the network, and actively prevent Internet users from accessing these bad websites. While filtering bad information, the system will maximize the efficiency of network speed.

Description

一种互联网内容过滤系统及过滤方法An Internet content filtering system and filtering method

技术领域technical field

本发明属于互联网技术领域,具体涉及一种互联网内容过滤系统以及进行过滤的方法,可用于阻止用户访问互联网上各类媒体数据,包括文本、图像、视频、音频、图形和动画等。The invention belongs to the technical field of the Internet, and in particular relates to an Internet content filtering system and a filtering method, which can be used to prevent users from accessing various media data on the Internet, including text, image, video, audio, graphics and animation.

背景技术Background technique

互联网已经成为日常生活中不可缺少的一个组成部分。人们在网上生活,享用着网络提供的形形色色的服务:网上购物、网上银行、收发邮件、信息查询等。然而,当人们享受互联网好处的同时,也存在着互联网上日益递增的负面影响,例如青少年沉溺于成人网站、不良信息的散播,或是通过互联网犯罪等。The Internet has become an indispensable part of everyday life. People live on the Internet and enjoy all kinds of services provided by the Internet: online shopping, online banking, sending and receiving emails, information inquiries, etc. However, while people are enjoying the benefits of the Internet, there are also increasing negative effects on the Internet, such as teenagers indulging in adult websites, spreading bad information, or committing crimes through the Internet.

根据美国N2H2公司统计,全球大致有8%的网页是色情网页。每天向搜索引擎提交的请求中有四分之一是有关色情内容的;另外色情垃圾邮件已经成为人们最为头痛的事情之一。一般主流的免费邮箱每天会收到3-10封这样的邮件,而发信人却不管信箱的主人是否成年。According to the statistics of the N2H2 company in the United States, roughly 8% of the web pages in the world are pornographic web pages. A quarter of the requests submitted to search engines every day are related to pornographic content; in addition, pornographic spam has become one of the biggest headaches for people. Generally, mainstream free mailboxes will receive 3-10 such emails every day, but the sender does not care whether the owner of the mailbox is an adult or not.

与网络色情相比,以反政府、反社会为内容的网站网页也同样是多如牛毛。“法轮功”的字眼随处可见,所谓的“政府秘密”正四处扩散。公众的视听被混淆,人们的生活被扰乱。网络负面效应之大,不良信息内容之广,是人们始料未及的。Compared with Internet pornography, there are also a plethora of website pages with anti-government and anti-society content. The words "Falun Gong" can be seen everywhere, and the so-called "government secrets" are spreading everywhere. The public's hearing and hearing are confused, and people's lives are disrupted. The negative effects of the Internet and the wide range of bad information are beyond people's expectations.

如何保障互联网的运行安全和信息安全已经引起全社会的普遍关注。为了兴利除弊,促进我国互联网的健康发展,全国人民代表大会常务委员会于2000年12月通过了一项关于维护互联网安全的决定。该决定明文规定了:“为了维护国家安全和社会稳定,对有下列行为之一,构成犯罪的,依照刑法有关规定追究刑事责任:How to guarantee the operation security and information security of the Internet has aroused the general concern of the whole society. In order to promote the healthy development of my country's Internet, the Standing Committee of the National People's Congress passed a decision on maintaining Internet security in December 2000. The decision expressly stipulates: "In order to maintain national security and social stability, anyone who commits any of the following acts, which constitutes a crime, shall be investigated for criminal responsibility in accordance with the relevant provisions of the Criminal Law:

(一)利用互联网造谣、诽谤或者发表、传播其他有害信息,煽动颠覆国家政权、推翻社会主义制度,或者煽动分裂国家、破坏国家统一;(1) Using the Internet to spread rumors, slander, or publish or disseminate other harmful information, inciting subversion of state power, overthrowing the socialist system, or inciting to split the country and undermine national unity;

(二)通过互联网窃取、泄露国家秘密、情报或者军事秘密;(2) Stealing or leaking state secrets, intelligence or military secrets through the Internet;

(三)利用互联网煽动民族仇恨、民族歧视,破坏民族团结;(3) Using the Internet to incite ethnic hatred and discrimination and undermine ethnic unity;

(四)利用互联网组织邪教组织、联络邪教组织成员,破坏国家法律、行政法规实施。”(4) Using the Internet to organize cult organizations, contact members of cult organizations, and undermine the implementation of national laws and administrative regulations. "

目前,中共中央、国务院正强调进一步加强和改进未成年人思想道德建设。教育部也于2004年5月要求把文明上网、网络安全知识列入学校德育的重要内容,以此来提高未成年人抵御有害信息的能力。At present, the Central Committee of the Communist Party of China and the State Council are emphasizing further strengthening and improving the ideological and moral construction of minors. The Ministry of Education also requested in May 2004 to include civilized surfing the Internet and network security knowledge as important contents of school moral education, so as to improve the ability of minors to resist harmful information.

为了防止违法与有害信息的入侵,在技术上主要采取三种手段,一是从服务器上删除文档,一旦主机服务者意识到在服务器上存在违法信息,必须将这类信息从服务器上删除。二是堵塞信息传递,如果违法信息所在的服务器的拥有者或国家,不认可这是违法信息或采取不合作的态度,其它国家只能采取堵塞的手段,禁止对这类信息的检索。三是开发行之有效的过滤软件,目前已经开发出了三代过滤软件。第一代被称为“黑名单”软件,第二代是“白名单”软件,第三代是PICS系统。In order to prevent the intrusion of illegal and harmful information, three methods are mainly adopted technically. One is to delete files from the server. Once the host server realizes that there is illegal information on the server, such information must be deleted from the server. The second is to block the transmission of information. If the owner or country of the server where the illegal information is located does not recognize it as illegal information or adopts an uncooperative attitude, other countries can only adopt blocking means to prohibit the retrieval of such information. The third is to develop effective filtering software. At present, three generations of filtering software have been developed. The first generation was called "blacklist" software, the second generation was "whitelist" software, and the third generation was the PICS system.

“黑名单”软件的工作原理是封锁住不应检索的网址,“白名单”软件是用来检索只允许访问的网址。“黑名单”软件在第一代过滤软件中得到广泛应用,最有名的是Cyber Patro,九十年代早期投入使用,可以与因特网检索商和联机服务商的检索软件配合合作。软件记录了大约7000个网址,12个大类的非法和有害信息(暴力/渎神、种族主义/对少数民族不恰当的评论、魔鬼崇拜、毒品、好战言论/极端主义、赌博等)。“白名单”是与“黑名单”工作原理正好相反的软件,它是先封锁住所有因特网网址,然后选择可供访问的网址,由于这种软件在逻辑上与因特网相反,因此适用范围十分有限。"Blacklist" software works by blocking URLs that should not be retrieved, and "whitelist" software is used to retrieve URLs that are only allowed to be accessed. "Blacklist" software is widely used in the first generation of filtering software, the most famous being Cyber Patro, which was put into use in the early 1990s and can cooperate with the retrieval software of Internet search providers and online service providers. The software recorded approximately 7,000 URLs with 12 categories of illegal and harmful information (violence/blasphemy, racism/inappropriate comments about minorities, devil worship, drugs, militant speech/extremism, gambling, etc.). "White list" is software that works just opposite to "black list". It first blocks all Internet addresses, and then selects accessible addresses. Since this software is logically opposite to the Internet, its scope of application is very limited. .

过滤违法与有害信息的另一种有效技术手段是采用“因特网内容选择平台”(PICS-Platform for Internet Content Selection)、“中性标签”(neutral labeling)系统。该系统由麻省理工学院计算机科学实验室的Jim Miller教授开发,它类似于过滤掉电视节目中色情与暴力的V芯片电视节目选择器。由万维网协论坛(W3C-Wold Wide WebConsortium)在1996年5月正式颁布。目前已被广泛应用。PICS得到了39个国际计算机公司、计算机软硬件制造商、检索服务商、联机服务商、出版商、内容提供者的广泛支持,它被安装在因特网的浏览器中,供用户选择使用。PICS的主要工作是对每一个网页的内容进行分类,并根据内容特性加上标签,同时由计算机软件对网页的标签进行监测,以限制对特定内容网页的检索。网页上的标签即可以是数字字符,也可以是密码。标签被嵌入RFC-822传输格式和HTML文本格式,通过HTTP协议,可以与文件一起传输。Another effective technical means to filter illegal and harmful information is to use the "Internet Content Selection Platform" (PICS-Platform for Internet Content Selection) and the "neutral labeling" system. The system, developed by Professor Jim Miller of MIT's Computer Science Laboratory, is similar to a V-chip TV program selector that filters out sex and violence from TV programs. It was officially promulgated by the World Wide Web Consortium (W3C-Wold Wide WebConsortium) in May 1996. It has been widely used at present. PICS has been widely supported by 39 international computer companies, computer software and hardware manufacturers, search service providers, online service providers, publishers, and content providers. It is installed in Internet browsers for users to choose to use. The main work of PICS is to classify the content of each webpage and add labels according to the content characteristics. At the same time, computer software monitors the labels of webpages to limit the retrieval of specific content webpages. Labels on web pages can be either numeric characters or passwords. The tags are embedded in the RFC-822 transmission format and HTML text format, and can be transmitted with the file through the HTTP protocol.

今天,许多软件公司意识到网络内容过滤带来的商机,各种过滤软件不断问世。“网络爸爸”、“美萍反黄专家”、“e反黄软件”、“正义战士”都是我国早期涌现的一批反黄软件。在国内为数不多的网页过滤软件中,不乏一些较有特色的软件,如:“MyIE2”、“别碰NoPorn!”、“过滤网”和“护花使者”等。纵观国内的过滤软件,大多采用简单的URL匹配和关键词判断技术来过滤网页,真正采用基于内容的分析处理方法来过滤网络媒体文件的产品基本上没有。Today, many software companies are aware of the business opportunities brought by network content filtering, and various filtering software are constantly coming out. "Internet Dad", "Meiping Anti-Pornography Expert", "e Anti-Pornography Software", and "Justice Warrior" are all anti-pornography software that emerged in the early days of our country. Among the few web filtering software in China, there are some more characteristic software, such as: "MyIE2", "Don't Touch NoPorn!", "Filter Net" and "Huahuashishi" and so on. Looking at domestic filtering software, most of them use simple URL matching and keyword judgment technology to filter web pages, and there are basically no products that really use content-based analysis and processing methods to filter network media files.

相比之下,国外同类产品的开发比国内更快,过滤技术也相对成熟。ZyXEL、WebSense、FilterLogix、SurfControl都是使用较为广泛的网络内容过滤软件,它们均拥有一个庞大的、分过类的URL数据库。普遍采用的技术也是黑白名单和关键词匹配查询。In contrast, the development of similar foreign products is faster than domestic ones, and the filtration technology is relatively mature. ZyXEL, WebSense, FilterLogix, and SurfControl are widely used network content filtering software, and they all have a huge, classified URL database. Commonly used techniques are also black and white lists and keyword matching queries.

ISS公司的Proventia Web Filter拥有世界上最大最新的内容过滤数据库,它不仅依赖关键词查询和手工网站分类,并且使用了一个文本图像分析系统一起处理媒体内容。ISS's Proventia Web Filter has the world's largest and latest content filtering database. It not only relies on keyword query and manual website classification, but also uses a text image analysis system to process media content.

FortiGuard的URL数据库包括超过5百万条URL并含有分类信息。每当有请求时,系统会先去询问FortiGuard数据库该网页的分类情况,并根据客户预先制定的政策允许或拒绝网页的请求。FortiGuard's URL database includes more than 5 million URLs and includes classification information. Whenever there is a request, the system will first query the FortiGuard database for the classification of the web page, and allow or deny the request for the web page according to the policy set in advance by the customer.

韩国的WebWacher是一款相当不错的网络图像过滤软件。该软件针对家庭用户,提供控制上网时间和过滤网络不良内容两大功能,以此保护儿童合理使用互联网。South Korea's WebWacher is a pretty good network image filtering software. The software is aimed at home users, and provides two functions of controlling online time and filtering harmful content on the Internet, so as to protect children from using the Internet reasonably.

另外,大多数美国开发过滤软件的公司都从事反垃圾邮件和杀病毒软件的开发。因此,一方面各大软件公司依托本身的基础,可以很快建立起庞大的URL数据库以供查询;而另一方面过滤软件的工作模式基本上与反垃圾邮件或杀病毒相似,同时销售对象也往往只针对企业级用户。In addition, most US companies that develop filtering software are engaged in the development of anti-spam and anti-virus software. Therefore, on the one hand, major software companies can quickly build a huge URL database for query based on their own foundation; Often only for enterprise-level users.

发明内容Contents of the invention

本发明的目的在于提出一种新的互联网内容过滤系统以及进行过滤的方法,使得系统具有自学习能力,并可提高系统分类精度,降低人工成本;当用户访问网络时,以主动方式过滤互联网中存在的各类媒体数据,包括:文本、图像、视频、音频、图形和动画等。The purpose of the present invention is to propose a new Internet content filtering system and filtering method, so that the system has self-learning ability, and can improve the classification accuracy of the system and reduce labor costs; There are various types of media data, including: text, image, video, audio, graphics and animation, etc.

下面先介绍URL的概念。Let's first introduce the concept of URL.

URL是Uniform Resource Locator(统一资源定位器)的缩写,其数据结构为:协议://主机名:端口号/目录路径/文件名。URL is the abbreviation of Uniform Resource Locator (Uniform Resource Locator), and its data structure is: protocol://host name: port number/directory path/file name.

URL与网站或服务器上一个具体的数据对象对应,例如一个URL对应一个门户站点或BBS服务器,也可对应一个站点中一个目录下的一幅特定图片。因此,如果要阻止用户访问某个网站、服务器或特定数据对象,则只要阻止向网络用户发送该URL请求即可。A URL corresponds to a specific data object on a website or server. For example, a URL corresponds to a portal site or a BBS server, and may also correspond to a specific picture in a directory of a site. Therefore, if you want to prevent users from accessing a certain website, server or specific data object, you only need to prevent the URL request from being sent to network users.

协议段说明Internet的资源类型,如:http表示超文本传输协议或WWW。其他协议有:ftp(表示文件传输协议)、telnet(表示远程登录)、news(表示新闻组)、mailto(表示电子邮件)、mms(表示流媒体)等。The protocol section describes the resource type of the Internet, such as: http means hypertext transfer protocol or WWW. Other protocols include: ftp (represents file transfer protocol), telnet (represents remote login), news (represents newsgroups), mailto (represents email), mms (represents streaming media), etc.

主机名段说明Internet的服务器名,例如: www.fudan.edu.cn。目录路径段指出文件或部分文件在internet服务器上位置。每一级目录以一个正斜杠(/)符号隔开。The host name section indicates the server name of the Internet, for example: www.fudan.edu.cn . A directory path segment indicates the location of a file or part of a file on an internet server. Each level of directory is separated by a forward slash (/) symbol.

文件名段是将要访问的文档、图像或脚本的实际名称,例如:index.html、logo.gif、script.cgi。端口号、目录路径、文件名这些都属于URL的可选组成部分。The filename field is the actual name of the document, image or script that will be accessed, for example: index.html, logo.gif, script.cgi. Port numbers, directory paths, and file names are optional components of URLs.

下面给出一些URL的实例:Some examples of URLs are given below:

http://www.w3.org/index.html:该URL对应一个网站 http://www.w3.org/index.html : This URL corresponds to a website

http://10.64.130.4/images/advice.gif:该URL对应一幅图片 http://10.64.130.4/images/advice.gif : This URL corresponds to an image

ftp://10.11.3.8:该URL对应一个FTP服务器 ftp://10.11.3.8 : This URL corresponds to an FTP server

mms://10.11.4.6/abc.avi:该URL用于点播一个音像节目 mms://10.11.4.6/abc.avi : This URL is used to order an audio-visual program

telnet://bbs.fudan.edu.cn:该URL对应一个BBS服务器 telnet: //bbs.fudan.edu.cn : This URL corresponds to a BBS server

本发明提出的网络内容过滤系统包括如下几个部分(参见图1所示):互联网支持下的内容过滤代理、查询服务器、内容分析与管理服务器,它们位于用户端设备和目标站点之间。其中The network content filtering system that the present invention proposes comprises following several parts (referring to Fig. 1): the content filtering proxy under the support of Internet, query server, content analysis and management server, they are located between client equipment and target site. in

1、用户端设备(UT:User Terminal),可以是计算机或其它能访问互联网的设备,用户通过UT访问网络资源,例如浏览网页、检索文献、下载文件等。1. User terminal equipment (UT: User Terminal), which can be a computer or other equipment that can access the Internet. Users access network resources through UT, such as browsing web pages, retrieving documents, downloading files, etc.

2、内容过滤代理(CFA:Content Filtering Agent),存储黑名单(即禁止访问的站点或文件)和白名单(允许访问的站点或文件),它们实际上是一组URL列表。该模块将以多种形式运行在不同类型的平台上。2. Content Filtering Agent (CFA: Content Filtering Agent), which stores blacklists (sites or files that are prohibited from being accessed) and whitelists (sites or files that are allowed to be accessed), which are actually a set of URL lists. The module will run on different types of platforms in many forms.

3、查询服务器(QS:Query Server),有一个具有分类和分级信息的、海量的URL库。当QS接收到UT提交的URL时,在分类与分级库中进行查询,并将结果告诉UT。采用QS的基本原因是因为CFA的资源受限,不能存储太多的分类分级信息,只能存储少量的黑/白名单,而在QS上可以大量存储分类分级信息,一个QS可以支持大量CFA的并发访问。同时在Internet网上还可以部署多个QS,一个单位的Intranet上也可以部署QS,以对付大量并发的查询请求。3. The query server (QS: Query Server) has a massive URL library with classification and classification information. When QS receives the URL submitted by UT, it will query in the classification and grading library and tell UT the result. The basic reason for adopting QS is that the resources of CFA are limited, so too much classification and grading information cannot be stored, and only a small amount of black/white list can be stored, while a large amount of classification and grading information can be stored on QS, and one QS can support a large number of CFA concurrent access. At the same time, multiple QSs can be deployed on the Internet, and QSs can also be deployed on the Intranet of a unit to deal with a large number of concurrent query requests.

4、内容分析与管理服务器(CAMS:Content Analysis and Management Server),其主要任务是对Internet中的资源进行分类与分级评估。例如记录“存放黄色图片或音像的网站或不良URL”的列表。获得授权的QS可以从这里下载具有分类与分级信息的URL库。通常情况下,不同企业或部门关注不同类型的CAMS,可以有多个不同类别的CAMS。CAMS还必须具有管理和发布功能,也可作为一个网络门户网站存在。4. Content Analysis and Management Server (CAMS: Content Analysis and Management Server), its main task is to classify and classify resources in the Internet. For example, record the list of "websites or bad URLs that store pornographic pictures or audio and video". Authorized QS can download the URL library with classification and classification information from here. Usually, different enterprises or departments focus on different types of CAMS, and there may be multiple different types of CAMS. CAMS must also have management and publishing functions, and can also exist as a web portal.

5、目标站点(TWS:Target Website or Server),可以是任何一个存储资源的网站或服务器,UT通过Internet可以访问其公开资源。5. The target site (TWS: Target Website or Server) can be any website or server that stores resources, and UT can access its public resources through the Internet.

该网络内容过滤系统工作的具体步骤概括如下:The specific steps of the network content filtering system work are summarized as follows:

1、当用户发出对某个URL进行访问的请求时,CFA根据黑名单或白名单,允许或禁止该访问请求;1. When a user sends a request to access a certain URL, CFA allows or prohibits the access request according to the blacklist or whitelist;

2、倘若该URL不在CFA的黑名单和白名单中,CFA则向QS发出查询请求;2. If the URL is not in the blacklist and whitelist of CFA, CFA will send a query request to QS;

3、QS将会在本地URL库中查询该URL的分级信息并将结果返回给CFA,CFA则据此做出反应;3. QS will query the classification information of the URL in the local URL library and return the result to CFA, and CFA will respond accordingly;

4、QS会定期从CAMS中下载更新的URL分级信息;4. QS will regularly download updated URL rating information from CAMS;

5、CAMS自动搜索、下载和分析处理互联网上多媒体数据,采用人工交互标注方法和机器自动分类方法,对网络内容进行分类和分级评估,形成分类和分级的URL信息库。5. CAMS automatically searches, downloads, analyzes and processes multimedia data on the Internet, uses manual interactive labeling methods and machine automatic classification methods to classify and grade network content, and forms a classified and graded URL information base.

本发明提出的互联网内容过滤系统可以应用于各种应用场合,例如:The Internet content filtering system proposed by the present invention can be applied to various application occasions, such as:

1.用于阻止访问政治反动的、或危害国家安全的站点。1. Used to prevent access to sites that are politically reactionary or endanger national security.

2.用于阻止访问黄色的、影响青少年身心健康的站点。2. It is used to prevent access to yellow sites that affect the physical and mental health of young people.

3.用于阻止访问电子竞技游戏的站点。3. Sites used to block access to e-sports games.

4.用于阻止访问特定类型的站点或资源,有具体应用需求确定。4. Used to prevent access to specific types of sites or resources, determined by specific application requirements.

过滤代理CFA能以多种方式运行在多种类型的软硬件平台上,例如:Filter agent CFA can run on various types of software and hardware platforms in various ways, for example:

1.CFA可以运行在代理服务器上。1. CFA can run on a proxy server.

2.CFA可以运行在防火墙上。2. CFA can run on the firewall.

3.CFA可以作为浏览器插件运行在浏览器上。3. CFA can run on the browser as a browser plug-in.

4.CFA可以运行于ADSL调制解调器、Cable Modem、电话线调制解调器、ISDN PC适配器等网络访问设备中。4. CFA can run in ADSL modem, Cable Modem, telephone line modem, ISDN PC adapter and other network access devices.

附图说明Description of drawings

图1为互联网上内容过滤系统总体框架结构图示。Fig. 1 is a schematic diagram of the overall framework structure of the content filtering system on the Internet.

图2为内容分析与管理服务器(CAMS)的基本组成与工作流程图示。Figure 2 is a schematic diagram of the basic composition and workflow of the Content Analysis and Management Server (CAMS).

图中标号:1为用户端UT,2为内容过滤代理CFA,3为查询服务器QS,4为内容分析和管理服务器CAMS,5位目标站点TWS。Numbers in the figure: 1 is the user terminal UT, 2 is the content filtering agent CFA, 3 is the query server QS, 4 is the content analysis and management server CAMS, and 5 is the target site TWS.

具体实施方式Detailed ways

下面通过举例进一步介绍本发明的内容。The content of the present invention is further described below by giving examples.

关于内容分析与管理服务器(CAMS)About Content Analysis and Management Server (CAMS)

众所周知,Internet上存在着各种各样的、时刻变化的内容,例如文本、图像、视频、音频、图形、动画、动态网页、Flash等;从全世界角度看,Internet网中数据是真正海量。As we all know, there are various and ever-changing content on the Internet, such as text, image, video, audio, graphics, animation, dynamic web pages, Flash, etc.; from the perspective of the world, the data in the Internet network is truly massive.

CAMS应时刻关注互联网络中各种时刻变化着的、海量的多媒体数据内容,并且能及时对网络内容作出客观的分类和分级。这是一项难度较大的、富有挑战性的工作,需要大规模计算和存储设备,也需要大量的人工辅助。CAMS should always pay attention to various ever-changing and massive multimedia data content in the Internet, and can make objective classification and classification of network content in time. This is a difficult and challenging task that requires large-scale computing and storage equipment, as well as a lot of human assistance.

下表给出了关于“暴力”、“裸体”等类别的分级实例。 类别:暴力 级别0 无暴力 没有侵犯性的暴力行为,没有自然的或意外的暴力事件 级别1 打斗 对生物的伤害或惨杀,对有生命物体的伤害 级别2 杀戮 人或生物遭到伤害或被杀死不危及生物的报复性伤害 级别3 带血腥的杀戮场面 人被杀或受到伤害 级别4 恣意的、非常无理的暴力行为 恶意的和无端的暴力行为 类别:裸体 级别0 无裸体场面 级别1 暴露的服装 暴露的服装 级别2 半裸 半裸 级别3 正面裸体 正面裸体 级别4 带挑逗性的正面裸体 极具挑逗性的正面裸体表演 类别:性 级别0 没有性行为的描写/浪漫故事 级别1 充满激情的亲吻 热烈的亲吻 级别2 穿着衣服的性抚摸 穿着衣服的性抚摸 级别3 非暴露性的性抚摸 非暴露性的性抚摸 级别4 暴露的性行为 暴露的性行为 类别:政治 级别0 没有任何反动内容 级别1 隐讳的内容 隐晦的提到相关内容 级别2 一般的讲述政治敏感内容 一般的讲述到政治敏感的反动内容 级别3 反动的言论 直白的讲述反动的言论 级别4 极其反动的言论 极其反动的言论 The table below gives examples of ratings for the categories "Violence", "Nudity", etc. Category: violence level 0 no violence No aggressive violence, no natural or accidental violence level 1 fighting Injury or massacre to living things, injury to living objects level 2 to kill retaliatory injury in which a person or living being is injured or killed without endangering the living being level 3 bloody killing scene people killed or injured level 4 wanton, highly senseless violence Malicious and gratuitous acts of violence Category: Nude level 0 none no nudity level 1 revealing clothing revealing clothing level 2 half naked half naked level 3 frontal nudity frontal nudity level 4 provocative frontal nudity Provocative frontal nudity Category: sex level 0 none Non-sexual depictions/romance stories level 1 passionate kiss passionate kiss level 2 Clothed Sexual Touching Clothed Sexual Touching level 3 non-revealing sexual touching non-revealing sexual touching level 4 exposed sex exposed sex Category: Politics level 0 none no reactionary content level 1 Taboo content cryptic references to level 2 Generally speaking about politically sensitive content Generally speaking to politically sensitive reactionary content level 3 reactionary remarks Straight-talking reactionary remarks level 4 extremely reactionary remarks extremely reactionary remarks

对网站或服务器上各种数据进行自动或半自动的分类与分级是CAMS的一个非常重要的任务。这里必须指出的是,分类与分级的标准应该由国家有关部门制定、发布和执行。It is a very important task of CAMS to automatically or semi-automatically classify and classify various data on websites or servers. What must be pointed out here is that the standards for classification and grading should be formulated, promulgated and implemented by relevant state departments.

有了网络内容的分类与分级标准之后,不同的公司、单位或门户网站就有可能针对某一类数据进行分级评价。例如,某个CAMS只关注政治性的内容,另一个CAMS可能只关注色情方面的内容,由此可以产生很多商业机会。With the classification and grading standards for network content, it is possible for different companies, units or portals to conduct grading evaluations for a certain type of data. For example, one CAMS only focuses on political content, another CAMS may only focus on pornographic content, which can generate many business opportunities.

显然,某个特定类别的CAMS能否全面并准确地实现对网络数据内容的分级,将直接关系到网络内容过滤的准确性。完全依靠计算机处理和分析来全自动评价网络内容是非常困难的,在本发明中采用人工指导和机器学习相结合的方法来指导计算机完成海量时变网络数据的评价任务。Obviously, whether a certain type of CAMS can fully and accurately classify network data content will directly affect the accuracy of network content filtering. It is very difficult to fully rely on computer processing and analysis to automatically evaluate network content. In this invention, a method combining manual guidance and machine learning is used to guide the computer to complete the evaluation task of massive time-varying network data.

图2给出了基于内容的多媒体数据分析处理和评价方法(针对特定类,类别可以事先人工确定),它可以对图像、视频、音频、文本等各种媒体内容进行分级,其工作步骤是:Figure 2 shows a content-based multimedia data analysis, processing and evaluation method (for a specific class, the class can be manually determined in advance), it can classify various media contents such as images, videos, audios, texts, etc., and its working steps are:

1、对各种媒体对象进行特征提取。例如,从图片中提取颜色和颜色直方图、分析图象区域颜色和纹理结构等;从视频数据中提取相机或物体的运动信息、颜色信息、纹理信息等;从文本中提取关键词等。1. Extract features from various media objects. For example, extract color and color histogram from pictures, analyze image area color and texture structure, etc.; extract camera or object motion information, color information, texture information, etc. from video data; extract keywords from text, etc.

2、用人工方法对部分少量的对象进行标注。这些人工标注的对象将作为机器学习的样本。2. Manually mark some small number of objects. These human-labeled objects will be used as samples for machine learning.

3、系统根据人工标注结果进行学习,获得较高层次的语义信息,并形成用于分级的知识库。3. The system learns according to the manual labeling results, obtains higher-level semantic information, and forms a knowledge base for classification.

4、最后,系统对没有人工标注的绝大多数数据对象进行自动分级,从而大大减轻人工成本。4. Finally, the system automatically grades the vast majority of data objects that are not manually labeled, thereby greatly reducing labor costs.

为保证机器有足够的分类精度,还需要对机器分类的结果进行抽查和人工评价,即通过人工再次评价的方式进一步改进机器的分类性能,即相关反馈。In order to ensure that the machine has sufficient classification accuracy, it is necessary to conduct spot checks and manual evaluation of the machine classification results, that is, to further improve the classification performance of the machine through manual re-evaluation, that is, relevant feedback.

上述方法的主要特点有:采用基于内容的分析处理方法,对各种媒体对象的理解进入语义层面;引入人工交互和标注,允许机器学习,以增强系统的分类准确性;采用反馈机制,系统有自学习能力。通过适当人工指导和机器学习方法,可以较好地提高机器分类精度,极大降低人工成本。The main features of the above method are: using content-based analysis and processing methods, the understanding of various media objects enters the semantic level; introducing manual interaction and labeling, allowing machine learning to enhance the classification accuracy of the system; using a feedback mechanism, the system has Self-learning ability. Through appropriate manual guidance and machine learning methods, the accuracy of machine classification can be better improved and labor costs can be greatly reduced.

另外,CAMS其它功能模块有:管理URL的分类与分级信息库;发布URL分类与分级信息库;还必须有一个重要模块就是“网络爬虫”,用于自动探索互联网,访问网站或服务器,抓取各种媒体文件。现在已有很多类似功能的爬虫软件,这不是本发明的重点。In addition, other functional modules of CAMS include: managing URL classification and grading information database; publishing URL classification and grading information database; there must also be an important module called "web crawler", which is used to automatically explore the Internet, access websites or servers, and crawl Various media files. Existing crawler software of a lot of similar functions now, this is not the focus of the present invention.

下面给出CAMS的详细工作步骤(见图2):The detailed working steps of CAMS are given below (see Figure 2):

(1)网络爬虫组:从Internet网上自主搜索下载各种类型的数据,例如网页、图片、视频、音乐等[对应流程①];根据可疑的URL信息库要求,下载数据对象[对应流程⑦]。注意,这里“可疑的URL信息库”主要由查询服务器(QS)发来的、QS尚不能处理的URL列表。(1) Web crawler group: independently search and download various types of data from the Internet, such as web pages, pictures, videos, music, etc. [corresponding process ①]; download data objects according to suspicious URL information database requirements [corresponding process ⑦] . Note that the "suspicious URL information base" here is mainly a list of URLs sent by the query server (QS) that cannot be processed by the QS yet.

(2)特征提取:对下载的各类多媒体数据对象进行分析处理,提取特征。例如,提取图像的颜色、纹理和形状等特征;提取视频的特征,例如物体运动、相机运动等;组织存储每个下载的数据对象的URL及其特征[对应流程②]。(2) Feature extraction: analyze and process all kinds of downloaded multimedia data objects, and extract features. For example, extract features such as color, texture, and shape of images; extract features of videos, such as object motion, camera motion, etc.; organize and store the URL of each downloaded data object and its features [corresponding process ②].

(3)人工标注:从人工下载的多媒体数据对象中,选择部分数据对象进行分类和分级标注;人工对自动分类和分级的结果进行检查,既可以减少错误,也可以通过这种相关反馈的方法提高分类性能[对应流程③]。(3) Manual labeling: From the manually downloaded multimedia data objects, select some data objects for classification and grading labeling; manually check the results of automatic classification and grading, which can reduce errors, and can also pass this method of related feedback Improve classification performance [corresponding process ③].

(4)训练分类器:对URL对应的数据对象进行自动分类和分级,可以采用机器学习方法,用人工指导标注和相关反馈信息,对分类器进行训练,得到高精度的分类和分级结果[对应流程④]。(4) Training classifiers: Automatically classify and classify data objects corresponding to URLs. Machine learning methods can be used to train classifiers with manual guidance and related feedback information to obtain high-precision classification and classification results [corresponding to Process ④].

(5)自动分类和分级:训练好的分类器可以自动地对每个下载的数据对象进行分类和分级处理,得到分类和分级之后的URL信息库[对应流程⑤];可以对该URL信息库定期更新和发布,由于Internet网内容时时刻刻在变化之中,因此要求更新和发布的周期尽量短[对应流程⑥]。(5) Automatic classification and classification: The trained classifier can automatically classify and classify each downloaded data object, and obtain the URL information base after classification and classification [corresponding process ⑤]; the URL information base can be Regularly update and release, because the content of the Internet is changing all the time, so the update and release cycle is required to be as short as possible [corresponding process ⑥].

关于查询服务器(QS)About Query Server (QS)

在QS上存储了海量的URL分类和分级信息库,这些信息可能来自于一个或多个CAMS。在QS中URL分类和分级信息库的一般性数据结构如下(实例): 序号     URL(字符串型)   “暴力”级别(整数型) “裸体”级别(整数型) “政治”级别(整数型) ……     1     URL-1     1     0     4      …     2     URL-2     2     2     0      …     …     …     …     …     …      …     L     URL-L     3     2     1      … Massive URL classification and grading information databases are stored on QS, which may come from one or more CAMS. The general data structure of the URL classification and grading information base in QS is as follows (example): serial number URL (string type) "Violence" level (integer) "naked" level (integer) "political" level (integer) ... 1 URL-1 1 0 4 2 URL-2 2 2 0 L URL-L 3 2 1

QS的主要工作是对内容过滤代理(CFA)提交的URL做出判决,这是一个简单的查表过程。如果该URL存在于分类与分级表中,则QS将查表结果(即级别)反馈给CFA;否则,QS要做两件事情:(1)给CFA反馈“不可判定”(NAN)信息;(2)将该URL提交给CAMS,由CAMS进行分析处理。由于Internet网上内容时刻变化,出现不可判定的情况是无法避免的。如果CAMS能够及时分析、处理、跟踪网络内容变化情况,则出现“不可判定”的概率会很小。The main job of QS is to make a judgment on the URL submitted by the Content Filtering Agent (CFA), which is a simple look-up process. If the URL exists in the classification and grading table, QS will feed back the table lookup result (ie level) to CFA; otherwise, QS will do two things: (1) Feed back "undecidable" (NAN) information to CFA; ( 2) Submit the URL to CAMS for analysis and processing by CAMS. Since the content on the Internet changes all the time, undecidable situations cannot be avoided. If CAMS can analyze, process, and track network content changes in time, the probability of "undecidable" will be very small.

在实现QS的时候,必须考虑支持并发访问。本发明采用基于Trie树的URL索引结构,同时利用主存缓存策略,将经常访问的URL项存放在服务器的主存中,不经常使用的存放在磁盘上。这种利用索引结构和缓存的策略极大提高了QS的验证速度,支持大并发量的访问。When implementing QS, you must consider supporting concurrent access. The invention adopts the URL index structure based on the Trie tree, and utilizes the main memory caching strategy to store frequently accessed URL items in the main memory of the server, and store infrequently used URL items on the disk. This strategy of using index structure and cache greatly improves the verification speed of QS and supports large concurrent access.

QS可以在Internet或Intranet上大量部署,以服务于各类用户,包括家庭用户或企业用户。QS将从各类获得授权的CAMS下载分类和分级信息库。CAMS应及时处理QS不能判断结果的URL所对应的数据,并周期性发布分类和分级信息,供QS下载。QS can be deployed in large numbers on the Internet or Intranet to serve various users, including home users or business users. QS will download classification and classification information bases from various authorized CAMS. CAMS should promptly process the data corresponding to URLs that QS cannot judge the results, and periodically publish classification and grading information for QS to download.

关于内容过滤代理(CFA)About the Content Filtering Agent (CFA)

CFA是一个非常简单的软件模块,它以多种形式运行在各类软硬件系统平台上。在CFA中存储白名单(WNList)和黑名单(BNList)。本质上,黑/白名单是一张URL列表。CFA is a very simple software module, which runs on various hardware and software system platforms in various forms. A white list (WNList) and a black list (BNList) are stored in the CFA. Essentially, a black/white list is a list of URLs.

CFA的黑/白名单的数据结构如下: 序号 URL(字符串型) 属性(布尔值) 1  http://www.fudan.edu.cn/news/  0(代表属于白名单) 2  http://www.private.com  1(代表属于黑名单) 3  ……  …… The data structure of CFA's black/white list is as follows: serial number URL (string type) property (boolean) 1 http://www.fudan.edu.cn/news/ 0 (represents belonging to the white list) 2 http://www.private.com 1 (represents belonging to the blacklist) 3 ... ...

CFA的基本工作过程:The basic working process of CFA:

1、当URL属于WNList时,CFA允许URL通过,将URL转发给TWS,TWS将根据URL请求返回结果给UT。1. When the URL belongs to WNList, CFA allows the URL to pass through and forwards the URL to TWS, and TWS will return the result to UT according to the URL request.

2、当URL属于BNList时,CFA禁止URL通过,CFA直接将“禁止访问或警告”信息发送给UT,这实际上是切断了UT的请求信息。2. When the URL belongs to the BNList, the CFA prohibits the URL from passing through, and the CFA directly sends the "Access Prohibited or Warning" message to the UT, which actually cuts off the request information of the UT.

3、当URL既不属于WNList,也不属于BNList时,CFA将该URL发送给QS,请求QS对URL进行验证,并根据验证结果进行相应处理。3. When the URL does not belong to WNList or BNList, CFA sends the URL to QS, requests QS to verify the URL, and handles accordingly according to the verification result.

以上操作细节在后续工作流程中还有更加详细的叙述。The above operation details are described in more detail in the follow-up workflow.

每个CFA都将有个授权账号,授权用户可以通过用户端的图形化界面,设置各类CFA选项,形成各自的过滤策略,具体包括:Each CFA will have an authorized account. Authorized users can set various CFA options through the graphical interface of the client to form their own filtering strategies, including:

1、判定URL属于黑/白名单的URL分类级别的设置1. The setting of the URL classification level to determine that the URL belongs to the black/white list

例如,假设用户设定“暴力”1级以上(包括1级),“裸体”2级以上(包括2级)的URL为黑名单。当UT请求访问一个不在CFA的黑/白名单中的URL时,CFA将该URL发送给QS。假设在QS的分类与分级库中,该URL的分级信息为“暴力”0级、“裸体”3级,当QS将这个分级信息返回给CFA时,CFA会根据用户设置,判定该URL属于黑名单,从而拦截该URL。For example, assume that the user sets "violence" level 1 or higher (including level 1), and "naked" level 2 or higher (including level 2) URLs into the blacklist. When UT requests to access a URL that is not in CFA's black/white list, CFA sends the URL to QS. Assume that in the QS classification and rating database, the rating information of this URL is "violence" level 0, "nude" level 3, when QS returns this rating information to CFA, CFA will determine that the URL is black according to user settings. list to block the URL.

2、当QS返回信息为“NAN”时,判定该URL的属性设置2. When the information returned by QS is "NAN", determine the attribute setting of the URL

假设用户将此选项设为“白名单”,则当QS返回“NAN”给CFA时,CFA自动判定该URL属于白名单;否则,认为是黑名单。Assuming that the user sets this option to "white list", when QS returns "NAN" to CFA, CFA will automatically determine that the URL belongs to the white list; otherwise, it will be considered as black list.

3、用户可以手工管理CFA中的黑/白名单,包括浏览、增加和删除。3. Users can manually manage the black/white list in CFA, including browsing, adding and deleting.

4、用户可以修改CFA中授权账号的密码。4. The user can modify the password of the authorized account in CFA.

当CFA的存储资源受限时,需要采取一定缓存策略,例如保留最近的和最频繁使用的黑白名单。When the storage resources of CFA are limited, a certain caching strategy needs to be adopted, such as keeping the latest and most frequently used black and white lists.

CFA的计算能力和存储资源通常是受限的。例如,CFA运行在ADSL的调制解调器(MODEM)中,此时计算能力明显不足,能存储的黑/白名单也相当有限。针对这种应用,CFA必须设计得简单小巧快速。显然,本发明提出的CFA不需要复杂的程序,只是一个查表和维护缓存的过程,而且缓存机制大大减少了对存储空间的需求。Computational power and storage resources of CFA are usually limited. For example, CFA runs in the modem (MODEM) of ADSL, and at this time the computing power is obviously insufficient, and the black/white list that can be stored is also quite limited. For this application, the CFA must be designed to be simple, compact and fast. Apparently, the CFA proposed by the present invention does not require complex programs, but is only a process of table lookup and cache maintenance, and the cache mechanism greatly reduces the demand for storage space.

最后需要指出的是,CFA、QS与CAMS三者之间通信可以通过Socket编程实现,也可以通过其它方法实现。CFA与QS之间,QS与CAMS之间的通信都要通过身份验证。本发明中互联网内容过滤的具体步骤如下(见图1所示)Finally, it should be pointed out that the communication between CFA, QS and CAMS can be realized through Socket programming, or through other methods. The communication between CFA and QS, and between QS and CAMS must pass identity verification. The concrete steps of Internet content filtering among the present invention are as follows (seeing as shown in Figure 1)

1、当用户希望访问某个目标站点或服务器(TWS),进行网页浏览、视频点播或文件下载时,将发出http(或ftp、mms、telnet等)请求,内容过滤代理CFA会马上截获该请求的URL,并与CFA的黑白名单中的URL进行比较[对应流程①]。1. When a user wants to visit a target site or server (TWS) for web browsing, video on demand or file download, an http (or ftp, mms, telnet, etc.) request will be sent, and the content filtering agent CFA will immediately intercept the request and compare it with the URLs in the black and white list of CFA [corresponding process ①].

如果UT请求的URL在CFA黑名单中,则拦截该URL请求,返回错误或警告信息给UT[对应流程②]。If the URL requested by UT is in the CFA blacklist, the URL request will be intercepted, and an error or warning message will be returned to UT [corresponding process ②].

2、如果UT请求的URL在CFA白名单中,则将该URL请求直接转发给目标站点TWS[对应流程③];TWS将回复UT相应的响应[对应流程⑥]。2. If the URL requested by UT is in the CFA whitelist, the URL request will be forwarded directly to the target site TWS [corresponding process ③]; TWS will reply to UT with a corresponding response [corresponding process ⑥].

3、如果请求的URL即不在CFA黑名单中,也不在CFA白名单中,CFA将该URL发送给查询服务器QS[对应流程④],QS对该URL进行查询,获取分级信息或NAN,并发送给CFA[对应流程⑤]。3. If the requested URL is neither in the CFA blacklist nor in the CFA whitelist, CFA sends the URL to the query server QS [corresponding process ④], QS queries the URL, obtains classification information or NAN, and sends To CFA [corresponding process ⑤].

(1)如果该URL在QS的URL库中,并且按照用户设置,它的分类级别属于黑名单时,CFA认定该URL属于黑名单,立即自动更新其黑名单,禁止UT访问该URL,并返回错误或警告信息给UT[对应流程②]。(1) If the URL is in the URL library of QS, and according to the user’s setting, its classification level belongs to the blacklist, CFA will determine that the URL belongs to the blacklist, immediately and automatically update its blacklist, prohibit UT from accessing the URL, and return Error or warning information to UT [corresponding process ②].

(2)如果该URL在QS的URL库中,并且按照用户设置,它的分类级别属于白名单时,CFA认定该URL属于白名单,立即自动更新其白名单,并将请求转发给TWS[对应流程③];TWS将回复UT相应的响应[对应流程⑥]。(2) If the URL is in the URL library of QS, and according to user settings, its classification level belongs to the white list, CFA determines that the URL belongs to the white list, automatically updates its white list immediately, and forwards the request to TWS [corresponding Process ③]; TWS will reply UT with a corresponding response [corresponding process ⑥].

(3)如果该URL不在QS的URL库中,QS将通知CFA该URL是无法判定的,CFA将根据事先用户设置的策略作出反应:一种是自动作为白名单处理,另一种自动作为黑名单处理。不过,在此情形下,CFA不再更新其黑名单或白名单。另一方面,QS会将该URL交由CAMS处理[对应流程⑦]。(3) If the URL is not in the URL library of QS, QS will notify CFA that the URL cannot be determined, and CFA will respond according to the policy set by the user in advance: one is automatically treated as a whitelist, and the other is automatically treated as a blacklist list processing. However, in this case, CFA will no longer update its blacklist or whitelist. On the other hand, QS will hand over the URL to CAMS for processing [corresponding process ⑦].

内容分析管理服务器CAMS会定期向QS发布更新的URL分级库[对应流程⑩],使之能够及时反映互联网中内容的变化。CAMS的性能直接影响过滤精度,因此需要付出较大代价来维护和更新CAMS。The content analysis management server CAMS will regularly publish updated URL classification database [corresponding process ⑩] to QS, so that it can reflect the changes of the content in the Internet in time. The performance of CAMS directly affects the filtering accuracy, so it needs to pay a large price to maintain and update CAMS.

为了提高CFA的判别速度,减小CFA对存储资源的要求,需要在CFA中引入缓存机制,即存储用户UT经常访问的黑白名单,减少UT向QS发送验证请求的机会,因为一次验证请求需要一定的等待时间。In order to improve the discrimination speed of CFA and reduce the requirements of CFA for storage resources, it is necessary to introduce a caching mechanism in CFA, that is, to store the black and white lists frequently accessed by user UT, and reduce the chance of UT sending verification requests to QS, because a verification request requires a certain amount of time. waiting time.

授权用户可以根据自己的需要,管理CFA中黑白名单列表,对其浏览、添加或删除[对应流程⑧和⑨]。Authorized users can manage the black and white lists in CFA according to their own needs, browse, add or delete them [corresponding processes ⑧ and ⑨].

Claims (10)

1, a kind of Web content filtration system is characterized in that being made up of information filtering agency (being designated as CFA), querying server (being designated as QS) and content analysis and management server (being designated as CAMS), and wherein, information filtering agency storage has blacklist and white list; Querying server has a URL storehouse with classification and rating information; Content analysis and management server are that the resource among the Internet is classified and classified estimation.
2, Web content filtration system according to claim 1 is characterized in that being provided with among the CFA user individual configuration, comprising: (1) judges that URL belongs to the setting of the URL category level of blacklist or white list; (2) when QS return information during, judge the setting of this URL attribute for these URL clauses and subclauses not; (3) blacklist or the white list among the manual management CFA, comprise browse, increase and delete function.
3, Web content filtration system according to claim 1 is characterized in that CFA operates on the following all kinds of software and hardware system platform in a variety of forms: (1) acting server; (2) fire compartment wall; (3) browser; (4) network access devices such as ADSL Modem, Cable Modem, telephone line modem, ISDN PC adapter.
4, Web content filtration system according to claim 1 is characterized in that QS has the URL classification and the rating information of magnanimity, and the URL that CFA is submitted to carries out quick search and returns corresponding rating information.
5, Web content filtration system according to claim 1 is characterized in that QS can dispose in a large number on Internet or Intranet, support concurrent inquiry, is used to serve all types of user; QS will download classification and rating information storehouse from all kinds of CAMS that obtain the authorization.
6, Web content filtration system according to claim 1 is characterized in that CAMS adopts content-based multimedia analysis and processing method, all kinds of media contents in the Internet analyzed and assessed, and according to different their mark classifications that is categorized as.
7, Web content filtration system according to claim 1 is characterized in that CAMS introduces man-machine interactively and mark, utilizes the classification accuracy of machine learning enhanced system.
8, Web content filtration system according to claim 1 is characterized in that between CFA and the QS, and communicating by letter between QS and the CAMS all needs by authentication.
9, a kind of method of Web content filtration is characterized in that utilizing the described Web content filtration system of claim 1, and concrete steps are as follows:
(1) when the user sends the request that certain URL is conducted interviews, CFA is according to blacklist or white list, forbids or allows this access request;
(2) if this URL not in the blacklist and white list of CFA, CFA then sends query requests to QS;
(3) QS will inquire about the rating information of this URL and the result is returned to CFA in local URL storehouse, and CFA then makes a response in view of the above;
(4) QS understands the URL rating information of down loading updating from CAMS regularly;
(5) CAMS search for automatically, multi-medium data on download and the analyzing and processing the Internet, adopt man-machine interactively mask method and machine automatic classification method, Web content is classified and classified estimation, form the URL information bank of classification and classification.
10, Web content filter method according to claim 9 is characterized in that the job step of CAMS is as follows:
(1) web crawlers group: go up from main search download various types of data, according to suspicious URL information bank requirement, data download object from Internet;
(2) feature extraction: all kinds of multimedia data objects of downloading are carried out analyzing and processing, extract feature;
(3) artificial mark: from the multimedia data downloaded object, select segment data object to classify and the classification mark; Manually the result to automatic classification and classification checks;
(4) training classifier: to classifying automatically and classification, adopt machine learning method,, grader is trained with artificial mark and the related feedback information of instructing with the corresponding data object of URL;
(5) classification and classification automatically: the grader that trains is handled each data downloaded object being carried out classify and grading automatically, obtains classifying and classification URL information bank afterwards; To this URL information bank regular update and issue.
CN 200410053683 2004-08-12 2004-08-12 Internet content filtering system and method Pending CN1588879A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200410053683 CN1588879A (en) 2004-08-12 2004-08-12 Internet content filtering system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200410053683 CN1588879A (en) 2004-08-12 2004-08-12 Internet content filtering system and method

Publications (1)

Publication Number Publication Date
CN1588879A true CN1588879A (en) 2005-03-02

Family

ID=34602956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200410053683 Pending CN1588879A (en) 2004-08-12 2004-08-12 Internet content filtering system and method

Country Status (1)

Country Link
CN (1) CN1588879A (en)

Cited By (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007076714A1 (en) * 2005-12-31 2007-07-12 Metaswarm (Hongkong) Ltd. System and method for generalizing an antispam blacklist
CN100362805C (en) * 2005-11-18 2008-01-16 郑州金惠计算机系统工程有限公司 Multifunctional management system for network pornographic image and bad information detection
CN100367294C (en) * 2005-06-23 2008-02-06 复旦大学 Method for Segmenting Human Skin Regions in Color Digital Images and Videos
WO2010121542A1 (en) * 2009-04-22 2010-10-28 中兴通讯股份有限公司 Home gateway-based anti-virus method and device thereof
CN101877704A (en) * 2010-06-02 2010-11-03 中兴通讯股份有限公司 Method and service gateway for network access control
CN101883180A (en) * 2010-05-11 2010-11-10 中兴通讯股份有限公司 Method, mobile terminal and system for shielding mobile terminal from accessing wireless network information
CN101923561A (en) * 2010-05-24 2010-12-22 中国科学技术信息研究所 Automatic document classifying method
CN101937445A (en) * 2010-05-24 2011-01-05 中国科学技术信息研究所 Automatic file classification system
CN101951379A (en) * 2010-09-27 2011-01-19 苏州昂信科技有限公司 Green browser and URL long-distance filtration mechanism used thereby
CN102027777A (en) * 2008-05-16 2011-04-20 日本电气株式会社 Base station device, information processing device, filtering system, filtering method, and program
CN101317376B (en) * 2006-07-11 2011-04-20 华为技术有限公司 Method, device and system for contents filtering
CN102075617A (en) * 2010-12-02 2011-05-25 惠州Tcl移动通信有限公司 Method and device thereof for preventing short messages from being automatically sent through mobile phone virus
CN102075502A (en) * 2009-11-24 2011-05-25 北京网御星云信息技术有限公司 Virus protection system based on cloud computing
CN102110132A (en) * 2010-12-08 2011-06-29 北京星网锐捷网络技术有限公司 Uniform resource locator matching and searching method, device and network equipment
CN102137111A (en) * 2011-04-20 2011-07-27 北京蓝汛通信技术有限责任公司 Method and device for preventing CC (Challenge Collapsar) attack and content delivery network server
CN101605129B (en) * 2009-06-23 2012-02-01 北京理工大学 URL lookup method for URL filtering system
CN101163161B (en) * 2007-11-07 2012-02-29 福建星网锐捷网络有限公司 United resource localizer address filtering method and intermediate transmission equipment
CN102415119A (en) * 2009-04-27 2012-04-11 皇家Kpn公司 Managing undesired service requests in a network
CN102469146A (en) * 2010-11-19 2012-05-23 北京奇虎科技有限公司 Cloud security downloading method
CN101547197B (en) * 2009-04-30 2012-05-30 珠海金山软件有限公司 URL (Uniform resource locator) whitening device and method
CN102663291A (en) * 2012-03-23 2012-09-12 奇智软件(北京)有限公司 Mail information prompt method and device
CN102682037A (en) * 2011-03-18 2012-09-19 阿里巴巴集团控股有限公司 Data acquisition method, system and device
CN101283356B (en) * 2005-10-14 2012-10-10 微软公司 Search results injected into client applications
CN102724187A (en) * 2012-06-06 2012-10-10 奇智软件(北京)有限公司 Method and device for safety detection of universal resource locators
CN102754488A (en) * 2011-04-18 2012-10-24 华为技术有限公司 User access control method, device and system
CN102831149A (en) * 2012-06-25 2012-12-19 腾讯科技(深圳)有限公司 Sample analyzing method, device and storage medium
CN102833258A (en) * 2012-08-31 2012-12-19 北京奇虎科技有限公司 Website access method and system
CN102946377A (en) * 2012-07-16 2013-02-27 珠海市君天电子科技有限公司 Antivirus system and method for preventing users from downloading virus documents from internet
CN103024092A (en) * 2011-09-28 2013-04-03 中国移动通信集团公司 Method, system and device for blocking domain
CN101208942B (en) * 2005-03-30 2013-04-10 西门子企业通讯有限责任两合公司 Method for preventing unwanted telephony advertising of a communication network
WO2013067724A1 (en) * 2011-11-08 2013-05-16 北京捷通华声语音技术有限公司 Cloud end user mapping system and method
WO2013078825A1 (en) * 2011-11-30 2013-06-06 华为技术有限公司 Method, device and system for recommending accessible website to user
CN103338211A (en) * 2013-07-19 2013-10-02 腾讯科技(深圳)有限公司 Malicious URL (unified resource locator) authenticating method and device
CN103390129A (en) * 2012-05-08 2013-11-13 腾讯科技(深圳)有限公司 Method and device for detecting security of uniform resource locator
CN103428187A (en) * 2012-05-25 2013-12-04 腾讯科技(深圳)有限公司 Method and system for access controlling, and equipment
CN103679014A (en) * 2012-09-04 2014-03-26 腾讯科技(深圳)有限公司 Method and device for intercepting processing of webpage malicious Flash
CN103973749A (en) * 2013-02-05 2014-08-06 腾讯科技(深圳)有限公司 Cloud server and website processing method based on same
CN103984708A (en) * 2014-04-29 2014-08-13 暨南大学 Method and system of emergency decomposing and sorting for processing of big data of catastrophe risks
CN104079528A (en) * 2013-03-26 2014-10-01 北大方正集团有限公司 Method and system of safety protection of Web application
CN104239369A (en) * 2013-06-24 2014-12-24 腾讯科技(深圳)有限公司 Method, device and system for filtering out webpage advertisements
CN104506426A (en) * 2012-03-23 2015-04-08 北京奇虎科技有限公司 Information prompting method and device for E-mails
CN104598508A (en) * 2013-09-18 2015-05-06 Ims保健公司 System and method for fast query response
TWI490726B (en) * 2012-09-03 2015-07-01 Tencent Tech Shenzhen Co Ltd Method and device for protecting access to multiple applications by using single sign-on
CN105187290A (en) * 2005-03-25 2015-12-23 高通股份有限公司 Apparatus And Methods For Managing Content Exchange On A Wireless Device
CN106055557A (en) * 2015-12-25 2016-10-26 中国科学技术信息研究所 Method and system for classification and pre-processing of big data under Internet environment
CN103428187B (en) * 2012-05-25 2016-11-30 腾讯科技(深圳)有限公司 Access method, equipment and the system controlled
CN106408334A (en) * 2016-08-31 2017-02-15 微梦创科网络科技(中国)有限公司 Verification method and system of network advertisements
CN107528845A (en) * 2017-09-01 2017-12-29 华中科技大学 A kind of intelligent url filtering system and method based on crawler technology
CN107580004A (en) * 2017-10-31 2018-01-12 深圳竹云科技有限公司 A kind of new authentication method and authentication center's framework
CN109063641A (en) * 2018-08-01 2018-12-21 浠诲嘲 Computer checking method
CN110472133A (en) * 2018-05-08 2019-11-19 上海利业律兴企业管理有限公司 A kind of internet information exchange method and device
CN110516066A (en) * 2019-07-23 2019-11-29 同盾控股有限公司 A kind of content of text safety protecting method and device
CN110709833A (en) * 2017-12-05 2020-01-17 谷歌有限责任公司 Identify videos with inappropriate content by processing search logs
CN113099441A (en) * 2021-03-29 2021-07-09 Oppo广东移动通信有限公司 Website management method, website management platform, electronic device and medium
CN114238962A (en) * 2021-09-29 2022-03-25 睿贸恒诚(山东)科技发展有限责任公司 Harmful information filtering system and method based on mobile internet
CN114491643A (en) * 2022-02-14 2022-05-13 Tcl通讯科技(成都)有限公司 Access control method, device, storage medium and server

Cited By (83)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105187290A (en) * 2005-03-25 2015-12-23 高通股份有限公司 Apparatus And Methods For Managing Content Exchange On A Wireless Device
CN105187290B (en) * 2005-03-25 2019-07-30 高通股份有限公司 For managing the device and method of content exchange on the wireless device
CN101208942B (en) * 2005-03-30 2013-04-10 西门子企业通讯有限责任两合公司 Method for preventing unwanted telephony advertising of a communication network
CN100367294C (en) * 2005-06-23 2008-02-06 复旦大学 Method for Segmenting Human Skin Regions in Color Digital Images and Videos
CN101283356B (en) * 2005-10-14 2012-10-10 微软公司 Search results injected into client applications
CN100362805C (en) * 2005-11-18 2008-01-16 郑州金惠计算机系统工程有限公司 Multifunctional management system for network pornographic image and bad information detection
WO2007076714A1 (en) * 2005-12-31 2007-07-12 Metaswarm (Hongkong) Ltd. System and method for generalizing an antispam blacklist
CN101317376B (en) * 2006-07-11 2011-04-20 华为技术有限公司 Method, device and system for contents filtering
US8055241B2 (en) 2006-07-11 2011-11-08 Huawei Technologies Co., Ltd. System, apparatus and method for content screening
CN101163161B (en) * 2007-11-07 2012-02-29 福建星网锐捷网络有限公司 United resource localizer address filtering method and intermediate transmission equipment
CN102027777A (en) * 2008-05-16 2011-04-20 日本电气株式会社 Base station device, information processing device, filtering system, filtering method, and program
WO2010121542A1 (en) * 2009-04-22 2010-10-28 中兴通讯股份有限公司 Home gateway-based anti-virus method and device thereof
CN101527721B (en) * 2009-04-22 2012-09-05 中兴通讯股份有限公司 Anti-virus method on the basis of household gateway and device thereof
CN104822146B (en) * 2009-04-27 2020-03-24 皇家Kpn公司 Managing undesired service requests in a network
CN102415119B (en) * 2009-04-27 2015-05-27 皇家Kpn公司 Managing undesired service requests in a network
CN104822146A (en) * 2009-04-27 2015-08-05 皇家Kpn公司 Managing undesired service requests in a network
CN102415119A (en) * 2009-04-27 2012-04-11 皇家Kpn公司 Managing undesired service requests in a network
CN101547197B (en) * 2009-04-30 2012-05-30 珠海金山软件有限公司 URL (Uniform resource locator) whitening device and method
CN101605129B (en) * 2009-06-23 2012-02-01 北京理工大学 URL lookup method for URL filtering system
CN102075502A (en) * 2009-11-24 2011-05-25 北京网御星云信息技术有限公司 Virus protection system based on cloud computing
CN102075502B (en) * 2009-11-24 2013-12-11 北京网御星云信息技术有限公司 Virus protection system based on cloud computing
CN101883180A (en) * 2010-05-11 2010-11-10 中兴通讯股份有限公司 Method, mobile terminal and system for shielding mobile terminal from accessing wireless network information
WO2011140784A1 (en) * 2010-05-11 2011-11-17 中兴通讯股份有限公司 Method for screening mobile terminal from accessing wireless network information, mobile terminal and system thereof
CN101937445A (en) * 2010-05-24 2011-01-05 中国科学技术信息研究所 Automatic file classification system
CN101923561A (en) * 2010-05-24 2010-12-22 中国科学技术信息研究所 Automatic document classifying method
CN101877704A (en) * 2010-06-02 2010-11-03 中兴通讯股份有限公司 Method and service gateway for network access control
WO2011150692A1 (en) * 2010-06-02 2011-12-08 中兴通讯股份有限公司 Method for controlling network access and service gateway thereof
CN101951379A (en) * 2010-09-27 2011-01-19 苏州昂信科技有限公司 Green browser and URL long-distance filtration mechanism used thereby
CN102469146B (en) * 2010-11-19 2015-11-25 北京奇虎科技有限公司 A kind of cloud security downloading method
WO2012065551A1 (en) * 2010-11-19 2012-05-24 北京奇虎科技有限公司 Method for cloud security download
CN102469146A (en) * 2010-11-19 2012-05-23 北京奇虎科技有限公司 Cloud security downloading method
CN102075617A (en) * 2010-12-02 2011-05-25 惠州Tcl移动通信有限公司 Method and device thereof for preventing short messages from being automatically sent through mobile phone virus
CN102110132B (en) * 2010-12-08 2013-06-19 北京星网锐捷网络技术有限公司 Uniform resource locator matching and searching method, device and network equipment
CN102110132A (en) * 2010-12-08 2011-06-29 北京星网锐捷网络技术有限公司 Uniform resource locator matching and searching method, device and network equipment
CN102682037B (en) * 2011-03-18 2016-09-28 阿里巴巴集团控股有限公司 A kind of data capture method, system and device
CN102682037A (en) * 2011-03-18 2012-09-19 阿里巴巴集团控股有限公司 Data acquisition method, system and device
CN102754488B (en) * 2011-04-18 2016-06-08 华为技术有限公司 User access control method, device and system
CN102754488A (en) * 2011-04-18 2012-10-24 华为技术有限公司 User access control method, device and system
CN102137111A (en) * 2011-04-20 2011-07-27 北京蓝汛通信技术有限责任公司 Method and device for preventing CC (Challenge Collapsar) attack and content delivery network server
CN103024092B (en) * 2011-09-28 2015-04-22 中国移动通信集团公司 Method, system and device for blocking domain
CN103024092A (en) * 2011-09-28 2013-04-03 中国移动通信集团公司 Method, system and device for blocking domain
WO2013067724A1 (en) * 2011-11-08 2013-05-16 北京捷通华声语音技术有限公司 Cloud end user mapping system and method
WO2013078825A1 (en) * 2011-11-30 2013-06-06 华为技术有限公司 Method, device and system for recommending accessible website to user
CN104506426B (en) * 2012-03-23 2019-03-01 北京奇虎科技有限公司 The information cuing method and device of mail
CN102663291A (en) * 2012-03-23 2012-09-12 奇智软件(北京)有限公司 Mail information prompt method and device
CN104506426A (en) * 2012-03-23 2015-04-08 北京奇虎科技有限公司 Information prompting method and device for E-mails
CN103390129A (en) * 2012-05-08 2013-11-13 腾讯科技(深圳)有限公司 Method and device for detecting security of uniform resource locator
CN103390129B (en) * 2012-05-08 2015-12-16 腾讯科技(深圳)有限公司 Detect the method and apparatus of security of uniform resource locator
CN103428187B (en) * 2012-05-25 2016-11-30 腾讯科技(深圳)有限公司 Access method, equipment and the system controlled
CN103428187A (en) * 2012-05-25 2013-12-04 腾讯科技(深圳)有限公司 Method and system for access controlling, and equipment
CN102724187A (en) * 2012-06-06 2012-10-10 奇智软件(北京)有限公司 Method and device for safety detection of universal resource locators
CN102724187B (en) * 2012-06-06 2016-05-25 北京奇虎科技有限公司 A kind of safety detection method for network address and device
CN102831149B (en) * 2012-06-25 2015-08-12 腾讯科技(深圳)有限公司 Method of sample analysis, device
CN102831149A (en) * 2012-06-25 2012-12-19 腾讯科技(深圳)有限公司 Sample analyzing method, device and storage medium
CN102946377A (en) * 2012-07-16 2013-02-27 珠海市君天电子科技有限公司 Antivirus system and method for preventing users from downloading virus documents from internet
CN102833258B (en) * 2012-08-31 2015-09-23 北京奇虎科技有限公司 Network address access method and system
CN102833258A (en) * 2012-08-31 2012-12-19 北京奇虎科技有限公司 Website access method and system
TWI490726B (en) * 2012-09-03 2015-07-01 Tencent Tech Shenzhen Co Ltd Method and device for protecting access to multiple applications by using single sign-on
CN103679014B (en) * 2012-09-04 2018-07-03 腾讯科技(深圳)有限公司 The intercepting processing method and device of webpage malicious Flash
CN103679014A (en) * 2012-09-04 2014-03-26 腾讯科技(深圳)有限公司 Method and device for intercepting processing of webpage malicious Flash
CN103973749A (en) * 2013-02-05 2014-08-06 腾讯科技(深圳)有限公司 Cloud server and website processing method based on same
CN104079528A (en) * 2013-03-26 2014-10-01 北大方正集团有限公司 Method and system of safety protection of Web application
CN104239369A (en) * 2013-06-24 2014-12-24 腾讯科技(深圳)有限公司 Method, device and system for filtering out webpage advertisements
WO2015007231A1 (en) * 2013-07-19 2015-01-22 腾讯科技(深圳)有限公司 Method and device for identification of malicious url
CN103338211A (en) * 2013-07-19 2013-10-02 腾讯科技(深圳)有限公司 Malicious URL (unified resource locator) authenticating method and device
CN104598508A (en) * 2013-09-18 2015-05-06 Ims保健公司 System and method for fast query response
CN104598508B (en) * 2013-09-18 2021-06-08 Iqvia 公司 System and method for fast query response
CN103984708B (en) * 2014-04-29 2017-11-28 暨南大学 The emergent decomposition method for sorting and system of catastrophe risk big data processing
CN103984708A (en) * 2014-04-29 2014-08-13 暨南大学 Method and system of emergency decomposing and sorting for processing of big data of catastrophe risks
CN106055557A (en) * 2015-12-25 2016-10-26 中国科学技术信息研究所 Method and system for classification and pre-processing of big data under Internet environment
CN106408334A (en) * 2016-08-31 2017-02-15 微梦创科网络科技(中国)有限公司 Verification method and system of network advertisements
CN107528845A (en) * 2017-09-01 2017-12-29 华中科技大学 A kind of intelligent url filtering system and method based on crawler technology
CN107580004A (en) * 2017-10-31 2018-01-12 深圳竹云科技有限公司 A kind of new authentication method and authentication center's framework
CN110709833B (en) * 2017-12-05 2023-09-05 谷歌有限责任公司 Identify videos with inappropriate content by processing search logs
CN110709833A (en) * 2017-12-05 2020-01-17 谷歌有限责任公司 Identify videos with inappropriate content by processing search logs
CN110472133A (en) * 2018-05-08 2019-11-19 上海利业律兴企业管理有限公司 A kind of internet information exchange method and device
CN109063641A (en) * 2018-08-01 2018-12-21 浠诲嘲 Computer checking method
CN110516066A (en) * 2019-07-23 2019-11-29 同盾控股有限公司 A kind of content of text safety protecting method and device
CN110516066B (en) * 2019-07-23 2022-04-15 同盾控股有限公司 Text content safety protection method and device
CN113099441B (en) * 2021-03-29 2022-11-18 Oppo广东移动通信有限公司 Website management method, website management platform, electronic equipment and medium
CN113099441A (en) * 2021-03-29 2021-07-09 Oppo广东移动通信有限公司 Website management method, website management platform, electronic device and medium
CN114238962A (en) * 2021-09-29 2022-03-25 睿贸恒诚(山东)科技发展有限责任公司 Harmful information filtering system and method based on mobile internet
CN114491643A (en) * 2022-02-14 2022-05-13 Tcl通讯科技(成都)有限公司 Access control method, device, storage medium and server

Similar Documents

Publication Publication Date Title
CN1588879A (en) Internet content filtering system and method
US10922377B2 (en) Internet-based proxy service to limit internet visitor connection speed
US8359651B1 (en) Discovering malicious locations in a public computer network
Niu et al. A Quantitative Study of Forum Spamming Using Context-based Analysis.
US9985978B2 (en) Method and system for misuse detection
JP4906273B2 (en) Search engine spam detection using external data
CN1253813C (en) Contents-index search system and its method
US7987173B2 (en) Systems and methods of handling internet spiders
US8271650B2 (en) Systems and method of identifying and managing abusive requests
CN102957694B (en) A kind of method and device judging fishing website
US8695091B2 (en) Systems and methods for enforcing policies for proxy website detection using advertising account ID
US20120191693A1 (en) Systems and methods of identifying and handling abusive requesters
US20120060221A1 (en) Prioritizing Malicious Website Detection
US20150180899A1 (en) System and method of analyzing web content
US7860971B2 (en) Anti-spam tool for browser
CN1592229A (en) Electronic communications and web pages filtering based on URL
US20090300012A1 (en) Multilevel intent analysis method for email filtration
CN1601532A (en) Improved systems and methods for ordering documents based on structurally relevant information
CN1459064A (en) Method for searching and analying information in data networks
CN1761961A (en) Method and apparatus for detecting invalid clicks on an internet search engine
US20090240684A1 (en) Image Content Categorization Database
Polakis et al. Using social networks to harvest email addresses
Yang et al. Dark web forum correlation analysis research
CN1863211A (en) Content filtering system and method thereof
US7634458B2 (en) Protecting non-adult privacy in content page search

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication