CN102460417B - Domain state, role and kind - Google Patents
Domain state, role and kind Download PDFInfo
- Publication number
- CN102460417B CN102460417B CN201080024769.7A CN201080024769A CN102460417B CN 102460417 B CN102460417 B CN 102460417B CN 201080024769 A CN201080024769 A CN 201080024769A CN 102460417 B CN102460417 B CN 102460417B
- Authority
- CN
- China
- Prior art keywords
- page
- content
- domain
- additional
- territory
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/954—Navigation, e.g. using categorised browsing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9558—Details of hyperlinks; Management of linked annotations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Information Transfer Between Computers (AREA)
- Medicines That Contain Protein Lipid Enzymes And Other Medicines (AREA)
Abstract
Description
背景技术 Background technique
随着在该因特网上活动网站数量的增长,存在用于组织和评估可获得内容的提高要求。搜索引擎使得个人用户更轻松地发现和访问可获得信息。典型的搜索引擎可能包括基于用户搜索词有助于识别相关内容的算法。这些可能包括,例如,考虑到明显的通用性,基于网站事务,属于具有被请求内容的特定页面。然而,在通常将内容从网站处下载、对其索引、随后允许用户基于该下载内容进行搜索的常用服务的性能上存在限制。 As the number of active web sites on the Internet grows, there are increasing demands for organizing and evaluating available content. Search engines make it easier for individual users to discover and access available information. A typical search engine may include algorithms that help identify relevant content based on a user's search terms. These may include, for example, in view of the apparent generality, based on the transaction of the website, pertaining to a specific page with the requested content. However, there are limitations in the performance of common services that typically download content from a website, index it, and then allow users to search based on the downloaded content.
理解到该因特网本质上组织的方式能够有助于理解与有效使用通过该因特网可访问的信息的整个范围相关的要求。该域名系统(DNS)是该因特网基本设施的一部分,其将人可读的域名转换成需要在该因特网上建立TCP/IP通信所需的因特网协议(IP)数。即,DNS允许用户涉及网站和其他资源,更轻松地使用以记住域名,如“www.example.com”,而不是该数字的IP地址,如“123.4.56.78”,指定给在该因特网上的计算机。每一个域名由一系列由点分开的字符串(符号)组成。在域名中的最右边符号公知作为该“顶层域名”(TLD)。公知的TLD的实施例是“.com”、“.net”、“.org”等。每一个TLD支持二级域,立即列表在该TLD的左边,如该在“www.example.com”中的“example”级。每一个二级域可包括立即位于该二级域左边的多个三级域,如在“www.example.com”中的“www”级。也能够存在附加级别的域,实际上没有限制。例如,具有附加域级的域可以是“www.photos.example.com”。 An understanding of the way in which the Internet is inherently organized can help in understanding the requirements associated with effectively using the entire range of information accessible through the Internet. The Domain Name System (DNS) is part of the Internet infrastructure that translates human-readable domain names into the Internet Protocol (IP) numbers needed to establish TCP/IP communications over the Internet. That is, DNS allows users to refer to websites and other resources more easily using to remember domain names, such as "www.example.com", rather than the numerical IP addresses, such as "123.4.56.78", assigned to computer. Each domain name consists of a series of character strings (symbols) separated by dots. The rightmost symbol in a domain name is known as the "top-level domain name" (TLD). Examples of well known TLDs are ".com", ".net", ".org", etc. Each TLD supports second-level domains, listed immediately to the left of the TLD, such as the "example" level in "www.example.com". Each second level domain may include multiple third level domains immediately to the left of the second level domain, such as the "www" level in "www.example.com". Additional levels of domains can also exist, virtually unlimited. For example, a domain with an additional domain level could be "www.photos.example.com".
同样应当注意的是,单个IP地址,如单个服务器,能够支持多个域名。即,不同域名可能解析到同一服务器,其能够随后确定基于该请求域名和/或附加非域信息提供什么内容。这有时称作为虚拟主机。 It should also be noted that a single IP address, such as a single server, can support multiple domain names. That is, different domain names may resolve to the same server, which can then determine what content to serve based on the requested domain name and/or additional non-domain information. This is sometimes called virtual hosting.
可将附加非域信息包括在包括该域名的统一资源标识符(“URI”)结构中。例如,“path”部分是一连串的分段(概念上类似于目录,尽管不必表示他们),由斜杠(“/”)分开。可将该信息包括在紧接该域名的右边,如在“www.example.com/blog”中的“blog”,可由服务器或其他接收装置使用以识别和传送指定内容或运行特定代码。非域信息的其他实施例可包括查询和片段,其具体内容为本领域普通技术人员所理解而在此处不进行具体讨论。可将该信息的组合包括在网页超链接中,其引导用户到相同页面的另一部分或到可属于该相同或不同域一部分的另一网页。 Additional non-domain information may be included in the Uniform Resource Identifier ("URI") structure that includes the domain name. For example, a "path" part is a series of segments (conceptually similar to directories, although they don't have to be represented), separated by slashes ("/"). This information can be included immediately to the right of the domain name, as in "blog" in "www.example.com/blog", and can be used by a server or other receiving device to identify and deliver specific content or run specific code. Other examples of non-domain information may include queries and fragments, the details of which are understood by those of ordinary skill in the art and will not be discussed in detail here. This combination of information may be included in a web page hyperlink that directs the user to another portion of the same page or to another web page that may be part of the same or a different domain.
可以分层的或嵌套方式组织相关域名和内容,如“www.example.com”;“www.blog.example.com”;“www.example.com/blog”;或“blog.example.com”等,每一个具有不同的意义。这样的相关域不需要将该实际IP地址中相同部分共享到该各个域名解析到位置。在这点上,该域名的部分可表示所需的特定服务器,例如,“mail.example.com”和“www.example.com”可解析到不同服务器处,具有不同功能,用于该相同二级域。 Related domain names and content can be organized in a hierarchical or nested fashion, such as "www.example.com"; "www.blog.example.com"; "www.example.com/blog"; or "blog.example.com ” and so on, each with a different meaning. Such related domains do not need to share the same part of the actual IP address to the respective domain name resolution locations. In this regard, parts of the domain name may represent the specific server required, for example, "mail.example.com" and "www.example.com" may resolve to different servers with different functions for the same two domain.
将操作每一个TLD的责任(包括保持在该TLD内的二级域的注册局)委托于特定组织,公知为域名注册局(“registry”)。该注册局首要负责通过在大型数据库中保持这样信息的DNS服务器将域名转换为IP地址(“解析”),且操作其顶层域。 Responsibility for operating each TLD (including maintaining a registry of second-level domains within that TLD) is delegated to specific organizations, known as domain name registries ("registries"). The registry is primarily responsible for translating domain names into IP addresses (“resolving”) and operating its top-level domains through DNS servers that maintain such information in large databases.
发明内容 Contents of the invention
本主题能够在域相关内容的可及性和有意义的改善中提供优点。通过考虑在系统方式中尚未识别出的域特征,用户可在域属性上进行搜索,在更广泛的应用中,使用通过DNS服务器和网站能够公共获得的信息,可构建与域网站相关的描述。如此处使用的一样,域特征可包括,但不限于,网页一部分的数据内容,链接网页的数据内容,该网页的DNS解析支持体系一部分的数据内容,等。因此,给出域名,可基于依据该域名的扩展搜索来提供如该网站的属性和特性的相关信息。该相关信息也可提供改善的搜索结果,基于描述该域和/或该域内容的属性。根据这些描述,可处理该因特网上的可获得内容子集的改善分析,其也可一般地用来提高个人用户、商业活动和网络内容支持服务的效率和生产力。 The subject matter can provide advantages in the accessibility and meaningful improvement of domain-related content. By considering domain characteristics that have not been identified in a systematic manner, users can search on domain attributes and, in a broader application, use information publicly available through DNS servers and websites to construct descriptions related to domain websites. As used herein, domain characteristics may include, but are not limited to, the data content of a part of a web page, the data content of a linked web page, the data content of a part of the DNS resolution support system of the web page, and the like. Thus, given a domain name, relevant information such as attributes and characteristics of the website can be provided based on an extended search against the domain name. The related information may also provide improved search results based on attributes describing the domain and/or the content of the domain. According to these descriptions, improved analysis of subsets of content available on the Internet can be processed, which can also be used generally to improve the efficiency and productivity of individual users, business activities and web content support services.
本发明的实施方案涉及分析域的系统和方法。实施方案可包括识别与域相关的第一页的状态。识别状态可包括试图解析该第一页的域名和,如果该域名解析失败,识别该域的状态作为无法运行的,如果该域名解析了,但内容无法访问,可将该状态识别为无法访问的。 Embodiments of the invention relate to systems and methods for analyzing domains. Embodiments may include identifying the state of the first page associated with the domain. Identifying the status may include attempting to resolve the domain name of the first page and, if the domain name resolution fails, identifying the status of the domain as inoperable, and identifying the status as inaccessible if the domain name resolves but the content is inaccessible .
如果该第一页可访问,随后可提取该第一页的内容。随后该域可基于来自该第一页的超链接识别多个附加页。随后可识别这些附加页的状态,作为该第一页的状态。即,识别这些附加页的状态可包括试图解析第一附加页的域名或网络地址,以及如果该域名或网络地址解析失败,可将该第一附加页的状态识别为无法运行的。如果该域名解析了,但内容无法访问,可将该第一附加页的状态识别为无法访问的,依此类推其他网页。 If the first page is accessible, then the content of the first page can be extracted. The domain can then identify additional pages based on the hyperlinks from the first page. The status of these additional pages can then be identified as the status of the first page. That is, identifying the status of the additional pages may include attempting to resolve the domain name or network address of the first additional page, and if the domain name or network address resolution fails, identifying the status of the first additional page as inoperable. If the domain name is resolved, but the content is inaccessible, the status of the first additional page can be identified as inaccessible, and so on for other web pages.
另外,可基于该状态和/或与预定日期进行的比较来区分这些超链接的优选顺序。例如,可将在URI内的字符串预先确定为具有高或低的重要性。因而,可相应地区分包含这样串的超链接的优选顺序。可从该第一页和该多个附加页的选择页处提取内容。内容可包括这里所讨论的整个范围的域特征。可基于他们的优选顺序在这些附加页中作出特定页的选择。 Additionally, the hyperlinks may be prioritized based on the status and/or comparison to a predetermined date. For example, character strings within a URI may be predetermined to have high or low importance. Thus, the preferred order of hyperlinks containing such strings can be distinguished accordingly. Content may be extracted from the first page and selected ones of the plurality of additional pages. Content may include the entire range of domain features discussed herein. Selection of specific pages among these additional pages can be made based on their order of preference.
通过预先确定或生成数据的集合可处理该第一页和附加页的内容以确定在该内容之间的前后匹配。该预先确定或生成的数据将被称作为签名标记集。将签名标记集理解为将已知元件与一些其他对象相关的数据表。例如,可将已知词语与该词语在域名样本上出现的频率相关。在页面上发现的词语与该词语的域频率的比较可以是在通过该签名标记集处理的该词语处理中的第一步骤。在实施方案中,这能够有助于更精确地确定该域说明,通过关注具有低域频率的词语,其可能是比具有高域频率的词语更有识别力。该签名标记集也可使用技术来减少潜在的词语含糊。例如,可将内容比较于表示多个数据内容已知关系的预先确定数据。这能包括已知原文的关系,已知数据类型关系和包括在域特征中的类似数据的各个组合。因此,通过在该域特征中第二相关对象的识别更精确地确定该第一对象的重要性。如上所示,域特征能包括来自或链接到该网页本身的数据,或如该IP地址或URL的DNS信息。 The content of the first page and the additional pages may be processed by pre-determining or generating a set of data to determine a contextual match between the content. This predetermined or generated data will be referred to as a signature token set. Think of a signature tag set as a table of data relating known components to some other objects. For example, a known term can be correlated with how often that term occurs on a sample of domain names. A comparison of a word found on a page with the domain frequency of that word may be the first step in the word processing through the signature tag set. In an embodiment, this can help determine the domain specification more precisely by focusing on words with low domain frequency, which may be more discriminative than words with high domain frequency. The set of signature tokens may also use techniques to reduce potential word ambiguities. For example, the content may be compared to predetermined data representing a known relationship of multiple data content. This can include known textual relationships, known data type relationships and various combinations of similar data included in domain features. Thus, the importance of the first object is determined more precisely by the identification of the second relevant object in the domain feature. As indicated above, domain characteristics can include data from or linked to the web page itself, or DNS information such as the IP address or URL.
基于该前后匹配,该第一页的状态和该附加页的状态,可确定该域的目的。实施方案可包括,其中该内容包括嵌在该第一页或这些附加页中的可执行代码,和将该签名标记集配置以确定在该可执行代码内的前后匹配。实施方案也可包括,其中该内容包括嵌在该第一页或这些附加页中的图像、视频和音频信息,和将该签名标记集配置以确定在该图像、视频和音频信息内的前后匹配。 Based on the context match, the status of the first page and the status of the additional page, the purpose of the domain can be determined. Embodiments may include where the content includes executable code embedded in the first page or the additional pages, and configuring the set of signature tags to determine contextual matches within the executable code. Embodiments may also include where the content includes image, video, and audio information embedded in the first page or the additional pages, and configuring the set of signature tags to determine contextual matches within the image, video, and audio information .
一旦确定,可以多种不同方式使用该域的目的,包括搜索、显示、存储和/或发送该域的确定目的。实施方案可包括基于该域的确定状态和目的来识别和获得相关域。 Once determined, the purpose of the field may be used in a number of different ways, including searching for, displaying, storing and/or transmitting the determined purpose of the field. Embodiments may include identifying and obtaining related domains based on the domain's determined status and purpose.
实施方案可包括确定该域的目的,基于在该内容中的确定域名、登记员转售者标记、指定数据类型的缺少、这些页面的替换公开者、社交团体标识和数据类型至少之一。 Embodiments may include determining the purpose of the domain based on at least one of the determined domain name, registrar reseller indicia, absence of specified data types, alternate publishers of the pages, social group identification, and data types in the content.
实施方案也可包括识别一个或多个域,基于他们的目的。随后可确定该识别域的种类,其区别于该域的各个目的。随后可基于该域的预先确定目的和种类识别和获得相关域。 Embodiments may also include identifying one or more domains, based on their purpose. The category of the recognition field can then be determined, which is distinct from the respective purpose of the field. Related domains can then be identified and obtained based on the domain's predetermined purpose and category.
实施方案可包括下列来自第一页的超链接,通过将重定向传到该网页的http版本。 An embodiment may include following a hyperlink from the first page by passing a redirect to the http version of the page.
实施方案也可包括从用户处接收输入域集;识别在这些输入域中常见的属性,来自该前后匹配内容之间;将该识别的属性输出给该用户。 Embodiments may also include receiving a set of input fields from a user; identifying attributes that are common in the input fields from between the contextually matched content; and outputting the identified attributes to the user.
实施方案可包括从用户处接收目的和/或种类的输入集;识别对应该目的和/或种类的输入集的域;和将该识别的域输出给该用户。 Embodiments may include receiving an input set of purposes and/or categories from a user; identifying domains corresponding to the input set of purposes and/or categories; and outputting the identified domains to the user.
实施方案可包括在域范围内以替换方式执行该所述方法和编辑该相关域空间的历史分析的结构。 Embodiments may include structures for performing the described method and compiling the historical analysis of the relevant domain space in an alternate manner on a domain-wide basis.
本发明的实施方案也能够包括实施该所述方法的系统,与编码为具有使得计算机执行该所述方法的指令的计算机可读存储介质一起。例如,可配置包括处理器、存储器和电子通信装置的电子系统以识别与域相关的第一页状态,并获得该第一页,通过该电子通信装置;基于从该第一的超链接从该域处识别多个附加页;通过该电子通信装置识别该附加页的状态;基于与预先确定数据的比较来区分该超链接的优先顺序;从该第一页和该多个附加页至少一页处提取内容,通过该电子通信装置,基于该优先顺序选择出的该多个附加页的至少一页;通过签名标记集处理该内容以确定前后匹配;确定该域的目的,根据该第一页的状态,这些附加页的状态和该内容的处理结果;和对该域的确定目的进行显示、存储和发送的至少之一。 Embodiments of the invention can also include a system implementing the described method, together with a computer-readable storage medium encoded with instructions causing a computer to perform the described method. For example, an electronic system including a processor, memory, and electronic communication device may be configured to identify the state of a first page associated with a domain, and obtain the first page, through the electronic communication device; based on a hyperlink from the first from the identifying a plurality of additional pages at the domain; identifying the status of the additional pages by the electronic communication device; prioritizing the hyperlinks based on a comparison with predetermined data; at least one page from the first page and the plurality of additional pages Extracting content at, by the electronic communication device, at least one of the plurality of additional pages selected based on the priority order; processing the content through a set of signature tags to determine a contextual match; determining the purpose of the field, according to the first page status of these additional pages and the processing results of the content; and at least one of displaying, storing and transmitting for the determined purpose of the domain.
本发明主题的进一步优点对于本领域普通技术人员来说将变得显而易见,基于该优选实施方案的下列具体说明的阅读和理解。 Further advantages of the inventive subject matter will become apparent to those of ordinary skill in the art upon reading and understanding the following detailed description of the preferred embodiment.
附图说明 Description of drawings
图1描述了依照本发明实施方案的示范性系统; Figure 1 depicts an exemplary system according to an embodiment of the present invention;
图2描述了依照本发明实施方案的示范性方法; Figure 2 depicts an exemplary method according to an embodiment of the invention;
图3描述了依照本发明实施方案的示范性网页和相关内容; Figure 3 depicts an exemplary web page and associated content according to an embodiment of the present invention;
图4图形描述了依照本发明实施方案的示范性分层数据解析; Figure 4 graphically depicts exemplary hierarchical data parsing in accordance with an embodiment of the present invention;
图5图形描述了依照本发明实施方案的示范性处理流程;和 Figure 5 graphically depicts an exemplary process flow in accordance with an embodiment of the present invention; and
图6描述了依照本发明实施方案的示范性方法的多个方面。 Figure 6 depicts aspects of an exemplary method in accordance with an embodiment of the invention.
具体实施方式 Detailed ways
本发明实施方案能够帮助用户、网络内容提供方和/或注册局理解如何确定域名的目的。可构建工业部分的在线活动的整体观念以提供对该在线环境的理解并用来补充其他微分析工具。因而,此处所述的系统和方法可生成能够提供域站点目的数据的结果。 Embodiments of the present invention can help users, web content providers, and/or registries understand how to determine the purpose of a domain name. A holistic view of the online activity of an industry segment can be constructed to provide an understanding of the online environment and to complement other micro-analysis tools. Thus, the systems and methods described herein can generate results that can provide data for domain site purposes.
给出域名,该所述系统和方法能够来自与通过该域名可访问站点相关的属性/特性。该服务也能允许用户搜索域,基于可能包括数个描述该域或在该域上内容的属性的标准。实施方案能使用包含在公共可获得DNS服务器中的信息和在公共可获得网站中可获得内容以构建与域网站相关的描述。 Given a domain name, the described systems and methods can be derived from attributes/characteristics associated with sites accessible through that domain name. The service can also allow users to search domains based on criteria that may include a number of attributes describing the domain or content on the domain. Embodiments can use information contained in publicly available DNS servers and content available in publicly available websites to construct descriptions associated with domain websites.
实施方案能够采集和提供与分层方式中域相关的不同类型数据,例如在图1中所描述的一样。实施方案能够确定域状态,通过试图装载网页,确定域的目的和域的种类,通过采集文本的或其他来自网站的数据并通过签名标记集对其传送。 Embodiments can collect and provide different types of data related to domains in a hierarchical manner, such as described in FIG. 1 . Embodiments are able to determine domain status, by attempting to load a web page, by determining the domain's purpose and domain category, by capturing textual or other data from a website and passing it through a set of signature tags.
该系统和方法试图采集与参照域网站的下列示范性属性相关的信息:域状态;域的目的;域的种类;域的事务,域关键词,域特性/特征/功能和域的内容。下面进一步描述这些属性。通过该所述的数据采集和分析,本发明的实施方案也能够提供TLD的改善的目录,例如该.com和.net TLD,和用于网站主机的他们的使用的总览。例如,通过为TLD内的所有域或域子集确定域的说明,能够创建改善的目录,其根据在该域空间内存在模式对该相关域分类,而不是根据个人网页内容确定和使用的种类。这可能提供先前未识别出的优点,在因特网结构和服务的各个级别的管理中。例如,通过确定域的状态、目的和种类,而不是单独的个人网页,个人用户、内容提供方和注册局能够更好地理解内容的关联和识别出直接相关于标记的模式和该因特网各个用处的其他重要方面。 The systems and methods attempt to collect information related to the following exemplary attributes of a reference domain website: domain status; domain purpose; domain category; domain transactions, domain keywords, domain features/features/functions, and domain content. These properties are described further below. Through the data collection and analysis described, embodiments of the present invention can also provide an improved directory of TLDs, such as the .com and .net TLDs, and an overview of their usage for website hosts. For example, by determining domain descriptions for all domains or a subset of domains within a TLD, an improved directory can be created that classifies the relevant domains according to patterns of presence within the domain space, rather than according to the categories identified and used by individual web page content . This may provide previously unrecognized advantages in the management of Internet structures and services at various levels. For example, by identifying the status, purpose, and category of domains, rather than individual individual web pages, individual users, content providers, and registries can better understand content associations and identify patterns directly related to the marks and uses of this Internet other important aspects.
参照图2和3描述了下列示范性方法。如图2中所述,该方法可开始于步骤S1000,其中确定第一页的状态。域状态一把涉及域是否和解析。例如,能够确定的是,是否存在与该域相关的网络服务器,以及,如果存在,该网页服务器是否能够被接入。进一步信息可包括是否存在任意特别识别出的网络服务器错误。例如,域名被输入并传送到DNS服务器以试图解析该域名。如果该域名解析失败,可将该域的状态确定为无法运行。如果该域名解析了,但内容无法访问,可将该状态识别为无法访问。其他状态标识也是可能的。 The following exemplary method is described with reference to FIGS. 2 and 3 . As described in FIG. 2, the method may start at step S1000, wherein the status of the first page is determined. The state of the domain involves whether and resolving the domain. For example, it can be determined whether there is a web server associated with the domain and, if so, whether the web server can be accessed. Further information may include whether there are any specifically identified web server errors. For example, a domain name is entered and passed to a DNS server in an attempt to resolve the domain name. If the domain name resolution fails, the status of the domain may be determined as non-operational. If the domain name is resolved, but the content is inaccessible, the status can be identified as inaccessible. Other state identifications are also possible.
域的状态可指示该相关网站的运行状态,如活动、HTTP错误等。该域的状态可指示网站是否从特定域处可被访问,如果不能,在什么阶段试图访问该网站失败。访问失败可包括指定给该域网站的误差码,如在表1中所指示的那些一样。 The status of a domain may indicate the operational status of the associated website, such as activity, HTTP errors, and the like. The status of the domain may indicate whether the website is accessible from the particular domain, and if not, at what stage attempts to access the website failed. Access failures may include error codes assigned to the domain website, such as those indicated in Table 1.
表1: Table 1:
如果一个或多个域查询结果是无法运行或不然无法访问的错误,该方法进行到步骤S1010到步骤S1700,其中基于该确定的状态来确定该域的目的。例如,状态码,如名称服务器错误,可被用来确定一般的无法运行的目的,或与错误码相关的更具体的无法运行的目的。 If one or more of the domain query results is a non-running or otherwise inaccessible error, the method proceeds to step S1010 to step S1700, wherein the purpose of the domain is determined based on the determined status. For example, a status code, such as a name server error, can be used to determine a general inability to function, or a more specific inability to function in relation to an error code.
如果该域名解析了,但导致重新定向,可将其包括在该域状态的确定中。例如,可将该状态确定为“重定向”的域,不具有活动的内容。 If the domain name resolves, but results in a redirect, it may be included in the determination of the domain status. For example, domains whose status may be determined to be "redirected" do not have active content.
如果该域名的网络服务器成功地连接到,并无法导致为正在重定向,该第一页500的内容,作为在图3中指出的实施例,随后可在步骤S1100中获得。 If the domain name's web server is successfully connected to and fails to result in being redirected, the content of the first page 500, as an example indicated in FIG. 3, can then be obtained in step S1100.
在步骤S1100中,获得来自该第一页的内容,如元件502, 504, 506和508。该内容能够以本领域技术人员公知的各种形式,包括例如,文本、多媒体、超链接或其他可执行代码。通过实施例的方式,元件502, 504和506可以是分别将超链接激活到网页510, 520和530的网页按钮。元件508可以是文本、图形或其他多媒体数据内容。可将该内容用于此处所述的至少两个目的。一个目的可能是用来识别与该第一页相关的任意其他页,例如,通过该第一页可访问的,基于嵌入在该第一页中的超链接,如元件502, 504, 506。可在步骤S1200中执行该功能。即,多个附加页,如510, 520, 530,可基于在该第一页中检测到的超链接而被识别出。应当注意到,尽管示范性页510, 520和530共享同一二级域,相关于例如超链接自页500的其他页不必需要共享同一域。该内容的另一目的可能是有助于确定该域的目的,其在下面进一步讨论。该方法继续到步骤S1300。 In step S1100, the content from the first page is obtained, such as elements 502, 504, 506 and 508. This content can be in various forms known to those skilled in the art, including, for example, text, multimedia, hyperlinks, or other executable code. By way of example, elements 502, 504, and 506 may be webpage buttons that activate hyperlinks to webpages 510, 520, and 530, respectively. Element 508 may be text, graphics, or other multimedia data content. This content can be used for at least two purposes described herein. One purpose may be to identify any other pages related to the first page, for example, accessible through the first page, based on hyperlinks embedded in the first page, such as elements 502, 504, 506. This function may be performed in step S1200. That is, a plurality of additional pages, such as 510, 520, 530, can be identified based on hyperlinks detected in the first page. It should be noted that although exemplary pages 510, 520, and 530 share the same second-level domain, other pages related to, eg, hyperlinked from page 500 do not necessarily need to share the same domain. Another purpose of this content may be to help determine the purpose of this domain, which is discussed further below. The method continues to step S1300.
在步骤S1300中,可识别出该附加页如510, 520和530的状态。该附加页状态的识别可包括试图解析第一附加页的域名或网络地址,和如果该域名或网络地址解析失败,可将该第一附加页的状态识别为无法运行的。如果该域名解析了,但内容无法访问,该第一附加页的状态可被识别为运行的,但无法访问的。其他状态标识也是可能的,如上所述参照该第一页500的状态确定。该方法继续到步骤S1400。 In step S1300, the status of the additional pages such as 510, 520 and 530 can be identified. Identifying the status of the additional page may include attempting to resolve the domain name or network address of the first additional page, and if resolution of the domain name or network address fails, identifying the status of the first additional page as inoperable. If the domain name resolves but the content is inaccessible, the status of the first additional page may be identified as operational but inaccessible. Other status identifications are also possible, as described above with reference to the status determination of the first page 500 . The method continues to step S1400.
在步骤S1400中,可区分该识别的超链接和相关的附加网页的优先顺序,基于与预先确定数据的比较。例如,超链接数据,包括域,和非域URI信息,可被比较于预先确定的标记,如重要关键词、字符或其他值的列表,其提示所需内容。也可将该超链接数据的结构分析作为该处理的一部分以识别信息的模式,如特定嵌套格式等。作为结果,可产生该识别的超链接的优先顺序列表。 In step S1400, the identified hyperlinks and associated additional web pages may be prioritized based on comparison with predetermined data. For example, hyperlink data, including domain, and non-domain URI information, can be compared to predetermined tokens, such as lists of important keywords, characters, or other values, that suggest desired content. Structural analysis of the hyperlink data may also be used as part of the process to identify patterns of information, such as specific nesting formats and the like. As a result, a prioritized list of the identified hyperlinks may be generated.
用户可以选择该特定标记使用以定制该优先顺序。其可包括向该用户展现在该识别超链接数据中已经认出的标记的列表,并允许该用户在这些标记中选择。为了进一步帮助该用户,该用户可被提供有可选的与该认出的标记相关的附加信息。例如,可为该认出的标记的每一个给出定量或定性值,其帮助该用户估算对于该用户或正讨论的域来说哪一个标记最重要。例如,该用户可能具有在某些识别标记中的特别的兴趣,或一组标记可显示出更相关,基于在该超链接数据中该标记出现的定量值。因而,该用户可基于与展现给该用户的特定标记相关的附加预先确定信息从这些标记中进行选择。这能够提供方便于识别更相关的从其中提取来自该域描述一部分处内容的网页。该方法继续到步骤S1500。 The user can choose which specific flags to use to customize this priority. It may include presenting to the user a list of already recognized indicia in the identifying hyperlink data and allowing the user to select among these indicia. To further assist the user, the user may be provided with optional additional information related to the recognized marker. For example, each of the recognized tokens may be given a quantitative or qualitative value, which helps the user estimate which token is most important to the user or the domain in question. For example, the user may have a particular interest in certain identifying tags, or a group of tags may appear more relevant based on the quantitative value of the tag's occurrence in the hyperlink data. Thus, the user can select from among the particular indicia presented to the user based on additional predetermined information related to these indicia. This can provide a facility for identifying more relevant web pages from which to extract content from a portion of the domain description. The method continues to step S1500.
在步骤S1500,可从该第一页500处提取附加内容,如内容508,如果需要,可从该多个附加页的选择页处提取内容。从这些附加页中指定页的选择可基于他们的优先顺序。例如,如果附加页510, 520和530排列优先顺序使得页530优先权最低,可仅仅从页510和520处聚集内容。在实施方案中,可基于在该域名内的文本串“mail”能将页530指定为低优先权。这可保存资源且也引导到更精确结果,在后续步骤通过识别该最相关附加页和内容。也可基于以下进一步讨论的签名标记集的参数来确定所聚集的内容的类型。例如,文本内容可能是所有所需的数据,如果该签名标记集仅仅被配置为文本。该区别也可提供聚集和分析的大比例数据中的效率。如果配置该签名标记集以处理多种数据类型,这可提高该分析的整体精度。 In step S1500, additional content, such as content 508, may be extracted from the first page 500, and if necessary, content may be extracted from a selected page of the plurality of additional pages. Selection of designated pages from these additional pages may be based on their priority order. For example, if additional pages 510, 520, and 530 are prioritized such that page 530 has the lowest priority, content may only be aggregated from pages 510 and 520. In an embodiment, page 530 can be designated as low priority based on the text string "mail" within the domain name. This saves resources and also leads to more precise results at a later step by identifying the most relevant additional pages and content. The type of content aggregated may also be determined based on the parameters of the set of signature tags discussed further below. For example, text content may be all required data if the signature tag set is configured as text only. This distinction can also provide efficiencies in the large proportions of data that are aggregated and analyzed. This can improve the overall accuracy of the analysis if the signature tag set is configured to handle multiple data types.
根据来自该第一页的超链接数据也可包括传送重定向。例如,超链接数据不可以直接解析到另一网页,但替换地可需要指示一个重定向。因而,该“附加页”可间接地链接或相关于该第一页。这也可包括传送重定向到该网页的http版本。该方法继续到步骤S1600。 Transmitting redirects may also be included based on the hyperlink data from the first page. For example, hyperlink data may not resolve directly to another web page, but instead may need to indicate a redirection. Thus, the "additional page" may be indirectly linked or related to the first page. This may also include sending a redirect to the http version of the page. The method continues to step S1600.
在步骤S1600中,可通过签名标记集处理从该第一页和这些附加页所采集的内容以确定在该内容之间的前后匹配。如上所述,签名标记集可包括提供前后匹配的链接数据元件,或该数据元件之一的重要性。通过识别所聚集的内容的前后匹配,合适权重可给予该内容的各个部分的重要性。例如,当该词语“Ford”是汽车制造商时,其也是常用的姓,以及通过其自身,具有不确定的重要性。这可导致不正确的考虑或忽视该词语作为网页内容的一部分。词语的解疑可能需要“Ford”和其他识别出的汽车词语以更紧密接近该词语“Ford”的使用以认作为相关于该汽车类别。N-gram模型是将该下一项重定向在序列中的或然说模型的类型。将N-gram用于统计自然语音处理和通用序列分析的各个区域中,并可用于当前主题以提炼此处所述的内容处理。例如,该n-gram模型基于xi-1, xi-2,…, xi-n重定向xi。如上所述参照用来区分该超链接数据和附加页优先顺序的标记,该方法可允许该用户从识别出的联系之间选择所需联系。例如,该用户可将某些识别出的前后匹配识别为有效,其他前后匹配识别为无效,或将被忽略。另外,这些方法可帮助该用户识别前后匹配的重要性,通过提供结合该前后匹配的附加定量或定性信息。因而,可使用自动的方法来评估给定于该前后匹配的加权,如在自动用于提取内容的预先确定签名标记集的情况中一样,或可帮助该用户做这些,如在允许该用户接收,加权或拒绝识别前后匹配的情况中一样。 In step S1600, the content collected from the first page and the additional pages may be processed through a set of signature tags to determine a contextual match between the content. As noted above, the set of signature tags may include linked data elements providing a contextual match, or the importance of one of the data elements. By identifying contextual matches of aggregated content, appropriate weights can be given importance to various portions of the content. For example, when the term "Ford" is an automobile manufacturer, it is also a common surname and, by itself, of uncertain importance. This can lead to incorrect consideration or disregard of the term as part of the web page content. Disambiguation of terms may require usage of "Ford" and other recognized car terms to more closely approximate the term "Ford" to be considered relevant to the class of cars. The N-gram model is a type of probabilistic model that redirects the next item in the sequence. N-grams are used in various areas of statistical natural speech processing and general sequence analysis, and can be used in the current topic to refine the content processing described here. For example, the n-gram model redirects xi based on xi -1 , xi-2 ,…, x in . With reference to the tags used to prioritize the hyperlink data and additional pages as described above, the method may allow the user to select a desired link from among identified links. For example, the user may identify certain identified context matches as valid and others as invalid, or to be ignored. In addition, these methods can help the user identify the importance of a contextual match by providing additional quantitative or qualitative information in conjunction with the contextual match. Thus, automated methods can be used to evaluate the weighting given to the contextual match, as in the case of a predetermined set of signature signatures automatically used to extract content, or the user can be assisted in doing so, as in allowing the user to receive , as in the case of weighted or rejected matches before and after recognition.
参照该签名标记集,从该内容处处理的数据的特定类型不限于文本或其他描述数据。例如,实施方案可包括,其中该内容包括嵌入在该第一页或这些附加页中的可执行代码,配置该签名标记集以确定在该可执行代码内或在该可执行代码结果内的前后匹配。实施方案也可包括,其中该内容包括嵌入在该第一页或这些附加页中的图像、视频和/或音频信息,配置该签名标记集以确定在该图像、视频和音频信息中任意一个内的前后匹配。例如,在如模式识别结果、音频类型、长度或任意数量相关属性的各种音频数据之间可能存在前后匹配。大批相同模式的音频信息可能是特定类型网页的强健指示符,如定向到特定音乐艺术家的内容,并因而具有特定重要性。 With reference to the set of signature tags, the particular type of data processed from the content is not limited to text or other descriptive data. For example, embodiments may include, where the content includes executable code embedded in the first page or the additional pages, configuring the set of signature tags to determine context within the executable code or within the results of the executable code match. Embodiments may also include, where the content includes image, video and/or audio information embedded in the first page or the additional pages, configuring the set of signature tags to determine before and after matching. For example, there may be contextual matches between various audio data such as pattern recognition results, audio type, length, or any number of related attributes. A mass of audio information of the same pattern may be a strong indicator of a particular type of web page, such as content directed to a particular musical artist, and thus of particular importance.
在该超链接标记和该签名标记集都有的情况中,可包括标记性能报告以提供提高的性能。例如,可为用户展现个人标记的性能的定性评估。可替换地,或结合该自动评估,该用户可独立地评估和分级该标记的效力。因此,该系统可确保使用的标记是有效的,通过提供关于该标记性能的报告,并允许在该标记使用中的变化。可将该报告展现给可实施变化的用户,或该系统可自动地丢弃差性能的标记,如落在某些阈值之下的那些。在处理该内容之后,该方法继续到步骤S1700。 Where both the hyperlink tag and the set of signature tags are present, a tag performance report may be included to provide improved performance. For example, a user may be presented with a qualitative assessment of the performance of a personal marker. Alternatively, or in conjunction with the automated assessment, the user can independently assess and rate the efficacy of the marker. Thus, the system can ensure that the marker used is valid, by providing reports on the marker's performance, and allowing for variations in the marker's usage. This report can be presented to a user who can implement changes, or the system can automatically discard poor performing markers, such as those falling below certain thresholds. After processing the content, the method continues to step S1700.
在步骤S1700中,可基于该第一页的状态,任意附加页的状态,和该前后匹配的任意结果来确定该域的目的。域的目的能涉及到与该域相关的内容的主题或整体意义或计划用处或用处。该目的可反映使用域的明显原因。在域解析失败,或返回其他访问错误的情况中,该目的可能在于未在使用或限制该域。其他识别出的目的可包括点击付费(PPC)停留,有目的、隐蔽的重定向,重定向,微博等,如下面图2中具体所述一样。 In step S1700, the purpose of the field may be determined based on the status of the first page, the status of any additional pages, and any results of the preceding and following matching. The purpose of a field can relate to the topical or overall meaning or intended use or usefulness of the content related to that field. This purpose may reflect an obvious reason for using domains. In cases where domain resolution fails, or other access errors are returned, the intent may be that the domain is not in use or restricted. Other identified purposes may include pay-per-click (PPC) dwell, purposeful, covert redirects, retargeting, microblogging, etc., as detailed in Figure 2 below.
该前后匹配的结果能够在确定活动域的指定目的确定中特别有效。该评估的精度通过该优先顺序后的附加页的附加分析来提高。因而,可将功能网站指定有非唯一目的的代码,如在表2中识别出的那些,以及用来指示其他目的的任意其他合适的代码。 The result of this context-matching can be particularly effective in determining the purpose-specific determination of the active domain. The accuracy of this evaluation is increased by additional analysis of the additional pages following this prioritization. Thus, functional websites may be assigned non-unique purpose codes, such as those identified in Table 2, as well as any other suitable codes used to indicate other purposes.
表2: Table 2:
也可将附加信息包括在该目的确定中。例如,域事务的测量,如首先来自DNS事务处理器的统计,可被包括以评估网站是否是授权零售站点。域关键词,如标题、名称和说明,也可被给予指定权重,作为上述前后匹配权重因子的补充。可使用指示共同特征是否展现在该网站和附加页上的域特性、特征和/或功能以确定目的,如博客、零售等。 Additional information may also be included in this purpose determination. For example, measurements of domain transactions, such as statistics from DNS transaction processors first, can be included to assess whether a website is an authorized retail site. Domain keywords, such as title, name, and description, may also be given assigned weights, in addition to the contextual match weighting factors described above. Domain properties, features and/or functions that indicate whether common features are present on the website and additional pages can be used to determine purposes such as blogging, retail, etc.
同样可能有利的是,将该内容的技术细节考虑到确定该目的中。例如,识别该网站使用的那种类型技术,如与该域相关的邮件服务器,cookies,多媒体,安全数据的SSL认证等,可提供目的的指示,如相对于个人的零售或其他目的。超出指定内容的附加技术数据也可能是指示性的,如在该相关页、该网页服务器的地理位置等上表现该内容的平均时间。 It may also be advantageous to take technical details of the content into account in determining the purpose. For example, identifying what type of technology the site uses, such as mail servers, cookies, multimedia, SSL certification for secure data, etc. associated with the domain, may provide an indication of purpose, such as retail or other purposes as opposed to individuals. Additional technical data beyond the specified content may also be indicative, such as the average time of presentation of the content on the relevant page, the geographical location of the web server, etc.
实施方案可包括确定该域的目的,基于确定域名、登记者转售人标记、指定数据类型的缺少、这些页的替换公布人、社交团体标识和数据类型中至少一个,在该内容中。即,特定内容的出现或缺少,如上述列出的那些,可能具有在建立该域整个目的中的独立意义。实施例可包括广告客户信息的提示,或其的缺少,或如图像数据的数据类型,或其的缺少。一旦确定了该域的目的,该方法继续到步骤S1800。 Embodiments may include determining the purpose of the domain based on determining at least one of domain name, registrant-reseller markup, absence of specified data types, alternate publishers of the pages, social group identification, and data types in the content. That is, the presence or absence of specific content, such as those listed above, may have independent significance in establishing the overall purpose of the field. Embodiments may include prompts for advertiser information, or the absence thereof, or data types such as image data, or the absence thereof. Once the purpose of the domain is determined, the method continues to step S1800.
在步骤S1800中,可确定该域的种类,其区分于该域的各个目的。域的种类可反映出与在该站点上内容相关的商业部门相应的种类。这可能涉及到确定来自该第一页和这些附加页处的内容属于哪一种类。例如,该域种类可能将该域放置在与北美工业分类系统一致的商业分类分类学的系统中。表3是示范性种类的部分列表,其可包括该种类码内种类的指定优先顺序。 In step S1800, the category of the domain may be determined, which is distinguished from each purpose of the domain. The category of the domain may reflect the category corresponding to the business sector related to the content on the site. This may involve determining which category the content from the first page and the additional pages belongs to. For example, the domain category might place the domain in a system of business classification taxonomy consistent with the North American Industrial Classification System. Table 3 is a partial list of exemplary categories that may include an assigned priority order for categories within the category code.
表3: table 3:
用来确定该域种类的信息可包括从该第一页和这些附加页处获得的内容,并可能甚至是用来确定该域目的的相同内容。然而,属于各个信息的重要性可能在每一个处理中不同。例如,如上所述,具有其他前后共同相关信息的“Ford”的出现可用来确定该网页的COMPANY/ORGANIZATION目的。具有其他汽车信息的相同信息“Ford”的出现也可用来确定在该汽车领域中的域的种类。 The information used to determine the domain category may include content obtained from the first page and the additional pages, and possibly even the same content used to determine the purpose of the domain. However, the importance of information belonging to each may differ in each process. For example, as described above, the presence of "Ford" with other contextually relevant information can be used to determine the COMPANY/ORGANIZATION purpose of the web page. The presence of the same information "Ford" with other automobile information can also be used to determine the category of domains in the automobile domain.
如使用该域目的的确定一样,未将该域分类限定到特定网页的内容。而是,该域分类可能来自该第一页和该优先的附加页的内容,以达到该域的全面分类。实施方案也可包括基于他们目的识别一个或多个域,在执行这些域的分类之前。这可能提供在精确分类中的优势,如将域划到分类中,使用或不使用用户的交互。 As with the determination of the purpose of using this domain, this domain classification is not limited to the content of a particular web page. Rather, the domain classification may be derived from the content of the first page and the preferred additional pages to arrive at a comprehensive classification for the domain. Embodiments may also include identifying one or more domains based on their purpose, prior to performing classification of these domains. This may provide advantages in precise classification, such as dividing domains into classifications, with or without user interaction.
如本发明的实施方案所讨论的一样,该域目的和域种类的特征能够是不同的。即,域目的可包括与该域相关的网站的目的,如新闻、博客、停留等。该域的目的能够是该域网站的首要目的。该目的能够涉及到该特定网站的内容,而不必该公司拥有他。即,该给定公司的共同网站能够具有比例如该公司推广博客站点不同的目的。 As discussed in embodiments of the present invention, the domain purpose and domain category characteristics can vary. That is, the domain purpose may include the purpose of the website related to the domain, such as news, blog, stay, etc. The purpose of the domain can be the primary purpose of the domain website. The purpose can relate to the content of that particular website without the company owning it. That is, the corporate website for the given company can have a different purpose than, for example, the company's promotion of a blog site.
在确定了该域种类之后,该示范性方法可进行到步骤S1900,具有将该结果存储到计算机可读存储介质、将该结果显示给用户、或不然通过电子通信网络电性地将该结果通信到请求者中的至少一个。实施方案可进一步包括在该存储数据中实施有目标的搜索,和/或在域的范围内以迭代方式实施所述方法和编辑这些结果用于该相关域空间的历史分析,如下进一步所讨论的一样。 After determining the domain category, the exemplary method may proceed to step S1900, with storing the result to a computer-readable storage medium, displaying the result to a user, or otherwise communicating the result electronically over an electronic communication network to at least one of the requesters. Embodiments may further include performing targeted searches in the stored data, and/or iteratively implementing the method across domains and compiling the results for historical analysis of the relevant domain space, as discussed further below Same.
图4描述了依照本发明实施方案的用于每月迭代周期的高级处理流程的实施例。其一般由输入610、组件620和输出630所构成。也可包括附加工具690。这样附加工具可帮助用户实施各种选择/输入610,如优先化识别的超链接、词语等。该输入可包括,例如,区域文件612、签名标记集614和训练集618。在实施方案中,用户可输入或选择,在该签名标记集614和/或该训练集618内的项。该组件可包括网络爬行器622和分析器624。该分析器能被功能性地划分成种类626和分类628部分。登记处616可保留该各种处理(输入,组件和输出),并将他们实施在网络空间650的指定部分上以更好地由工业部门理解相关在线活动。例如,该登记处可采集来自给定服务器的所有DNS事务。用于TLD的所有子域名的数据可被周期采集并在文件存储器632中保留一段时间。也可生成报告634,基于包括与该附加工具690协作工作以处理来自网络空间650指定部分处信息的训练集616的方法。这能够提供DNS事务值和域状态,目的和种类,用于在一段时间上的每一个域名,并提供在过去未获得过的信息的访问。 Figure 4 depicts an example of a high-level process flow for a monthly iteration cycle in accordance with an embodiment of the present invention. It generally consists of inputs 610 , components 620 and outputs 630 . Additional tool 690 may also be included. Such additional tools may assist the user in implementing various selections/inputs 610, such as prioritizing recognized hyperlinks, words, and the like. The input may include, for example, a zone file 612 , a signature tag set 614 , and a training set 618 . In an embodiment, a user may enter or select items within the set of signature tags 614 and/or the training set 618 . The components may include a crawler 622 and an analyzer 624 . The analyzer can be functionally divided into category 626 and category 628 sections. Registry 616 may retain the various processes (inputs, components, and outputs) and implement them on designated portions of cyberspace 650 to better understand relevant online activities by industry sectors. For example, the registry may collect all DNS transactions from a given server. Data for all subdomain names of a TLD may be collected periodically and retained in file storage 632 for a period of time. A report 634 may also be generated based on the method including the training set 616 working in conjunction with the additional tool 690 to process information from the specified portion of the cyberspace 650 . This can provide DNS transaction values and domain status, purpose and type, for each domain name over time and provide access to information that has not been obtained in the past.
实施方案也可包括接收在步骤S2000中作为查询一部分的来自用户用于分析的输入域集。该系统可自动识别在这些输入域之间共同的属性。这些属性可来自该前后匹配内容或其他聚集信息之间。这些分析的结果可包括将该识别共同属性输出到该用户,在步骤S2100中。该性能可提供优点,如能够自动识别域的共同目的相关的属性,包括该附加识别出的页。 Embodiments may also include receiving a set of input domains from a user for analysis in step S2000 as part of the query. The system can automatically identify attributes that are common between these input fields. These attributes can come from between the contextual content or other aggregated information. The results of these analyzes may include outputting the identified common attributes to the user, in step S2100. This capability may provide advantages such as being able to automatically identify common purpose related attributes of domains, including the additionally identified pages.
实施方案也可包括接收在步骤S2000中作为查询一部分的目的和/或种类的输入集,如来自用户。对应该目的和/或种类的输入集的域,基于所述方法,可被识别出,且将该识别出的域输出给该用户,在步骤S2100。该性能可具有优点在于提供相关信息的改善分类和识别和/或将不可能仅仅来自于常用的网页内容分析方法的域。所述方法能够提供下载内容的优先化和解析,识别和采集与域相关的重要属性,包括直接和间接内容的各个种类,并能够运行用户或管理员基于该域的属性进行搜索。 Embodiments may also include receiving an input set of purposes and/or categories as part of the query in step S2000, such as from a user. Domains corresponding to the purpose and/or category of the input set may be identified based on the method, and the identified domains are output to the user, at step S2100. This capability may have the advantage of providing improved classification and identification of relevant information and/or domains that would not be possible only from commonly used web content analysis methods. The method can provide prioritization and resolution of downloaded content, identify and capture important attributes associated with a domain, including categories of direct and indirect content, and enable user or administrator searches based on attributes of the domain.
在图6中提供了与示范性处理流程相关的附加细节。如图6所示,该处理可起始于S600并处理到S610,其中获取目标区域文件。例如,该目标区域可能是指定的域,如此处所讨论的一样。该方法进行到S620。 Additional details related to an exemplary process flow are provided in FIG. 6 . As shown in FIG. 6, the process may start at S600 and proceed to S610, where the target region file is acquired. For example, the target area may be a designated domain, as discussed herein. The method proceeds to S620.
在S620中,该处理试图连接到该目标区域,其可能是指定的域。可基于到该目标区域的连接的尝试来形成不同类型的错误。例如,可能存在无用的授权,其中该DNS服务器无法操作。如果没有DNS被识别出,或是类似于发生在S620处的上述那些的其他错误,该方法进行到S624,其被称作为“No DNS”错误。基于该确定,在S680处可报告状态。 In S620, the process attempts to connect to the target zone, which may be a specified domain. Different types of errors can be formed based on attempts to connect to the target area. For example, there may be useless delegations where the DNS server cannot operate. If no DNS is identified, or other errors similar to the above-mentioned ones occurring at S620, the method proceeds to S624, which is referred to as "No DNS" error. Based on this determination, a status may be reported at S680.
如果到该目标区域的连接的尝试成功,该方法继续到S630。成功的尝试可能包括正由该DNS服务器解析到IP地址的请求域名。然而,存在能够发生在该网络服务器级处的错误,其可能阻止从该请求的地址处获取内容。例如,在S634中,可识别出服务器错误,如该名称服务器超时,或不然指出与该服务器相关的错误作为到该IP地址连接的请求的响应。如果在S634出指出服务器错误,该方法可进行到S680,其中可报告该错误。 If the attempt to connect to the target area is successful, the method continues to S630. Successful attempts may include the requested domain name being resolved to an IP address by the DNS server. However, there are errors that can occur at the web server level that may prevent content from being fetched from the requested address. For example, at S634, a server error may be identified, such as the name server timing out, or otherwise indicating an error related to the server as a response to the request to connect to the IP address. If a server error is indicated at S634, the method may proceed to S680, where the error may be reported.
如果由该服务器在S630期间发现服务器,该方法可进行到S640-648,其中可基于爬行该指定网站或地址的尝试来接收各种响应。这可包括该请求的域不具有活动网站的提示,如在S640中。也可能存在具有负责在发现该服务器后提示该网站的网络服务器的错误,如在S642中一样。该服务器或网站也可限制该网络爬行器获取内容的能力,如在S644中一样,或将该网络爬行器重定向到另一站点,如在S646中一样。这些和其他表现为不完全访问该网站内容的响应,可在S680中被报告。 If a server is discovered by the server during S630, the method may proceed to S640-648, where various responses may be received based on attempts to crawl the specified website or address. This may include a prompt that the requested domain does not have an active website, as in S640. There may also be a bug with the web server responsible for prompting the website once it has been found, as in S642. The server or website may also restrict the web crawler's ability to retrieve content, as in S644, or redirect the web crawler to another site, as in S646. These and other responses in the form of incomplete access to the website content may be reported at S680.
如果评估该网站并可获得内容,该方法可进行到S648,其中将该网站处的内容识别为已发现。如此处进一步所述一样,一旦来自网站或目标域的内容被发现,该方法可由访问并分析该发现的内容来继续,如在S660中一样。可在S680中报告该内容获取和/或分析的结果。 If the website is evaluated and content is available, the method may proceed to S648, where the content at the website is identified as found. As described further herein, once content from the website or target domain is discovered, the method may continue by accessing and analyzing the discovered content, as in S660. Results of this content acquisition and/or analysis may be reported at S680.
图6中所述的方法学可因而导致多个各种报告,在S680中,基于通过访问该目标区域的尝试步骤该方法发展到的程度。这些中的一些可能反映出域的无法允许的状态,如在DNS或网页服务器错误的情况中,或附加状态,目的和种类,基于该信息的数量和类型,包括在该处理期间所获得的内容。 The methodology described in Figure 6 may thus result in a number of various reports, at S680, based on how far the method has progressed through the attempted steps of accessing the target area. Some of these may reflect an impermissible state of the domain, as in the case of a DNS or web server error, or additional state, purpose and kind, based on the amount and type of information, including content obtained during this processing .
本发明的实施方案能够包括用来实施所述方法的系统,与具有使得计算机执行所述方法的指令的已编码计算机可读存储介质一样。例如,如图5中所示一样,可配置包括处理器、存储器和电子通信装置的电子系统100以通过DNS服务器140识别与域相关的第一页的状态。可将该系统100表示为用户计算机系统、如120, 170的无线通信装置、如130, 190的子网、服务器或具有该必需功能性能的任意其他网络能力的装置。系统100可操作为与登记处相关的DNS服务器一部分或独立于该DNS服务器。 Embodiments of the present invention can include a system for implementing the method, as well as an encoded computer-readable storage medium having instructions for causing a computer to perform the method. For example, as shown in FIG. 5 , electronic system 100 including a processor, memory, and electronic communication device may be configured to identify the status of a first page associated with a domain through DNS server 140 . The system 100 may be represented as a user computer system, a wireless communication device such as 120, 170, a subnet such as 130, 190, a server, or any other network capable device having the necessary functional capabilities. System 100 may operate as part of a DNS server associated with a registry or independently of the DNS server.
可通过如该因特网的电子通信网络170从服务器150处由该系统100接收到该第一页。该系统100可基于来自该第一页处的超链接从该域处识别多个附加页。该系统100随后可通过该DNS服务器140来识别这些附加页的状态。该系统100也可基于与预先确定数据的比较来确定这些超链接的优先顺序,如上所述。可通过该服务器150, 160从该第一页和该多个附加页中至少一页处提取内容,例如从网络主机服务器处。该系统100随后可通过签名标记集来处理该内容,其被存储或不然电性地由该系统100所访问,以确定前后匹配。 The first page may be received by the system 100 from a server 150 over an electronic communication network 170 such as the Internet. The system 100 can identify additional pages from the domain based on hyperlinks from the first page. The system 100 can then identify the status of these additional pages through the DNS server 140 . The system 100 may also prioritize the hyperlinks based on a comparison with predetermined data, as described above. Content may be extracted from the first page and at least one of the plurality of additional pages by the server 150, 160, such as from a web hosting server. The system 100 may then process the content through a set of signature tags, either stored or otherwise electronically accessed by the system 100, to determine a contextual match.
该系统100能够进一步确定该域的目的,根据该第一页的状态、这些附加页的状态和该内容处理的结果。该系统100可接收上述的各种用户输入,如选择从其提取数据内容的识别网页、选择项,由系统100所实施的所述处理的结果可被显示、存储和/或发送,根据已知技术。 The system 100 can further determine the purpose of the field based on the status of the first page, the status of the additional pages and the results of the content processing. The system 100 may receive various user inputs as described above, such as selection of identified web pages, selections from which data content is to be extracted, and the results of said processing performed by the system 100 may be displayed, stored and/or transmitted, according to known technology.
该系统100包括任意数量耦合到包括第一存储(未示出,通常随机访问存储器,或“RAM”)、第二存储(未示出,通常只读存储器,或“ROM”)的存储装置的处理器(未示出)。这些存储装置都可能包括任意适合类型的以上所述和/或所提及的计算机可读介质。主存储装置(未示出)可能也用来存储程序、数据等,并通常是次要存储介质,如比主要存储慢的硬盘。将被理解的是,可能,在适合情况中,将保留在该主存储装置内的信息以标准方式结合作为主要存储的一部分,作为虚拟内存。如CD-ROM的指定主存储装置也可将数据无方向地传送到该处理器。 The system 100 includes any number of storage devices coupled to storage devices including first storage (not shown, typically random access memory, or "RAM"), second storage (not shown, typically read-only memory, or "ROM") processor (not shown). These storage devices may each include any suitable type of computer-readable media described and/or referred to above. Primary storage (not shown) may also be used to store programs, data, etc. and is typically a secondary storage medium such as a hard disk that is slower than primary storage. It will be appreciated that it is possible, where appropriate, to incorporate information retained within the main storage means in a standard manner as part of the main storage, as virtual memory. A designated primary storage device such as a CD-ROM can also transfer data to the processor without direction.
该系统100也可包括接口,该接口包括一个或多个输入/输出装置,如视频监视器、跟踪球、鼠标104、键盘、麦克风、触摸感应显示器、传导卡阅读器、磁性或纸带阅读器、写字板、铁笔、语音或手写识别器或其他已知输入装置,包括其他计算机110。可将该系统100耦合到计算机或其他电性通信网络170, 180,使用在101处一般所示的网络连接。该网络能够连接各种有线、光学的、电子的和其他已知网络以在计算机110、服务器160、无线通信装置120, 170和子网190, 130之间交换信息。使用这样的网络连接,预期的是,那里的系统100和处理器可从该网络处接收信息,或可将信息输出到该网络,在执行上述方法步骤的过程中。该上述装置和材料将为该计算机硬件和软件领域中技术人员所熟知并不需独立地或彻底地为了本领域技术人员的理解所描述。可配置(通常临时地)上述硬件元件以作为用来实施上述操作的一个或多个模块。 The system 100 may also include an interface including one or more input/output devices such as a video monitor, trackball, mouse 104, keyboard, microphone, touch sensitive display, conductive card reader, magnetic or paper tape reader , tablet, stylus, voice or handwriting recognizer or other known input devices, including other computers 110. The system 100 can be coupled to a computer or other electrical communication network 170, 180 using the network connection shown generally at 101. The network can connect various wired, optical, electronic and other known networks to exchange information between computers 110, servers 160, wireless communication devices 120, 170 and subnets 190, 130. Using such a network connection, it is contemplated that the system 100 and processors therein may receive information from, or output information to, the network in the course of performing the method steps described above. The above-mentioned means and materials will be well known to those skilled in the computer hardware and software arts and need not be described independently or completely for the understanding of those skilled in the art. The hardware elements described above may be configured (usually temporarily) as one or more modules for performing the operations described above.
另外,本发明的实施方案进一步包括计算机可读存储介质,其包括程序指令用来实施各种计算机可实施的操作。该介质也可能包括,单独地或结合该程序指令,数据文件,数据结果,表等。该介质和程序指令可能是那些为本主题的目的所特别设计和构建的,或他们可能是为该计算机软件领域中那些技术人员可获得的类型。计算机可读存储介质的实施例包括磁性介质,如硬盘、软盘和磁带;光学介质,如CD-ROM盘;磁光学介质,如可光读的盘;和特别配置以存储和实施程序指令的硬件装置,如只读存储器装置(ROM)和随机存储器(RAM)。程序指令的实施例包括如编辑器所产生的机器代码和包含使用注释器可由该计算机执行的较高级代码的文件二者。 In addition, embodiments of the present invention further include computer-readable storage media including program instructions for implementing various computer-implemented operations. The medium may also include, alone or in combination with the program instructions, data files, data results, tables and the like. The media and program instructions may be those specially designed and constructed for the subject matter, or they may be of the type available to those skilled in the computer software arts. Examples of computer-readable storage media include magnetic media, such as hard disks, floppy disks, and magnetic tape; optical media, such as CD-ROM disks; magneto-optical media, such as optically readable disks; and hardware specially configured to store and implement program instructions devices such as read-only memory (ROM) and random access memory (RAM). Examples of program instructions include both machine code as produced by an editor and files containing higher level code executable by the computer using an interpreter.
已经参照示范性实施方案描述了本发明。所述实施方案的修改和改变对于本领域普通技术人员来说是显然的,基于本说明书的阅读和理解。本发明打算包括所有这样的修改和改变,在他们出现在该附属权利要求或其等同物的范围内的范围时。 The invention has been described with reference to exemplary embodiments. Modifications and alterations to the described embodiments will be apparent to those of ordinary skill in the art upon a reading and understanding of the specification. The present invention is intended to embrace all such modifications and changes, as they come within the scope of the appended claims or their equivalents.
Claims (14)
Applications Claiming Priority (9)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16752109P | 2009-04-07 | 2009-04-07 | |
US16752309P | 2009-04-07 | 2009-04-07 | |
US16752809P | 2009-04-07 | 2009-04-07 | |
US61/167521 | 2009-04-07 | ||
US61/167528 | 2009-04-07 | ||
US61/167523 | 2009-04-07 | ||
US12/428208 | 2009-04-22 | ||
US12/428,208 US9292612B2 (en) | 2009-04-22 | 2009-04-22 | Internet profile service |
PCT/US2010/030211 WO2010118115A1 (en) | 2009-04-07 | 2010-04-07 | Domain status, purpose and categories |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102460417A CN102460417A (en) | 2012-05-16 |
CN102460417B true CN102460417B (en) | 2015-07-29 |
Family
ID=42936554
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201080024769.7A Active CN102460417B (en) | 2009-04-07 | 2010-04-07 | Domain state, role and kind |
Country Status (9)
Country | Link |
---|---|
EP (1) | EP2417536A4 (en) |
JP (1) | JP2012523626A (en) |
KR (1) | KR101670700B1 (en) |
CN (1) | CN102460417B (en) |
AU (1) | AU2010234488B2 (en) |
BR (1) | BRPI1014177A2 (en) |
CA (1) | CA2757833C (en) |
RU (1) | RU2011144859A (en) |
WO (1) | WO2010118115A1 (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104104556B (en) | 2013-04-12 | 2018-09-28 | 腾讯科技(北京)有限公司 | Carry out the method and system that recommendation information shows |
CN105243073A (en) * | 2014-07-11 | 2016-01-13 | 北京金山安全软件有限公司 | Bookmark access method and device and terminal |
US10606821B1 (en) | 2016-08-23 | 2020-03-31 | Microsoft Technology Licensing, Llc | Applicant tracking system integration |
CN111291284A (en) * | 2018-12-10 | 2020-06-16 | 北京京东金融科技控股有限公司 | Method and device for redirecting multi-level page |
CN110211581B (en) * | 2019-05-16 | 2021-04-20 | 济南市疾病预防控制中心 | A laboratory automatic speech recognition record identification system and method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1786947A (en) * | 2004-12-07 | 2006-06-14 | 国际商业机器公司 | System, method and program for extracting web page core content based on web page layout |
CN101046820A (en) * | 2006-03-29 | 2007-10-03 | 国际商业机器公司 | System and method for prioritizing websites during a webcrawling process |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7114177B2 (en) * | 2001-03-28 | 2006-09-26 | Geotrust, Inc. | Web site identity assurance |
US7565402B2 (en) * | 2002-01-05 | 2009-07-21 | Eric Schneider | Sitemap access method, product, and apparatus |
US20030225876A1 (en) * | 2002-05-31 | 2003-12-04 | Peter Oliver | Method and apparatus for graphically depicting network performance and connectivity |
US7461257B2 (en) * | 2003-09-22 | 2008-12-02 | Proofpoint, Inc. | System for detecting spoofed hyperlinks |
US20080028443A1 (en) * | 2004-10-29 | 2008-01-31 | The Go Daddy Group, Inc. | Domain name related reputation and secure certificates |
US20080082662A1 (en) * | 2006-05-19 | 2008-04-03 | Richard Dandliker | Method and apparatus for controlling access to network resources based on reputation |
US20080163369A1 (en) * | 2006-12-28 | 2008-07-03 | Ming-Tai Allen Chang | Dynamic phishing detection methods and apparatus |
-
2010
- 2010-04-07 WO PCT/US2010/030211 patent/WO2010118115A1/en active Application Filing
- 2010-04-07 CA CA2757833A patent/CA2757833C/en active Active
- 2010-04-07 JP JP2012504817A patent/JP2012523626A/en active Pending
- 2010-04-07 CN CN201080024769.7A patent/CN102460417B/en active Active
- 2010-04-07 KR KR1020117026116A patent/KR101670700B1/en active Active
- 2010-04-07 RU RU2011144859/08A patent/RU2011144859A/en not_active Application Discontinuation
- 2010-04-07 AU AU2010234488A patent/AU2010234488B2/en active Active
- 2010-04-07 BR BRPI1014177A patent/BRPI1014177A2/en not_active IP Right Cessation
- 2010-04-07 EP EP10762358.9A patent/EP2417536A4/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1786947A (en) * | 2004-12-07 | 2006-06-14 | 国际商业机器公司 | System, method and program for extracting web page core content based on web page layout |
CN101046820A (en) * | 2006-03-29 | 2007-10-03 | 国际商业机器公司 | System and method for prioritizing websites during a webcrawling process |
Also Published As
Publication number | Publication date |
---|---|
JP2012523626A (en) | 2012-10-04 |
AU2010234488B2 (en) | 2015-01-22 |
WO2010118115A1 (en) | 2010-10-14 |
CA2757833A1 (en) | 2010-10-14 |
EP2417536A1 (en) | 2012-02-15 |
BRPI1014177A2 (en) | 2016-04-05 |
CN102460417A (en) | 2012-05-16 |
KR20120005012A (en) | 2012-01-13 |
CA2757833C (en) | 2018-09-18 |
KR101670700B1 (en) | 2016-10-31 |
EP2417536A4 (en) | 2016-08-31 |
RU2011144859A (en) | 2013-05-20 |
AU2010234488A1 (en) | 2011-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9742723B2 (en) | Internet profile service | |
US9262767B2 (en) | Systems and methods for generating statistics from search engine query logs | |
KR100478019B1 (en) | Method and system for generating a search result list based on local information | |
CN100403305C (en) | Search result generation systems including search by subdomain leads and sponsored results by subdomain | |
US7421441B1 (en) | Systems and methods for presenting information based on publisher-selected labels | |
US20110314011A1 (en) | Automatically generating training data | |
US20090089278A1 (en) | Techniques for keyword extraction from urls using statistical analysis | |
US20070112777A1 (en) | Identification and automatic propagation of geo-location associations to un-located documents | |
US20110119268A1 (en) | Method and system for segmenting query urls | |
US20110060717A1 (en) | Systems and methods for improving web site user experience | |
US20150058712A1 (en) | Method for assisting website design using keywords | |
US9330168B1 (en) | System and method for identifying website verticals | |
US7216122B2 (en) | Information processing device and method, recording medium, and program | |
US9892189B1 (en) | System and method for website categorization | |
JP2007528520A (en) | Method and system for managing websites registered with search engines | |
US20150058339A1 (en) | Method for automating search engine optimization for websites | |
CN102460417B (en) | Domain state, role and kind | |
KR100901960B1 (en) | Method and system for providing new advertisements available | |
KR101020895B1 (en) | Method and system for generating a search result list based on local information | |
KR101048590B1 (en) | A method of managing web sites registered in search engine and a system thereof | |
KR100458458B1 (en) | A method of managing web sites registered in search engine and a system thereof | |
KR20040086731A (en) | Method and system for generating a search result list based on local information | |
KR20040103763A (en) | A method of managing web sites registered in search engine | |
Bardone et al. | Estimation of Web Contents Geographic Provenience Exploiting Creative Commons Licensed Pages for Training Set Aggregation | |
Bardone et al. | Source geography estimation for web pages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |