CN102799610B

CN102799610B - Method and system for collecting network information

Info

Publication number: CN102799610B
Application number: CN201210180521.0A
Authority: CN
Inventors: 赵勇; 党书国; 阎飞飞
Original assignee: BEIJING QILEKE TECHNOLOGY CO LTD
Current assignee: Zhejiang Shuqin Technology Co Ltd
Priority date: 2012-06-01
Filing date: 2012-06-01
Publication date: 2017-04-12
Anticipated expiration: 2032-06-01
Also published as: CN102799610A

Abstract

The invention discloses a method and a system for collecting network information. The method comprises the following steps of: acquiring information to be collected which is required to be collected by a user according to a collection instruction of the user; analyzing the information to be collected to determine a classification of the information to be collected; and storing the information to be collected and the determined classification of the information to be collected, wherein the information to be collected comprises website information which is required to be collected by the user and/or information relevant to webpage contents. Preferably, during determination, based on a preset keyword library, at least one of text analysis, semantic analysis and word frequency statistical analysis is executed on information of the webpage contents corresponding to the information to be collected, so that whether at least one keyword is comprised in the information of the webpage contents corresponding to the information to be collected and also comprised in the preset keyword library is judged; and the classification of the information to be collected is determined according to a judgment result. By the method and the system, the convenience in network collection for the user is enhanced.

Description

Network information collection method and system

Technical Field

The present invention relates to network technologies, and in particular, to a method and a system for collecting network information for collecting network contents such as network addresses and web page contents.

Background

With the popularization and development of internet technology, the number and content of websites, blogs and microblogs are rapidly increasing. Some technical solutions for helping users to collect network contents are emerging.

In one collection mode, the user can add the website access address and its name or the web page access address and its name to the browser's favorites as a bookmark (consisting of the web page name and corresponding link) in the browser's favorites. When a user wants to access the collected webpage, the corresponding bookmark in the favorite is clicked, and the browser can be switched to the corresponding page for reading. In this collection manner, if the user does not manually classify the collected web page bookmarks to place newly added bookmarks in the category set under the favorite selected by the user, the browser will place the web links that the user wants to collect as bookmarks in the favorites in sequence according to the addition order of the user, which may cause the user to not easily find bookmarks corresponding to the web page links when the user wants to access the web page links through the collected bookmarks after a period of time while collecting the web page links. Further, if the user wishes to place the network links to be collected in the favorites as bookmarks in categories, manual setting is required, which may cause unnecessary trouble to the user.

In another collection mode, the user may access a web page provided by the web service provider for collecting the website links, and input the website name or website link that the user wants to collect and the classification of the website in the web page to store the website links that the user likes to access. In this manner, the user needs to manually set/select the category of the website to be collected, and even manually add the website name or website link, which tedious manual operations greatly reduce the user-friendliness of the collection system.

The invention is provided in order to better help the user to organize and manage the content collected by the user.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a network information collecting method and system capable of automatically classifying the network information to be collected.

In order to solve the technical problem, the invention provides a network information collection method. The method comprises the following steps:

an acquisition step, namely acquiring information to be collected, which is to be collected by a user, according to a collection instruction of the user;

a determining step, namely analyzing the information to be collected to determine the classification of the information to be collected;

a collection step of storing the information to be collected and the determined classification of the information to be collected,

the information to be collected comprises information related to website information and/or webpage content to be collected by the user.

According to another aspect of the present invention, in the determining step, at least one of text analysis, semantic analysis and word frequency statistical analysis is performed on the web page content information corresponding to the information to be collected based on a preset keyword library to determine whether there is at least one keyword which is included in the web page content information corresponding to the information to be collected and is included in the preset keyword library, and a classification of the information to be collected is determined according to a determination result, wherein the preset keyword library includes a plurality of keywords, and each keyword corresponds to one or more of the classifications; the webpage content information corresponding to the information to be collected comprises webpage content pointed by website information in the information to be collected, part or all of webpage content of a website corresponding to the website information in the information to be collected, and/or webpage content included in the information to be collected.

According to another aspect of the invention, when the judgment result is yes, the classification corresponding to the at least one keyword is determined as the classification of the information to be collected; or, determining the classification corresponding to one or more keywords which appear more frequently in the webpage content information corresponding to the information to be collected in the at least one keyword as the classification of the information to be collected.

According to another aspect of the present invention, in the analyzing step, when the determination result is negative, the classification of the information to be collected is determined as a preset default classification; or determining the classification designated by the user as the classification of the information to be collected.

According to still another aspect of the present invention, further comprising: and a setting step, namely adding, deleting and modifying the keywords of the preset keyword library according to the instruction of the user.

According to another aspect of the present invention, in the determining step, at least one of text analysis, semantic analysis and word frequency statistical analysis is performed on the web content information corresponding to the information to be collected to obtain keywords which are used for embodying features of the web content information corresponding to the information to be collected and are included in the web content information, and the classification of the information to be collected is determined according to the at least one keyword, where the web content information corresponding to the information to be collected includes web content pointed by website information in the information to be collected, part or all of the web content of a website corresponding to the website information in the information to be collected, and/or the web content included in the information to be collected.

According to another aspect of the present invention, in the determining step, a category corresponding to the one or more keywords is determined as a category of the information to be collected.

According to another aspect of the present invention, in the determining step, when none of the classifications corresponding to the one or more keywords is a classification used by the user in a previous collection process, the classification of the information to be collected is determined as a preset default classification; or determining the classification designated by the user as the classification of the information to be collected.

According to still another aspect of the present invention, the classification corresponding to each keyword is set in advance, or determined by performing text analysis and/or semantic analysis.

According to yet another aspect of the invention, the web page content includes all or part of text, images and/or movies in the web page.

According to another aspect of the present invention, when the information to be collected is website information to be collected by the user, the obtaining step further includes analyzing, according to a machine learning algorithm, the webpage content information corresponding to the website information to be collected by the user.

According to yet another aspect of the invention, the machine learning algorithm is naive bayes, a support vector machine, a latent dirichlet allocation model, and/or a neural network.

According to another aspect of the invention, a network information collection system is also provided. The system comprises: the acquisition unit is used for acquiring information to be collected, which is collected by the user, according to the collection instruction of the user; the determining unit is used for analyzing the information to be collected so as to determine the classification of the information to be collected; and the collection unit is used for storing the information to be collected and the determined classification of the information to be collected, wherein the information to be collected comprises information related to website information and/or webpage content to be collected by the user.

According to another aspect of the present invention, the determining unit further performs at least one of text analysis, semantic analysis and word frequency statistical analysis on the web content information corresponding to the information to be collected based on a preset keyword library to determine whether there is at least one keyword which is included in the web content information corresponding to the information to be collected and is included in the preset keyword library, and determines the classification of the information to be collected according to a determination result, wherein the preset keyword library includes a plurality of keywords, and each keyword corresponds to one or more of the classifications; the webpage content information corresponding to the information to be collected comprises webpage content pointed by website information in the information to be collected, part or all of webpage content of a website corresponding to the website information in the information to be collected, and/or webpage content included in the information to be collected.

According to another aspect of the present invention, when the determination result is yes, the determining unit determines the category corresponding to the at least one keyword as the category of the information to be collected; or, determining the classification corresponding to one or more keywords which appear more frequently in the webpage content information corresponding to the information to be collected in the at least one keyword as the classification of the information to be collected.

According to one or more embodiments of the invention, after the information to be collected, which is to be collected by the user, is acquired according to the collection indication of the user, the information to be collected is analyzed, and the classification of the information to be collected can be determined according to the analysis result without manually participating in the classification process. Therefore, the information to be collected can be automatically stored in the collection positions of the corresponding categories, and the convenience of using network collection by the user is enhanced.

In other words, one or more embodiments of the invention solve the problems of complicated content classification, lack of organization and the like in the favorites when the user does not perform manual classification, so that the collection operation is more convenient and faster.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic diagram illustrating a network information collection system according to the present embodiment;

fig. 2A and 2B illustrate flowcharts of a network information collection method according to first and second embodiments of the present invention, respectively;

FIG. 3 shows a flow diagram of an example of network favorites according to the present invention;

FIG. 4 illustrates a flow diagram of yet another example of network collections according to the present invention;

FIG. 5 illustrates a flow chart of yet another example of network collections according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

It should be noted that, if not conflicting, the embodiments of the present invention and the features of the embodiments may be combined with each other within the scope of protection of the present invention. Additionally, the steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions and, although a logical order is illustrated in the flow charts, in some cases, the steps illustrated or described may be performed in an order different than here.

First embodiment

Fig. 1 shows a schematic configuration diagram of a network information collection system according to the present embodiment. As shown in fig. 1, a server 10 is network-connected to a plurality of clients 20.

It should be noted that fig. 1 shows only one server 10, however, the server 10 of the present invention may be multiple servers, for example, multiple computer devices in a cloud platform may jointly function as a server. The client 10 may be a computer, a Personal Digital Assistant (PDA), a tablet computer, a smart phone, or other various computing devices.

The connection between the server 10 and the client 20 may be a wired network or a wireless network. In the client 20, a browser or other network information processing software for accessing network information may be provided.

Fig. 2A illustrates steps of a network information collection method according to the present embodiment. The method according to the present embodiment is described below with reference to fig. 2A.

Step 210, acquiring information to be collected, which is to be collected by a user, according to a collection instruction of the user, wherein the information to be collected includes website information and/or webpage content related information, such as a webpage address, a webpage name, and all or part of webpage content, which are to be collected by the user;

step 220, analyzing the information to be collected to determine the classification of the information to be collected;

step 230, storing the information to be collected and the determined classification of the information to be collected.

In step 210, more specifically, when the user clicks on the browser of the client 20, the browser receives the user's favorite indication. According to the collection instruction, the browser can acquire the information to be collected, which is collected by the user.

In one example, the user clicks a customized collection button or a browser plug-in on the browser, and the browser obtains the website information of the currently accessed webpage as the information to be collected according to the operation (corresponding to the collection instruction of the user) of the user. Certainly, the information to be collected may not only be website information to be collected by the user, but also be web page content to be collected by the user, including pictures, texts, even images and the like in the web page.

Preferably, the browser may also send the received collection instruction of the user to the server 10, and the server 10 obtains the information to be collected, which is to be collected by the user, according to the instruction, so as to store the information to be collected in the server 10 in step 230.

For example, the user clicks a custom button, browser plug-in, menu, etc. on the browser for local collection or network collection, the server 10 receives a message (corresponding to the user's collection indication) from the browser indicating that the click event has occurred, and determines the information to be collected by the user based on the message. For example, the server 10 uses the website included in the message and uses all or part of the content of the webpage corresponding to the website as the information to be collected according to the website. More specifically, the server 10 may analyze, according to a machine learning algorithm such as nave Bayesian Model (NBC), a Support Vector Machine (SVM), a potential Dirichlet Allocation Model (LDA), a neural network, and the like, a web content pointed by the website information in the downloaded information to be collected, a part or all of the web contents of the website corresponding to the website information in the information to be collected, and/or a web content included in the information to be collected.

In addition, other software or modules in the client 20 may also be used to perform the operation of acquiring the information to be collected, which is to be collected by the user, according to the collection instruction of the user.

The process of analyzing the information to be collected to determine the classification of the information to be collected in step 220 is described in detail below.

Firstly, analyzing the webpage content information corresponding to the information to be collected based on a preset keyword library to judge whether at least one keyword which is contained in the webpage content information corresponding to the information to be collected and contained in the preset keyword library exists.

For example, the frequency of occurrence of each keyword in the keyword library in the web content information corresponding to the information to be collected is analyzed through the word frequency analysis, so that the above-mentioned determination operation can be completed, that is, it is determined whether at least one keyword which is included in the web content information corresponding to the information to be collected and is included in the preset keyword library exists.

For another example, keywords that can be used to reflect characteristics of the web content information corresponding to the information to be collected in the web content information corresponding to the information to be collected may also be analyzed through semantic analysis or text analysis, and then whether the keywords are in a preset keyword library is determined, thereby completing the above determination operation.

If the result of the determination is yes, that is, if it is determined that at least one keyword is included in the web content information corresponding to the information to be collected and included in the preset keyword library, the category corresponding to the keyword may be determined as the category of the information to be collected. It should be noted that the number of the keywords may be one or more, one keyword may correspond to multiple categories, and multiple keywords may also correspond to the same category, so that the categories of the information to be collected may be multiple.

Preferably, when it is determined that there are more keywords included in the web content information corresponding to the information to be collected and included in the preset keyword library, a category corresponding to one or more keywords that appear more frequently in the web content information corresponding to the information to be collected among the keywords may be determined as the category of the information to be collected, so that the classification accuracy may be better improved.

In addition, during analysis, the three analysis modes of text analysis, semantic analysis and word frequency statistical analysis can be combined for use, so as to analyze and obtain the keywords which can most embody the characteristics of the webpage content information corresponding to the information to be collected.

In addition, only one or a plurality of classifications corresponding to more keywords in the classifications corresponding to the keywords can be used as the classification of the information to be collected, so that the classification accuracy is improved.

Determining the classification of the information to be collected as a preset default classification when the judgment result is negative, namely, the judgment result is that no keyword which is contained in the webpage content information corresponding to the information to be collected and is contained in the preset keyword library exists; or determining the classification designated by the user as the classification of the information to be collected.

The process of step 220 may be performed by client 20, a browser or other software on client 20, or server 10.

Then, step 230 is entered, and the information to be collected and the determined classification of the information to be collected are stored.

It should be noted that the information to be collected and the determined classification of the information to be collected may be stored in the client 20, or may be stored in the server 10.

For example, when the user desires to locally store the website information as the collection information so as to conveniently access the web page by directly clicking a bookmark on the browser next time, the browser installed on the client 20 may store the website information and the name of the web page and the category determined according to this step as a bookmark of one browser. In this way, when the user wants to access the bookmark next time, the bookmark can be conveniently searched by category, so that the manual classification of the user is not needed, and the user friendliness is improved.

Similarly, when a user desires to store collection information on a network, the collection information and its categories may be stored by the server 20. In this way, the user can conveniently access the collected contents by category through the network.

In the foregoing embodiments, the classification corresponding to each keyword may be set in advance, or the classification corresponding to each keyword may be determined by performing text analysis and/or semantic analysis.

In addition, the preset keyword library can be manually configured in advance by research personnel or determined by a program, and a setting interface can also be provided for a user, so that the addition, deletion and modification operations can be carried out on the keywords of the preset keyword library according to the indication of the user.

In this embodiment, the web page content is analyzed for each keyword in the preset keyword library based on the preset keyword library, and word frequency analysis is not required for each keyword of the web page content, so that complexity of analysis processing can be reduced well.

Second embodiment

Fig. 2B illustrates a flowchart of a network information collection method according to the present embodiment. In fig. 2B, the same or similar steps as in fig. 2 are denoted by the same reference numerals.

Step 210 and step 230 of the present embodiment are substantially the same as those of the first embodiment, and therefore, are not expanded in detail. In this embodiment, step 221 replaces step 220, and the information of the web pages to be collected can be automatically classified without preselecting the keyword library.

More specifically, in step 221 of this embodiment, text analysis, semantic analysis, word frequency analysis, and/or the like are performed on the web content information corresponding to the information to be collected to obtain keywords which are used for embodying features of the web content information corresponding to the information to be collected and are included in the web content information corresponding to the information to be collected, and then the classification of the information to be collected is determined according to the keywords.

Compared with the previous preferred embodiment, the embodiment does not necessarily need to preset a keyword library, but directly performs text analysis, semantic analysis and/or word frequency statistical analysis and the like on the webpage content, and analyzes and determines one or more keywords which can reflect the characteristics of the webpage content most. For example, the word with the highest frequency of occurrence is analyzed as the keyword. Some words with specific meanings are also analyzed as keywords, for example, if the automobile variety such as popular, audi, etc. appears many times, it can be indicated that the website belongs to the automobile class, and it is determined as a keyword through semantic analysis.

Then, the classification corresponding to the analyzed keywords can be determined as the classification of the information to be collected.

Further, it can also be determined whether the keywords are categories that the user has used in a previous collection process. If not, determining the classification of the information to be collected as a preset default classification; or determining the classification designated by the user as the classification of the information to be collected.

In the embodiment, the automatic classification of the information to be collected can be better realized without preselecting and setting a keyword library.

Example one

FIG. 3 shows a flow diagram of an example of network favorites according to the present invention.

After the user collects the favorite websites or links, the system automatically extracts keywords, classifies the contents to be collected and adds the contents to be collected into the favorites of the corresponding category.

The following describes in detail how the example performs automatic keyword extraction in a web collection for content classification steps:

step 310, a user logs in a network favorite;

step 320, the user directly adds the website or link to be collected in the system, and clicks the collection; or in the process of browsing the webpage, selecting a mode of needing to collect through a browser plug-in customized by the system, and automatically collecting the current page link or the picture and content (corresponding to the information to be collected) selected by the user for the user by the system;

step 330, the system analyzes, compares and automatically extracts the content of the corresponding website or link according to the machine learning algorithms of naive Bayes, support vector machines, LDA, neural networks and the like, and extracts the keywords of the content description through the algorithms of word segmentation, word frequency statistics and the like;

step 340, the system performs text and semantic analysis according to the extracted keywords to obtain the classification of the keywords;

step 350, according to the classification result, the system adds the connection or favorite content to the favorites of the corresponding category.

Example two

FIG. 4 illustrates a flow chart of yet another example of network collections according to the present invention.

In this example, when the user collects the website, link or page, blog, microblog content, the system automatically classifies it into the corresponding favorite.

The steps of the present example are described in detail below.

Step 410, clicking a website or a link to be collected by a user;

step 420, the system crawler program automatically captures the corresponding website or the linked content;

430, automatically performing text analysis, semantic analysis, word frequency statistics and other work on the captured contents by the system;

step 440, the system extracts one or more keywords from the captured content according to a predefined keyword library, so as to realize automatic keyword extraction of the system;

step 450, the system automatically classifies the corresponding content according to the category to which the keyword belongs;

step 460, the system adds the corresponding web address or link to the favorites of the corresponding category.

Example two

In this example, what the user wants to collect is the content of a page, a blog, or a microblog. The specific steps of this example are as follows:

step 510, clicking a page, a blog and a microblog to be collected by a user;

step 520, the system automatically performs text analysis, semantic analysis, word frequency statistics and other work on the corresponding content;

step 530, which is substantially the same as step 440 above and is not described again;

step 540, which is substantially the same as step 450 and is not described again;

step 550, the system adds the corresponding page, blog, microblog to the favorites of the corresponding category.

Therefore, the content of the website, link or page, blog and microblog which the user wants to collect is automatically classified into the favorite of the category to which the user belongs by the system, and the user can greatly conveniently access the content of the corresponding website, link or page, blog and microblog according to the category.

The system is intelligentized, one behavior of user collection can intelligently reflect the collection habits and hobbies of the user, and the website can automatically classify the collection contents of the user; the collection mode is various, the collection can be performed by using a website, the collection can be performed by using applications of various operation platforms (such as android, ios and winphone), and the collection can be performed by using a customized browser plug-in.

The above embodiment has been described by taking a browser as an example of the network information processing software, and it should be noted that, alternatively, other network information processing software built in or installed on the client may be used.

In general, as described in the above embodiments, the client and the server are generally two different devices connected to the network, but as a specific example, when both the web server and the browser are installed in the same computer, the client and the server may be the same device.

Those skilled in the art will appreciate that the modules or steps of the invention described above can be implemented in a general purpose computing device, centralized on a single computing device or distributed across a network of computing devices, and optionally implemented in program code that is executable by a computing device, such that the modules or steps are stored in a memory device and executed by a computing device, fabricated separately into integrated circuit modules, or fabricated as a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

Although the embodiments of the present invention have been described above, the above descriptions are only for the convenience of understanding the present invention, and are not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A network information collection method is characterized by comprising the following steps:

wherein,

the information to be collected comprises information related to website information and/or webpage content to be collected by the user;

in the step of determining,

performing at least one of text analysis and semantic analysis on the webpage content information corresponding to the information to be collected to acquire keywords which are used for embodying the characteristics of the webpage content information corresponding to the information to be collected and are contained in the webpage content information, determining the classification of the information to be collected according to at least one keyword, wherein,

the webpage content information corresponding to the information to be collected comprises all webpage contents of websites corresponding to the website information in the information to be collected;

in the determining step, when all the classifications corresponding to one or more keywords are not the classifications used by the user in the previous collection process, the classification specified by the user is determined as the classification of the information to be collected.

2. The method of claim 1, wherein, in the determining step,

performing at least one of text analysis and semantic analysis on the web content information corresponding to the information to be collected based on a preset keyword library to judge whether at least one keyword which is contained in the web content information corresponding to the information to be collected and contained in the preset keyword library exists or not, and determining the classification of the information to be collected according to the judgment result, wherein,

the preset keyword library comprises a plurality of keywords, and each keyword corresponds to one or more classifications;

the webpage content information corresponding to the information to be collected comprises webpage content pointed by website information in the information to be collected, all webpage content of a website corresponding to the website information in the information to be collected, and/or webpage content included in the information to be collected.

3. The method according to claim 2, wherein in the determining step, when the judgment result is yes,

determining the classification corresponding to the at least one keyword as the classification of the information to be collected; or,

and determining the classification corresponding to one or more keywords which appear more frequently in the webpage content information corresponding to the information to be collected in the at least one keyword as the classification of the information to be collected.

4. The method according to claim 2, wherein in the analyzing step, when the judgment result is negative,

determining the classification of the information to be collected as a preset default classification; or

And determining the classification designated by the user as the classification of the information to be collected.

5. The method of any of claims 2 to 4, further comprising:

and a setting step, namely adding, deleting and modifying the keywords of the preset keyword library according to the instruction of the user.

6. The method of claim 1, wherein, in the determining step,

and determining the classification corresponding to the one or more keywords as the classification of the information to be collected.

7. The method according to claim 3 or 6,

the classification corresponding to each keyword is preset, or determined by performing text analysis and/or semantic analysis.

8. The method according to any one of claims 1 to 4 or 6,

the web page content includes all or part of text, images and/or video in the web page.

9. The method according to any one of claims 2 to 4, wherein, when the information to be collected is website information to be collected by the user, the obtaining step further comprises,

and analyzing the webpage content information corresponding to the website information to be collected by the user according to a machine learning algorithm.

10. The method of claim 9, wherein the machine learning algorithm is naive bayes, a support vector machine, a latent dirichlet allocation model, and/or a neural network.

11. A network information collection system, comprising:

the acquisition unit is used for acquiring information to be collected, which is collected by the user, according to the collection instruction of the user;

the determining unit is used for analyzing the information to be collected so as to determine the classification of the information to be collected;

a collection unit for storing the information to be collected and the determined classification of the information to be collected,

wherein,

the determining unit further performs at least one of text analysis and semantic analysis on the web content information corresponding to the information to be collected based on a preset keyword library to determine whether at least one keyword included in the web content information corresponding to the information to be collected and included in the preset keyword library exists, and determines the classification of the information to be collected according to the determination result,

the determining unit determines the classification specified by the user as the classification of the information to be collected when all the classifications corresponding to the one or more keywords are not the classifications used by the user in the previous collection process.

12. The system according to claim 11, wherein the determination unit, when the determination result is yes,