Summary of the invention
The object of the invention is to overcome the shortcoming of above-mentioned prior art, provide a kind of can realize do not create for specific website extract template in the situation that, extract website on all internets audiovisual material, the accuracy that guarantees audiovisual material information scratching, there is broader applications scope realize based on context environmental the method that internet audiovisual material is extracted.
To achieve these goals, of the present inventionly realize based on context environmental the method that internet audiovisual material extracts and there is following formation:
Should realize the method that internet audiovisual material is extracted based on context environmental, its principal feature is that described method comprises the following steps:
(1) load predefined audiovisual material metadatabase;
(2) load the seed address that need to extract audiovisual material website;
(3) download the web page contents that need to extract audiovisual material website;
(4) judge whether the webpage of downloading is the broadcasting page of an audiovisual material, if so, continues step (5), otherwise continue step (6);
(5) search this audiovisual material above and generate audiovisual material list;
(6) web page contents that quantizes this download according to the audiovisual material metadatabase loading as audiovisual material above and deposit in set above.
Preferably, described audiovisual material metadata comprises director, protagonist, performer, issuing time, update time and the program outline of audiovisual material.
Preferably, described loading need to be extracted the seed address of audiovisual material website, is specially:
Load the seed address that need to capture audiovisual material website from Xml file or database.
Preferably, described download need to be extracted the web page contents of audiovisual material website, is specially:
Use Http client or reptile that the web page contents of the named web page of intended target website is downloaded to this locality from server.
Preferably, described search this audiovisual material above and generate audiovisual material list, comprise the following steps:
(51) broadcasting type corresponding to this audiovisual material identified;
(52) from searching this audiovisual material above set above;
(53) merge the above complete documentation of this audiovisual material of web content data Information generation of metadata information and this download.
More preferably, described identifies broadcasting type corresponding to this audiovisual material, is specially:
Identify broadcasting type corresponding to this audiovisual material and utilize corresponding player to verify broadcasting to this audiovisual material.
Preferably, the described web page contents that quantizes this download according to the audiovisual material metadatabase loading as audiovisual material above and deposit in set above, comprises the following steps:
(61) judge whether this webpage is the details page of an audiovisual material, if so, continues step (62), otherwise continue step (3);
(62) according to the rule of audiovisual material metadatabase definition, this webpage is quantized and judge this webpage be whether an audiovisual material above, if so, continue step (63), otherwise continue step (64);
(63) using this webpage as an audiovisual material above and deposit above set in, then continue step (64);
(64) judge whether this webpage is last webpage of website, if so, finishes to exit, otherwise, step (65) continued;
(65) analyze the hyperlink of this webpage and add webpage queue to be downloaded, then continuing step (3).
Adopt and realized based on context environmental the method that internet audiovisual material is extracted in this invention, there is following beneficial effect:
(1) adopt audiovisual material information characteristics quantization method, can evade unnecessary interference in audiovisual material information scratching process, thereby can guarantee that the audiovisual material grabbing is accurately.
(2) utilize the unchangeability of audiovisual material metadata information, for the renewal of website layout or content, capture as long as implement the increment of the method, can grab the latest update audiovisual material information of website.
(3) utilize the checking of player rule, can guarantee that the audiovisual material grabbing is the audiovisual material that can play.
(4) adopt a small amount of configuration, not for specific website, but can identify the audiovisual material on internet by the relation between webpage, obtain essential information and the broadcast address of audiovisual material, can be in the situation that not creating extraction template for specific website, extract the audiovisual material of website on all internets, there is range of application widely.
Embodiment
In order more clearly to describe technology contents of the present invention, conduct further description below in conjunction with specific embodiment.
The audiovisual material grasping means of existing internet, is all for page layout and content configuration template, thus identification audiovisual material.
The present invention is from audiovisual material, carry out abstract to the metadata of audiovisual material, such as: audiovisual material generally all can issuing time/update time, director, performer, the present invention carries out template configuration for these metadata exactly, main contents display area at webpage is identified these metadata, and then the information recording of formation audiovisual material above.
According to the invention process, as long as the once template of (or on a small quantity several times) audiovisual material metadata of configuration, just can avoid the template of a large amount of different web sites that configure under prior art, and the website space of a whole page upgrade after later maintenance, because for existing audiovisual material, its basic metadata information can not become, as: director and the performer of film " decisive battle greatly " can not become all the time.
Audiovisual material on internet, have the details page of audiovisual material, details page has been collected most of metadata of this audiovisual material, these data can form a part for audiovisual material information, have link in details page and be associated with the broadcasting page, play together with the information of the page and the information combination of details page, form the context of an audiovisual material, in conjunction with context, system generates an audiovisual material record.
Realization flow:
1, system starts, and loads meta data category, definition in predefined audiovisual material metadatabase, Web page loading player recognition feature;
2, load the seed address of the spiders of configuration, in these addresses, may have the audiovisual material information of expection;
3, the network being defined by reptile is downloaded logic, downloads the web page contents being present in queue to be crawled;
4, analyzing web page content:
Whether first identify this page by player identification module is the broadcasting page of an audiovisual material;
Whether be the details page of an audiovisual material by this webpage of audiovisual material metadata collecting Module recognition;
Collected the hyperlink of this page by URL analysis module, these hyperlink be likely an audiovisual material above below, also may be a new audiovisual material above, these hyperlink are added in the queue to be climbed of reptile, for continuing the crawl of the next page, complete the traversal to whole website with this;
If 5 current pages are broadcasting pages of an audiovisual material,, merge above metadata information and this page metadata information from this page of set search above above, generate the complete documentation of an audiovisual material;
If 6 current pages are not the broadcasting pages of an audiovisual material, according to the rule of metadata definition, quantize this page, with judge this page be whether an audiovisual material above, if quantized result meets the rule above of an audiovisual material, deposit current page in above set;
If 7 systems need further to capture, jump to 3;
8, system has completed the page to be analyzed, completes audiovisual material and extracts.
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, the specific embodiment of the present invention is further described.
Fig. 1 is that a kind of that the embodiment of the present invention provides realizes based on context environmental the method that internet audiovisual material is extracted, and comprising:
Step (1): loading audiovisual material feature database is metadatabase.
Particularly, audiovisual material is all accompanied with director, protagonist, performer, issuing time, update time, program outline etc., according to different audiovisual material types, can configure different audiovisual material metadata combinations.
Step (2): loading need to be extracted the seed address of audiovisual material website.
Particularly, can load the website that need to capture audiovisual material from Xml file or database.
Step (3): downloading web pages content.
Particularly, use Http client or reptile, the named web page that refers to targeted website is downloaded to this locality from server.
Step (4): analyzing web page content, determine whether this page is the broadcasting page of audiovisual material.
Whether particularly, first identifying this page by player identification module is the broadcasting page of an audiovisual material, and which kind of player identifies be, as Flash player; Whether be the details page of an audiovisual material by this webpage of audiovisual material metadata collecting Module recognition, arrange out audiovisual material metamessage and define needed content; Collected the link of this page by URL analysis module, for continuing the crawl of the next page.
Step (5): search audiovisual material above, generate audiovisual material list.
Particularly, the audiovisual material information getting according to step (4), if current page is the broadcasting page of an audiovisual material, from this page of set search above above, merge above metadata information and this page metadata information, generate the complete documentation of an audiovisual material.
Step (6): take audiovisual material feature database as criterion, quantize web page contents, as audiovisual material above.
Particularly, the audiovisual material information getting according to step (4), quantize this page, with judge this page be whether an audiovisual material above, if quantized result meets the rule above of an audiovisual material, deposit above current page in set, this set can be a HASH table, or a tables of data in database.
Adopt and realized based on context environmental the method that internet audiovisual material is extracted in this invention, there is following beneficial effect:
(1) adopt audiovisual material information characteristics quantization method, can evade unnecessary interference in audiovisual material information scratching process, thereby can guarantee that the audiovisual material grabbing is accurately.
(2) utilize the unchangeability of audiovisual material metadata information, for the renewal of website layout or content, capture as long as implement the increment of the method, can grab the latest update audiovisual material information of website.
(3) utilize the checking of player rule, can guarantee that the audiovisual material grabbing is the audiovisual material that can play.
(4) adopt a small amount of configuration, not for specific website, but can identify the audiovisual material on internet by the relation between webpage, obtain essential information and the broadcast address of audiovisual material, can be in the situation that not creating extraction template for specific website, extract the audiovisual material of website on all internets, there is range of application widely.
In this instructions, the present invention is described with reference to its specific embodiment.But, still can make various modifications and conversion obviously and not deviate from the spirit and scope of the present invention.Therefore, instructions and accompanying drawing are regarded in an illustrative, rather than a restrictive.