[go: up one dir, main page]

CN103440342B - Information-pushing method based on type of webpage and device - Google Patents

Information-pushing method based on type of webpage and device Download PDF

Info

Publication number
CN103440342B
CN103440342B CN201310410102.6A CN201310410102A CN103440342B CN 103440342 B CN103440342 B CN 103440342B CN 201310410102 A CN201310410102 A CN 201310410102A CN 103440342 B CN103440342 B CN 103440342B
Authority
CN
China
Prior art keywords
type
page
word
description
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310410102.6A
Other languages
Chinese (zh)
Other versions
CN103440342A (en
Inventor
梁捷
李建兴
李建设
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou Dongjing Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Dongjing Computer Technology Co Ltd filed Critical Guangzhou Dongjing Computer Technology Co Ltd
Priority to CN201310410102.6A priority Critical patent/CN103440342B/en
Publication of CN103440342A publication Critical patent/CN103440342A/en
Application granted granted Critical
Publication of CN103440342B publication Critical patent/CN103440342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention relates to mobile communication technology field, disclosing a kind of information-pushing method based on type of webpage and device, the method includes: utilize the cooccurrence relation of the history page words of description being obtained ahead of time to obtain each history page words of description type weights to each page type;Attribute with type weights as word builds word's kinds attribute library;Utilize the current page words of description obtained in real time to inquire about in word's kinds attribute library, obtain the type weights of each page type of current page words of description;Calculate the type weights sum of each current page words of description in each page type, the page type that type weights sum is maximum is set to currently browse the type of webpage;Pushing network information in webpage is currently browsed user based on the type currently browsing webpage.The present invention simply, effectively and accurately achieves the judgement of type of webpage so that the precision of the information pushing carried out based on type of webpage is increased dramatically.

Description

Information-pushing method based on type of webpage and device
Technical field
The present invention relates to mobile communication technology field, particularly to a kind of information-pushing method based on type of webpage and dress Put.
Background technology
Data clusters is a focus of current internet application, through the development of decades, the network user and the Internet Scale presents explosive growth, and a small amount of useful information is often flooded by the internet data of magnanimity, only by individual subscriber Actively browse webpage and be difficult to effectively obtain key message.In this case, the Internet is from simple exhibition information passively Start to change, in order to make the information of propelling movement the most quick and precisely, it is necessary to whole internet informations are carried out to active push information Preliminary screening, data clusters is exactly a kind of information classification approach for setting up association between internet information.
Owing to pushed information is frequently not the information of user's initiative, it is easy to disliked by user, thus push Accuracy is particularly important.Generally, pushed information mainly includes Search Results, news, life & amusement information and wide Accusing, the accurate input of pushed information increasingly comes into one's own, and the type currently browsing webpage based on user pushes relevant letter Breath is exactly one of which realization approach.Such as based on web page contents advertisement directional technology, it is simply that refer at the page that browser returns Adding an advertisement in face, the classification of advertisement is consistent with type of webpage as far as possible.By data clusters, network push can be from pass The information that connection degree is higher is carried out preferably, but the webpage owing to wanting real-time online currently to browse user is sorted out, to relevant The performance of sorting algorithm proposes the requirement of harshness.
Web page classifying generally uses machine learning algorithm, such as naive Bayesian (Naive Bayes) algorithm, KNN (K-at present Nearest neighbor) algorithm, support vector machine (Support Vector Machine, SVM) algorithm, neutral net (Artificial Neural Network, ANN) algorithm etc..The vector that the basic ideas of these algorithms are all based on document is empty Between model, by having marked in a large number the document training of classification, the model after being trained is to predict the classification of new web page.
The subject matter of these machine learning algorithms of prior art has:
(1) needing the substantial amounts of sample having marked classification, workload is big, and the quality of grader is by mark sample Quality impact is bigger.Obtaining mark classification webpage to be typically manually to mark, the advantage of this method is mark sample quality Height, but a large amount of manpower need to be expended.Also having some ways is to utilize the classified navigation website of the Internet or search engine orientation to crawl The page, the advantage of this method is can to mark with automatization, but sample is of low quality, and noise is relatively big, and classification also differs surely Needed for meeting self, the efficiency i.e. obtaining webpage is low, accuracy rate is low.
(2) some algorithm (such as ANN algorithm, SVM algorithm etc.) itself is more complicated, runs expense high, is only suitable at off-line Reason, it is impossible to for the online process in real time that performance requirement is higher, i.e. real-time is low.
Method based on above-mentioned acquisition webpage, when carrying out information pushing, causes information pushing inefficiency, and real-time is low.
Summary of the invention
For the defect of prior art, the technical problem to be solved is to carry out accurately the most real-time and efficiently Information pushing.
For solving the problems referred to above, an aspect of of the present present invention provides a kind of information-pushing method based on type of webpage, institute The method of stating includes step:
The cooccurrence relation utilizing the history page words of description being obtained ahead of time obtains each described history page words of description The type weights of corresponding different page type;Wherein, described cooccurrence relation is for representing the coexisting state between word;With described class Type weights are that the attribute of word builds word's kinds attribute library;Utilize the current page words of description obtained in real time at described word Categorical attribute is inquired about in storehouse, obtains the type weights of each page type of current page words of description;Calculate each page The type weights sum of each current page words of description in the type of face, is set to the page type that type weights sum is maximum The described type currently browsing webpage;Propelling movement network in webpage is currently browsed user based on the described type currently browsing webpage Information.
Preferably, the cooccurrence relation of the history page words of description that described utilization is obtained ahead of time obtains each described history page The step of the type weights of face words of description correspondence difference page type includes:
The cooccurrence relation utilizing history page words of description sets up term network;
The strength of association between each history page words of description is obtained in described term network according to described cooccurrence relation;
Travel through described term network, obtain the distance between each described history page words of description;
According in advance give each setting classification core word give initial weight, described distance, described strength of association with And the decay intensity preset, obtain the type weights of each described history page words of description correspondence difference page type.
Preferably, the described association obtained in described term network between each history page words of description according to cooccurrence relation is strong The step of degree includes:
The number of times that each history page words of description occurs jointly is obtained according to described cooccurrence relation;
Strength of association according between the equation below each history page words of description of acquisition:
Sij=Cij/Max(C)
Wherein, CijIt it is the co-occurrence number of times of word i and word j;Max (C) is that the co-occurrence number of times between word is maximum.
Preferably, described basis give each classification core word give initial weight, described distance, described strength of association with And the decay intensity preset, obtain the step of the type weights of each described history page words of description correspondence difference page type Including:
Equation below is utilized to obtain the type weights of each described history page words of description correspondence difference page type:
w j = w i * S ij * α - ( d i + 1 )
Wherein, wjType weights for term node j;I with j is two term node associated in term network, SijIt is Node i and the strength of association of node j, wiIt is the type weights of node i;α is default decay intensity, diIt is node i and classification core The distance of heart word, when calculating for the first time, wiFor giving the initial weight of described classification core word.
Preferably, described the step of pushing network information in webpage is currently browsed based on the type currently browsing webpage user Suddenly include:
The network letter that inquiry is same or like with the type currently browsing webpage in default network information database Breath;
The network information of acquisition is pushed to currently browsing webpage.
On the other hand, the present invention provides a kind of information push-delivery apparatus based on type of webpage, described device bag the most simultaneously Include:
First weights acquisition module, utilizes the cooccurrence relation of the history page words of description being obtained ahead of time to obtain described in each The type weights of history page words of description correspondence difference page type;Wherein, described cooccurrence relation is for representing between word Coexisting state;
Attribute library sets up module, builds word's kinds attribute library for the attribute with described type weights as word;
Second weights acquisition module, utilizes the current page words of description obtained in real time in described word's kinds attribute library Inquire about, obtain the type weights of each page type of current page words of description;
Page type determines module, for calculating the type weights of each current page words of description in each page type Sum, is set to the described type currently browsing webpage by the page type that type weights sum is maximum;
Info push module, for currently browsing propelling movement net in webpage based on the described type currently browsing webpage user Network information.
Preferably, described weights module includes:
Described first weights acquisition module includes:
Term network sets up unit, for utilizing the cooccurrence relation of the history page words of description being obtained ahead of time to set up word Network;
Word association intensity acquiring unit, for obtaining each history page in described term network according to described cooccurrence relation Strength of association between words of description;
Traversal Unit, is used for traveling through described term network, obtains the distance between each described history page words of description;
Acquiring unit, for according to the initial weight giving in advance the classification core word of each setting, described distance, institute State strength of association and default decay intensity, obtain the class of each described history page words of description correspondence difference page type Type weights.
Preferably, described word association intensity acquiring unit obtains each history page descriptor according to described cooccurrence relation The number of times that language occurs jointly;Strength of association according between the equation below each history page words of description of acquisition:
Sij=Cij/Max(C)
Wherein, CijIt it is the co-occurrence number of times of word i and word j;Max (C) is that the co-occurrence number of times between word is maximum.
Preferably, described acquiring unit utilizes equation below to obtain each described history page words of description correspondence not same page The type weights of face type:
w j = w i * S ij * α - ( d i + 1 )
Wherein, wjType weights for term node j;I with j is two term node associated in term network, SijIt is Node i and the strength of association of node j, wiIt is the type weights of node i;α is default decay intensity, diIt is node i and classification core The distance of heart word, when calculating for the first time, wiFor giving the initial weight of described classification core word.
Preferably, described info push module includes:
Information query unit is identical with the type currently browsing webpage for inquiring about in default network information database Or the close network information;
Information pushing unit, for pushing the network information of acquisition to currently browsing webpage.
Compared with prior art, the invention provides a kind of information-pushing method based on type of webpage and device, pass through The cooccurrence relation of the history page descriptor obtained in advance determines the word type weights relative to different page types;Weigh with the type Value sets up Words ' Attributes storehouse;When user's displaying live view webpage, obtain page-describing word in real time, retouch with the page obtained in real time Predicate language query terms attribute library, it is thus achieved that the type weights of the relatively different page type of each page-describing word obtained in real time; Calculate in different page type the most again, the sum of all types of weights;Thus can get the page-describing word of each page type Type weights and;Type weights and maximum page type are set to the page type of current page, such that it is able to more Determine current page type accurately;The corresponding network information is selected to push further according to the page type determined;Due to Page type can be accurately determined, it is not necessary to repeated pages type judges and the process of network information push, such that it is able to real Now push accurate related information for user.Therefore, technical scheme is real-time, judgement is accurate, substantially increases The accuracy of information pushing and efficiency.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of information-pushing method based on type of webpage in one embodiment of the invention;
Fig. 2 is the topological structure schematic diagram constructing term network in one embodiment of the invention;
Fig. 3 is the structural representation of information push-delivery apparatus based on type of webpage in one embodiment of the invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe wholely.Obviously, described embodiment is to implement the better embodiment of the present invention, and described description is that the present invention is described Rule for the purpose of, be not limited to the scope of the present invention.Protection scope of the present invention should be with claim institute circle The person of determining is as the criterion, and based on the embodiment in the present invention, those of ordinary skill in the art are not on the premise of making creative work The every other embodiment obtained, broadly falls into the scope of protection of the invention.
Machine learning algorithm of the prior art depends on the document marked in a large number, how to obtain these a large amount of The document of mark becomes the bottleneck affecting prior art performance.In technical scheme, it is no longer dependent on the mark to document Note, sets up term network by the cooccurrence relation of statistical web page descriptor, utilizes the mapping relations reality between word in term network Type decision the most accurately, thus ensure the accuracy pushed.
Fig. 1 is the schematic flow sheet of information-pushing method based on type of webpage, in one embodiment of the invention, base Information-pushing method in type of webpage includes step:
The cooccurrence relation of the history page words of description that S1, utilization are obtained ahead of time obtains each history page words of description pair Answer the type weights of different page types;Wherein, cooccurrence relation is for representing the coexisting state between word;Enter step S2;
S2, attribute with type weights as word build word's kinds attribute library;Wait and enter step S3;
The current page words of description that S3, utilization obtain in real time is inquired about in word's kinds attribute library, obtains current The type weights of each page type of page-describing word;Enter step S4;
Preferably, this S3 step includes but not limited to following steps:
Acquisition current page words of description webpage is currently browsed from user;
Inquire about in word's kinds attribute library for index with current page words of description, obtain each current page and retouch The type weights that the business classification of predicate language, each current page words of description are classified relative to this business;Wherein, word's kinds Attribute library includes the mapping relations of page-describing word and each page type.
S4, calculate the type weights sum of each current page words of description in each page type, by type weights it It is set to currently browse the type of webpage with maximum page type;Enter step S5;
Preferably, this S4 includes but not limited to following steps:
Calculate in the classification of each business, the type weights sum that each current page words of description is classified relative to this business;
The page type that type weights sum is maximum is set to currently browse the type of webpage.
S5, currently browse pushing network information in webpage based on the type currently browsing webpage user.
Compared with prior art, the invention provides a kind of information-pushing method based on type of webpage and device, pass through The cooccurrence relation of the history page descriptor obtained in advance determines the word type weights relative to different page types;Weigh with the type Value sets up Words ' Attributes storehouse;When user's displaying live view webpage, obtain page-describing word in real time, retouch with the page obtained in real time Predicate language query terms attribute library, it is thus achieved that the type weights of the relatively different page type of each page-describing word obtained in real time; Calculate in different page type the most again, the sum of all types of weights;Thus can get the page-describing word of each page type Type weights and;Type weights and maximum page type are set to the page type of current page, such that it is able to more Determine current page type accurately;The corresponding network information is selected to push further according to the page type determined;Due to Page type can be accurately determined, it is not necessary to repeated pages type judges and the process of network information push, such that it is able to real Now push accurate related information for user.Therefore, technical scheme is real-time, judgement is accurate, substantially increases The accuracy of information pushing and efficiency.
In one or more embodiments, with off-line, the process of history web pages can be carried out i.e. step S1 and S2 can be from Line processes, and is not take up the real time resources of system.History web pages can be the tired of the webpage that accesses within a period of time of local user Long-pending, it is also possible to be the server end collection to Webpage.Based on both modes, word's kinds attribute library can user originally Ground structure or renewal, it is also possible to after server end builds or updates, pass to user this locality again preserve use.The most clear to user The amount of calculation of the process of webpage and the process of propelling movement of looking at is little, requirement of real-time is high, can exist in real time while user browses webpage Line processes, i.e. step S3 to S5 can be with online treatment.
In an embodiment of the present invention, found by the general character of research html web page, existing html web page exists many Containing being described the label of web page characteristics main information, the label such as the most common title, keywords and description, this The word occurred in the description information of a little labels is associated together according to certain categorical attribute.By label is divided further Analysis is it is found that word therein and then can be divided into again classification core word and classified description word.Wherein, classification core word is classification Title, such as " physical culture ", " reading " etc.;Classified description word is a kind of description to classification core word, as " football ", " NBA " are exactly A kind of description to " physical culture ".
For example, the label of Sina's sports channel page is:
<title>sina's sports storm _ Sina website</title>
< meta name=" keywords " content=" physical culture, sports news, Sina's sports storm, the Olympic Games, 2012, difficult to understand National Games, NBA is live, "/>
< " Sina's physical culture provides the most professional physical culture to meta name=" description " content= News and race report, mainly have a following column: domestic football, international soccer, basketball, NBA, comprehensive sports, the Olympic Games, F1, net Ball, golf, chess and card, lottery ticket, video, picture, blog, physical culture microblogging, community forum "/>
And the label of Netease's sports channel page is:
<title>the physical culture door of Netease's physical culture _ have attitude</title>
<meta name=" keywords " content=" physical culture, sports news, sports center, physical culture picture, "/>
< " physical culture, sports channel comprise sports news, England Premier League, meaning to meta name=" description " content= First, Division A League Matches of Spanish Football, champion cup, sports score, football lottery, welfare lottery ticket, physical culture beautiful sceneries, tennis, F1, chess and card, table tennis and badminton, physical culture forum, in super, in State's football, the professional sports portal website such as comprehensive sports "/>
It can be seen that the word of the appearance in the description information of the labels such as above-mentioned title, keywords and description Language is the information manually selected and describe a webpage principal character, can regard a kind of natural artificial mark as, although its Differ and mark out classification belonging to webpage the most clearly, but it is believed that selected word is relevant to classification, permissible by selected word Indirectly summarize implicit classification.
Preferably, in one or more embodiment of the present invention, utilize the history page descriptor being obtained ahead of time The step of the type weights that the cooccurrence relation of language obtains each history page words of description correspondence difference page type includes:
The cooccurrence relation utilizing the history page words of description being obtained ahead of time sets up term network;Wherein, term network is Fully connected topology;
The strength of association between each history page words of description is obtained in term network according to cooccurrence relation;
Traversal term network, obtains the distance between each history page words of description;
According to the initial weight giving in advance the classification core word of each setting, distance, strength of association and default Decay intensity, obtains the type weights of each history page words of description correspondence difference page type.Wherein, classification core word can To be set according to actual needs.
Preferably, in one or more embodiment of the present invention, obtain in term network each according to cooccurrence relation The step of the strength of association between history page words of description includes:
The number of times that each history page words of description occurs jointly is obtained according to cooccurrence relation;
Co-occurrence number of times between word is the most, and strength of association is the biggest, obtains each history page words of description according to equation below Between strength of association:
Sij=Cij/Max(C)
Wherein, CijIt it is the co-occurrence number of times of word i and word j;Max (C) is that the co-occurrence number of times between word is maximum.
Preferably, in one or more embodiment of the present invention, at the beginning of giving to each classification core word Beginning weights, distance, strength of association and default decay intensity, obtain each history page words of description correspondence difference classes of pages The step of the type weights of type includes:
Equation below is utilized to obtain the type weights of each history page words of description correspondence difference page type:
w j = w i * S ij * &alpha; - ( d i + 1 )
Wherein, i with j is two term node associated in term network, SijIt is the strength of association of node i and node j, α It is default decay intensity, diIt it is the distance of node i and classification core word;wjFor the type weights of node j to be calculated, wiFor The type weights of the node i having calculated that, when calculating for the first time, wiFor giving the initial weight of described classification core word.
The computing formula of the above-mentioned type weights is that recurrence uses, i.e. i is only only this page type when calculating for the first time Classification core word, wiOnly it is only the initial weight of classification core word when first calculates, before using during calculating subsequently The secondary value calculated.
wj、wiMeaning different and different according to corresponding page type;As: if wiRelative for giving classification core word i The initial weight of first page type, then wjFor word j relative to the type weights of first page type.If wiFor giving classification Core word i relative to the initial weight of second page type, then wjWeigh relative to the type of second page type for word j Value;... by that analogy, the type weights of each history page words of description correspondence difference page type can be obtained.
In the above-described embodiment, in order to save system consumption, w is worked asjLess than stopping during certain threshold value obtaining type weights Process.
More specifically, in a preferred embodiment of the invention, the step that history page words of description is obtained ahead of time can be: Description information is obtained, subsequently from retouching from the label (such as labels such as title, keywords and description) of history web pages State and information carries out word extraction to obtain history page words of description;Wherein, the extraction of history page words of description can be from Line processes, thus can use more complicated extraction mode, such as based on statistics or the participle of semanteme.
In the above-described embodiment, it is thus achieved that need after term network to further determine that page type set up in which word of employing Classification.In a preferred embodiment of the invention, will appear in the word in web page tag and be divided into two kinds, one is classification Core word, another kind is classified description word.Corresponding above-mentioned Sina and the example of Netease, physical culture is classification core word, score, English Super, NBA etc. is classified description word.But for a webpage, computer itself is difficult to differentiate between classification core word and classification is retouched Predicate, the present invention is by substantial amounts of webpage statistical analysis, and the frequency that classification core word occurs is higher than classification descriptor, Thus the classification core word in label word and classified description word can be distinguished by the statistics word frequency of occurrences.
As a specific embodiment, the result that description information to multiple webpages process after obtain is presented herein below (each word in the description information obtained after word segmentation processing):
Webpage 1: novel, describing love affairs, read ...
Webpage 2: read, novel, chapters and sections, describing love affairs ...
Webpage 3: read, magazine ...
Webpage 4: read, pass through, novel ...
Webpage 5: the bird of indignation, game, Need For Speed ...
Webpage 6: the counteroffensive of wild boar, the bird of indignation ...
Webpage 7: game, plant Great War corpse ...
Webpage 8: software, Hua Jun ...
Webpage 9: software, instrument ...
Webpage 10: automobile, popular ...
Webpage 11: automobile, Ke Luzi, Chevrolet ...
The term network that obtains after processing according to above-mentioned explanation is as in figure 2 it is shown, from figure 2 it can be seen that limit Power is determined by the co-occurrence number of times in different web pages between word, and co-occurrence number of times is the most, and the weights of incidence edge are the biggest.Fig. 2 associates Spend high node, such as reading, game, software, automobile etc., it is believed that being classification core word, other word is classified description word. According to Fig. 2 it can be seen that the degree of incidence of classification core word and other words is apparently higher than general category descriptor, in the drawings It is more that performance is exactly the number on the limit that node has, and therefore i.e. can determine that classification core word by the limit number of statistics node.Determine I.e. each page type can be divided by each classification core word, the type power of each classification core word under default situations after classification core word Value is set to 1, using it as the basis calculating each classified description part of speech type weights.
Certainly, rely on the classification core word that is calculated automatically from of machine may and imperfect, sometimes with business needed for point Class not one_to_one corresponding, now can adjust page type, by some classification core on the basis of machine processing result further Word is divided into more accurately in page type, and specify corresponding type weights (classification core word is under this page type simultaneously Initial weight).Such as the following is a kind of adjustment mode of the classification core word to Fig. 2:
When, after the page type determining classification core word, carrying out page type according to the type weights of classification core word Extension, wherein the type weights of this page type can carry out index (default decay intensity) decay expansion by the weights of incidence edge Dissipate.So, each root, according to different from the strength of association of classification core word and distance, obtains each page-describing word for each page The type weights of face type.
Specifically, utilize classification core word with the starting type weights of this page type, by BFS (breadth-first) The type weights of each page type of each word that graph search algorithm computing and sorting core word is associated, and associate further The type weights of each page type of word, by that analogy, the type weights of each page type are diffused to whole word net Network.
Page-describing word is as follows for the computing formula of the type weights of different page types:
w j = w i * S ij * &alpha; - ( d i + 1 ) ;
Wherein, wjType weights for word j;I with j is two term node associated in term network, SijIt it is node i With the strength of association of node j, wiIt is the type weights of node i;When calculating for the first time, wiFor giving the initial of classification core word i Weights;α is default decay intensity, diIt it is the distance of node i and classification core word.
Save and calculate resource, avoid the word's kinds attribute library the hugest and maintain the high degree of association of word, use When above formula calculates, when the type weight w of node jjLess than stopping extension during certain threshold value.
Use aforesaid way to travel through whole term network, export each word type weights for different page types.Still As a example by the term network of Fig. 2, being set in after determining classification core word, the result that traversal obtains is:
Game: game: 1.0;Software: 0.3
The bird of indignation: game: 0.6;Software: 0.2
Software: software: 1.0;Game: 0.2
(in the case of this locality does not exist word's kinds attribute library) can be built after completing traversal or update (in this locality In the case of existing word's kinds attribute library) word's kinds attribute library, this storehouse have recorded the different classes of pages that each word is had The type weights of type.
Browsing situation according to user subsequently, the webpage currently browsed user carries out real-time judge, to determine current page The type in face, it is judged that process such as above-mentioned steps.
In a preferred embodiment of the invention, first currently browse from user webpage label (as title, keywords, Description etc.) in extraction obtain the description information of webpage, from description information, carry out word extraction subsequently current to obtain Page-describing word.The extraction of current page words of description needs online treatment, can use and simply mate based on dictionary dictionary Participle method (mechanical Chinese word segmentation method).After extraction current page words of description, divide at word with current page words of description for index Generic attribute is inquired about in storehouse, obtains the categorical attribute of each word and collects.I.e. obtain each of current page words of description The type weights of page type;Calculate the type weights sum of each current page words of description in each page type, by class The page type of type weights sum maximum is as the type currently browsing webpage.
Such as, obtain the description information of certain webpage of user's displaying live view, after extraction, obtain following page-describing word:
Amusement, attack strategy, treasured book, download;
Word's kinds attribute library is inquired about and obtains each page-describing word type weights for each page type:
Amusement: game: 0.7;Video: 0.6
Attack strategy: game 0.9;Read: 0.4
Treasured book: game: 0.6
Download: software: 0.9
The type weights of whole page types are collected (the most each page type is sued for peace respectively), obtain data below:
Game: 0.7+0.9+0.6=2.2
Video: 0.6
Read: 0.4
Software: 0.9
This webpage visible type weights sum under " game " this page type is maximum, so this webpage is classified as " play " type.
Relevant technical staff in the field will be understood that corresponding with the method for the present invention, and the present invention includes one the most simultaneously Planting information push-delivery apparatus based on type of webpage, with said method step correspondingly, Fig. 3 is information based on type of webpage The structural representation of pusher, this device includes:
First weights acquisition module, utilizes the cooccurrence relation of the history page words of description being obtained ahead of time to obtain each history The type weights of page-describing word correspondence difference page type;Wherein, cooccurrence relation is for representing the coexisting state between word;
Attribute library sets up module, builds word's kinds attribute library for the attribute with type weights as word;
Second weights acquisition module, utilizes the current page words of description obtained in real time to carry out in word's kinds attribute library Inquiry, obtains the type weights of each page type of current page words of description;
Page type determines module, for calculating the type weights of each current page words of description in each page type Sum, is set to currently browse the type of webpage by the page type that type weights sum is maximum;
Info push module, for currently browsing propelling movement network letter in webpage based on the type currently browsing webpage user Breath.
Compared with prior art, the invention provides a kind of information-pushing method based on type of webpage and device, pass through The cooccurrence relation of the history page descriptor obtained in advance determines the word type weights relative to different page types;Weigh with the type Value sets up Words ' Attributes storehouse;When user's displaying live view webpage, obtain page-describing word in real time, retouch with the page obtained in real time Predicate language query terms attribute library, it is thus achieved that the type weights of the relatively different page type of each page-describing word obtained in real time; Calculate in different page type the most again, the sum of all types of weights;Thus can get the page-describing word of each page type Type weights and;Type weights and maximum page type are set to the page type of current page, such that it is able to more Determine current page type accurately;The corresponding network information is selected to push further according to the page type determined;Due to Page type can be accurately determined, it is not necessary to repeated pages type judges and the process of network information push, such that it is able to real Now push accurate related information for user.Therefore, technical scheme is real-time, judgement is accurate, substantially increases The accuracy of information pushing and efficiency.
Preferably, in one or more embodiment of the present invention, the first weights acquisition module includes:
Term network sets up unit, for utilizing the cooccurrence relation of the history page words of description being obtained ahead of time to set up word Network;
Word association intensity acquiring unit, for obtaining each history page words of description in term network according to cooccurrence relation Between strength of association;
Traversal Unit, is used for traveling through term network, obtains the distance between each history page words of description;
Acquiring unit, is used for according to the initial weight given to the classification core word of each setting in advance, distance, association by force Degree and the decay intensity preset, obtain the type weights of each history page words of description correspondence difference page type.
Preferably, in one or more embodiment of the present invention, word association intensity acquiring unit is according to co-occurrence The number of times that each history page words of description of Relation acquisition occurs jointly;Each history page words of description is obtained according to equation below Between strength of association:
Sij=Cij/Max(C)
Wherein, CijIt it is the co-occurrence number of times of word i and word j;Max (C) is that the co-occurrence number of times between word is maximum.
Preferably, in one or more embodiment of the present invention, state acquiring unit and utilize equation below to obtain respectively The type weights of individual history page words of description correspondence difference page type:
w j = w i * S ij * &alpha; - ( d i + 1 )
Wherein, wjType weights for term node j;I with j is two term node associated in term network, SijIt is Node i and the strength of association of node j, wiIt is the type weights of node i;α is default decay intensity, diIt is node i and classification core The distance of heart word, when calculating for the first time, wiFor giving the initial weight of described classification core word.
Preferably, in one or more embodiment of the present invention, info push module includes:
Information query unit is identical with the type currently browsing webpage for inquiring about in default network information database Or the close network information;
Information pushing unit, for pushing the network information of acquisition to currently browsing webpage.
Compared with prior art, technical scheme is closed by the classification between term network administration web page descriptor Connection, determines the page type of webpage, thus carries out information pushing the most efficiently, it has followed by the type weights of word Advantage highlighted below:
1. algorithm realizes simple, efficient, shows that this method can meet online webpage substantially by actual data test The accuracy of type decision and performance requirement;
2. without artificial mark Web page classifying sample, being greatly saved human cost, workload is few;
3. data model updates simple, supports incremental update, and the increase of the increase of sample data, even page type is the most not Needing re-training, this sorting algorithm being also general is difficult to.
It will appreciated by the skilled person that all or part of step realizing in above-described embodiment method is permissible Instructing relevant hardware by program to complete, described program can be stored in a computer read/write memory medium, Upon execution, including each step of above-described embodiment method, and described storage medium may is that ROM/RAM, magnetic to this program Dish, CD, storage card etc..
Although above in association with preferred embodiment, invention has been described, but it should be appreciated by those skilled in the art, Method and system of the present invention is not limited to the embodiment described in detailed description of the invention, is wanting without departing substantially from by appended right In the case of seeking the spirit and scope of the invention that book limits, can to the present invention various modification can be adapted, increase and replace.

Claims (10)

1. an information-pushing method based on type of webpage, it is characterised in that described method includes step:
It is corresponding that the cooccurrence relation utilizing the history page words of description being obtained ahead of time obtains each described history page words of description The type weights of different page types;Wherein, described cooccurrence relation is for representing the coexisting state between word, described page type Divided by classification core word, webpage exists containing the label being described web page characteristics main information, the description letter of described label Breath comprises described classification core word;
Attribute with described type weights as word builds word's kinds attribute library;
Utilize the current page words of description obtained in real time to inquire about in described word's kinds attribute library, obtain current page The type weights of each page type of words of description;
Calculate the type weights sum of each current page words of description in each page type, by type weights sum maximum Page type is set to the described type currently browsing webpage;
Pushing network information in webpage is currently browsed user based on the described type currently browsing webpage.
Information-pushing method based on type of webpage the most according to claim 1, it is characterised in that described utilization obtains in advance The cooccurrence relation of the history page words of description obtained obtains each described history page words of description correspondence difference page type The step of type weights includes:
The cooccurrence relation utilizing the history page words of description being obtained ahead of time sets up term network;
The strength of association between each history page words of description is obtained in described term network according to described cooccurrence relation;
Travel through described term network, obtain the distance between each described history page words of description;
According to the initial weight giving in advance the classification core word of each setting, described distance, described strength of association and pre- If decay intensity, obtain the type weights of each described history page words of description correspondence difference page type.
Information-pushing method based on type of webpage the most according to claim 2, it is characterised in that described according to co-occurrence pass System obtains the step of the strength of association in described term network between each history page words of description and includes:
The number of times that each history page words of description occurs jointly is obtained according to described cooccurrence relation;
Strength of association according between the equation below each history page words of description of acquisition:
Sij=Cij/Max(C)
Wherein, CijIt it is the co-occurrence number of times of word i and word j;Max (C) is that the co-occurrence number of times between word is maximum.
Information-pushing method based on type of webpage the most according to claim 2, it is characterised in that described basis is given each Initial weight, described distance, described strength of association and the default decay intensity that classification core word gives, obtains described in each The step of the type weights of history page words of description correspondence difference page type includes:
Equation below is utilized to obtain the type weights of each described history page words of description correspondence difference page type:
w j = w i * S ij * &alpha; - ( d i + 1 )
Wherein, wjType weights for term node j;I with j is two term node associated in term network, SijIt it is word Node i and the strength of association of term node j, wiIt is the type weights of term node i;α is default decay intensity, diIt it is node i With the distance of classification core word, when calculating for the first time, wiFor giving the initial weight of described classification core word.
5. according to the information-pushing method based on type of webpage described in any one of Claims 1-4, it is characterised in that described Currently browse the step of pushing network information in webpage based on the type currently browsing webpage user to include:
The network information that inquiry is same or like with the type currently browsing webpage in default network information database;
The network information inquired is pushed to currently browsing webpage.
6. an information push-delivery apparatus based on type of webpage, it is characterised in that described device includes:
First weights acquisition module, utilizes the cooccurrence relation of the history page words of description being obtained ahead of time to obtain each described history The type weights of page-describing word correspondence difference page type;Wherein, described cooccurrence relation is for representing coexisting between word State, described page type is divided by classification core word, exists containing the label being described web page characteristics main information in webpage, The description information of described label comprises described classification core word;
Attribute library sets up module, builds word's kinds attribute library for the attribute with described type weights as word;
Second weights acquisition module, utilizes the current page words of description obtained in real time to carry out in described word's kinds attribute library Inquiry, obtains the type weights of each page type of current page words of description;
Page type determines module, for calculate each current page words of description in each page type type weights it With, the page type that type weights sum is maximum is set to the described type currently browsing webpage;
Info push module, for currently browsing propelling movement network letter in webpage based on the described type currently browsing webpage user Breath.
Information push-delivery apparatus based on type of webpage the most according to claim 6, it is characterised in that described first weights obtain Delivery block includes:
Term network sets up unit, for utilizing the cooccurrence relation of the history page words of description being obtained ahead of time to set up word net Network;
Word association intensity acquiring unit, describes for obtaining each history page in described term network according to described cooccurrence relation Strength of association between word;
Traversal Unit, is used for traveling through described term network, obtains the distance between each described history page words of description;
Acquiring unit, for according to the initial weight giving in advance the classification core word of each setting, described distance, described pass Connection intensity and default decay intensity, obtain the type power of each described history page words of description correspondence difference page type Value.
Information push-delivery apparatus based on type of webpage the most according to claim 7, it is characterised in that
Described word association intensity acquiring unit obtains each history page words of description according to described cooccurrence relation to be occurred jointly Number of times;Strength of association according between the equation below each history page words of description of acquisition:
Sij=Cij/Max(C)
Wherein, CijIt it is the co-occurrence number of times of word i and word j;Max (C) is that the co-occurrence number of times between word is maximum.
Information push-delivery apparatus based on type of webpage the most according to claim 7, it is characterised in that described acquiring unit profit The type weights of each described history page words of description correspondence difference page type are obtained by equation below:
w j = w i * S ij * &alpha; - ( d i + 1 )
Wherein, wjType weights for term node j;I with j is two term node associated in term network, SijIt it is word Node i and the strength of association of term node j, wiIt is the type weights of term node i;α is default decay intensity, diIt it is node i With the distance of classification core word, when calculating for the first time, wiFor giving the initial weight of described classification core word.
10. according to the information push-delivery apparatus based on type of webpage described in any one of claim 6 to 9, it is characterised in that described Info push module includes:
Information query unit, in default network information database inquire about identical with the type currently browsing webpage or The close network information;
Information pushing unit, for pushing the network information of acquisition to currently browsing webpage.
CN201310410102.6A 2013-09-10 2013-09-10 Information-pushing method based on type of webpage and device Active CN103440342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310410102.6A CN103440342B (en) 2013-09-10 2013-09-10 Information-pushing method based on type of webpage and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310410102.6A CN103440342B (en) 2013-09-10 2013-09-10 Information-pushing method based on type of webpage and device

Publications (2)

Publication Number Publication Date
CN103440342A CN103440342A (en) 2013-12-11
CN103440342B true CN103440342B (en) 2016-10-26

Family

ID=49694035

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310410102.6A Active CN103440342B (en) 2013-09-10 2013-09-10 Information-pushing method based on type of webpage and device

Country Status (1)

Country Link
CN (1) CN103440342B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122367B (en) * 2016-02-25 2020-07-03 阿里巴巴集团控股有限公司 User attribute value calculation method and device based on user browsing behavior
CN108512879A (en) * 2017-02-28 2018-09-07 阿里巴巴集团控股有限公司 A kind of information-pushing method and device
CN114637817A (en) * 2020-12-15 2022-06-17 深信服科技股份有限公司 Text classification method, device, equipment and computer readable storage medium
CN116932942A (en) * 2022-03-31 2023-10-24 贵州白山云科技股份有限公司 Webpage access method, device, medium and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890689A (en) * 2011-07-22 2013-01-23 北京百度网讯科技有限公司 Method and system for building user interest model
CN102902794A (en) * 2012-09-29 2013-01-30 北京奇虎科技有限公司 Web page classification system and method
CN103116854A (en) * 2012-12-07 2013-05-22 大连奥林匹克电子城咨信商行 Network advertisement targeted delivery method based on types of advertisements visited by visitors

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7912458B2 (en) * 2005-09-14 2011-03-22 Jumptap, Inc. Interaction analysis and prioritization of mobile content

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890689A (en) * 2011-07-22 2013-01-23 北京百度网讯科技有限公司 Method and system for building user interest model
CN102902794A (en) * 2012-09-29 2013-01-30 北京奇虎科技有限公司 Web page classification system and method
CN103116854A (en) * 2012-12-07 2013-05-22 大连奥林匹克电子城咨信商行 Network advertisement targeted delivery method based on types of advertisements visited by visitors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于衰减词共现图的多文档摘要研究;周进华 等;《小型微型计算机系统》;20090130(第1期);173-177 *

Also Published As

Publication number Publication date
CN103440342A (en) 2013-12-11

Similar Documents

Publication Publication Date Title
CN108280155B (en) Short video-based problem retrieval feedback method, device and equipment
CN103886054B (en) Personalization recommendation system and method of network teaching resources
CN101122909B (en) Text information retrieval device and text information retrieval method
CN103544267B (en) Search method and device based on search recommended words
CN109271518B (en) Method and equipment for classified display of microblog information
CN104881458B (en) A kind of mask method and device of Web page subject
CN103544266B (en) A kind of method and device for searching for suggestion word generation
CN106547864B (en) A Personalized Information Retrieval Method Based on Query Expansion
CN108280059A (en) Direct broadcasting room content tab extracting method, storage medium, electronic equipment and system
CN105786977A (en) Mobile search method and device based on artificial intelligence
CN104036038A (en) News recommendation method and system
CN106709040A (en) Application search method and server
CN105975596A (en) Query expansion method and system of search engine
CN107291886A (en) A kind of microblog topic detecting method and system based on incremental clustering algorithm
CN106294744A (en) Interest recognition methods and system
CN105095368A (en) Method and device for sequencing news information
CN113158672B (en) Relationship analysis method and device based on news events
CN103440342B (en) Information-pushing method based on type of webpage and device
CN104899335A (en) Method for performing sentiment classification on network public sentiment of information
CN110609950B (en) Public opinion system search word recommendation method and system
Duarte Torres et al. An analysis of queries intended to search information for children
CN106294473B (en) Entity word mining method, information recommendation method and device
CN109446399A (en) A kind of video display entity search method
CN107908749A (en) A kind of personage&#39;s searching system and method based on search engine
CN108595411B (en) Method for acquiring multiple text abstracts in same subject text set

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent for invention or patent application
CB02 Change of applicant information

Address after: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping B radio 16 floor tower square

Applicant after: Guangzhou Dongjing Computer Technology Co., Ltd.

Address before: 3, building 16, building B, information tower, No. 510665, Yun Yun Road, Guangzhou, Guangdong, Tianhe District

Applicant before: Guangzhou Dongjing Computer Technology Co., Ltd.

C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200422

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping B radio 16 floor tower square

Patentee before: GUANGZHOU UCWEB COMPUTER TECHNOLOGY Co.,Ltd.