[go: up one dir, main page]

CN104462241A - Population property classification method and device based on anchor texts and peripheral texts in URLs - Google Patents

Population property classification method and device based on anchor texts and peripheral texts in URLs Download PDF

Info

Publication number
CN104462241A
CN104462241A CN201410658093.7A CN201410658093A CN104462241A CN 104462241 A CN104462241 A CN 104462241A CN 201410658093 A CN201410658093 A CN 201410658093A CN 104462241 A CN104462241 A CN 104462241A
Authority
CN
China
Prior art keywords
classification
category
classification model
urls
population
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410658093.7A
Other languages
Chinese (zh)
Inventor
张岩峰
梁东山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201410658093.7A priority Critical patent/CN104462241A/en
Publication of CN104462241A publication Critical patent/CN104462241A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a population property classification method and device based on anchor texts and peripheral texts in URLs. The method comprises the first step of acquiring the anchor texts and the peripheral texts in the URLs clicked by unknown users within a preset time period, the second step of classifying the URLs into different catalogues according to the anchor texts, the peripheral texts and a pre-established first classification model, wherein the first classification model is obtained by conducting classification training through classified catalogues on the Internet, and the third step of conducting population property classification prediction on the unknown users according to catalogue feature information under different catalogues and a pre-established second classification model, wherein the second classification model is obtained by conducting classification training according to the catalogue feature information under the catalogues which the URLs clicked by known users belong to and population properties.

Description

Population attribute classification method and device based on anchor characters and peripheral texts in Uniform Resource Locator (URL)
Technical Field
The invention relates to the technical field of data mining, in particular to a population attribute classification method and device based on anchor characters in a Uniform Resource Locator (URL) and peripheral texts.
Background
Demographic attributes of a person include, but are not limited to, age, gender, household income, occupation category, education level, biographical stage, and the like. The demographic attributes of the insights have important practical application significance for personalized Web application, personalized advertisement delivery and the like, for example, website managers can be helped to understand the demographic attributes of visitors through statistical insights, and website content and expression forms are optimized for target groups.
The existing population attribute classification method generally obtains text features in Web pages according to Web pages browsed by users, and searches a pre-established population attribute classification model according to the text features, so as to finish classification of the population attributes of the users. The population attribute classification model is obtained by training by using population attribute information of known users and text features contained in browsed Web pages as sample data.
However, the above method needs to acquire keyword information in a Web page browsed by a user, and the information amount of the Web page is huge, and the interference factor is large, and the click purpose of the user cannot be directly reflected. The population attribute classification model in the method is established according to the sample information of the known user, the number of the samples of the known user is limited, and the text characteristics of the browsed Web page have strong sparsity.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for classifying population attributes based on anchor characters and surrounding texts in a URL, which can quickly and accurately classify the population attributes of a user.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a population attribute classification method based on anchor characters and peripheral texts in a Uniform Resource Locator (URL), which comprises the following steps of:
acquiring anchor characters and peripheral texts in a URL clicked within a preset time period by an unknown user;
classifying the URLs into different catalog categories according to the anchor characters, the peripheral texts and a pre-established first classification model, wherein the first classification model is obtained by utilizing an internet classification catalog to perform classification training;
and performing population attribute classification forecast on the unknown user according to the category characteristic information under different category of the directory and a pre-established second classification model, wherein the second classification model is obtained by performing classification training according to the category characteristic information and the population attribute under the category of the directory to which the URL clicked by the known user belongs.
Further, the category characteristic information includes the number of URLs;
the second classification model is obtained by performing classification training according to the category characteristic information and the population attributes under the category of the directory to which the URL clicked by the known user belongs, and comprises the following steps:
generating feature vectors according to the number of URLs under the category to which the URLs clicked by the known user belong, and training by utilizing a classification algorithm to obtain a corresponding relation between the feature vectors and population attributes;
the method for conducting population attribute classification forecasting on the unknown user according to the category characteristic information under different category categories and a pre-established second classification model comprises the following steps:
generating feature vectors to be classified according to the URL quantity under different directory categories; determining a feature vector which is most matched with the feature vector to be classified in the second classification model; and determining population attributes corresponding to the feature vectors to be classified according to the most matched feature vectors.
Further, the classification algorithm is any one of the following:
a logic recursion classification algorithm, a support vector machine classification algorithm, a decision tree classification algorithm and a Bayesian classification algorithm.
Further, the first classification model is obtained by performing classification training by using an internet classification directory, and includes:
capturing a directory tree as a classification from a classification service website provided on the Internet, wherein the directory tree comprises different directory categories;
and training the text contents contained in the webpages under different catalog categories to obtain a first classification model.
Further, the training of the text content included in the web pages under different category includes:
extracting feature words in the webpage content and constructing feature vectors;
and classifying the URLs of the webpages by adopting a classification algorithm according to the feature vectors and the category of the catalogue.
The invention also provides a population attribute classification device based on the anchor characters in the URL and the peripheral text, which comprises the following steps:
the acquisition module is used for acquiring anchor characters and peripheral texts in a URL clicked within a preset time period by an unknown user;
the catalog classification module is used for classifying the URL into different catalog categories according to the anchor characters, the peripheral text and a pre-established first classification model, wherein the first classification model is obtained by utilizing an internet classification catalog to perform classification training;
and the population attribute forecasting module is used for carrying out population attribute classification forecasting on the unknown user according to the category characteristic information under different directory categories and a pre-established second classification model, and the second classification model is obtained by carrying out classification training according to the category characteristic information and the population attributes under the directory category to which the URL clicked by the known user belongs.
Further, the category characteristic information includes the number of URLs;
the device, still include:
the second classification model establishing module is used for generating a feature vector according to the number of URLs under the category to which the URL clicked by the known user belongs, and training by using a classification algorithm to obtain a corresponding relation between the feature vector and population attributes;
the population attribute forecasting module is specifically used for generating feature vectors to be classified according to the URL quantity under different directory categories; determining a feature vector which is most matched with the feature vector to be classified in the second classification model; and determining population attributes corresponding to the feature vectors to be classified according to the most matched feature vectors.
Further, the classification algorithm is any one of the following:
a logic recursion classification algorithm, a support vector machine classification algorithm, a decision tree classification algorithm and a Bayesian classification algorithm.
Further, the apparatus further includes:
the first classification model establishing module is specifically used for capturing a directory tree as a classification from a classification service website provided on the internet, wherein the directory tree comprises different directory categories, and training text contents contained in webpages under the different directory categories to obtain a first classification model.
Further, the first classification model building module is specifically configured to extract feature words in the web page content, construct feature vectors, and classify the URLs of the web pages by using a classification algorithm according to the feature vectors and the category of the catalog.
According to the method, the anchor characters and the peripheral texts of the URL link clicked by the unknown user are used as the classification standard, and the anchor characters and the peripheral texts of the URL link have the characteristics of short and bold property and low noise relative to a webpage browsed by the user, so that the clicking purpose of the user can be directly reflected, and the population attribute forecast is more accurate. In addition, when population prediction is carried out on unknown users, the first classification model is firstly adopted to classify the URLs clicked by the users, and the first classification model is obtained by training through an internet directory instead of the known URLs clicked by the users, so that the first classification model is wide in coverage and complete in classification, and the problem of sparseness caused by the fact that the limited known users are adopted as training samples is solved.
Drawings
Fig. 1 is a schematic flowchart of a population attribute classification method based on anchor characters and surrounding text in a URL according to embodiment 1 of the present invention;
fig. 2 is a schematic flow chart of a first classification model establishment method according to embodiment 1 of the present invention;
fig. 3 is a schematic flow chart of a second classification model establishment method according to embodiment 1 of the present invention;
fig. 4 is a schematic structural diagram of a demographic property classification apparatus based on anchor text and surrounding text in a URL according to embodiment 2 of the present invention;
fig. 5 is a schematic structural diagram of a demographic property classification apparatus based on anchor text and surrounding text in a URL according to embodiment 3 of the present invention;
fig. 6 is a schematic structural diagram of a demographic property classification apparatus based on anchor characters in a URL and surrounding text according to embodiment 4 of the present invention.
Detailed Description
The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings.
Fig. 1 is a schematic flow chart of a population attribute classification method based on anchor characters and surrounding text in a URL according to embodiment 1 of the present invention, as shown in fig. 1, including the following steps:
s101, obtaining anchor characters and peripheral texts in a URL clicked within a preset time period of an unknown user.
Specifically, when a population attribute classification forecast is performed on an unknown user, anchor characters and peripheral text information in a URL clicked by the user within a preset time period are obtained first. The preset time period may be determined according to actual conditions, and may be, for example, one week, several weeks, one month, or several months, which is not specifically limited herein. The Anchor Text (Anchor Text) is a URL link Text, namely a Text which can be clicked in a webpage URL link, and reflects the current page and the pointed page theme; the peripheral characters refer to descriptive texts around the URL link, and are in the same text paragraph of the same webpage with the anchor characters to assist in explaining the URL. The anchor text and surrounding text of a URL link are also collectively referred to as the context information for the link. Specifically, the click data stream of the user in a certain period is tracked, and the anchor words and the peripheral text of the URL link clicked by the user are extracted by analyzing the web page structure where the URL link of the clicked data stream is located.
S102, classifying the URLs into different catalog categories according to the anchor characters, the peripheral texts and a pre-established first classification model, wherein the first classification model is obtained by utilizing an internet classification catalog to perform classification training.
Specifically, the anchor characters and the peripheral characters correspond to the URL and are connected with corresponding webpage contents, and the click intention of the user can be directly reflected by the anchor characters and the peripheral characters. Extracting the feature word information in the anchor characters and the surrounding text, forming the feature words into feature vectors, and classifying URLs corresponding to the feature words into different catalog categories in the models by adopting a matching algorithm according to a pre-established first classification model. Specifically, feature vectors generated according to feature words in the anchor characters and the surrounding text are matched with feature vectors in a first classification model, the feature vectors in the model which are most matched with the feature vectors corresponding to the anchor characters and the surrounding text are determined, if the category corresponding to the most matched feature vectors is the category corresponding to the anchor characters and the surrounding text, then URLs corresponding to the anchor characters and the surrounding text are classified under the category. The first classification model is obtained by performing classification training using an internet classification directory, and for the establishment of the first classification model, reference is made to the embodiment described in fig. 2 below.
S103, conducting population attribute classification forecasting on the unknown user according to class characteristic information under different directory classes and a pre-established second classification model, wherein the second classification model is obtained through classification training according to class characteristic information and population attributes under the directory class to which the URL clicked by the known user belongs.
Specifically, the second classification model is obtained by performing classification training according to the category feature information and the population attributes under the category to which the URL clicked by the known user belongs, and the model already establishes the correspondence between the category feature information and the population attributes under each category. Demographic attributes include, but are not limited to, the following features: age, gender, occupation, cultural level, industry (e.g., IT, law, agriculture, medicine, processing, business, not profit, etc.), household income level, life stage (e.g., student, job hunting, marriage stage, pregnancy period, childbearing period, career maturity period, elderly period), residence territory, etc.
Specifically, the method comprises the following steps. After the URLs of unknown users are classified into different directory categories in the first classification model, counting the number of the URLs under different directory categories, namely category characteristic information, generating characteristic vectors to be classified according to the number of the URLs under different directory categories, and performing population attribute classification forecasting on the unknown users according to the characteristic vectors to be classified and a pre-established second classification model. The second classification model establishes the corresponding relationship between the feature vector corresponding to the known user and the population attribute, only the feature vector to be classified corresponding to the unknown user is input into the second classification model, the feature vector which is most matched with the feature vector to be classified is searched in the model, and then the population attribute corresponding to the feature vector to be classified is determined according to the most matched feature vector, wherein the population attribute is the population attribute of the unknown user. For example, for an unknown user, assuming that the total amount of the URLs clicked by the unknown user in the recent time period is 100, extracting feature words contained in anchor words and peripheral texts in the 100 URLs, using a first classification model to classify the 100 URLs into different category categories, assuming that 10 URLs exist under the category a, 50 URLs exist under the category B, and 40 URLs exist under the category C, generating feature vectors [105040] from the numbers of URLs 10, 50, and 40 corresponding to the category a, the category B, and the category C, matching the feature vectors [105040] with all feature vectors composed of the category a, the category B, and the category C established in a second classification model, or normalizing the feature vectors, matching the normalized feature vectors with all feature vectors composed of the category a, the category B, and the category C established in the second classification model, for example, a method of calculating the shortest euclidean distance may be adopted to determine a feature vector closest to the feature vector in the model, and then the population attribute corresponding to the closest feature vector is the population attribute corresponding to the feature vector, and is thus the population attribute of the unknown user. In addition, with respect to the establishment of the second classification model, reference is made in particular to the embodiment described with reference to fig. 3.
According to the embodiment of the invention, the anchor characters and the peripheral texts of the URL link clicked by the unknown user are used as the classification standard, and the anchor characters and the peripheral texts of the URL link have the characteristics of short and bold property and low noise relative to the webpage browsed by the user, so that the clicking purpose of the user can be directly reflected, and the population attribute forecast is more accurate. In addition, when population prediction is carried out on unknown users, the first classification model is firstly adopted to classify the URLs clicked by the users, and the first classification model is obtained by training through an internet directory instead of the known URLs clicked by the users, so that the first classification model is wide in coverage and complete in classification, and the problem of sparseness caused by the fact that the limited known users are adopted as training samples is solved.
Fig. 2 is a schematic flow chart of a first classification model establishing method according to embodiment 1 of the present invention, as shown in fig. 2, including the following steps:
s201, extracting a directory tree from a classification service website provided on the Internet as a classification, wherein the directory tree comprises different directory categories.
Specifically, since the number of known users is limited, and text features of web page content corresponding to browsed URLs have obvious sparse characteristics, the internet classification directory is adopted to train the first classification model. For example, the directory tree may be crawled as a taxonomy from a taxonomy service website provided on the Internet, as shown in Table 1 below, http:// dmoz. org,http://www.chinadmoz.com.cn/and (4) fetching a directory tree, wherein the directory tree is used for classifying primary directory categories and secondary directory category examples of websites. The first-level catalog category comprises commercial economy and life services, the second-level catalog category is contained under the commercial economy catalog, such as agriculture, forestry, animal husbandry, energy and chemical engineering, mechano-electronics, building environment and the like, and the second-level catalog category is contained under the life service catalog, such as clothing, shoes and caps, catering and food, housing and home, hotel travel, traffic and logistics and the like.
TABLE 1
S202, capturing webpages in different directory categories in the directory tree, and screening the directory categories.
Specifically, after the directory tree is crawled from the classification service website, corresponding webpages in different classes are crawled, for example, crawling may be performed by using a crawler technology. Here, to provide enough category features for demographic prediction, web pages under the category of second or even third category need to be crawled, and to ensure that there is enough web page content to train, the number of web pages under each category is deleted less than the preset threshold number. For example, the preset threshold number may be set to 20, so that the number of catalog classification categories of about 1000-2000 and the number of training samples of not less than 20 webpages under each catalog category can be basically obtained through the above processing.
S203, training the text content contained in the web pages under different catalog categories to obtain a first classification model.
Specifically, feature words in the webpage content are extracted, feature vectors are constructed, and URLs of the webpages are classified by adopting a classification algorithm according to the feature vectors and the category of the catalog. Specifically, a classification model based on text features is established after different catalog categories and webpage contents under different catalog categories are obtained. This can be achieved, for example, by the following method:
and step A, extracting the text content of each webpage under different catalog categories to form different documents.
Specifically, the Chinese content in the document is segmented through a segmentation tool, stop words in the segmented terms are eliminated by searching a stop word list, the stop words are eliminated by searching the English stop word list for the English content in the webpage, and each word is converted into the root word by adopting a Porter root algorithm. And then, extracting the characteristic words from each document after word segmentation through a text characteristic extraction algorithm to form a new document containing the characteristic words. For example, the feature extraction algorithm may be any one of the following algorithms: information gain, mutual information, word frequency, Chi-square and other algorithms. For example, if the total number of web pages in different category categories is 10000, 10000 documents can be formed by the above word segmentation and feature extraction, that is, each web page corresponds to 1 document.
And step B, generating a feature vector from the feature words in the document.
Specifically, the number of the feature words extracted in step a should be more than ten thousand, the feature words contained in each document are formed into feature vectors, and the feature vectors of all documents form a feature vector set a _ mn ═ (a _ ij), where m and n are the total number of the feature words and documents, respectively, a row represents a feature word, a column represents a document, and a _ ij represents a weight value of the ith feature word (i is greater than or equal to 0 and less than or equal to m-1) on the jth document (j is greater than or equal to 0 and less than or equal to n-1), where a TF-IDF algorithm may be used to perform weight calculation. After the weights are calculated, normalization processing is performed on the weights, for example, the normalization weight calculation can be performed by using the following formula:
<math><mrow> <mi>W</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mover> <mi>d</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mo>(</mo> <mn>1</mn> <mo>+</mo> <msub> <mi>log</mi> <mn>2</mn> </msub> <mi>if</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mover> <mi>d</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>&times;</mo> <msub> <mi>log</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mfrac> <mi>n</mi> <mrow> <mi>n</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <msqrt> <msub> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>&Element;</mo> <mover> <mi>d</mi> <mo>&OverBar;</mo> </mover> </mrow> </msub> <msup> <mrow> <mo>[</mo> <mn>1</mn> <mo>+</mo> <msub> <mi>log</mi> <mn>2</mn> </msub> <mi>if</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mover> <mi>d</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>&times;</mo> <msub> <mi>log</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mfrac> <mi>n</mi> <mrow> <mi>n</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>]</mo> </mrow> <mn>2</mn> </msup> </msqrt> </mfrac> </mrow></math>
whereinIs that the characteristic word i is in the documentN denotes the total number of documents, n (t) denotes the number of documents including the feature word i, which is referred to as a document frequency, and IDF (log 2) (n/n (t)) denotes the inverse document frequency. In the a _ mn ═ a _ ij matrix, only the information of the feature words is considered, and the feature words are regarded as mutually independent and orthogonal features, and the semantic relation between the feature words is not considered. In fact, the co-occurrence and the inherent semantic structure of feature words in text are also important information, the latent semantic index (latent semantic man)tic indexing, LSI) is a method of exploring the semantic relation inherent between feature words based on the co-occurrence information of the feature words. The document matrix is subjected to special matrix decomposition, the matrix is approximately mapped to a K-dimensional latent semantic space, wherein K is the maximum singular value number selected, and the singular value vector after mapping can reflect the dependency relationship between the characteristic words and the documents to the maximum extent. The latent semantic space actually maps the co-occurrence feature words to the same dimensional space, but not maps the co-occurrence feature words to different dimensional spaces, so that the latent semantic space is much smaller than the original spatial dimension, and the purpose of reducing the dimension is achieved. After the mapping, documents which originally do not contain or contain little information of the same characteristic words may have larger similarity due to the co-occurrence relationship of the characteristic words. Here, the matrix a _ mn is realized as an LSI by using a singular value decomposition method, and a good decomposition effect can be obtained by using the LSI method, and the matrix a _ mn has a strong expansion performance. The singular value decomposition is to decompose the characteristic word-document matrix A _ mn into a product form of 3 matrixes, namely A_mn=T_ma·S_aa·(D_na)TWherein m is the total number of feature words, i.e. the dimension of the original feature space, n is the total number of documents, a is min (m, n), T and D are both orthogonal matrices, S is a diagonal matrix, and the value of the diagonal is a non-negative real number arranged from large to small. The value on the diagonal of S is actually ATAnd A is the characteristic value. Taking the first k columns of the matrix T, S, D to obtain an approximate matrix A of A_mn=T_mk·S_kk·(D_nk)TThus, the matrix B _ \ "after A dimension reduction can be obtainedknReducing the characteristic space from t dimension to k dimension: b is_kn=S_kk·(D_nk)T
C. And obtaining a first classification model by using the feature vector and the category of the catalogue and adopting a classification algorithm.
Specifically, in step B, a first classification model may be obtained by using a classification algorithm using the reduced feature vectors and the category of the catalog. There are many possible classification algorithms that can be implemented, such as a support vector machine classification algorithm, a logical recursive classification algorithm, a bayesian classification algorithm, a k-nearest neighbor classification algorithm, a neural network classification algorithm, a random forest classification algorithm (e.g., an adaboost decision tree classification algorithm, etc.).
According to the embodiment of the invention, when the first classification model is trained, the Internet directory is used for training instead of training according to the URL clicked by the known user, the first classification model is wide in coverage and complete in classification, and the problem of sparsity caused by the fact that the limited known user is used as a training sample in the prior art is solved.
Fig. 3 is a schematic flow chart of a second classification model establishment method provided in embodiment 1 of the present invention, and as shown in fig. 3, the method includes the following steps:
s301, acquiring the population attribute information of the known user.
Specifically, before the second classification model is trained, training data is collected, wherein the training data knows the demographic attribute information of the user and the historical browsing URL records of the part of the user. The population attribute information can be obtained by means of offline questionnaires, online surveys, or user registration data, and the like, and the URL access records of the user can be obtained by installing proxy software or Cookies of a browser in a local computer of the user, or by intercepting user traffic in a routing mechanism and analyzing a protocol. For example, demographic attribute information as shown in Table 2 may be obtained and defined for classification.
TABLE 2 examples of demographic attributes and attribute classification definitions therefor
Table 2 includes age, gender, occupation, education, etc., but is not limited thereto, and may be extended to other aspects such as industry categories (e.g., IT, law, agriculture, medicine, processing, business, non-profit, etc.), monthly income divisions of households, life stages (e.g., student age, job hunting, marriage stage, pregnancy period, childbearing period, maturity period, geriatric period), living areas, etc. Table 3 shows an example of a specific demographic survey, and as shown in table 3 below, in order to protect the privacy of the respondents, each person uses a 32-bit random ID. Then, under the permission of the respondent, the URL accessed by the user in a period of time is recorded through a software agent or a Cookie, the ID of the user and the recording start time are used as file names, and the specific recording result is shown in table 4 below.
TABLE 3
"7F 64F0CAB28DDE6781F430FCCFF09F3D 2", "women", "1982", "university Benedict", "clerk", "2001-3000 yuan", "Tianjin", "Diwu", "suburb"
"6A 64W3CAB24GD96A81F530FCH6FY7E4 BE", "woman", "1970", "university Ben ke", "manager", "8001 ~ 13000 Yuan", "Anhui", "Fuyang", "City"
"CE 6ABB45B97FVE6781F430F9D3ED46E 5B", "men", "1974", "senior high school/transit/technical school", "office clerk", "3001-5000 yuan", "Henan", "Luoyang", "City"
TABLE 4
P<=>360chrome.exe[=]I<=>3492[=]U<=>www.cqedu.cn/site/html/cqjwportal/portal/index/index.htm
P<=>360chrome.exe[=]I<=>3492[=]U<=>so.5ipatent.com/SearchResult.aspx
P<=>360chrome.exe[=]I<=>3492[=]U<=>http://vip.163.com/?b08abh1
S302, anchor characters and peripheral texts of the URL clicked by the known user are obtained.
S303, classifying the URLs into different catalog categories according to the anchor characters, the peripheral texts and a pre-established first classification model.
S304, generating feature vectors according to the number of the URLs under the category of the directory to which the URLs clicked by the known user belong, and training by utilizing a classification algorithm to obtain the corresponding relation between the feature vectors and the population attributes.
Specifically, the classification algorithm used here is any one of the following: a logic recursion classification algorithm, a support vector machine classification algorithm, a decision tree classification algorithm and a Bayesian classification algorithm. For a detailed description of S302 to S304, refer to the related description of the embodiments described in fig. 1 and fig. 2.
Fig. 4 is a schematic structural diagram of a population attribute classification device based on anchor characters in a URL and peripheral text according to embodiment 2 of the present invention, as shown in fig. 4, including: an acquisition module 11, a catalog classification module 12 and a population attribute forecasting module 13. Wherein,
the acquisition module 11 is configured to acquire anchor characters and peripheral texts in a URL clicked within a preset time period by an unknown user;
specifically, when the population attribute classification forecast is performed on an unknown user, the anchor characters and the peripheral text information in the URL clicked by the user within a preset time period are firstly acquired by the acquisition module 11. The preset time period may be determined according to actual conditions, and may be, for example, one week, several weeks, one month, or several months, which is not specifically limited herein. The Anchor Text (Anchor Text) is a URL link Text, namely a Text which can be clicked in a webpage URL link, and reflects the current page and the pointed page theme; the peripheral characters refer to descriptive texts around the URL link, and are in the same text paragraph of the same webpage with the anchor characters to assist in explaining the URL. The anchor text and surrounding text of a URL link are also collectively referred to as the context information for the link.
The catalog classification module 12 is used for classifying the URLs into different catalog categories according to anchor characters, peripheral texts and a pre-established first classification model, wherein the first classification model is obtained by utilizing an internet classification catalog to perform classification training;
specifically, the anchor characters and the peripheral characters correspond to the URL and are connected with the corresponding webpage content, and the click intention of the user can be directly reflected by the anchor characters and the peripheral characters. The catalog classification module 12 extracts feature words in the anchor characters and the surrounding text to form feature vectors, and classifies URLs corresponding to the feature words into different catalog categories in the model by adopting a matching algorithm according to a pre-established first classification model.
And the population attribute forecasting module 13 is configured to perform population attribute classification forecasting on the unknown user according to category feature information under different category categories and a pre-established second classification model, where the second classification model is obtained by performing classification training according to category feature information and population attributes under a category to which the URL clicked by the known user belongs.
Specifically, the second classification model is obtained by performing classification training according to the category feature information and the population attributes under the category to which the URL clicked by the known user belongs, and the model already establishes the correspondence between the category feature information and the population attributes under each category. Demographic attributes include, but are not limited to, the following features: age, gender, occupation, cultural level, industry (e.g., IT, law, agriculture, medicine, processing, business, not profit, etc.), household income level, life stage (e.g., student, job hunting, marriage stage, pregnancy period, childbearing period, career maturity period, elderly period), residence territory, etc. Specifically, the method comprises the following steps. After the directory classification module 12 classifies the URLs of unknown users into different directory categories in the first classification model, the number of the URLs under different directory categories, that is, category feature information, is counted, feature vectors to be classified are generated according to the number of the URLs under different directory categories, the feature vectors to be classified are input into the population attribute forecasting module 13, and the population attribute forecasting module 13 performs population attribute classification forecasting on the unknown users according to the feature vectors to be classified and a pre-established second classification model. The second classification model establishes the corresponding relationship between the feature vector corresponding to the known user and the population attribute, only the feature vector to be classified corresponding to the unknown user is input into the second classification model, the feature vector which is most matched with the feature vector to be classified is searched in the model, and then the population attribute corresponding to the feature vector to be classified is determined according to the most matched feature vector, wherein the population attribute is the population attribute of the unknown user.
The apparatus of this embodiment is used to perform the steps of the method for classifying the demographic property based on the anchor text and the surrounding text in the URL shown in fig. 1, and the technical principle and the resulting technical effect are similar, which refer to the related description of the embodiment shown in fig. 1 specifically.
Fig. 5 is a schematic structural diagram of a demographic property classification apparatus based on anchor characters in a URL and surrounding text according to embodiment 3 of the present invention, as shown in fig. 5, including: an acquisition module 21, a catalog classification module 22 and a population attribute forecasting module 23. Wherein,
the acquisition module 21 is configured to acquire anchor characters and peripheral texts in a URL clicked within a preset time period by an unknown user;
the catalog classification module 22 is configured to classify the URLs into different catalog categories according to the anchor characters, the surrounding texts and a pre-established first classification model, where the first classification model is obtained by performing classification training by using an internet classification catalog;
and the population attribute forecasting module 23 is configured to perform population attribute classification forecasting on the unknown user according to category feature information under different category categories and a pre-established second classification model, where the second classification model is obtained by performing classification training according to category feature information and population attributes under a category to which the URL clicked by the known user belongs.
Further, the above apparatus further includes:
the first classification model establishing module 24 is specifically configured to capture a directory tree as a classification from a classification service website provided on the internet, where the directory tree includes different directory categories, and train text contents included in webpages under different directory categories to obtain a first classification model.
Further, the first classification model building module is specifically configured to extract feature words in the web page content, construct feature vectors, and classify the URLs of the web pages by using a classification algorithm according to the feature vectors and the category of the catalog.
The apparatus of this embodiment is used to perform the steps of the method for classifying the demographic property based on the anchor text and the surrounding text in the URL shown in fig. 1 and fig. 2, and the technical principle and the resulting technical effect are similar, which refer to the related description of the embodiment shown in fig. 1 and fig. 2.
Fig. 6 is a schematic structural diagram of a population attribute classification device based on anchor characters in a URL and peripheral text according to embodiment 4 of the present invention, as shown in fig. 6, including: an acquisition module 31, a catalog classification module 32 and a demographic property forecasting module 33. Wherein,
the obtaining module 31 is configured to obtain anchor characters and peripheral texts in a URL clicked within a preset time period by an unknown user;
the catalog classification module 32 is configured to classify the URLs into different catalog categories according to the anchor characters, the surrounding texts and a pre-established first classification model, where the first classification model is obtained by performing classification training by using an internet classification catalog;
and the population attribute forecasting module 33 is configured to perform population attribute classification forecasting on the unknown user according to the category characteristic information under different category categories and a pre-established second classification model, where the second classification model is obtained by performing classification training according to the category characteristic information and the population attribute under the category to which the URL clicked by the known user belongs.
Further, the category characteristic information includes the number of URLs;
the above-mentioned device still includes:
the second classification model establishing module 34 is configured to generate feature vectors from the number of URLs in the category to which the URLs clicked by the known user belong, and train the feature vectors and the population attributes by using a classification algorithm to obtain a corresponding relationship between the feature vectors and the population attributes;
the population attribute forecasting module 33 is specifically configured to generate feature vectors to be classified according to the number of URLs in the different directory categories; determining a feature vector which is most matched with the feature vector to be classified in the second classification model; and determining population attributes corresponding to the feature vectors to be classified according to the most matched feature vectors.
Further, the classification algorithm is any one of the following:
a logic recursion classification algorithm, a support vector machine classification algorithm, a decision tree classification algorithm and a Bayesian classification algorithm.
Further, the above apparatus further includes:
the first classification model establishing module 35 is specifically configured to capture a directory tree as a classification from a classification service website provided on the internet, where the directory tree includes different directory categories, and train text contents included in webpages under different directory categories to obtain a first classification model.
Further, the first classification model building module 35 is specifically configured to extract feature words in the web page content, construct feature vectors, and classify the URLs of the web pages by using a classification algorithm according to the feature vectors and the category of the directory.
The apparatus of this embodiment is used to perform the steps of the population attribute classification method based on anchor text and surrounding text in URL shown in fig. 1, fig. 2 and fig. 3, and its technical principle and resulting technical effect are similar, which refer to the related description of the embodiment shown in fig. 1, fig. 2 and fig. 3 specifically.
The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A population attribute classification method based on anchor characters and peripheral texts in a URL is characterized by comprising the following steps:
acquiring anchor characters and peripheral texts in a URL clicked within a preset time period by an unknown user;
classifying the URLs into different catalog categories according to the anchor characters, the peripheral texts and a pre-established first classification model, wherein the first classification model is obtained by utilizing an internet classification catalog to perform classification training;
and performing population attribute classification forecast on the unknown user according to the category characteristic information under different category of the directory and a pre-established second classification model, wherein the second classification model is obtained by performing classification training according to the category characteristic information and the population attribute under the category of the directory to which the URL clicked by the known user belongs.
2. The method according to claim 1, wherein the category characteristic information includes a number of URLs;
the second classification model is obtained by performing classification training according to the category characteristic information and the population attributes under the category of the directory to which the URL clicked by the known user belongs, and comprises the following steps:
generating feature vectors according to the number of URLs under the category to which the URLs clicked by the known user belong, and training by utilizing a classification algorithm to obtain a corresponding relation between the feature vectors and population attributes;
the method for conducting population attribute classification forecasting on the unknown user according to the category characteristic information under different category categories and a pre-established second classification model comprises the following steps:
generating feature vectors to be classified according to the URL quantity under different directory categories; determining a feature vector which is most matched with the feature vector to be classified in the second classification model; and determining population attributes corresponding to the feature vectors to be classified according to the most matched feature vectors.
3. The method of claim 2, wherein the classification algorithm is any one of:
a logic recursion classification algorithm, a support vector machine classification algorithm, a decision tree classification algorithm and a Bayesian classification algorithm.
4. The method according to any one of claims 1 to 3, wherein the first classification model is obtained by performing classification training by using an internet classification catalogue, and comprises the following steps:
capturing a directory tree as a classification from a classification service website provided on the Internet, wherein the directory tree comprises different directory categories;
and training the text contents contained in the webpages under different catalog categories to obtain a first classification model.
5. The method of claim 3, wherein training the text content contained in the web pages under different category of contents comprises:
extracting feature words in the webpage content and constructing feature vectors;
and classifying the URLs of the webpages by adopting a classification algorithm according to the feature vectors and the category of the catalogue.
6. A population attribute classification device based on anchor characters and peripheral texts in a URL (uniform resource locator), comprising:
the acquisition module is used for acquiring anchor characters and peripheral texts in a URL clicked within a preset time period by an unknown user;
the catalog classification module is used for classifying the URL into different catalog categories according to the anchor characters, the peripheral text and a pre-established first classification model, wherein the first classification model is obtained by utilizing an internet classification catalog to perform classification training;
and the population attribute forecasting module is used for carrying out population attribute classification forecasting on the unknown user according to the category characteristic information under different directory categories and a pre-established second classification model, and the second classification model is obtained by carrying out classification training according to the category characteristic information and the population attributes under the directory category to which the URL clicked by the known user belongs.
7. The apparatus of claim 6, wherein the category characteristic information comprises a number of URLs;
the device, still include:
the second classification model establishing module is used for generating a feature vector according to the number of URLs under the category to which the URL clicked by the known user belongs, and training by using a classification algorithm to obtain a corresponding relation between the feature vector and population attributes;
the population attribute forecasting module is specifically used for generating feature vectors to be classified according to the URL quantity under different directory categories; determining a feature vector which is most matched with the feature vector to be classified in the second classification model; and determining population attributes corresponding to the feature vectors to be classified according to the most matched feature vectors.
8. The apparatus of claim 7, wherein the classification algorithm is any one of:
a logic recursion classification algorithm, a support vector machine classification algorithm, a decision tree classification algorithm and a Bayesian classification algorithm.
9. The apparatus of any one of claims 6 to 8, further comprising:
the first classification model establishing module is specifically used for capturing a directory tree as a classification from a classification service website provided on the internet, wherein the directory tree comprises different directory categories, and training text contents contained in webpages under the different directory categories to obtain a first classification model.
10. The apparatus according to claim 8, wherein the first classification model building module is specifically configured to extract feature words in the web page content, construct feature vectors, and classify the URLs of the web pages by using a classification algorithm according to the feature vectors and category categories.
CN201410658093.7A 2014-11-18 2014-11-18 Population property classification method and device based on anchor texts and peripheral texts in URLs Pending CN104462241A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410658093.7A CN104462241A (en) 2014-11-18 2014-11-18 Population property classification method and device based on anchor texts and peripheral texts in URLs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410658093.7A CN104462241A (en) 2014-11-18 2014-11-18 Population property classification method and device based on anchor texts and peripheral texts in URLs

Publications (1)

Publication Number Publication Date
CN104462241A true CN104462241A (en) 2015-03-25

Family

ID=52908277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410658093.7A Pending CN104462241A (en) 2014-11-18 2014-11-18 Population property classification method and device based on anchor texts and peripheral texts in URLs

Country Status (1)

Country Link
CN (1) CN104462241A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104991899A (en) * 2015-06-02 2015-10-21 广州酷狗计算机科技有限公司 Identification method and apparatus of user property
CN105956004A (en) * 2016-04-20 2016-09-21 广州精点计算机科技有限公司 Method and device for analyzing mobile user internet behavior based on URL analysis model
CN106126681A (en) * 2016-06-29 2016-11-16 泰华智慧产业集团股份有限公司 A kind of increment type stream data clustering method and system
CN108280104A (en) * 2017-02-13 2018-07-13 腾讯科技(深圳)有限公司 The characteristics information extraction method and device of target object
CN115658607A (en) * 2022-10-25 2023-01-31 上海数慧系统技术有限公司 Classification method, device, system, equipment and storage medium of a data directory

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101051313A (en) * 2007-05-09 2007-10-10 崔志明 Integrated data source finding method for deep layer net page data source
US20090083244A1 (en) * 2007-09-25 2009-03-26 Nec (China) Co., Ltd. Method and system for subject relevant web page filtering based on navigation paths information
CN102867265A (en) * 2011-07-08 2013-01-09 北京亿赞普网络技术有限公司 Online advertising weight calculation system and calculation method
CN103186574A (en) * 2011-12-29 2013-07-03 北京百度网讯科技有限公司 Method and device for generating searching result
CN103514266A (en) * 2013-09-04 2014-01-15 快传(上海)广告有限公司 Method and system for issuing network information to mobile terminal
CN103778555A (en) * 2014-01-21 2014-05-07 北京集奥聚合科技有限公司 User attribute mining method and system based on user tags

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101051313A (en) * 2007-05-09 2007-10-10 崔志明 Integrated data source finding method for deep layer net page data source
US20090083244A1 (en) * 2007-09-25 2009-03-26 Nec (China) Co., Ltd. Method and system for subject relevant web page filtering based on navigation paths information
CN102867265A (en) * 2011-07-08 2013-01-09 北京亿赞普网络技术有限公司 Online advertising weight calculation system and calculation method
CN103186574A (en) * 2011-12-29 2013-07-03 北京百度网讯科技有限公司 Method and device for generating searching result
CN103514266A (en) * 2013-09-04 2014-01-15 快传(上海)广告有限公司 Method and system for issuing network information to mobile terminal
CN103778555A (en) * 2014-01-21 2014-05-07 北京集奥聚合科技有限公司 User attribute mining method and system based on user tags

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104991899A (en) * 2015-06-02 2015-10-21 广州酷狗计算机科技有限公司 Identification method and apparatus of user property
CN104991899B (en) * 2015-06-02 2018-06-19 广州酷狗计算机科技有限公司 The recognition methods of user property and device
CN105956004A (en) * 2016-04-20 2016-09-21 广州精点计算机科技有限公司 Method and device for analyzing mobile user internet behavior based on URL analysis model
CN106126681A (en) * 2016-06-29 2016-11-16 泰华智慧产业集团股份有限公司 A kind of increment type stream data clustering method and system
CN106126681B (en) * 2016-06-29 2019-10-15 泰华智慧产业集团股份有限公司 A kind of increment type stream data clustering method and system
CN108280104A (en) * 2017-02-13 2018-07-13 腾讯科技(深圳)有限公司 The characteristics information extraction method and device of target object
CN108280104B (en) * 2017-02-13 2020-06-02 腾讯科技(深圳)有限公司 Method and device for extracting characteristic information of target object
US11436430B2 (en) 2017-02-13 2022-09-06 Tencent Technology (Shenzhen) Company Limited Feature information extraction method, apparatus, server cluster, and storage medium
CN115658607A (en) * 2022-10-25 2023-01-31 上海数慧系统技术有限公司 Classification method, device, system, equipment and storage medium of a data directory

Similar Documents

Publication Publication Date Title
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN104090919B (en) Advertisement recommending method and advertisement recommending server
Bennett et al. Inferring and using location metadata to personalize web search
Li et al. Community detection using hierarchical clustering based on edge-weighted similarity in cloud environment
US9798820B1 (en) Classification of keywords
CN105718579B (en) A kind of information-pushing method excavated based on internet log and User Activity identifies
CN101216825B (en) Indexing key words extraction/ prediction method
US8185536B2 (en) Rank-order service providers based on desired service properties
US20060287988A1 (en) Keyword charaterization and application
Bhagat et al. Applying link-based classification to label blogs
Nesi et al. Geographical localization of web domains and organization addresses recognition by employing natural language processing, Pattern Matching and clustering
JP7166116B2 (en) Information processing device, information processing method, and program
Piccardi et al. On the value of Wikipedia as a gateway to the web
CN104462241A (en) Population property classification method and device based on anchor texts and peripheral texts in URLs
Valle et al. Individual movement strategies revealed through novel clustering of emergent movement patterns
del Gobbo et al. Geographies of Twitter debates: Detect public stances on Brexit at UK parliamentary constituencies’ level
Wu et al. Event evolution model based on random walk model with hot topic extraction
Malhotra et al. Quantitative evaluation of web metrics for automatic genre classification of web pages
Zhang et al. Targeted advertising based on browsing history
Mishra et al. Leveraging semantic annotations to link wikipedia and news archives
KR101252245B1 (en) Module for topic classification and contextual advertisement system using the same
Zhang et al. Identification of factors predicting clickthrough in Web searching using neural network analysis
US11586824B2 (en) System and method for link prediction with semantic analysis
US9275133B1 (en) Content request identification via a computer network
WO2015196397A1 (en) Method and device for data mining based on user&#39;s search behaviour

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150325

RJ01 Rejection of invention patent application after publication