[go: up one dir, main page]

CN110704624B - Geographic information service metadata text multi-level multi-label classification method - Google Patents

Geographic information service metadata text multi-level multi-label classification method Download PDF

Info

Publication number
CN110704624B
CN110704624B CN201910942287.2A CN201910942287A CN110704624B CN 110704624 B CN110704624 B CN 110704624B CN 201910942287 A CN201910942287 A CN 201910942287A CN 110704624 B CN110704624 B CN 110704624B
Authority
CN
China
Prior art keywords
text
classification
topic
samples
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910942287.2A
Other languages
Chinese (zh)
Other versions
CN110704624A (en
Inventor
桂志鹏
张敏
彭德华
吴华意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201910942287.2A priority Critical patent/CN110704624B/en
Publication of CN110704624A publication Critical patent/CN110704624A/en
Application granted granted Critical
Publication of CN110704624B publication Critical patent/CN110704624B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/387Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种地理信息服务元数据文本多层级多标签分类方法,包括:1)获取地理信息服务元数据文本集进行文本预处理,将每条数据样本划分为文本特征词组合;2)设定一级分类目录,生成与分类类别语义关联的典型词词表;3)根据典型词词表对文本特征词进行筛选;4)选取ML‑KNN作为协同训练的一个基模型;5)建立主题预测模型ML‑CSW作为协同训练的另一基模型;6)设计协同机制,为元数据文本匹配多标签主题,作为一级粗粒度主题分类结果;7)选取某一分类标签对应的元数据文本,得到不同级别的细粒度主题类别目录。本发明方法考虑地理信息服务元数据的领域特色和文本语义,仅依赖少量的标记数据样本且分类结果相比传统多标签分类方法整体表现更好。

Figure 201910942287

The invention discloses a multi-level and multi-label classification method for geographic information service metadata text, including: 1) acquiring a geographic information service metadata text set to perform text preprocessing, and dividing each data sample into text feature word combinations; 2) Set a first-level classification directory, and generate a typical vocabulary related to the classification category semantics; 3) Screen the text feature words according to the typical vocabulary; 4) Select ML-KNN as a base model for collaborative training; 5) Establish The topic prediction model ML‑CSW is used as another base model for collaborative training; 6) Design a collaborative mechanism to match multi-label topics for metadata texts as a first-level coarse-grained topic classification result; 7) Select the metadata corresponding to a classification label text, get different levels of fine-grained topic category catalogs. The method of the invention considers the domain characteristics and text semantics of the metadata of the geographic information service, only relies on a small number of labeled data samples, and the classification result has better overall performance than the traditional multi-label classification method.

Figure 201910942287

Description

Geographic information service metadata text multi-level multi-label classification method
Technical Field
The invention relates to a natural language processing technology, in particular to a method for classifying geographic information service metadata texts in a multi-level and multi-label manner.
Background
The text accurate classification is an important means for data analysis, is a key for improving the geographic information resource retrieval quality, and has a wide application scene. The traditional classification method is mostly suitable for two-classification or single-classification scenes, and training a classification model by excessively depending on a large number of labeled samples limits the accuracy and comprehensiveness of text classification and the application scene of the model. Particularly, for the metadata of the geographic information service, a sample data set for marking a topic is usually lacked, text content is complicated, and a characteristic vocabulary is complicated due to the mixing of a geoscience term and a general knowledge vocabulary; and the overlapping and membership between the topics enables the metadata text topics to have multi-granularity and multi-class characteristics, and further increases the difficulty of topic classification. Aiming at the problem of lack of training samples and the requirement of multi-class matching, some students propose mechanisms such as semi-supervision and weak supervision to reduce the dependence of a classifier on the training samples, and also realize text multi-label classification by methods such as ML-KNN, BR-KNN and TSVM. However, these methods usually do not combine the domain features, do not consider the semantics of the professional terms in the text, and cannot effectively conform to the text characteristics of the geographic information service metadata.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method for classifying geographic information service metadata texts in a multi-level and multi-label manner aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: a geographic information service metadata text multi-level multi-label classification method comprises the following steps:
1) acquiring a geographic information service metadata text set containing unmarked samples and marked samples to perform text preprocessing, and dividing each data sample into text feature word combinations;
2) defining a primary classification catalogue based on the domain application theme category of the geographic information resource, and generating a typical word list which is closely associated with the semantics of the classification category (hereinafter referred to as theme);
3) screening text characteristic words according to the typical word list, filtering out the characteristics of which the distance from the typical words is greater than a threshold value, and obtaining a characteristic subset screened according to the theme classification;
4) selecting a classical Multi-label classification algorithm ML-KNN (Multi-label K Nearest Neighbors) as a base model H for cooperative training1
5) Calculating the semantic distance from the features to the theme according to the corpus, and establishing a theme prediction model ML-CSW (Multi-label Classification based on SWEET spot)&WordNet), using the model as another basis model H for co-training2
6) Designing a cooperative mechanism based on the two basic models, and matching a multi-label theme for the metadata text to serve as a primary coarse-grained theme classification result;
7) selecting a metadata text corresponding to a certain classification label according to a primary coarse-grained theme classification result, extracting a text theme to serve as a fine-grained theme of a next level, and simultaneously obtaining a matching relation between the metadata text and a double-layer theme catalog;
8) and 7) repeating the step 7) to obtain fine-grained subject category catalogues with different levels and a matching relation between the metadata text and the subject catalogues.
According to the scheme, the step 2) of defining the primary classification directory based on the domain application theme categories of the geographic information resources is to obtain primary classification by expanding the social benefit fields SBAs proposed by the international earth observation organization aiming at the field of geology.
According to the scheme, the typical vocabulary generation mode in the step 2) is as follows:
and (3) extracting the superior words, the inferior words and the synonyms of the subjects in the SWEET and WordNet definitions as typical words related to the subject semantics by taking the SBAs as a subject classification directory to generate a typical word list.
According to the scheme, the text characteristic words are screened according to the typical word list in the step 3), which specifically comprises the following steps:
s31, representing the typical words and the text feature words into two-dimensional space Word vectors based on the Word2vec algorithm;
s32, calculating the cosine distance between the typical word and the text feature word vector;
and S33, setting a distance threshold T, and filtering out text characteristic words with the cosine distance with the typical word larger than T.
According to the scheme, the method for establishing the topic model in the step 5) is as follows:
according to SWEET notebookNetwork definition of body library and WordNet English vocabulary net, calculating text characteristics f and each theme piSemantic distance d betweenpi
Finding features f and each topic piSemantic distance d betweenpiAnd the minimum value of (c) is obtained and is used as the maximum semantic relevance s of the text feature f and all the subjects PfWherein P is the set of all topics;
defining feature weight based on the shortest distance between the text feature and the theme, establishing a theme prediction model, and predicting a multi-label theme for the unmarked sample;
assuming that the training set contains n text features in total, the vector S ═ S of the maximum semantic relevance from all the features to all the subjects in the training set can be calculated1,s2,…,sn]Defining the weight w (x) of a single piece of data x as a vector of 1 × n, respectively corresponding to the weights of n text features, and defining the weight w (x) as s if the feature f appears in the sample xfOtherwise, defining as 0;
and establishing a theme prediction model Y, wherein F is an adjustment vector of the features, and alpha is a smoothing parameter. Based on the marked sample data, adopting a BP neural network iterative optimization training model Y, calculating the optimal solution of F and alpha under the condition of minimum loss, obtaining a final model, and predicting the category set of the unmarked sample t according to the model;
Y=w(x)*F+α。
according to the scheme, the step 6) designs a cooperation mechanism, and matches a multi-label theme for the metadata text as a primary coarse-grained theme classification result; the method comprises the following specific steps:
s61, generating L according to mark sample in geographic information service metadata text set1And L2Two subsets, respectively as co-training basis model H1And H2The training set of (2);
s62 training base model H by using training set1And H2Predicting the category vector of the unlabeled sample by using the trained base model;
s63, selecting classifier H from unlabeled samples1And H2Samples with the same prediction result are given a pseudo label,adding pseudo-labeled samples to two training subsets L, respectively1And L2Updating the training set, and repeating the steps S62-S63 until the classification results of the two classifiers do not change obviously, so as to obtain the class sets of all unlabeled samples and the finally updated training set;
s64 training classifier H based on all marked samples1A set of topic categories is matched for the test sample.
According to the scheme, the classic multi-label classification algorithm ML-KNN is selected as a base model for collaborative training in the step 4), and the method specifically comprises the following steps:
s41, selecting ML-KNN algorithm as a base model H of cooperative training1Specifying the number k of neighbor samples, expressing the set of k neighbor samples of the samples x in the training set by N (x), and counting the number c [ j ] of the samples belonging to the subject class l in N (x)]Counting the number of samples c' in N (x) that do not belong to the subject category l [ j]. In the following formula, when a sample x belongs to the topic category l,
Figure BDA0002223242790000051
the number of the carbon atoms is 1,
Figure BDA0002223242790000052
is 0, otherwise
Figure BDA0002223242790000053
Is a non-volatile organic compound (I) with a value of 0,
Figure BDA0002223242790000054
is 1;
Figure BDA0002223242790000055
s42, calculating the prior probability that the unlabeled sample t belongs to the subject category l
Figure BDA0002223242790000056
And posterior probability
Figure BDA0002223242790000057
Wherein the value of b is 0 and 1,
Figure BDA0002223242790000058
an event indicating that a sample t belongs to the topic category l,
Figure BDA0002223242790000059
an event indicating that the sample t does not belong to the topic class l, s is a smoothing parameter, m is the number of training samples,
Figure BDA0002223242790000061
an event representing that sample j among k neighboring samples of sample t belongs to class l;
Figure BDA0002223242790000062
Figure BDA0002223242790000063
Figure BDA0002223242790000064
s43, predicting the category set of the unlabeled samples t according to the maximum posterior probability and the Bayes principle
Figure BDA0002223242790000065
Figure BDA0002223242790000066
According to the scheme, the text theme extracted in the step 7) is extracted based on a Latent Dirichlet Allocation (LDA) algorithm.
The invention has the following beneficial effects: the invention provides a novel multi-level multi-label classification process aiming at an OGC network map service WMS and other geographic information network resource metadata texts. The process introduces a geoscience ontology library SWEET and a general English vocabulary network WordNet into a classification process, and combines a traditional classification algorithm ML-KNN and a classification algorithm ML-CSW with close fit domain characteristics and text semantics to perform collaborative training so as to obtain the matching relation between a geographic information service metadata text and a multi-level topic directory. The method only depends on a small number of marked data samples by considering the field characteristics and text semantics of the geographic information service metadata; meanwhile, compared with the traditional multi-label classification algorithm such as a classifier chain and a voting classifier, the method has better overall performance of the classification result.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a flow diagram of a method of an embodiment of the invention;
FIG. 3 is a diagram of an exemplary word for an embodiment of the present invention;
FIG. 4 is a diagram illustrating an example of calculating a shortest distance between a text feature and a topic in the ML-CSW algorithm according to an embodiment of the present invention;
FIG. 5 is a classification result of an exemplary text of an embodiment of the present invention;
FIG. 6 is a comparison of classification results of different classification algorithms according to an embodiment of the present invention;
FIG. 7 is a comparison of classification results based on different feature selection algorithms according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
There are 46000 pieces of Web Map Service (WMS) text data, 400 of which are marked with SBAs topics, and all the topics are uniformly distributed. The text content comes from the URL, Abstract, Keywords and Title fields in the Service tag in the WMS GetCapability capability document. Because text contents are mixed and mashup, sections are different in length, a single datum corresponds to a plurality of theme categories, the sample data size of the marked themes is small, the traditional multi-label classification algorithm is difficult to accurately and comprehensively classify, and multi-level theme matching results cannot be obtained.
The invention combines the theoretical basis of cooperative training in semi-supervised learning and introduces a geoscience ontology library and a basic classification model of general English vocabulary net design and fitting with characteristics of the geoscience field. And performing collaborative training in combination with a widely-applied classical multi-label classification model in the classification process, and extracting a multi-level fine-grained theme to match the multi-level multi-label theme with the WMS metadata text.
The algorithm process of the present invention will be described in detail below with reference to the accompanying drawings, in which:
as shown in fig. 1 and 2, a method for multi-level and multi-label classification of meta-data text of geographic information service includes the following steps:
1) performing text preprocessing on all WMS metadata, including three steps of word segmentation, stop word removal and word shape reduction, and segmenting each text into text feature word combinations;
2) the first class is obtained by expanding Social Benefit Areas (SBAs) proposed for the field of geography based on international Earth observation organization (GEO), the SBAs include 9 major interest topics including Agriculture (Agriculture), Biodiversity (Biodiversity), Climate (Climate), Disaster (disaser), ecology (ecosys), Energy (Energy), Health (Health), Water (Water), and Weather (Weather), etc., the SBAs are Social Benefit Areas (SBAs) proposed for the field of geography based on international Earth observation organization (GEO), including 9 major interest topics including Agriculture (Agriculture), Biodiversity (Biodiversity), Climate (Climate), Disaster (ecological Disaster), Health (Health), and Energy (Weather), etc. The topic classification catalog of this embodiment is expanded on the basis of SBAs, and Geology (geography) is added as the 10 th topic, so all topic classification catalogs and primary topic classification catalogs referred to in this embodiment refer to these 10 topics.
Using SBAs as a topic classification directory, extracting hypernyms, hyponyms and synonyms of topics in the SWEET and WordNet definitions as typical words related to topic semantics, and generating a typical word list, wherein a diagram in FIG. 3(a) is a typical word example corresponding to a topic "Agriculture" extracted from the SWEET, a diagram in FIG. 3(b) is a typical word example corresponding to a topic "Agriculture" extracted from the WordNet, and different colors represent different semantic sets;
3) the CBOW model based on the Word2vec algorithm represents the typical words and the text characteristic words as two-dimensional space Word vectors, and calculates cosine distances between the typical words and the text characteristic Word vectors;
4) setting a distance threshold, screening text feature words based on the distance threshold, and filtering features with the distance from the typical words larger than the threshold, thereby obtaining a feature subset with larger contribution to topic classification as model input of a classification algorithm;
5) designing a multi-label classification algorithm ML-CSW which is fit with WMS field characteristics and considers text semantics as a collaborative training base model H1And training a theme prediction model by taking the semantic association degree between text features and themes calculated by the corpus as feature weight:
5.1) taking the network definition of SWEET as a main part and WordNet as an auxiliary part to calculate the semantic shortest distance between text features and a theme;
if the text feature word is recorded by the SWEET, the shortest distance between the crawled feature word and the theme is defined according to the SWEET network, as shown in fig. 4(a), the distance between the feature "Glacier" and the theme "Water" is 3;
if the text features are not included by the SWEET, searching the superior words upwards layer by layer in the WordNet as the substitute words of the text features until the substitute words included by the SWEET are searched, and calculating the shortest distance D from the features to the substitute words in the WordNet definition1As shown in fig. 4(b), the alternative word of the feature "new (snow)" is "Ice", and the shortest distance is 1. And then calculating the shortest distance D between the substitute word and the subject based on Dijkstra algorithm according to the network definition of SWEET2The alternative word "Ice" to main as in fig. 4(b)The shortest distance of the title "Water" is 2. The final distance between the text feature and the theme is the sum of the distance between the text feature and the substitute word and the distance between the substitute word and the theme, namely D-D1+D2The shortest distance from the feature "new" to the subject "Water" as in fig. 4(b) is 3.
5.2) defining feature weight based on the shortest distance between text features and topics, establishing a topic prediction model, and predicting multi-label topics for unmarked samples;
a) according to the step 5.1), the text characteristics f and each theme p can be calculatediSemantic distance between
Figure BDA0002223242790000111
Deriving the shortest distance as the maximum semantic relevance s of the text features f to all the topics PfWherein P is the set of all topics;
Figure BDA0002223242790000112
b) if all texts contain n text features, the maximum semantic relevance vector S ═ S from all features to all subjects in the training set can be calculated1,s2,…,sn]. Defining the weight w (x) of single data x as a 1 x n vector, respectively corresponding to the weights of n text features, and defining the weight w (x) as s if the feature f appears in a sample xfOtherwise, it is defined as 0.
c) And establishing a theme prediction model Y, wherein F is an adjustment vector of the features, and alpha is a smoothing parameter. Based on the marked sample data, adopting a BP neural network iterative optimization training topic prediction model, calculating the optimal solution of F and alpha under the condition of minimum loss to obtain a final model, and predicting the category set of the unmarked sample t according to the model;
Y=w(x)*F+α
6) selecting a widely-applied classic multi-label classification algorithm ML-KNN as a collaborative training basis model H2
The number k of adjacent samples is specified, and N (x) represents a training set L1Middle sampleK neighbor sample sets of x, and the number c [ j ] of samples belonging to the subject class l in N (x) is counted]Counting the number of samples c' in N (x) that do not belong to the subject category l [ j]. In the following formula, when a sample x belongs to the topic category l,
Figure BDA0002223242790000121
the number of the carbon atoms is 1,
Figure BDA0002223242790000122
is 0, when the sample x does not belong to the topic class l,
Figure BDA0002223242790000123
is a non-volatile organic compound (I) with a value of 0,
Figure BDA0002223242790000124
is 1;
Figure BDA0002223242790000125
calculating the prior probability that an unlabeled sample t belongs to a topic class l
Figure BDA0002223242790000126
And posterior probability
Figure BDA0002223242790000127
Wherein s is a smoothing parameter, m is the number of training samples,
Figure BDA0002223242790000128
indicating that the event sample t belongs to the topic category/,
Figure BDA0002223242790000129
indicating that the event sample t does not belong to the topic category l,
Figure BDA00022232427900001210
the instance j of the k neighboring samples representing the event sample t belongs to the class l;
Figure BDA00022232427900001211
Figure BDA00022232427900001212
Figure BDA00022232427900001213
predicting the category set of unlabeled samples t according to the maximum posterior probability and Bayesian principle
Figure BDA00022232427900001214
Figure BDA00022232427900001215
7) Divide 80% of the repeated random samples of all labeled samples into L1And L2Two subsets, each as a classifier H1And H2Predicting the class set of all unlabeled samples by using two classifiers;
8) sorting classifier H1And H2The samples with the same prediction result are endowed with pseudo-marks, and the pseudo-marked samples are respectively added to the two training subsets L1And L2And updating the training set, and repeating 7) until the classification results of the two classifiers do not have obvious change, thereby obtaining the class set of the unlabeled samples.
9) The test samples were matched with a topic class set using a trained classifier with 10% of all labeled samples as test samples, such as the SBAs class labels of the example text in fig. 5 containing Biodiversity, click, Disaster, Ecosystem, Water and Weather.
10) Specifying a topic layer number N, selecting a metadata text of a single topic category for each layer, extracting a text fine-grained topic based on a Latent Dirichlet Allocation (LDA) algorithm until generating an N-layer topic directory, matching the WMS metadata text with N-layer topics, wherein a secondary topic corresponding to biology in FIG. 5 is wildlife, specie and diversity, a secondary topic corresponding to Climate is forest and metology, a secondary topic corresponding to Disaster is polarization, a secondary topic corresponding to Ecosystem is bittaat, resource and containment, a secondary topic corresponding to Water is rain, and a secondary topic corresponding to Weather is metology.
The method considers the field characteristics and text semantics of the geographic information service metadata, and only depends on a small number of marked data samples; as shown in fig. 6, compared with the conventional multi-label classification algorithm such as a classifier chain and a voting classifier, the classification result of the method of the present invention is better in overall performance.
As shown in fig. 7, the text feature selection process of the present invention can filter out features that do not contribute to the classification result compared to the chi-square test and WordNet-based feature selection method. The method can be popularized and applied to geographic information portals and data directory services, and assists in the retrieval and discovery of various geographic information resources.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (8)

1.一种地理信息服务元数据文本多层级多标签分类方法,其特征在于,包括以下步骤:1. a multi-level multi-label classification method for geographic information service metadata text, is characterized in that, comprises the following steps: 1)获取包含未标记样本与标记样本的地理信息服务元数据文本集进行文本预处理,将每条数据样本划分为文本特征词组合;1) Acquire a geographic information service metadata text set containing unmarked samples and marked samples for text preprocessing, and divide each data sample into text feature word combinations; 2)基于地理信息资源的领域应用主题类别设定一级分类目录,获得分类类别,即主题,然后生成与分类类别语义关联的典型词词表;2) Set a first-level classification directory based on the domain application theme category of geographic information resources, obtain the classification category, that is, the theme, and then generate a typical vocabulary related to the classification category semantics; 3)根据典型词词表对文本特征词进行筛选,滤除与典型词距离大于阈值的特征,获得根据主题分类筛选的特征子集;3) Screen the text feature words according to the typical word list, filter out the features whose distance from the typical word is greater than the threshold, and obtain the feature subsets screened according to the topic classification; 4)选取经典多标签分类算法ML-KNN作为协同训练的一个基模型,记为H14) Select the classic multi-label classification algorithm ML-KNN as a base model for collaborative training, denoted as H 1 ; 5)依据语料库计算特征到主题的语义距离,建立主题预测模型ML-CSW,将该模型作为协同训练的另一基模型,记为H25) Calculate the semantic distance from the feature to the topic according to the corpus, establish a topic prediction model ML-CSW, and use this model as another base model for collaborative training, denoted as H 2 ; 6)基于上述两个基模型设计协同机制,为元数据文本匹配多标签主题,作为一级粗粒度主题分类结果;6) Design a collaborative mechanism based on the above two base models to match multi-label topics for metadata texts as a first-level coarse-grained topic classification result; 7)选取某一分类标签对应的元数据文本,提取文本主题作为下一层级的细粒度主题,同时获得元数据文本与双层主题目录的匹配关系;7) Select the metadata text corresponding to a certain classification label, extract the text subject as the fine-grained subject of the next level, and obtain the matching relationship between the metadata text and the double-layer subject directory; 8)重复步骤7),得到不同级别的细粒度主题类别目录,以及元数据文本与主题目录间的匹配关系。8) Step 7) is repeated to obtain fine-grained subject category catalogues at different levels, as well as the matching relationship between the metadata text and the subject catalogue. 2.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤2)中基于地理信息资源的领域应用主题类别定义一级分类目录是基于国际地球观测组织针对地学领域提出的社会受益领域SBAs进行扩展而得到一级分类。2. the multi-level multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, in the described step 2), based on the domain application subject category definition first-level classification of geographic information resources is based on international earth The observation organization expands the SBAs proposed in the field of geosciences to the social benefit field and obtains a first-level classification. 3.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤2)中典型词词表生成方式如下:3. The multi-level multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, in described step 2), typical vocabulary list generation mode is as follows: 以SBAs为主题分类目录,抽取SWEET和WordNet定义中主题的上位词、下位词和同义词作为与主题语义相关的典型词,生成典型词词表。Taking SBAs as the subject classification catalogue, the hypernyms, hyponyms and synonyms of the subjects in the SWEET and WordNet definitions are extracted as typical words related to the semantics of the subjects, and a typical vocabulary list is generated. 4.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤3)中根据典型词词表对文本特征词进行筛选,具体如下:4. The multi-level and multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, in described step 3), according to typical word list, text feature word is screened, specifically as follows: S31、基于Word2vec算法将典型词与文本特征词表示为二维空间词向量;S31. Represent typical words and text feature words as two-dimensional space word vectors based on the Word2vec algorithm; S32、计算典型词与文本特征词向量间的余弦距离;S32. Calculate the cosine distance between the typical word and the text feature word vector; S33、设定距离阈值T,滤除掉与典型词余弦距离大于T的文本特征词。S33 , setting a distance threshold T, and filtering out text feature words whose cosine distance from typical words is greater than T. 5.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤5)中主题预测模型的建立方法具体如下:5. the multi-level and multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, the establishment method of topic prediction model in described step 5) is specifically as follows: S51、依据SWEET本体库与WordNet英语词汇网的网络定义,计算文本特征f与每个主题pi间的语义距离
Figure FDA0002957025660000021
若特征f被SWEET收录,则依据SWEET网络直接基于Dijsktra算法得到特征f与每个主题pi间的语义距离
Figure FDA0002957025660000031
若特征f未被SWEET收录,则逐层级向上查找被SWEET收录的上位词作为特征f的替代词,对WordNet中特征f与替代词的距离和SWEET中替代词与每个主题pi的距离求和,作为特征f与每个主题pi间的语义距离
Figure FDA0002957025660000032
S51. According to the network definition of the SWEET ontology library and the WordNet English vocabulary network, calculate the semantic distance between the text feature f and each topic p i
Figure FDA0002957025660000021
If the feature f is included in SWEET, the semantic distance between the feature f and each topic p i is obtained directly based on the Dijsktra algorithm according to the SWEET network.
Figure FDA0002957025660000031
If the feature f is not included in SWEET, then look up the hypernym that is included in SWEET as the substitute word for the feature f, and compare the distance between the feature f and the substitute word in WordNet and the distance between the substitute word in SWEET and each topic p i sum, as the semantic distance between feature f and each topic pi
Figure FDA0002957025660000032
S52、计算特征f与每个主题pi间的语义距离
Figure FDA0002957025660000033
的最小值,并求倒作为文本特征f与所有主题P的最大语义相关度sf,其中,P为所有主题集合;
S52. Calculate the semantic distance between the feature f and each topic p i
Figure FDA0002957025660000033
The minimum value of , and find the maximum semantic relevance s f between the text feature f and all topics P, where P is the set of all topics;
Figure FDA0002957025660000034
Figure FDA0002957025660000034
S53、基于文本特征与主题的最短距离定义特征权重,建立主题预测模型,为未标记样本预测多标签主题;S53. Define feature weights based on the shortest distance between text features and topics, establish a topic prediction model, and predict multi-label topics for unlabeled samples; S54、假定训练集中共包含n个文本特征,则可计算得到训练集中所有特征到所有主题的最大语义相关度的向量S=[s1,s2,…,sn],将单条数据x的权重w(x)定义为1×n的向量,分别对应n个文本特征的权重,若特征f在样本x中出现,则定义为sf,否则定义为0;S54. Assuming that there are n text features in the training set, the vector S=[s 1 , s 2 , . The weight w(x) is defined as a 1×n vector, corresponding to the weights of n text features respectively. If the feature f appears in the sample x, it is defined as s f , otherwise it is defined as 0; S55、建立主题预测模型Y,其中F为特征的调整向量,α为平滑参数;基于标记样本数据,采用BP神经网络迭代优化训练模型Y,计算损失最小情况下F和α的最优解并得到最终的模型,依据模型预测未标记样本t的类别集合;S55. Establish a topic prediction model Y, where F is the adjustment vector of the feature, and α is a smoothing parameter; based on the labeled sample data, the BP neural network is used to iteratively optimize the training model Y, and the optimal solutions of F and α under the condition of minimum loss are calculated and obtained. The final model predicts the category set of unlabeled samples t according to the model; Y=w(x)*F+α。Y=w(x)*F+α.
6.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤6)设计协同机制,为元数据文本匹配多标签主题,作为一级粗粒度主题分类结果;具体如下:6. The multi-level and multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, described step 6) design collaboration mechanism, for metadata text matching multi-label theme, as one-level coarse-grained theme Classification results; details are as follows: S61、根据地理信息服务元数据文本集中的标记样本生成L1和L2两个子集,分别作为协同训练基模型H1和H2的训练集;S61. Generate two subsets L 1 and L 2 according to the marked samples in the geographic information service metadata text set, which are respectively used as the training sets of the collaborative training base models H 1 and H 2 ; S62、利用训练集训练基模型H1和H2,并利用训练好的基模型预测未标记样本的类别向量;S62, using the training set to train the base models H 1 and H 2 , and using the trained base model to predict the category vector of the unlabeled sample; S63、从未标记样本中选出分类器H1和H2具有相同预测结果的样本赋予伪标记,将伪标记样本分别添加至两个训练子集L1和L2,更新训练集,重复步骤S62-S63,直至两个分类器的分类结果不出现明显变化,得到所有未标记样本的类别集合;S63. Select the samples with the same prediction result by the classifiers H 1 and H 2 from the unlabeled samples and assign them to pseudo-labels, add the pseudo-labeled samples to the two training subsets L 1 and L 2 respectively, update the training set, and repeat the steps S62-S63, until there is no obvious change in the classification results of the two classifiers, obtain the category set of all unlabeled samples; S64、基于所有有标记的样本训练分类器H1,为测试样本匹配主题类别集合。S64: Train the classifier H 1 based on all the labeled samples, and match the set of subject categories for the test samples. 7.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤4)中选取经典多标签分类算法ML-KNN作为协同训练的一个基模型,具体如下:7. The multi-level and multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, in described step 4), select classic multi-label classification algorithm ML-KNN as a base model of collaborative training, specifically as follows: S41、指定近邻样本个数k,以N(x)表示训练集中样本x的k个近邻样本的集合,统计N(x)中属于主题类别l的样本数量c[j],统计N(x)中不属于主题类别l的样本数量c′[j];下列公式中,当样本x属于主题类别l时,
Figure FDA0002957025660000051
为1,
Figure FDA0002957025660000052
为0,反之则
Figure FDA0002957025660000053
为0,
Figure FDA0002957025660000054
为1;
S41. Specify the number k of neighboring samples, use N(x) to represent the set of k neighboring samples of the sample x in the training set, count the number of samples c[j] belonging to the topic category l in N(x), and count N(x) The number of samples c'[j] that do not belong to the subject category l in the following formula; when the sample x belongs to the subject category l,
Figure FDA0002957025660000051
is 1,
Figure FDA0002957025660000052
0, otherwise
Figure FDA0002957025660000053
is 0,
Figure FDA0002957025660000054
is 1;
Figure FDA0002957025660000055
Figure FDA0002957025660000055
S42、计算未标记样本t属于主题类别l的先验概率
Figure FDA0002957025660000056
与后验概率
Figure FDA0002957025660000057
其中b的取值为0和1,
Figure FDA0002957025660000058
表示样本t属于主题类别l的事件,
Figure FDA0002957025660000059
表示样本t不属于主题类别l的事件,s为平滑参数,m为训练样本个数,
Figure FDA00029570256600000510
表示样本t的k个近邻样本中样本j属于类别l的事件;
S42. Calculate the prior probability that the unlabeled sample t belongs to the topic category l
Figure FDA0002957025660000056
with the posterior probability
Figure FDA0002957025660000057
where b takes the values 0 and 1,
Figure FDA0002957025660000058
represents the event that the sample t belongs to the topic class l,
Figure FDA0002957025660000059
Indicates that the sample t does not belong to the event of the topic category l, s is the smoothing parameter, m is the number of training samples,
Figure FDA00029570256600000510
Represents the event that sample j belongs to category l in the k nearest neighbor samples of sample t;
Figure FDA00029570256600000511
Figure FDA00029570256600000511
Figure FDA00029570256600000512
Figure FDA00029570256600000512
Figure FDA00029570256600000513
Figure FDA00029570256600000513
S43、依据最大化后验概率和贝叶斯原则预测未标记样本t的类别集合
Figure FDA00029570256600000514
S43. Predict the class set of the unlabeled sample t according to the maximized posterior probability and the Bayesian principle
Figure FDA00029570256600000514
Figure FDA00029570256600000515
Figure FDA00029570256600000515
8.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤7)中抽取文本主题是基于隐狄利克雷分布算法抽取文本主题。8 . The multi-level and multi-label classification method for geographic information service metadata text according to claim 1 , wherein the extraction of text topics in the step 7) is based on the hidden Dirichlet distribution algorithm to extract text topics. 9 .
CN201910942287.2A 2019-09-30 2019-09-30 Geographic information service metadata text multi-level multi-label classification method Active CN110704624B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910942287.2A CN110704624B (en) 2019-09-30 2019-09-30 Geographic information service metadata text multi-level multi-label classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910942287.2A CN110704624B (en) 2019-09-30 2019-09-30 Geographic information service metadata text multi-level multi-label classification method

Publications (2)

Publication Number Publication Date
CN110704624A CN110704624A (en) 2020-01-17
CN110704624B true CN110704624B (en) 2021-08-10

Family

ID=69197772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910942287.2A Active CN110704624B (en) 2019-09-30 2019-09-30 Geographic information service metadata text multi-level multi-label classification method

Country Status (1)

Country Link
CN (1) CN110704624B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460097B (en) * 2020-03-26 2024-06-07 华泰证券股份有限公司 TPN-based small sample text classification method
CN111611801B (en) * 2020-06-02 2021-09-14 腾讯科技(深圳)有限公司 Method, device, server and storage medium for identifying text region attribute
CN112464010B (en) * 2020-12-17 2021-08-27 中国矿业大学(北京) An automatic image labeling method based on Bayesian network and classifier chain
CN112256938B (en) * 2020-12-23 2021-03-19 畅捷通信息技术股份有限公司 Message metadata processing method, device and medium
CN112465075B (en) * 2020-12-31 2021-05-25 杭银消费金融股份有限公司 Metadata management method and system
CN113792081B (en) * 2021-08-31 2022-05-17 吉林银行股份有限公司 Method and system for automatically checking data assets
CN114330333B (en) * 2021-12-29 2025-09-05 北京百度网讯科技有限公司 Method for processing skill information, model training method and device
CN114358208A (en) * 2022-01-13 2022-04-15 辽宁工程技术大学 Science and collaboration activity text title recognition method based on deep learning
CN115408525B (en) * 2022-09-29 2023-07-04 中电科新型智慧城市研究院有限公司 Method, device, equipment and medium for classifying petition texts based on multi-level tags
CN116343104B (en) * 2023-02-03 2023-09-15 中国矿业大学 Map scene recognition method and system for visual feature and vector semantic space coupling
CN116541752B (en) * 2023-07-06 2023-09-15 杭州美创科技股份有限公司 Metadata management method, device, computer equipment and storage medium
CN118114060B (en) * 2024-02-01 2025-05-23 郑州大学 Disaster metadata automatic matching method and system based on word2vec model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7958068B2 (en) * 2007-12-12 2011-06-07 International Business Machines Corporation Method and apparatus for model-shared subspace boosting for multi-label classification
US7975039B2 (en) * 2003-12-01 2011-07-05 International Business Machines Corporation Method and apparatus to support application and network awareness of collaborative applications using multi-attribute clustering
US8340405B2 (en) * 2009-01-13 2012-12-25 Fuji Xerox Co., Ltd. Systems and methods for scalable media categorization
CN104408153A (en) * 2014-12-03 2015-03-11 中国科学院自动化研究所 Short text hash learning method based on multi-granularity topic models
CN104850650A (en) * 2015-05-29 2015-08-19 清华大学 Short-text expanding method based on similar-label relation
CN104991974A (en) * 2015-07-31 2015-10-21 中国地质大学(武汉) Particle swarm algorithm-based multi-label classification method
CN105868415A (en) * 2016-05-06 2016-08-17 黑龙江工程学院 Microblog real-time filtering model based on historical microblogs

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101283353B (en) * 2005-08-03 2015-11-25 搜索引擎科技有限责任公司 The system and method for relevant documentation is found by analyzing tags
CN102129470A (en) * 2011-03-28 2011-07-20 中国科学技术大学 Tag clustering method and system
CN104951554B (en) * 2015-06-29 2018-03-06 浙江大学 It is that landscape shines the method for mixing the verse for meeting its artistic conception
CN105354593B (en) * 2015-10-22 2018-10-30 南京大学 A kind of threedimensional model sorting technique based on NMF
CN105868905A (en) * 2016-03-28 2016-08-17 国网天津市电力公司 Managing and control system based on sensitive content perception
US9928448B1 (en) * 2016-09-23 2018-03-27 International Business Machines Corporation Image classification utilizing semantic relationships in a classification hierarchy

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7975039B2 (en) * 2003-12-01 2011-07-05 International Business Machines Corporation Method and apparatus to support application and network awareness of collaborative applications using multi-attribute clustering
US7958068B2 (en) * 2007-12-12 2011-06-07 International Business Machines Corporation Method and apparatus for model-shared subspace boosting for multi-label classification
US8340405B2 (en) * 2009-01-13 2012-12-25 Fuji Xerox Co., Ltd. Systems and methods for scalable media categorization
CN104408153A (en) * 2014-12-03 2015-03-11 中国科学院自动化研究所 Short text hash learning method based on multi-granularity topic models
CN104850650A (en) * 2015-05-29 2015-08-19 清华大学 Short-text expanding method based on similar-label relation
CN104991974A (en) * 2015-07-31 2015-10-21 中国地质大学(武汉) Particle swarm algorithm-based multi-label classification method
CN105868415A (en) * 2016-05-06 2016-08-17 黑龙江工程学院 Microblog real-time filtering model based on historical microblogs

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Multi-label classification and interactive NLP-based visualization of electric";Djavan De Clercqa et.al;《https://doi.org/10.1016/j.wpi.2019.101903》;20190716;第1-10页 *
"基于 LDA 主题模型的标签传递算法";刘培奇;《计算机应用》;20120201;第403-410页 *

Also Published As

Publication number Publication date
CN110704624A (en) 2020-01-17

Similar Documents

Publication Publication Date Title
CN110704624B (en) Geographic information service metadata text multi-level multi-label classification method
CN109271477B (en) Method and system for constructing classified corpus by means of Internet
Yang et al. Learning transferred weights from co-occurrence data for heterogeneous transfer learning
Gao et al. Visual-textual joint relevance learning for tag-based social image search
CN112434168B (en) Knowledge graph construction method and fragmented knowledge generation method based on library
CN110298042A (en) Based on Bilstm-crf and knowledge mapping video display entity recognition method
Van Laere et al. Georeferencing Flickr resources based on textual meta-data
Xing et al. Exploring geo-tagged photos for land cover validation with deep learning
CN103605729A (en) POI (point of interest) Chinese text categorizing method based on local random word density model
Cai et al. A new clustering mining algorithm for multi-source imbalanced location data
CN110162601B (en) Biomedical publication contribution recommendation system based on deep learning
CN108710672B (en) A Topic Crawler Method Based on Incremental Bayesian Algorithm
Yuan et al. An effective pattern-based Bayesian classifier for evolving data stream
Wang et al. Capturing joint label distribution for multi-label classification through adversarial learning
Jeawak et al. Predicting the environment from social media: a collective classification approach
Calumby et al. Diversity-based interactive learning meets multimodality
Kordopatis-Zilos et al. Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features.
Chen et al. Toward the understanding of deep text matching models for information retrieval
Sheehan et al. Learning to interpret satellite images using wikipedia
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising
Yang et al. Multi-Label Learning Based on Transfer Learning and Label Correlation.
Park et al. Estimating comic content from the book cover information using fine-tuned VGG model for comic search
Tóth et al. Multilabel clustering analysis of the Croatian-English parallel corpus based on Latent Dirichlet Allocation Algorithm
Zhang et al. Web service discovery based on information gain theory and bilstm with attention mechanism
Aggarwal SATLAB-an End to End Framework for Labelling Satellite Images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant