[go: up one dir, main page]

CN115687314B - A Thangka cultural knowledge graph display system and its construction method - Google Patents

A Thangka cultural knowledge graph display system and its construction method

Info

Publication number
CN115687314B
CN115687314B CN202211136388.9A CN202211136388A CN115687314B CN 115687314 B CN115687314 B CN 115687314B CN 202211136388 A CN202211136388 A CN 202211136388A CN 115687314 B CN115687314 B CN 115687314B
Authority
CN
China
Prior art keywords
tang
entity
cavan
thangka
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211136388.9A
Other languages
Chinese (zh)
Other versions
CN115687314A (en
Inventor
李长哲
刘晓静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qinghai University
Original Assignee
Qinghai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qinghai University filed Critical Qinghai University
Priority to CN202211136388.9A priority Critical patent/CN115687314B/en
Publication of CN115687314A publication Critical patent/CN115687314A/en
Application granted granted Critical
Publication of CN115687314B publication Critical patent/CN115687314B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种唐卡文化知识图谱展示系统及其构建方法,通过唐卡数据集的构建、唐卡文化命名实体识别、基于Bs‑Spert模型的实体及实体间关系的联合抽取及唐卡文化知识图谱展示系统搭建四个部分完成基于Web端的唐卡文化知识图谱展示系统的构建,实现了对唐卡自然语言文本的命名实体识别和关系抽取可视化查询功能,缓解当前以唐卡文化为主题的门户网站资源少,数据分散、知识浅显的窘迫困境。系统能够针对不同受众人群特点进行角色划分,并授予不同程度权限,使系统在满足不同人群的基础上增加不断改进功能。另外系统操作简单,界面简洁而丰满,适用于年龄偏大或不太擅长利用网络的用户。

The present invention discloses a thangka culture knowledge graph display system and a construction method thereof. The system completes the construction of a thangka culture knowledge graph display system based on a web terminal through four parts: the construction of a thangka data set, the recognition of thangka culture named entities, the joint extraction of entities and relationships between entities based on the Bs-Spert model, and the construction of a thangka culture knowledge graph display system. The system realizes the named entity recognition and relationship extraction visualization query function of thangka natural language text, and alleviates the current predicament of portal websites with thangka culture as the theme, such as the lack of resources, scattered data, and shallow knowledge. The system can divide roles according to the characteristics of different audiences and grant different degrees of authority, so that the system can increase and continuously improve functions based on the needs of different groups of people. In addition, the system is simple to operate, with a concise and rich interface, and is suitable for users who are older or not very good at using the Internet.

Description

Tang Cavan knowledge graph display system and construction method thereof
Technical Field
The invention relates to the technical field of computer image description, in particular to a Tang Cavan chemical knowledge graph display system and a construction method thereof.
Background
The text of Tang Cavan refers to natural language text left by people through recording, researching, describing and creating the Tang-Ka drawing content in the development process of the Tang-Ka drawing for thousands of years, the natural language text mostly exists in the form of paper books or mouth-mouth transmission, and the phenomenon of missing, omission and errors of Tang-Ka Cavan is often caused due to the limitation of single form and Tang Kawen text carrier. The occurrence of the phenomenon is certainly left to the idea of protecting cultural inheritance in China. The protection of the tang Cavan text is thus a task to be solved.
Along with the development of the times, except for adopting a more traditional mode of publishing books and setting up a file to protect Thangka culture in the mode of similar Chinese people, the mode of storing digital Tang Cavan resource websites and hundred-degree encyclopedia vocabulary entries on the Internet is continuously emerging. The construction subject of the Tang Cavan resource website is played by university research institutions and minority land market libraries, but the construction subject has the defects of construction resource dispersion and research variability in different areas of different mechanisms. The Baicaled encyclopedia website is mainly displayed in the form of Baicaled vocabulary entries, and the Baicaled encyclopedia vocabulary entries are displayed more shallowly and singly, and cannot be used for explaining the inherent meaning in the Thangka text in detail. Therefore, the proposal of a solution capable of uniformly dispersing resources and deeply managing Jie Tangka cultures becomes a difficult problem to be solved in the development of Tang Cavan protection.
The definition of the given knowledge graph is different according to the different application scenes and in different technical categories. The application is based on the construction of the tangram under the tangram Cavan, and is mainly focused on the tangram corpus text in the tangram culture, and the knowledge ram is constructed from the natural language perspective, so that the construction process of the knowledge ram can be regarded as extracting the semantics from the tangram text and the structural information in the tangram text, namely extracting the entity in the tangram text and the dependency relationship among the entities from the natural language processing (Natural Language Processing, NLP) perspective. In short, a knowledge graph can be regarded as a tool for expressing real-world knowledge in the form of a graph, where each node in the graph represents an entity, and edges in the graph represent a relationship between two entities in the knowledge graph.
Disclosure of Invention
In order to solve the technical problems, the invention aims to provide a Tang Cavan knowledge graph display system and a construction method thereof.
The invention provides a construction method of a Tang Cavan knowledge graph display system, which specifically comprises the following steps:
S1, constructing a Tang-Ka knowledge graph dataset, namely collecting and sorting the Tang-Ka Cavan chemical dataset, and marking the Tang-Ka Cavan chemical dataset by adopting a Brat tool;
S1-1, acquiring a data set, namely firstly, pertinently analyzing a Buddha' S warrior website with strong correlation with Tang Kawen, compiling a corresponding web crawler, crawling the data set by using the web crawler, then, manually acquiring data and carrying out supplementary correction, manually consulting and extracting Tang Kawen strong correlation entries, and further carrying out supplementary correction on the crawled data set through OCR technology identification;
S1-2, data arrangement and cleaning, namely arranging and cleaning the problems of large data noise, missing, repetition and abnormal points of the acquired data set, filling the data missing by adopting a global constant UnkNOWN, directly deleting or assigning a global variable UnkNOWN to the data abnormal points, assigning a global variable UnkNOWN to the direct deleting or abnormal outlier of the data abnormal points, eliminating redundancy of the data repetition, and referring entry data identified by an OCR technology during data arrangement and cleaning;
S1-3, marking a data set by adopting a Brat marking tool, namely firstly, generating a common-name suffix ann file for data to be extracted Tang Kawen, then configuring initial entity and entity relation and position information of a text in an animation. Conf file, selecting BIOES as a marking mode of the entity, dividing the data set according to the proportion of a training set of test set=8:2 on the basis of obtaining the entity and entity relation, and installing a Ubuntu system environment by adopting VMware virtual machine software in a Windows environment, and further deploying the Brat marking tool;
S2, tang Kawen, identifying named entities, namely adopting a Bi-Lstm +CRF model to identify the Tang-Ka named entities, introducing a conditional random field based on the Bi-Lstm model, taking Bi-Lstm as a feature extractor, taking the final output of the model as the input of the conditional random field model, and utilizing the conditional random field to obtain a state transition rule between labeling sequences;
s3, joint extraction of entities and relationships among entities based on Bs-Spert model:
S3-1, constructing a Bs-Spert model, wherein the model mainly comprises a Bert pre-training model module, a cluster searching module, a span classifying module, a span filtering module and a relation classifying module, and expanding a basic development of the Bs-Spert model by using the Bert pre-training model as a basic development and facing the joint extraction of the entity and the relation between the entity of Tang Cavan;
S3-2, training a Bs-Spert model, wherein the Bs-Spert model is trained on the basis of the construction of the data set of the Tang Cavan model in the step 2, and a Bert pre-training model adopts a Bert-Base-Chinese;
S3-3, performing Tang Cavan entity and entity joint extraction task experiments, namely firstly testing the performance of a Bs-Spert model under the influence of Beam Width of different bundling widths, then selecting a pooling function by testing Precision, recall, F-Score values of different pooling functions on a span classification module, and finally, on the basis of the results of the first two steps, performing transverse comparison with a typical model Bert-CNN and an LSTM-RNN in information extraction to obtain the experimental performance of the Bs-Spert model on a Tang Cavan data set;
S4, constructing a Tang Kawen knowledge graph display system, namely firstly, importing the Tang Cavan knowledge graph from Tang Kawen-oriented entities and storing Tang Cavan-oriented entities into a Neo4j graph database by two steps of the Tang Cavan knowledge graph, and then, completing the construction of the Tang Cavan knowledge graph web end display system by three steps of system demand analysis, system design and system test.
Further, in the step S3-3, beam Width of different bundle widths is set to 3, 5, 7, 9, 11, and different pooling functions are Average Pooling, sum Pooling, and Max Pooling.
Further, in the step S4, the requirement analysis includes interface requirement analysis and function requirement analysis, the system design includes interface design and function design, the interface design includes login interface and function interface, the system test includes login test and function test, the login test includes guest mode, user mode and management mode, and the function test includes identification of a named entity of the tangka, inquiry of entity relation of the tangka and addition of entity or entity relation.
The invention also protects a Tang Cavan knowledge graph display system which comprises a demand analysis layer, an interface login layer, a functional layer and a data storage layer;
The demand analysis layer comprises interface demand analysis and function demand analysis, wherein the interface demand analysis is used for analyzing whether the interface demands of different user operation uses, font colors and sizes reasonably matched with the main colors of the interfaces are met or not;
the interface login layer is used for logging in interfaces of different users, distinguishing targets of the different users based on the data analyzed by the demand analysis layer, and popping up an access login interface which is suitable for the targets;
the function layer is used for operating functions of different users, wherein the operating functions comprise identification of a Thangka naming entity, inquiry of the Thangka entity, inquiry of a Thangka relation, addition of the Thangka entity or the Thangka entity relation, modification of the correction of the Thangka entity or the Thangka entity relation and deletion of the Thangka entity or the Thangka entity relation;
The data storage layer is used for storing relationships between the Tang Cavan entity and the Tang Cavan entity.
Furthermore, the interface login layer comprises a login interface, a new user registration interface and a password modification interface, wherein the login interface is suitable for a Tang Cavan and a Tang Cavan study person and a platform manager, under the normal default condition, the login interface does not check the options of the manager, the system defaults to common user login, when the visitor mode accesses, the right lower corner visitor mode can be clicked directly, the user can jump to the platform interface with a limiting function automatically, and when the check box of the manager is checked, the user login mode is switched.
Furthermore, the users accessing the system in the tourist mode are automatically divided into the groups of Thangka lovers only to grant limiting functions, and can only use three functions of named entity identification, entity inquiry and entity relation inquiry of the system, the normal registration of the user through the system additionally grants the functions of adding Thangka culture and Tang Cavan relation of the system on the basis of being granted with the three functional authorities, and the system administrator grants all control authorities to the Tang Cavan knowledge graph display platform.
Compared with the prior art, the invention has the following beneficial effects:
The method completes the construction of the Tang Cavan knowledge graph display system based on the Web end through four parts of construction of a Tang-Ka dataset, identification of Tang Cavan naming entities, joint extraction of entities and relationships among entities based on a Bs-Spert model and construction of a Tang-Cavan knowledge graph spectrum display system, achieves the functions of identification of the naming entities and visual query of relationship extraction of Tang-Ka natural language texts, and relieves the embarrassment of less portal resources, scattered data and shallow knowledge display which take Tang-Cavan as a theme. The Bi-Lstm +CRF can obtain better Thangka entity identification effect than the CRF and Bi-Lstm model on the premise of considering time expenditure on the Tang Cavan data set, the Bs-Spert model has excellent entity and entity relation joint extraction performance on the Tang Cavan data set, and the system can divide roles according to the characteristics of different audience groups and grant rights of different degrees, so that the system has an improved function on the basis of meeting different people. In addition, the system is simple to operate, has a concise and plump interface, and is suitable for users with older ages or less good at utilizing the network.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic illustration of a Brat tool labeling Thangka dataset;
FIG. 3 is a diagram of a Bi-Lstm +CRF model architecture;
FIG. 4 shows the experimental results of Bi-Lstm +CRF at different epochs;
FIG. 5 shows the results of Bi-Lstm +CRF experiments at different Hidden_Size;
FIG. 6 is a diagram of the overall architecture of the model for Bs-Spert;
FIG. 7 is a Beam Search for a Beam Search;
FIG. 8 shows experimental results at different Beam Width values;
FIG. 9 is a diagram of a Web-based Tang Cavan knowledge graph display system design architecture;
FIG. 10 is a diagram of the named entity identification of Tang Cavan;
FIG. 11 is a Tang-Card entity query;
FIG. 12 is a Tang-Ka relationship query;
FIG. 13 is a schematic illustration of a tang Cavan physical addition;
fig. 14 is a schematic view of a tang Cavan modification of the relationship.
Detailed Description
The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Construction method of Tang Cavan knowledge graph display system
1. Construction of tangka knowledge graph dataset
The method mainly comprises the steps of collecting and sorting Tang Cavan chemical data sets and marking the Tang-Ka culture data sets by adopting a Brat tool;
(1) Data set acquisition
Firstly, through the targeted analysis of the Buddha's warrior website with strong correlation with Tang Cavan, the corresponding web crawler is compiled, the data is crawled by the web crawler, for example, about 95000 pieces of data are obtained in a cumulative way in the related Tang Cavan field website (Buddha's warrior dictionary), about 95000 pieces of data are obtained in a cumulative way, the targeted data crawling work is carried out on the thematic websites such as China Tang Kawang, about 34 ten thousand words are obtained accurately, then, the data are obtained manually and the data are supplemented and corrected, the strongly correlated words of Tang Cavan are manually consulted and extracted, the data set obtained by the web crawler is identified by OCR technology, and then more than 150 pieces of 36000 words are supplemented and corrected.
(2) Data arrangement and cleaning
The method comprises the steps of acquiring data, wherein the acquired data has large data noise, data loss, data repetition and data outlier problems, and the method comprises the steps of filling global constant Unknown for the data loss, directly deleting or assigning global variable Unknown for the data outlier, assigning global variable Unknown for the direct deleting or outlier of the data, removing redundancy for the data repetition, and referring to more than 150 entry data identified by OCR technology during the data arrangement and cleaning;
(3) Dataset annotation
The method comprises the steps of adopting a Brat marking tool to achieve data set marking, firstly, generating a file with the same name and suffix of ann aiming at data of Tang Cavan to be extracted, then configuring initial entity and entity relation and position information of the text in the analysis.conf file, wherein 24 types of the currently configured and collected entities are 3942, such as [fo]_[age]_[name],[pusa],[tianmu],[tinanv],[shangshi], [jingang],[hufashen],[monk],[zunzhe],[book],[direction],[zuozi], [Loc],[jiaopai]_[believers],[shuliang],[shoushi],[faqi],[zhenyan], [yuyi],[faxiang],[qiguan],[nation],[job],[zuoji] and the like, 26 types of the entity relation are [Mentoring],[ToEast],[OtherName],[belong],[as], [previouslife],[previouslifeMother],[previouslifejob], [previouslifeBelong],[home],[relatives],[wife],[son],[verb],[own], [Means],[successful],[othername],[Create],[Birthtime],[Dietime], [BirthLoc],[DieLoc],[father],[mother],[xieshi] and the like, 1756, marking the Brat marked data set (see figure 2 in detail), selecting BIOES as a marking mode of the entity, wherein a label 'B' represents the beginning of the entity, an 'I' label represents the inside of the entity, an 'O' label represents no entity, an 'E' label represents the end of the entity, an 'S' label represents a single entity (see table 1 in detail), and dividing the data set according to the proportion of a training set: test set=8:2 on the basis of obtaining the entity relation.
Table 1 named entity identification data tagging format
TABLE 2 Tang Cavan chemical data set partitioning
The Ubuntu system environment is installed by adopting VMware virtual machine software in the Windows environment, and further a Brat marking tool is deployed (see Table 3 for details).
TABLE 3 TANGKA dataset labeling experiment Environment
2. Tang Cavan naming entity identification
(1) Construction of Bi-Lstm +CRF model
The final output of Bi-Lstm is to sort the labels through the fully connected neural network, and only select the most probable label of the current word for output, so that the output of each label is determined only according to the context. From the construction of conditional random field model, tang Kawen is known that the hidden states between different words in the model are the state transition rules between labeling sequences. The conditional random field is introduced on the basis of the Bi-Lstm model to solve the defects that the CRF has large data rule modulus and low speed and the Bi-Lstm model cannot well utilize the sequence hidden state transition rule, and finally, the aim of stably improving the entity identification precision is fulfilled. The application takes Bi-Lstm as a feature extractor, takes the final output of the model as the input of a conditional random field model, and can obtain a state transition rule between labeling sequences by utilizing a conditional random field. Bi-Lstm +CRF model architecture diagram (see FIG. 3 for details).
Let Bi-Lstm output dimension be L 1, represent the probability value of each word mapping to the label, bi-Lstm model output matrix be P, P i,j represent the non-normalized probability of word X i mapping to the j-th label, combined with the state transition matrix of conditional random field above, define scoring function as formula (1) for input sequence X= { X 1,x2,x3,...,xm-1,xm } output variable sequence Y= { Y 1,y2,y3,...,ym-1,ym }:
the probability value is then defined for each tag sequence using the Softmax function as shown in equation (2):
Wherein, the All tag sequences are represented, including tag sequences that are not possible. The maximum log likelihood probability log (P (y|x)) is then used in training, whereby the loss function can be defined as log (P (y|x)), and the Bi-Lstm +crf model loss function is as shown in equation (3):
and finally, carrying out network learning by using a random gradient descent algorithm.
(2) Bi-Lstm +CRF experiment comparison and analysis under different treatments
The results of this experiment are all the average accuracy, average recall, and corresponding average F1-Score values for all the categories of entities in the calculation dataset. The experiments were all conducted by means of Ubuntu 20.04.2 LTS system, and the experimental environment configuration is shown in Table 4.
Table 4 Tang-Card named entity recognition task experiment environment
1) Comparison and analysis of Bi-Lstm +CRF experiments under different epochs
Firstly, in order to determine what value of the Epoch is the best experimental effect of the Bi-Lstm +crf model, as shown in fig. 4, the Epoch values are respectively the exact value of Bi-Lstm +crf, recall and F1-Score when epoch= 50,100,150, 200,250,300, and as can be seen from fig. 4, the model can obtain the best experimental effect when epoch=100. The Epoch variable is thus set to 100 in the experimental comparison hereinafter.
2) Comparison and analysis of Bi-Lstm +CRF experiments under different Hidden_Size
On the premise of obtaining the optimal Epoch, the search is carried out on different Hidden variables Hidden_Size= 64,128,256,512 in Bi-Lstm, the final experimental result is shown in fig. 5, and the model obtains the optimal experimental result when the Hidden variables Hidden_Size=64.
3) Comparison and analysis of Bi-Lstm +CRF experiments under different Bi-Lstm layers
In the experiment, the effect on the final result of the experiment is larger when the number of layers of Bi-Lstm is changed, so that the experiment is carried out on the number of layers of Bi-Lstm, the number of layers is firstly selected from one layer to ten layers in sequence in the experiment, then ten layers and twenty layers are extracted for comparison, the experimental result is shown in the following table 5, the experimental effect of the model is gradually increased and then gradually reduced, and the optimal effect is achieved in the fourth layer.
TABLE 5 Experimental results of Bi-Lstm +CRF at different Bi-Lstm layers
4) Experimental comparison and analysis under different models
On the basis of the experiment, the Bi-Lstm +CRF iteration times are selected to be epoch=100, the Hidden layer characteristic dimension Hidden_Size of the Bi-Lstm model is selected to be=64, and the performance of the model CRF, bi-Lstm and Bi-Lstm +CRF on the Thangka data set is compared under the condition of four layers of Bi-Lstm. The detailed parameter settings of the Bi-Lstm +CRF model are shown in Table 6.
Table 6 Bi-Lstm +CRF parameter settings in the model
As shown in Table 7, the three models CRF, bi-Lstm, bi-Lstm +CRF performed the worst on the Tang Cavan data set, the model Bi-Lstm, the model CRF improved 0.78% Precision, 2.46% Recall, and 1.64% F1-Score values over the model Bi-Lstm. The traditional statistical model CRF model has the following three main reasons that the performance of the traditional statistical model CRF model on a Thangka dataset is superior to that of a Bi-Lstm model based on deep learning. The CRF model is a statistical probability graph model, text is scanned and input through a feature template, local feature linear weighted combination of an input sequence is considered, the CRF calculates joint probability on the sequence, the whole sequence is optimized, and the optimal solution splicing at each time is not simple. Secondly, the Bi-Lstm model does not consider the output of the last time slice when outputting each time slice, can not well model the rule marked in the sequence, and the semantic dependence of the context can be lost when only using the Bi-Lstm model for named entity recognition. Thirdly Tang Kawen the data set size is slightly smaller and CRF is good at handling data with small data sizes. And as shown in table 8, the time cost of the CRF model in the time cost comparison experiment of the three models is larger than that of the Bi-Lstm model, and the CRF model is equivalent to improvement of the performance by using time resources.
Because the Bi-Lstm model is based on a circulating neural network, various data training acceleration technologies such as GPU acceleration, multistage asynchronism and the like are provided for the neural network Injeida, compared with the CRF model, feature extraction of an observation sequence can be completed more quickly, and the CRF model can better utilize state transition rules in a hidden state of the sequence, so that better learning is dependent on Wen Yuyi. The advantages and disadvantages of the CRF model and the Bi-Lstm model are complemented, and finally, the performance improvement of 0.7% Precision, 1.01% Recall and 0.86% F1-Score is realized on the CRF model algorithm as shown in table 4.4. And compared with the CRF model, the Bi-Lstm +CRF model greatly shortens the time cost in different model time cost comparison.
Table 7 experimental results for comparison of different models
TABLE 8 time overhead for different models
3. Joint extraction of entities and relationships between entities based on Bs-Spert model
(1) Bs-Spert model construction
The model forming assembly mainly comprises a Bert pre-training model module, a cluster searching module, a span classifying module, a span filtering module and a relation classifying module; the Bert pre-training model is used as the base of the Bs-Spert model to develop the joint extraction of entity and relation between entities facing the Tang Cavan book, a Tang Cavan book P { Sapass "," Gal "," Save "," Mu "," Ni "," refer "," guide "," A "," difficult "," repair "," line "} is input into the Bs-Spert model, then, a group of m-byte coding pairs P= { P 1,p2,p3,...,pm-1,pm }, which are obtained after data preprocessing, and a group of embedded vector sequences Q= { Q 1,q2,q3,...,qm-1,qm, c }, which are m+1, can be obtained after the processing of a Bert pre-training model, wherein c represents a special classifier in the processing of Tang Kawen and is mainly used for capturing the history information of the context in the Tang Kawen book; then a group of m byte coding pairs P= { P 1,p2,p3,...,pm-1,pm } are obtained after data preprocessing, then a group of embedded vector sequences Q= { Q 1,q2,q3,...,qm-1,qm with the length of m+1 can be obtained after the processing of a Bert pre-training model, c, where c represents a special classifier in the process of Tang Kawen, which is mainly used to capture the history information of the context in the Tang Kawen book;
1) Span classification module
Any possible physical candidate spans may be input into the span classifier (SPAN CLASS LFLER block in fig. 6 of the drawings), where it is first assumed that the input span sequence is a span length of k+1. The entity set is then predefined with reference to the present entity class Tang Kawen in the third section, where all identifiable Tang Kawen present entity classes are represented and O represents the span of entities that are absent or unrecognizable.
The input to the span classifier module consists of three parts:
A. Span embedding, namely, the span embedding obtained by the Bert pre-training model is combined into f (q i,qi+1,...,qi+k-1,qi+k) by using a fusion function f, the influence of different fusion functions on a final test result is studied in depth, and the highest test precision under the maximum pooling function (Max-Pooling) is finally found.
B. span length embedding, namely splicing the span length embedding with f (q i,qi+1,...,qi+k-1,qi+k) to obtain a span representation with the width of k+1.
q(s)=f(qi,qi+1,...,qi+k-1,qi+k)·wk+1 (4)
C. Sentence vector c, which is obtained by pretrained model Bert and represents the historical information of the context of the sentence text, acts like a keyword in a sentence. For example, the keywords "Buddha" in the text { "now", "present", "Buddha", "for", "Save", "Sakyo", "Save", "Nib", "Buddha" } are important labels for "fo" in the entity class. Meanwhile, the sentence vector c can effectively eliminate the ambiguity problem of the entity by utilizing the history information of the context. And finally, the input of the span classification module after splicing the sentence vector c is shown in a formula (5).
xs=q(s)·c (5)
Finally, x s is fed into the span classification module, where the softmax function is mainly taken as the classifier function. A posterior value for each entity class will be generated as shown in equation (6), which will contain a posterior of O (span of entity that is not present or identifiable) as well.
ys=soft max(Ws·xs+bs) (6)
Where W s represents the weight matrix of input x s and b s represents the bias term.
2) Span classification module
And (5) scoring each entity category respectively, selecting the entity category with the highest score as an entity recognition result, filtering the O category, namely the non-entity span, and only embedding and splicing the span belonging to the psi entity category into the relation classification module.
3) Bundling searching module
Only one parameter Beam Width in Beam Search is set as k, k optimal results with the maximum current conditional probability are selected and used as the first word of the candidate output sequence, k optimal results with the maximum conditional probability in all combinations are selected according to the current word, k candidates are always kept optimal in the whole selection process, and finally the optimal results are selected from all candidates. Fig. 7 shows a search flow of the bundle search when the search tree width k=2.
4) Relationship classification module
The input of the relation classification module mainly consists of q(s) of span embedding and span length and context history information between two effective entities after being processed by fusing the Bert model. The fused candidate entity pairs (s 1,s2) are represented as q (s 1) and q (s 2) which can be easily obtained from the span classification module.
Another component entered in the relationship classification module is the context history information between the entity pairs (s 1,s2). Since context history information is obtained, it is naturally necessary to mention the sentence vector c obtained through the Bert model, and the range embedding d between two valid entity pairs is obtained through the pooling function in combination with the Bert pre-training model (s 1,s2). D (s 1,s2) is input as context information into the relationship classification module. If entity overlapping occurs, an abnormal phenomenon that the range between the effective entity pairs is empty is caused, so that the range embedding between the effective entity pairs is defined as d (s 1,s2) =0.
Next, splicing operation similar to the span classification module is adopted, span embedding of two effective entity pairs and range embedding splicing between the entity pairs are carried out, and meanwhile, the relation asymmetry phenomenon between the entity pairs is considered, so that input expression in the relation classification module is shown as a formula (7):
Will be AndInput into the single-layer classifier in the relationship classification module, as shown in the following formula (8):
Here, the And (3) withRespectively representAndIs used for the weight matrix of the (c),And (3) withThe bias terms corresponding thereto are respectively represented. Sigma represents a Sigmoid function with dimension R, and any high response phenomenon on the Sigmoid layer indicates that the valid entity pair (s 1,s2) has a corresponding relationship. Here, a response threshold α is set that indicates that a known relationship R e R may be considered to exist between the two valid entity pairs when the response score of any relationship is greater than or equal to α, and conversely, if the response score is less than the response threshold α, a known relationship may be considered to not exist between the two valid entity pairs
(2) Bs-Spert model training
The Bs-Spert model is trained on the basis of the construction of the Thangka text dataset in the step 2, a Bert pre-training model adopts Bert-Base-Chinese, parameters of the pre-training model are 12 in Bert model layer number Bert_layers of an encoder, 12 in Bert model self-attention heads, 768 in word vector Dimension Bert_dimension, random numbers of normal distribution (0,0.02) are adopted as initial weights of a relation classifier in the model, an Adam optimizer added with a preheating learning Rate and a linear attenuation learning Rate is adopted in the model, and in order to prevent the occurrence of an overfitting phenomenon, dropout rate=0.5 is set for an entity extraction and relation extraction module respectively, a relation response threshold in a relation classification module is set to be=0.4, and detailed parameter settings in the Bs-Spert model are shown in a table 9;
TABLE 9 entity and entity relationship joint extraction Bs-Spert model parameter set
(3) Tang Kawen this entity and entity joint extraction task experiment:
first, the performance of the Bs-Spert model under the influence of different beamwidths (width= {3,5,7, 9,11 }) was tested and the best beamwidth value was chosen to determine further experiments, with other experimental parameters remaining unchanged. As can be seen from fig. 8, beam width=11 achieves a remarkable effect. The pooling function was then selected by testing Precision, recall, F-Score values on the span classification module for the different pooling functions (Average Pooling, sum Pooling, and Max Pooling), with a relatively significant improvement from the Max Pooling function, as shown in table 10. Finally, based on the results of the first two steps, the experimental performance of the Bs-Spert model on the tang Cavan data set is obtained by transverse comparison with the classical models Bert-CNN and LSTM-RNN in the information extraction (see table 11 for details).
Table 10 experimental comparison under different pooling functions
Table 11 experimental comparison of different models on the tangka dataset
4. Construction of Tang Cavan knowledge graph display system
The method comprises the steps of importing a Tang Cavan knowledge graph through Tang Kawen entities and storing Tang Cavan entities, wherein Tang Cavan knowledge is stored in a Neo4j graph database, the construction of a Tang-Ka culture knowledge graph web end display system is completed through three steps of system demand analysis, system design and system test (see figure 9 for details), the demand analysis comprises interface demand analysis and function demand analysis, the system design comprises interface design and function design, the interface design comprises a login interface and a function interface, the system test comprises login test and function test, the login test comprises a tourist mode, a user mode and a management mode, and the function test comprises Tang-Ka named entity identification, tang-Ka entity inquiry, tang-Ka entity relation inquiry and entity or entity relation addition.
(1) Tang Cavan chemical knowledge graph storage
1) Storage of tang Cavan materialized entities
The Entity Thangka _entity.csv file of the Tang-Card culture is placed in the import folder under the "file:///Neo 4j-community-3.5.14/import" path, and then the command Load CSV WITH HEADERS from:///Thangka _entity1.csv "AS LINECREATE (: entity1{ Entiy: line.entity }) is used to import the Tang Cavan-ized Entity data into the Neo4j graph database.
2) Storage of relationships between entities of Tang Cavan
The relationship Thangka _relationship.csv file between entities of the Thangka culture is placed in the import folder under the "file:///neo 4j-community-3.5.14/import" path, and then the entities that have a relationship with each other are linked together using command Load CSV with headers from"file:///Thangka_Relation.csv" As lineMatch(entity1:fo{name:line.fo}),(entity2:faqi{name:line.faqi}) Create(entity1)-[r1:line.relation]->(entity2),.
(2) System demand analysis
1) And analyzing the interface requirement, and analyzing whether the interface requirement of reasonable collocation of the font colors and the sizes and the interface dominant colors is met or not according to the operation and the use of different users.
2) Functional requirement analysis, namely, analyzing Tang Kawen functional requirements of three role layers of fan, tang Cavan researchers and platform manager.
(3) System design
1) And (3) designing an interface, namely designing a login and registration interface and a functional interface for a user, distinguishing targets of different users based on data analyzed by a demand analysis layer, and designing an access platform interface which is suitable for the targets.
2) The method comprises the steps of designing functions, setting different login functions and operation functions of the Tang Cavan knowledge graph aiming at different users, wherein the login functions have functions of registering and modifying passwords, and the operation functions comprise query operation, adding operation, modification operation and deletion operation of the Tang Cavan knowledge graph, and the four basic operations are realized by facing two objects of an entity and a relation between the entities. The method comprises a query operation, an adding operation, a modifying operation and a deleting operation, wherein the query operation is used for identifying and querying all data in a Thangka text input by a user, and simultaneously can query the relationship between two entities, the adding operation can input the Thangka entity to the platform to add the relationship between the Thangka entity and the Thangka entity, the modifying operation can modify and correct the relationship between the Thangka entity input to the platform or the relationship between the Thangka entity input to the platform, and the deleting operation can delete the relationship between the Thangka entity input to the platform and the relationship between the Thangka entity. The system is characterized in that a user accessing the system in a tourist mode is automatically divided into a Thangka fan crowd and only grants a limiting function, the three functions of named entity identification, entity inquiry and entity relation inquiry of the system can be only used, the user is normally registered and logged in through the system and is additionally granted with functions of adding Thangka culture and Tang Kawen relation of the system on the basis of being granted with three function authorities, and a system administrator grants all control authorities of a Thangka Cavan knowledge map display platform.
(4) System testing
1) Login interface test
After users with different roles are tested to log in the system, the effect displayed based on the different authorities given by the characteristics of the roles is displayed, the registration interface of the user accessed for the first time is displayed, and the user modifies the password operation. The login interface is suitable for a Tang Cavan and a Tang Cavan study person and a platform manager, under the normal default condition, the login interface does not check the options of the manager, the system defaults to common user login, when a tourist accesses, the right lower corner tourist mode can be clicked directly, and then the system automatically jumps to the platform interface with a limiting function, and when a check box of the manager is checked, the system is switched to the manager login mode. When the user needs to modify the password, clicking the login interface to modify the password will jump to the user modification password interface.
2) Functional testing
Five functional modules including Tang-Ka named entity identification, tang Kawen entity inquiry, tang-Ka relation inquiry, tang-Ka entity addition and Tang-Ka relation modification are selected for functional test display.
A. The Tang-dynasty card named entity recognition function test is that, as shown in figure 10, an administrator character enters the overall overview chart of the system function, the default first page is a Tang-dynasty card named entity recognition interface, and the main function of the interface is that after a user inputs Tang Kawen books needing entity recognition in an input text box, the text of the figure { "big white umbrella cover Buddha bus, two main arms are placed in front of the chest, a left hand holds a diamond pestle, a right hand holds a handle white umbrella cover, six crowds are stepped on by feet, and six crowds can be protected. After clicking the confirm button, "}, named entity recognition is performed. The identification result is { "large white umbrella cover Buddha's mother/tianmu |main arm/qiguan |chest/qiguan |left hand/qiguan |adamantine pestle/French |right hand/qiguan |white umbrella cover/French|foot/qiguan |six crowded life/French" }, wherein the type output of/tianmu and the like indicates that the entity type of the large white umbrella cover Buddha' is tianmu and the "|" is used for indicating the separator function.
B. Tang Cavan Inquiries entities, as shown in figure 11, a user types in a character Bodhisattva field through an input field, clicks the query to show the entity relationship related to the character Bodhisattva in a lower relationship diagram column, wherein the center point of the diagram is the character Bodhisattva entity, the surrounding circles represent other types of entities, and the line connection between the entities represents the relationship between the two entities. Through the function, whether the Thangka field is a Thangka entity or not can be inquired, and if the field is the Thangka entity, the entity and the relationship among the entities which are mutually connected can be displayed.
C. Tang-Card relationship query As shown in FIG. 12, a user enters two specific Tang-Card entities to query whether an entity relationship exists between the two entities. The user inputs the relation between two specific Tang-Care entity queries of Sakyamuni and Wen-Criti Bodhisattva, and the Unknow fields are selected in the drop-down column to indicate that the relation between the two input fields is unknown, and the main goal is to query the relation between the two fields, as shown in the figure as xieshi, namely the Wen-Criti Bodhisattva is the hypochondriac waiter of Sakyamuni. The second idea is to input the entity 1 in the first input field, select the relation in the drop-down field, and query the entity matching with the input of the first field and the selected relation, and the test effect is different from the first idea in size, so that the test effect is not displayed.
D. As shown in figure 13, the functional module is mainly opened for a Tang Cavan chemical researcher logging in the system through normal registration, a user can input a Tang-Ka entity to be added through an input field in the current module, a request for adding the Tang-Ka entity can be sent to a background manager through clicking addition, and after the correctness of the added entity is confirmed by the platform manager, the user is granted to the request for adding the entity, so that the entity can be added into a Tang Cavan chemical database, and the operation flow for adding the relationship between the Tang-Ka entities is the same as the same.
E. The Tang-Card relationship modification is that during the operation of the system, some cases of adding errors to the Tang-Card entities or the relationship definitions among the Tang-Card entities are not occurred, and the errors can be discovered and corrected by a platform manager. For example, when the relationship between two pairs of entities of "wining Bodhisattva" and "Sakyamuni" is changed to "xieshi", the corresponding Tang Kashi entity names need to be input in two input fields as shown in fig. 14, and then the relationship between the entities to be modified is selected. The modification of the physical relationship of "Wen-Crypton Bodhisattva" - [ xieshi ] - > "Sakyamuni Buddha" - [ othername ] - > "Sakyamuni Buddha" can be completed by clicking the modification button.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. The construction method of the Tang Cavan knowledge graph display system is characterized by comprising the following steps:
S1, constructing a Tang-Ka knowledge graph dataset, namely collecting and sorting the Tang-Ka Cavan chemical dataset, and marking the Tang-Ka Cavan chemical dataset by adopting a Brat tool;
S1-1, acquiring a data set, namely firstly, pertinently analyzing a Buddha' S warrior class website with strong correlation with Tang Kawen, compiling a corresponding web crawler, crawling the data set by using the web crawler, then, manually acquiring data and carrying out supplementary correction;
S1-2, data arrangement and cleaning, namely arranging and cleaning the problems of large data noise, missing, duplication and abnormal points of the acquired data set, filling the data missing by adopting a global constant Unknonwn, directly deleting or assigning a global variable Unknonwn to the data abnormal points, directly deleting or assigning a global variable Unknonwn to the abnormal outlier to the data abnormal points, eliminating redundancy of the data repetition, and referencing entry data recognized by an OCR technology during data arrangement and cleaning;
S1-3, marking a data set by adopting a Brat marking tool, firstly, generating a common-name suffix ann file for Tang Kawen pieces of data to be extracted, then configuring initial entity and entity relation and position information of a text in an animation. Conf file, selecting BIOES as a marking mode of the entity, dividing the data set according to the proportion of a training set of test set=8:2 on the basis of obtaining the entity and entity relation, adopting VMware virtual machine software to install Ubuntu system environment in Windows environment, and further deploying the Brat marking tool;
S2, tang Kawen, identifying named entities, namely adopting a Bi-Lstm +CRF model to identify the Tang-Ka named entities, introducing a conditional random field based on the Bi-Lstm model, taking Bi-Lstm as a feature extractor, taking the final output of the model as the input of the conditional random field model, and utilizing the conditional random field to obtain a state transition rule between labeling sequences;
s3, joint extraction of entities and relationships among entities based on Bs-Spert model:
S3-1, constructing a Bs-Spert model, wherein the model mainly comprises a Bert pre-training model module, a cluster searching module, a span classifying module, a span filtering module and a relation classifying module, and expanding a basic development of the Bs-Spert model by using the Bert pre-training model to face the joint extraction of the entity and the relation between the entities of Tang Cavan;
S3-2, training a Bs-Spert model, wherein the Bs-Spert model is trained on the basis of the construction of the Thangka text dataset in the step 2, and a Bert pre-training model adopts a Bert-Base-Chinese;
S3-3, performing a Tang Cavan entity and entity joint extraction task experiment, namely firstly testing the performance of a Bs-Spert model under the influence of Beam Width of different bundling widths, then selecting a pooling function by testing Precision, recall, F-Score values of different pooling functions on a span classification module, and finally, performing transverse comparison with a classic model Bert-CNN and an LSTM-RNN in information extraction on the basis of the results of the first two steps to obtain the experimental performance of the Bs-Spert model on a Tang Cavan data set;
S4, constructing a Tang Kawen knowledge graph display system, namely firstly, importing the Tang Cavan knowledge graph from Tang Kawen-oriented entities and storing Tang Cavan-oriented entities into a Neo4j graph database by two steps of the Tang Cavan knowledge graph, and then, completing the construction of the Tang Cavan knowledge graph web end display system by three steps of system demand analysis, system design and system test.
2. The method for constructing a tang Cavan knowledge graph display system according to claim 1, wherein in the step S3-3, beam Width of different bundles is set to 3, 5, 7, 9, 11, and pooling functions are Average Pooling, sum Pooling, and Max Pooling.
3. The method for constructing the tang Cavan knowledge graph display system according to claim 1, wherein in the step S4, the requirement analysis includes interface requirement analysis and function requirement analysis, the system design includes interface design and function design, the interface design includes login interface and function interface, the system test includes login test and function test, the login test includes guest mode, user mode and management mode, and the function test includes tangka named entity identification, tangka entity query, tangka entity relationship query and entity or entity relationship addition.
4. A tang Cavan knowledge graph display system, which is characterized in that the system constructed by the construction method of the tang Cavan knowledge graph display system according to any one of claims 1-3 comprises a demand analysis layer, an interface login layer, a functional layer and a data storage layer;
The demand analysis layer comprises interface demand analysis and function demand analysis, wherein the interface demand analysis is used for analyzing whether the interface demands of different user operation uses, font colors and sizes reasonably matched with the main colors of the interface are met or not;
the interface login layer is used for logging in interfaces of different users, distinguishing targets of the different users based on the data analyzed by the demand analysis layer, and popping up an access login interface which is suitable for the targets;
The function layer is used for operating functions of different users, wherein the operating functions comprise identification of a Thangka naming entity, inquiry of the Thangka entity, inquiry of a Thangka relation, addition of the Thangka entity or the Thangka entity relation, modification of the correction of the Thangka entity or the Thangka entity relation and deletion of the Thangka entity or the Thangka entity relation;
the data storage layer is used for storing relationships between the Tang Cavan entity and the Tang Cavan entity.
5. The tang Cavan knowledge graph display system according to claim 4, wherein the interface login layer comprises a login interface, a new user registration interface and a password modification interface, wherein the login interface is suitable for tang Cavan lovers, tang Cavan researchers and platform administrators, the login interface does not check an administrator option under normal default conditions, the system defaults to a common user login, when a guest mode accesses, a right-hand lower guest mode can be clicked directly, the system automatically jumps to a platform interface with a limiting function, and when an administrator check box is checked, the system is switched to an administrator login mode.
6. The tang Cavan-based knowledge graph display system according to claim 5, wherein the system is characterized in that when a user accessing the system in a guest mode, the system is automatically divided into tangka lovers, the crowd only grants limited functions, and only three functions of named entity identification, entity inquiry and entity relation inquiry of the system can be used, a login user is normally registered through the system, and additionally grants functions of adding tangka culture and tang Cavan-based relation of the system on the basis of the three functions, and a system administrator grants all control rights to the tangka Cavan-based knowledge graph display platform.
CN202211136388.9A 2022-09-19 2022-09-19 A Thangka cultural knowledge graph display system and its construction method Active CN115687314B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211136388.9A CN115687314B (en) 2022-09-19 2022-09-19 A Thangka cultural knowledge graph display system and its construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211136388.9A CN115687314B (en) 2022-09-19 2022-09-19 A Thangka cultural knowledge graph display system and its construction method

Publications (2)

Publication Number Publication Date
CN115687314A CN115687314A (en) 2023-02-03
CN115687314B true CN115687314B (en) 2025-09-05

Family

ID=85062439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211136388.9A Active CN115687314B (en) 2022-09-19 2022-09-19 A Thangka cultural knowledge graph display system and its construction method

Country Status (1)

Country Link
CN (1) CN115687314B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118504572B (en) * 2024-07-17 2024-10-18 西安电子科技大学 A multi-level feature entity extraction method and system for digital reservoirs
CN118627318B (en) * 2024-08-13 2024-11-12 浙江业视数智科技有限公司 A method for analyzing cultural and tourism information data for interaction

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909549B (en) * 2019-10-11 2021-05-18 北京师范大学 Method, device and storage medium for segmenting ancient Chinese
CN111143574A (en) * 2019-12-05 2020-05-12 大连民族大学 Query and visualization system construction method based on minority culture knowledge graph
CN111563383A (en) * 2020-04-09 2020-08-21 华南理工大学 A Chinese Named Entity Recognition Method Based on BERT and SemiCRF
CN112988999B (en) * 2021-03-17 2024-07-12 平安科技(深圳)有限公司 Method, device, equipment and storage medium for constructing Buddha study answer pairs
CN115062156B (en) * 2022-04-24 2025-03-21 天津大学 Knowledge graph construction method based on function word enhanced small sample relation extraction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向唐卡文化的知识图谱构建与研究;李长哲;《中国优秀硕士学位论文全文数据库》;20230315;F088-230 *

Also Published As

Publication number Publication date
CN115687314A (en) 2023-02-03

Similar Documents

Publication Publication Date Title
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN109697162B (en) An automatic detection method for software defects based on open source code library
US11210468B2 (en) System and method for comparing plurality of documents
KR102431549B1 (en) Causality recognition device and computer program therefor
US9009134B2 (en) Named entity recognition in query
US8073877B2 (en) Scalable semi-structured named entity detection
US9483460B2 (en) Automated formation of specialized dictionaries
US11113470B2 (en) Preserving and processing ambiguity in natural language
CN112256939A (en) Text entity relation extraction method for chemical field
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN115292450B (en) A method for constructing domain knowledge base of data classification and grading based on information extraction
CN115687314B (en) A Thangka cultural knowledge graph display system and its construction method
US20090112845A1 (en) System and method for language sensitive contextual searching
CN1629837A (en) Method and apparatus for processing, browsing and classified searching of electronic document and system thereof
CN111325018B (en) Domain dictionary construction method based on web retrieval and new word discovery
JP2023517518A (en) Vector embedding model for relational tables with null or equivalent values
CN120086427B (en) A method and system for intelligently collecting and analyzing web merchant information
CN119599130A (en) Self-adaptive sensitive information intelligent identification method, device, equipment, storage medium and product
Kausar et al. A detailed study on information retrieval using genetic algorithm
CN112463804A (en) KDTree-based image database data processing method
CN119377490B (en) A person-job matching recommendation method based on BERT and latent semantic algorithm model
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN118964533A (en) Retrieval enhancement generation method and system supporting multi-language knowledge base
Chen et al. Toward the understanding of deep text matching models for information retrieval
JP2007047974A (en) Information extraction apparatus and information extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant