CN109977406A

CN109977406A - A kind of Chinese medicine state of an illness text key word extracting method based on sick position

Info

Publication number: CN109977406A
Application number: CN201910232088.2A
Authority: CN
Inventors: 姜晓红; 陈广; 吴健; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2019-07-05

Abstract

The invention discloses a method for extracting keywords of TCM condition text based on disease location, comprising the following steps: segmenting TCM condition text, and generating a TCM condition dictionary based on the word segmentation result of the TCM condition text; calculating the IDF of each word in the TCM condition dictionary value and TF value; according to the IDF value and TF value of the word, and the condition of the disease location in the word, the importance of the word is improved; according to the importance of each word, the m words in the top m are selected as the key of the text word. Considering that most of the text keywords in the TCM disease text are disease position words and disease words, the invention takes the disease position as the basis, and improves the extraction efficiency of keywords in the TCM disease text by weighting the disease position on the value of TF-IDF. accuracy.

Description

A method for extracting keywords from TCM disease text based on disease location

技术领域technical field

本发明属于自然语言处理技术领域，具体涉及一种基于病位的中医病情文本关键词提取方法。The invention belongs to the technical field of natural language processing, and in particular relates to a disease location-based method for extracting keywords from TCM disease text.

背景技术Background technique

中医辩证辨病常采用试探与反证、援物比类、“望、闻、问、切”四诊合参的方法对病人进行诊疗，通常问病人的病症部位，症状严重程度，病症的有无关系、病人饮食起居等，随着数字化检验的不断发展，中医诊疗中也常常包括西医检测数据，如血常规、尿常规等数据。相对于一般文本，比如人民日报、网络新闻文本等，中医病情文本具有以下特点：Dialectical disease differentiation in TCM often adopts the method of trial and discord, comparison of aids and materials, and the four diagnostic methods of “looking, smelling, asking, and cutting” to diagnose and treat patients. Usually, the patient is asked about the location of the disease, the severity of the symptoms, and the presence or absence of the disease. With the continuous development of digital testing, TCM diagnosis and treatment often include Western medicine testing data, such as blood routine, urine routine and other data. Compared with general texts, such as People's Daily, online news texts, etc., Chinese medicine disease texts have the following characteristics:

1)中医病情文本中主语、谓语和宾语等句子主要成分不明显，甚至缺少其中某一部分。另外，句子中并列关系明显，比如“无压痛、反跳痛”，正确的理解是“无压痛”、“无反跳痛”；1) The main components of sentences such as subject, predicate and object in TCM illness texts are not obvious, and even some of them are missing. In addition, the juxtaposition relationship in the sentence is obvious, such as "no tenderness, rebound tenderness", the correct understanding is "no tenderness", "no rebound tenderness";

2)中医病情文本中常常包括一些西医检测数据。比如体温等数据，这些数据给基于文本分析的算法带来了一些困难；2) Some western medicine testing data are often included in the texts of TCM illnesses. For example, data such as body temperature, these data bring some difficulties to algorithms based on text analysis;

3)中医病情文本中领域词比较多。比如“干湿性罗音”这个组合词在一般的文本中不会出现；3) There are many domain words in TCM disease texts. For example, the compound word "dry and wet rales" does not appear in ordinary texts;

4)中医病情文本关键语义信息主要以症状、病位、症状有无关系和症状严重程度等词或短语构成。4) The key semantic information of TCM disease text is mainly composed of words or phrases such as symptoms, disease location, whether symptoms are related or not, and the severity of symptoms.

常用的文本关键词提取算法是TF-IDF算法和TextRank算法。TF-IDF 算法计算词的词频和逆文档频率的乘积来衡量一个词在文本中的重要程度，然后按照词的TF-IDF值进行降序排列，选择最靠前的若干个词作为文本关键词。TextRank算法借鉴了PageRank的思路，通过构建词图网络来发掘关键词，其核心是将文本中的词作为图的节点。设定文本窗口大小为k，则词与词之间距离不大于k的节点间存在一条无向边。依据该词图网络，通过随机游走的方式，求出每个词节点的重要性，最重要的若干词就是文本关键词。Commonly used text keyword extraction algorithms are TF-IDF algorithm and TextRank algorithm. The TF-IDF algorithm calculates the product of the word frequency and the inverse document frequency to measure the importance of a word in the text, and then sorts in descending order according to the TF-IDF value of the word, and selects the most advanced words as text keywords. The TextRank algorithm draws on the idea of PageRank, and discovers keywords by constructing a word graph network, the core of which is to use the words in the text as the nodes of the graph. If the text window size is set to k, there is an undirected edge between the nodes whose distance between words is not greater than k. According to the word graph network, the importance of each word node is obtained by random walk, and the most important words are text keywords.

无论是TF-IDF算法，还是TextRank算法，其存在以下问题：对于文本中的每个词来说，其算法执行前，每个词的重要性相同。在这里未考虑到每个词中的字与内容之间的关系。对于中医病情文本来说，其关键词大部分应该是文本中的病位词和症状词，也就是说词中包含病位的词更加容易成为关键词。比如“腹泻”一词，其包含‘腹’字，其应该成为中医病情文本的关键词。Whether it is the TF-IDF algorithm or the TextRank algorithm, there are the following problems: for each word in the text, before the algorithm is executed, each word has the same importance. The relationship between the words in each word and the content is not considered here. For TCM disease texts, most of the keywords should be disease words and symptom words in the text, that is to say, words containing disease locations are more likely to become keywords. For example, the word "diarrhea", which contains the word "abdomen", should be the key word in the text of TCM conditions.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提出一种基于病位的中医病情文本关键词提取方法，以病位作为提取关键词的一个特征，通过计算加权的TF-IDF值来提取关键词。The object of the present invention is to propose a method for extracting keywords from disease-position-based Chinese medicine disease texts, using disease-position as a feature of extracting keywords, and extracting keywords by calculating a weighted TF-IDF value.

为实现上述发明目，本发明提供以下技术方案：In order to realize the above-mentioned purpose of the invention, the present invention provides the following technical solutions:

一种基于病位的中医病情文本关键词提取方法，包括以下步骤：A disease location-based method for extracting keywords from TCM condition texts, comprising the following steps:

对中医病情文本分词，并基于中医病情文本分词结果生成中医病情词典；Segmentation of TCM condition text, and generate a TCM condition dictionary based on the result of TCM condition text segmentation;

计算中医病情词典中每个词语的IDF值和TF值；Calculate the IDF value and TF value of each word in the TCM disease dictionary;

根据词语的IDF值和TF值、以及词语中包含病位情况，提升词语的重要度；According to the IDF value and TF value of the word, and the condition of the disease location in the word, the importance of the word is improved;

根据每个词语的重要度，选择排在前m位的m个词语为文本的关键词。According to the importance of each word, the m words in the top m are selected as the keywords of the text.

本发明提供的基于病位的中医病情文本关键词提取方法，克服了传统文本关键词提取方法，如TF-IDF或TextRank，每个词的重要度相同的问题，本发明虑到中医病情文本中的文本关键词大部分是病位词及病症词，以病位为基础，通过对TF-IDF的值进行病位加权，进而提升了中医病情文本关键词提取的准确性。The method for extracting keywords in TCM disease text based on disease location provided by the present invention overcomes the problem of traditional text keyword extraction methods, such as TF-IDF or TextRank, that the importance of each word is the same. Most of the text keywords are disease location words and disease words. Based on disease location, by weighting the disease location on the value of TF-IDF, the accuracy of TCM disease text keyword extraction is improved.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图做简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动前提下，还可以根据这些附图获得其他附图。In order to illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative efforts.

图1是本发明基于病位的中医病情文本关键词提取方法的流程图。Fig. 1 is a flow chart of the method for extracting keywords of TCM disease text based on disease location of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例对本发明进行进一步的详细说明。应当理解，此处所描述的具体实施方式仅仅用以解释本发明，并不限定本发明的保护范围。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, and do not limit the protection scope of the present invention.

如图1所示，本实施例提供了一种基于病位的中医病情文本关键词提取方法，该方法以病位为基础，通过对TF-IDF的值进行病位加权，提升了中医病情文本关键词提取的准确性，具体包括以下步骤：As shown in FIG. 1 , the present embodiment provides a method for extracting keywords of TCM condition text based on disease location. The method is based on the disease location, and improves the TCM condition text by weighting the disease location on the value of TF-IDF. The accuracy of keyword extraction includes the following steps:

S101，对中医病情文本分词。S101 , segmenting the text of TCM illnesses.

具体地，在进行中医病情文本分词时，对于待分词的病情文本集合，根据医学词典和停用词典stopWords，对病情文本集合进行分词，并去除停用词，得到分词文本集合。Specifically, when performing word segmentation of TCM disease text, for the disease text set to be segmented, according to the medical dictionary and stopWords, the disease text set is segmented, and the stop words are removed to obtain a word segmentation text set.

病情文本集合中包含有代表中医诊断文本案例。医学词典是指包含有医药材、方剂以及医学基本术语的字典。停用词典stopWords是中医领域相关的停用词表，比如“病人”、“病史”等词。The disease text collection contains text cases representing TCM diagnosis. A medical dictionary is a dictionary that contains medicinal materials, prescriptions, and basic medical terms. The stop dictionary stopWords is a list of stop words related to the field of traditional Chinese medicine, such as words such as "patient" and "medical history".

S102，基于中医病情文本分词结果生成中医病情词典。S102, a TCM disease dictionary is generated based on the word segmentation result of the TCM disease text.

其中，所述基于中医病情文本分词结果生成中医病情词典包括：Wherein, the generation of the TCM disease dictionary based on the word segmentation result of the TCM disease text includes:

统计分词文本集合中的词语，将满足出现频数属于区间[α₁，α₂]之间的词语添加到中医病情词典中。The words in the word segmentation text collection are counted, and the words whose frequency of occurrence belongs to the interval [α ₁ , α ₂ ] are added to the TCM disease dictionary.

S103，计算中医病情词典中每个词语的IDF值和TF值。S103, calculate the IDF value and TF value of each word in the TCM condition dictionary.

IDF值是指逆文本频率指数，对于中医病情词典中的第i个词语，其 IDF值为：The IDF value refers to the inverse text frequency index. For the i-th word in the TCM condition dictionary, its IDF value is:

其中，idf_i表示第i个词语的IDF值，|D|为中医病情文本总数，n_i为包含第i个词语的中医病情文本数量。Among them, idf _i represents the IDF value of the ith word, |D| is the total number of TCM disease texts, and _ni is the number of TCM disease texts containing the ith word.

TF是指词频，在计算词语的TF值之前，分别将分词文本集合中属于同一科室的病情文本拼成一个文本D_all，k，k表示科室索引；TF refers to the word frequency. Before calculating the TF value of the word, the disease texts belonging to the same department in the word segmentation text collection are spelled into a text D _{all, k} , and k represents the department index;

对于中医病情词典中的第i个词语，其TF值为：For the i-th word in the TCM condition dictionary, its TF value is:

其中，tf_i表示第i个词语的TF值，n_i表示第i个词语在文本D_all，k中出现的次数。Among them, tfi represents the TF value of the _ith word, and _ni represents the number of times the ith word appears in the text D _all,k .

S104，根据词语的IDF值和TF值、以及词语中包含病位情况，提升词语的重要度。S104, according to the IDF value and TF value of the word, and the condition that the word contains a disease location, improve the importance of the word.

当中医病情词典中词语中包含病位时，提升词语的重要度，若词语中不包含病位，则不提升该词语的重要度，其中，重要度的计算公式为：When the word in the TCM disease dictionary contains the disease location, the importance of the word is increased. If the word does not contain the disease location, the importance of the word is not increased. The calculation formula of the importance is as follows:

其中，word_i代表中医病情词典中第i个的词语的重要度，α用来调节病位重要度，并且α≥0，w_j代表第j个病位的权重，其计算公式为：Among them, word _i represents the importance of the i-th word in the TCM disease dictionary, α is used to adjust the importance of the disease location, and α≥0, w _j represents the weight of the j-th disease location, and the calculation formula is:

其中，c_j表示第j个病位出现的频数，n表示病位总数量。Among them, c _j represents the frequency of the jth disease site, and n represents the total number of disease sites.

本发明中，病位不仅仅包括身体的各个部位字，还包括某些基础病症字，比如“痛”、“虚”，这些病症字代表了中医辨证的理论基础。In the present invention, the disease position not only includes the characters of various parts of the body, but also includes some basic disease characters, such as "pain" and "deficiency", and these disease characters represent the theoretical basis of TCM syndrome differentiation.

S105，根据每个词语的重要度，选择排在前m位的m个词语为文本的关键词。S105, according to the importance of each word, select m words in the top m positions as keywords of the text.

具体地，可以根据重要度对词语进行排列，选择排在前m位的词语成为文本D_all的关键词W＝{W₁，W₂，...，W_m}。Specifically, the words may be arranged according to their importance, and the top m words are selected to be the keywords W={W ₁ , W ₂ , . . . , W _m } of the text D _all .

实验例Experimental example

假设病情文本集合中包括文本A，文本B，文本C，文本D，文本E：Suppose the disease text collection includes text A, text B, text C, text D, and text E:

文本A：病人1周前无明显诱因下出现咳嗽，神志清，精神可。无压痛、反跳痛等。Text A: The patient developed cough without obvious incentive 1 week ago, and was conscious and in good spirits. No tenderness, rebound tenderness, etc.

文本B：病人出现咳嗽，无压痛、反跳痛等。Text B: The patient presented with a cough without tenderness, rebound tenderness, etc.

文本C：病人出现咳嗽。Text C: Patient presents with cough.

文本D：病人神志清。Text D: Patient is sane.

文本E：病人血肌酐升高。Text E: Patient has elevated serum creatinine.

对文本A、文本B、文本C、文本D、文本E按照S101～105进行计算，其中文本A、文本C为同一类，文本B、文本D为同一类，文本E为一类。The text A, text B, text C, text D, and text E are calculated according to S101 to 105, wherein text A and text C are of the same type, text B and text D are of the same type, and text E is of a type.

经过S101中分词和去停用词得到如下文本集合D′，其中采用结巴分词工具进行分词：After word segmentation and stop word removal in S101, the following text set D' is obtained, in which the word segmentation tool is used for word segmentation:

文本A：1周前无明显诱因出现咳嗽神志清精神可无压痛反跳痛Text A: Coughing without obvious incentives 1 week ago, conscious but no tenderness, rebound tenderness

文本B：出现咳嗽无压痛反跳痛Text B: Coughing with nontender rebound tenderness

文本C：出现咳嗽Text C: Coughing occurs

文本D：神志清Text D: Consciousness

文本E：血肌酐升高Text E: Elevated serum creatinine

设置α₁＝1，α₂＝10，根据S101获得的文本集合D′，经过S102的筛选生成中医病情词典X，则中医病情词典X包含如表1所示的词：Set α ₁ =1, α ₂ =10, according to the text set D' obtained in S101, after the screening of S102, a TCM disease dictionary X is generated, and the TCM disease dictionary X contains the words shown in Table 1:

表1词典X生成Table 1 Dictionary X Generation

编号Numbering 词语words 频率frequency 11 出现Appear 33 22 咳嗽cough 33 33 无none 22 44 压痛tenderness 22 55 反跳痛rebound pain 22 66 神志consciousness 22 77 清clear 2 2

利用S103计算每个词语的IDF值，其中log底数取自然对数e，具体计算结果如表2所示，Utilize S103 to calculate the IDF value of each word, wherein log base is taken natural logarithm e, concrete calculation result is as shown in table 2,

表2词典X的IDF计算结果Table 2 IDF calculation results of dictionary X

利用S103计算每个词语的TF值。首先，分别将文本集合D'中同一科室的病情文本拼接成一个文本D_all，将同一类型文本A、C合并成D_all-ac，文本B、D合并成D_all-bd，文本E为D_all-e，然后，分别计算中医病情词典 X中词语的TF值，计算结果如表3所示：Use S103 to calculate the TF value of each word. First, the disease texts of the same department in the text set D' are spliced into one text D _all , the texts A and C of the same type are merged into D _all-ac , the texts B and D are merged into D _all-bd , and the text E is D _all-e , and then, calculate the TF values of the words in the TCM condition dictionary X respectively, and the calculation results are shown in Table 3:

表3同一类TF值计算结果Table 3 Calculation results of TF values of the same type

根据S104提升词语的重要度，示例中病位可以选择“痛”、“神”。其中病位权重初始化结果如表4所示：According to S104, the importance of the word is increased. In the example, the disease location can select "pain" and "god". Among them, the initial results of the disease position weight are shown in Table 4:

表4病位初始化结果Table 4 Initialization results of the sick position

设置α＝10，根据S104计算词语中病位对关键词影响的重要度如表5：Set α=10, according to S104, calculate the importance of the influence of the disease position in the word on the keyword, as shown in Table 5:

表5 word_i值计算Table 5 word _i value calculation

根据S105，选择2个词语形成文本的关键词，结果如表6所示：According to S105, two words are selected to form keywords of the text, and the results are shown in Table 6:

表6文本关键词结果Table 6 Text keyword results

D_all-ac中的top2关键词为压痛、反跳痛，D_all-bd中的top2关键词可以为压痛、反跳痛、神志等任意两个组合，D_all-e中没有关键词。上述示例只是说明该发明的计算过程，不代表实际情况。The top2 keywords in D _all-ac are tenderness and rebound tenderness. The top2 keywords in D _all-bd can be any two combinations of tenderness, rebound tenderness, and consciousness. There are no keywords in D _all-e . The above examples only illustrate the calculation process of the invention and do not represent the actual situation.

以上所述的具体实施方式对本发明的技术方案和有益效果进行了详细说明，应理解的是以上所述仅为本发明的最优选实施例，并不用于限制本发明，凡在本发明的原则范围内所做的任何修改、补充和等同替换等，均应包含在本发明的保护范围之内。The above-mentioned specific embodiments describe in detail the technical solutions and beneficial effects of the present invention. It should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, additions and equivalent substitutions made within the scope shall be included within the protection scope of the present invention.

Claims

1. a kind of Chinese medicine state of an illness text key word extracting method based on sick position, comprising the following steps:

Feelings of curing the desease in text participle, and Chinese medicine state of an illness dictionary is generated based on Chinese medicine state of an illness text word segmentation result；

Calculate the IDF value and TF value of each word in Chinese medicine state of an illness dictionary；

According to including sick position situation in the IDF value of word and TF value and word, the different degree of word is promoted；

According to the different degree of each word, select m before coming m words for the keyword of text.

2. the Chinese medicine state of an illness text key word extracting method based on sick position as described in claim 1, which is characterized in that carrying out When Chinese medicine state of an illness text segments, for state of an illness text collection to be segmented, according to Medical Dictionary and deactivated dictionary, to state of an illness text Set is segmented, and removes stop words, obtains participle text collection.

3. the Chinese medicine state of an illness text key word extracting method based on sick position as claimed in claim 2, which is characterized in that the base Generating Chinese medicine state of an illness dictionary in Chinese medicine state of an illness text word segmentation result includes:

, there is frequency for satisfaction and belongs to section [α in word in statistics participle text collection₁,α₂] between word be added to Chinese medicine In state of an illness dictionary.

4. the Chinese medicine state of an illness text key word extracting method based on sick position as claimed in claim 2, which is characterized in that in I-th of word in feelings of curing the desease dictionary, IDF value are as follows:

Wherein, idf_iIndicate the IDF value of i-th of word, | D | for Chinese medicine state of an illness text sum, n_iFor comprising in i-th of word Feelings of curing the desease amount of text.

5. the Chinese medicine state of an illness text key word extracting method based on sick position as claimed in claim 2, which is characterized in that calculating Before the TF value of word, the state of an illness text for belonging to same department in text collection will be segmented respectively and is combined into a text D_all,k, k table Show that department indexes；

For i-th of word in Chinese medicine state of an illness dictionary, TF value are as follows:

Wherein, tf_iIndicate the TF value of i-th of word, n_iIndicate i-th of word in text D_all,kThe number of middle appearance.

6. the Chinese medicine state of an illness text key word extracting method based on sick position as claimed in claim 2, which is characterized in that work as Chinese medicine When in state of an illness dictionary in word including sick position, the different degree of word, the calculation formula of different degree are promoted are as follows:

Wherein, eord_iThe different degree of i-th of word in Chinese medicine state of an illness dictionary is represented, α is used to adjust disease position different degree, and α >=0, w_jThe weight for representing j-th of sick position, its calculation formula is:

Wherein, c_jIndicate the frequency that j-th of sick position occurs, n indicates sick position total quantity.