CN111177412B

CN111177412B - Public logo bilingual parallel corpus system

Info

Publication number: CN111177412B
Application number: CN201911388415.XA
Authority: CN
Inventors: 李伟彬; 张洁; 刘小蓉; 毛智; 田娜; 阳程
Original assignee: Chengdu University of Information Technology; Chengdu Univeristy of Technology
Current assignee: Chengdu University of Information Technology; Chengdu Univeristy of Technology
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2023-03-31
Anticipated expiration: 2039-12-30
Also published as: CN111177412A

Abstract

The invention relates to a bilingual parallel corpus system of public signs, comprising a corpus collection module, a classification and labeling module, a parallel corpus information sub-database, a corpus information sub-database, a category index table, and a query information extraction module, which collects collected information through the classification and labeling module The corpus information is stored in the parallel corpus information sub-database by category, and the corpus information sub-database uses the secondary classification to establish associations with other main classifications, so that the corresponding bilingual parallel corpus information can be quickly found within the required range when querying information . Aiming at the problem that public logos involve many fields, the present invention specially designs an interrelated multi-level labeling form to classify and label the stored public logo corpus information, and combine semantic labeling to make corpus with possible associations All can be quickly displayed in the query, and the query search of non-related corpus is effectively excluded, and the use efficiency of the bilingual parallel corpus of public signs is improved.

Description

Public logo bilingual parallel corpus system

技术领域technical field

本发明涉及一种公共标识语双语平行语料库系统。The invention relates to a bilingual parallel corpus system of public signs.

背景技术Background technique

公共标识语也被称为公示语，主要是在城市中为公众或游客的出行方便而提供的指示性语音，包括服务设施、机构名称、广告牌、公共设施、公共交通、旅游景点、街头路牌、标语口号、商店招牌等，其作用是通过简明的语言向公众提供有效的信息。随着经济文化的发展，尤其是旅游业的发展，很多城市都吸引了大量的外国友人，因此公共标识语的翻译显得尤为重要，其不仅是城市语言环境和人文环境的代表，而且为促进旅游产业的发展起到重要的作用。正确、得体的公共标识语翻译内容能够为各国游客提供良好便捷的帮助并提高城市的整体形象，反之，错误、不得体的公共标识语反应内容会给外国游客带来理解上的障碍甚至误区，因此保证公共标识语翻译的准确很有必要。Public signs, also known as public signs, are mainly indicative voices provided in cities for the convenience of the public or tourists, including service facilities, agency names, billboards, public facilities, public transportation, tourist attractions, street signs , slogans, shop signs, etc., its role is to provide effective information to the public through concise language. With the development of economy and culture, especially the development of tourism, many cities have attracted a large number of foreign friends, so the translation of public signs is particularly important. important role in the development of the industry. Correct and appropriate translations of public signs can provide good and convenient help for tourists from all over the world and improve the overall image of the city. On the contrary, wrong and inappropriate public sign translations will bring obstacles or even misunderstandings to foreign tourists. Therefore, it is necessary to ensure the accuracy of the translation of public signs.

在提高公共标识语翻译准确度的过程中，建立合理准确的公共标识语双语平行语料库又至关重要，由于公共标识语涉及的领域众多，如何让使用者从中较为快速准确地获取所需的公共标识语双语语料信息为本领域技术人员亟需。In the process of improving the translation accuracy of public signs, it is very important to establish a reasonable and accurate bilingual parallel corpus of public signs. Since public signs involve many fields, how to enable users to obtain the required public signs quickly and accurately? Bilingual corpus information of signs is urgently needed by those skilled in the art.

发明内容Contents of the invention

针对上述技术问题，本发明提供一种公共标识语双语平行语料库系统，利用计算机信息处理技术来提高公共标识语双语平行语料的获取效率。Aiming at the above technical problems, the present invention provides a bilingual parallel corpus system of public signs, which utilizes computer information processing technology to improve the acquisition efficiency of bilingual parallel corpora of public signs.

为实现上述目的，本发明采用的技术方案如下：To achieve the above object, the technical scheme adopted in the present invention is as follows:

一种公共标识语双语平行语料库系统，包括：A bilingual parallel corpus system for public signs, including:

语料采集模块，用于从外部信息渠道采集获取公共标识语双语平行语料信息；The corpus collection module is used to collect and obtain bilingual parallel corpus information of public signs from external information channels;

分类标注模块，用于将语料采集模块获取到的公共标识语双语平行语料信息按预设的类别进行标注，并输出相应的类别标示符，该预设的类别至少包括对应公共标识语双语平行语料信息的主要分类的一级类别，以及对应公共标识语双语平行语料信息的次要分类的二级类别，该类别标示符包括与一级类别对应的一类标示符和与二级类别对应的二类标示符；The classification and labeling module is used to label the bilingual parallel corpus information of public signs obtained by the corpus collection module according to the preset category, and output the corresponding category identifier. The preset category includes at least the corresponding public sign bilingual parallel corpus The first-level category of the main classification of information, and the second-level category of the secondary classification of bilingual parallel corpus information corresponding to the public logo. class identifier;

平行语料信息分库，其数量与公共标识语双语平行语料信息的一级类别数量相匹配，用于按主要分类分别独立存储所述公共标识语双语平行语料信息；Parallel corpus information sub-database, the number of which matches the number of first-level categories of bilingual parallel corpus information of public signs, and is used to independently store the bilingual parallel corpus information of public signs according to main classifications;

语料信息子库，隶属于每一平行语料信息分库，用于存放当前分类的公共标识语双语平行语料信息按其次要分类所产生的二类标示符，并使该二类标示符建立与其他一级类别之间的关联；The corpus information sub-database belongs to each parallel corpus information sub-database, and is used to store the second-type identifiers generated by the bilingual parallel corpus information of the current classification of public signs according to its secondary classification, and to make the second-type identifiers establish a relationship with other Associations between primary categories;

类别索引表，用于记录存放所述类别标示符，并在关联有一级类别的二类标示符上配置跳转接口；The category index table is used to record and store the category identifier, and configure a jump interface on the second category identifier associated with the first category;

查询信息提取模块，用于在查询时将输入的信息按语意标注上其可能涉及的类别标示符，使其直接对照类别索引表在对应的平行语料信息分库中进行遍历式信息查询。The query information extraction module is used to label the input information semantically with the category identifiers it may involve, so that it can directly compare the category index table to perform traversal information query in the corresponding parallel corpus information sub-database.

具体地，所述二类标示符通过次要分类的语意语境建立与之相匹配的其他一级类别之间的关联。Specifically, the second-class identifier establishes a relationship between other first-level categories that match it through the semantic context of the secondary classification.

进一步地，每一所述平行语料信息分库中存储的公共标识语双语平行语料信息配置有优先级值，该优先级值按其被查询的频次进行排序。Further, the public logo bilingual parallel corpus information stored in each of the parallel corpus information sub-databases is configured with a priority value, and the priority values are sorted according to their query frequency.

进一步地，所述语料信息子库配置有关联度值，用于指示不同的公共标识语双语平行语料信息之间的语意相关性。Further, the corpus information sub-base is configured with a degree of relevance value, which is used to indicate the semantic correlation between bilingual parallel corpus information of different common logos.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明针对公共标识语涉及领域众多的问题专门设计了一种相互关联的多级标注形式来对所存储的公共标识语语料信息进行分类标注，结合语意标识的方式使可能存着关联关系的语料都能够在查询中快速展示出来，并有效地排除了非相关语料的查询搜索，提高了公共标识语双语平行语料库的使用效率，对公共标识语的应用具有重要的促进作用。Aiming at the problem that public logos involve many fields, the present invention specially designs an interrelated multi-level labeling form to classify and label the stored public logo corpus information, and combine semantic labeling to make corpus with possible associations All can be quickly displayed in the query, and effectively exclude the query search of non-related corpora, improve the efficiency of the use of the bilingual parallel corpus of public signs, and play an important role in promoting the application of public signs.

附图说明Description of drawings

图1为本发明的原理框图。Fig. 1 is a functional block diagram of the present invention.

具体实施方式Detailed ways

下面结合附图说明和实施例对本发明作进一步说明，本发明的方式包括但不仅限于以下实施例。The present invention will be further described below with reference to the accompanying drawings and embodiments, and the mode of the present invention includes but not limited to the following embodiments.

实施例Example

如图1所示，该公共标识语双语平行语料库系统，包括：As shown in Figure 1, the public logo bilingual parallel corpus system includes:

语料信息子库，隶属于每一平行语料信息分库，用于存放当前分类的公共标识语双语平行语料信息按其次要分类所产生的二类标示符，并使该二类标示符通过次要分类的语意语境建立与之相匹配的其他一级类别之间的关联；其中所述语料信息子库配置有关联度值，用于指示不同的公共标识语双语平行语料信息之间的语意相关性；The corpus information sub-database, which belongs to each parallel corpus information sub-database, is used to store the second-type identifiers generated by the secondary classification of the bilingual parallel corpus information of the current classification of public signs, and make the second-type identifiers pass through the secondary The semantic context of the classification establishes the association between other first-level categories that match it; wherein the corpus information sub-base is configured with an association degree value, which is used to indicate the semantic correlation between bilingual parallel corpus information of different public signs sex;

并且，每一所述平行语料信息分库中存储的公共标识语双语平行语料信息配置有优先级值，该优先级值按其被查询的频次进行排序。In addition, the bilingual parallel corpus information of public logos stored in each parallel corpus information sub-database is configured with a priority value, and the priority values are sorted according to their frequency of being queried.

在实际应用中，使用者在查询一段公共标识语语料时，先由查询信息提取模块进行类别标示符的标注，此后系统先根据其类别标示符分配待查询的与之有所关联的平行语料信息分库，再在这些分库内根据实际关键词信息进行查询，从而快速准确地获取到所需的双语平行语料信息。In practical applications, when a user queries a piece of public logo corpus, the query information extraction module first labels the category identifiers, and then the system first assigns the parallel corpus information associated with it to be queried according to the category identifiers. Sub-databases, and then query according to the actual keyword information in these sub-databases, so as to quickly and accurately obtain the required bilingual parallel corpus information.

上述实施例仅为本发明的优选实施方式之一，不应当用于限制本发明的保护范围，但凡在本发明的主体设计思想和精神上作出的毫无实质意义的改动或润色，其所解决的技术问题仍然与本发明一致的，均应当包含在本发明的保护范围之内。The above-mentioned embodiment is only one of the preferred implementation modes of the present invention, and should not be used to limit the scope of protection of the present invention, but any modification or embellishment without substantive significance made on the main design concept and spirit of the present invention shall be solved by it. If the technical problems are still consistent with the present invention, all should be included in the protection scope of the present invention.

Claims

1. A bilingual parallel corpus system of public signs, characterized in that it comprises:

The corpus collection module is used to collect and obtain bilingual parallel corpus information of public signs from external information channels;

The classification and labeling module is used to label the bilingual parallel corpus information of public signs obtained by the corpus collection module according to the preset category, and output the corresponding category identifier. The preset category includes at least the corresponding public sign bilingual parallel corpus The first-level category of the main classification of information, and the second-level category of the secondary classification of bilingual parallel corpus information corresponding to the public logo. class identifier;

Parallel corpus information sub-database, the number of which matches the number of first-level categories of bilingual parallel corpus information of public signs, and is used to independently store the bilingual parallel corpus information of public signs according to main classifications;

The corpus information sub-database belongs to each parallel corpus information sub-database, and is used to store the second-type identifiers generated by the bilingual parallel corpus information of the current classification of public signs according to its secondary classification, and to make the second-type identifiers establish a relationship with other The association between the first-level categories; wherein the second-class identifier establishes the association between other first-level categories matching it through the semantic context of the secondary classification;

The category index table is used to record and store the category identifier, and configure a jump interface on the second category identifier associated with the first category;

The query information extraction module is used to label the input information semantically with the category identifiers it may involve, so that it can directly compare the category index table to perform traversal information query in the corresponding parallel corpus information sub-database.

2. the bilingual parallel corpus system of public signs according to claim 1, is characterized in that, the bilingual parallel corpus information of public signs stored in each said parallel corpus information sub-bank is configured with a priority value, and the priority value Sort by how often they are queried.

3. The public sign bilingual parallel corpus system according to claim 2, wherein the corpus information sub-base is configured with a degree of relevance value for indicating the semantic correlation between different public sign bilingual parallel corpus information sex.