CN111177412B - Public logo bilingual parallel corpus system - Google Patents
Public logo bilingual parallel corpus system Download PDFInfo
- Publication number
- CN111177412B CN111177412B CN201911388415.XA CN201911388415A CN111177412B CN 111177412 B CN111177412 B CN 111177412B CN 201911388415 A CN201911388415 A CN 201911388415A CN 111177412 B CN111177412 B CN 111177412B
- Authority
- CN
- China
- Prior art keywords
- information
- parallel corpus
- public
- category
- corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/381—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using identifiers, e.g. barcodes, RFIDs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Tourism & Hospitality (AREA)
- Educational Administration (AREA)
- General Health & Medical Sciences (AREA)
- Development Economics (AREA)
- Animal Behavior & Ethology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- Computational Linguistics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明涉及一种公共标识语双语平行语料库系统,包括语料采集模块,分类标注模块,平行语料信息分库,语料信息子库,类别索引表,查询信息提取模块,其通过分类标注模块将采集的语料信息按类别存储在平行语料信息分库中,并是语料信息子库利用次要分类建立与其他主要分类的关联,使查询信息时能够在所需范围内快速找出相应的双语平行语料信息。本发明针对公共标识语涉及领域众多的问题专门设计了一种相互关联的多级标注形式来对所存储的公共标识语语料信息进行分类标注,结合语意标识的方式使可能存着关联关系的语料都能够在查询中快速展示出来,并有效地排除了非相关语料的查询搜索,提高了公共标识语双语平行语料库的使用效率。
The invention relates to a bilingual parallel corpus system of public signs, comprising a corpus collection module, a classification and labeling module, a parallel corpus information sub-database, a corpus information sub-database, a category index table, and a query information extraction module, which collects collected information through the classification and labeling module The corpus information is stored in the parallel corpus information sub-database by category, and the corpus information sub-database uses the secondary classification to establish associations with other main classifications, so that the corresponding bilingual parallel corpus information can be quickly found within the required range when querying information . Aiming at the problem that public logos involve many fields, the present invention specially designs an interrelated multi-level labeling form to classify and label the stored public logo corpus information, and combine semantic labeling to make corpus with possible associations All can be quickly displayed in the query, and the query search of non-related corpus is effectively excluded, and the use efficiency of the bilingual parallel corpus of public signs is improved.
Description
技术领域technical field
本发明涉及一种公共标识语双语平行语料库系统。The invention relates to a bilingual parallel corpus system of public signs.
背景技术Background technique
公共标识语也被称为公示语,主要是在城市中为公众或游客的出行方便而提供的指示性语音,包括服务设施、机构名称、广告牌、公共设施、公共交通、旅游景点、街头路牌、标语口号、商店招牌等,其作用是通过简明的语言向公众提供有效的信息。随着经济文化的发展,尤其是旅游业的发展,很多城市都吸引了大量的外国友人,因此公共标识语的翻译显得尤为重要,其不仅是城市语言环境和人文环境的代表,而且为促进旅游产业的发展起到重要的作用。正确、得体的公共标识语翻译内容能够为各国游客提供良好便捷的帮助并提高城市的整体形象,反之,错误、不得体的公共标识语反应内容会给外国游客带来理解上的障碍甚至误区,因此保证公共标识语翻译的准确很有必要。Public signs, also known as public signs, are mainly indicative voices provided in cities for the convenience of the public or tourists, including service facilities, agency names, billboards, public facilities, public transportation, tourist attractions, street signs , slogans, shop signs, etc., its role is to provide effective information to the public through concise language. With the development of economy and culture, especially the development of tourism, many cities have attracted a large number of foreign friends, so the translation of public signs is particularly important. important role in the development of the industry. Correct and appropriate translations of public signs can provide good and convenient help for tourists from all over the world and improve the overall image of the city. On the contrary, wrong and inappropriate public sign translations will bring obstacles or even misunderstandings to foreign tourists. Therefore, it is necessary to ensure the accuracy of the translation of public signs.
在提高公共标识语翻译准确度的过程中,建立合理准确的公共标识语双语平行语料库又至关重要,由于公共标识语涉及的领域众多,如何让使用者从中较为快速准确地获取所需的公共标识语双语语料信息为本领域技术人员亟需。In the process of improving the translation accuracy of public signs, it is very important to establish a reasonable and accurate bilingual parallel corpus of public signs. Since public signs involve many fields, how to enable users to obtain the required public signs quickly and accurately? Bilingual corpus information of signs is urgently needed by those skilled in the art.
发明内容Contents of the invention
针对上述技术问题,本发明提供一种公共标识语双语平行语料库系统,利用计算机信息处理技术来提高公共标识语双语平行语料的获取效率。Aiming at the above technical problems, the present invention provides a bilingual parallel corpus system of public signs, which utilizes computer information processing technology to improve the acquisition efficiency of bilingual parallel corpora of public signs.
为实现上述目的,本发明采用的技术方案如下:To achieve the above object, the technical scheme adopted in the present invention is as follows:
一种公共标识语双语平行语料库系统,包括:A bilingual parallel corpus system for public signs, including:
语料采集模块,用于从外部信息渠道采集获取公共标识语双语平行语料信息;The corpus collection module is used to collect and obtain bilingual parallel corpus information of public signs from external information channels;
分类标注模块,用于将语料采集模块获取到的公共标识语双语平行语料信息按预设的类别进行标注,并输出相应的类别标示符,该预设的类别至少包括对应公共标识语双语平行语料信息的主要分类的一级类别,以及对应公共标识语双语平行语料信息的次要分类的二级类别,该类别标示符包括与一级类别对应的一类标示符和与二级类别对应的二类标示符;The classification and labeling module is used to label the bilingual parallel corpus information of public signs obtained by the corpus collection module according to the preset category, and output the corresponding category identifier. The preset category includes at least the corresponding public sign bilingual parallel corpus The first-level category of the main classification of information, and the second-level category of the secondary classification of bilingual parallel corpus information corresponding to the public logo. class identifier;
平行语料信息分库,其数量与公共标识语双语平行语料信息的一级类别数量相匹配,用于按主要分类分别独立存储所述公共标识语双语平行语料信息;Parallel corpus information sub-database, the number of which matches the number of first-level categories of bilingual parallel corpus information of public signs, and is used to independently store the bilingual parallel corpus information of public signs according to main classifications;
语料信息子库,隶属于每一平行语料信息分库,用于存放当前分类的公共标识语双语平行语料信息按其次要分类所产生的二类标示符,并使该二类标示符建立与其他一级类别之间的关联;The corpus information sub-database belongs to each parallel corpus information sub-database, and is used to store the second-type identifiers generated by the bilingual parallel corpus information of the current classification of public signs according to its secondary classification, and to make the second-type identifiers establish a relationship with other Associations between primary categories;
类别索引表,用于记录存放所述类别标示符,并在关联有一级类别的二类标示符上配置跳转接口;The category index table is used to record and store the category identifier, and configure a jump interface on the second category identifier associated with the first category;
查询信息提取模块,用于在查询时将输入的信息按语意标注上其可能涉及的类别标示符,使其直接对照类别索引表在对应的平行语料信息分库中进行遍历式信息查询。The query information extraction module is used to label the input information semantically with the category identifiers it may involve, so that it can directly compare the category index table to perform traversal information query in the corresponding parallel corpus information sub-database.
具体地,所述二类标示符通过次要分类的语意语境建立与之相匹配的其他一级类别之间的关联。Specifically, the second-class identifier establishes a relationship between other first-level categories that match it through the semantic context of the secondary classification.
进一步地,每一所述平行语料信息分库中存储的公共标识语双语平行语料信息配置有优先级值,该优先级值按其被查询的频次进行排序。Further, the public logo bilingual parallel corpus information stored in each of the parallel corpus information sub-databases is configured with a priority value, and the priority values are sorted according to their query frequency.
进一步地,所述语料信息子库配置有关联度值,用于指示不同的公共标识语双语平行语料信息之间的语意相关性。Further, the corpus information sub-base is configured with a degree of relevance value, which is used to indicate the semantic correlation between bilingual parallel corpus information of different common logos.
与现有技术相比,本发明具有以下有益效果:Compared with the prior art, the present invention has the following beneficial effects:
本发明针对公共标识语涉及领域众多的问题专门设计了一种相互关联的多级标注形式来对所存储的公共标识语语料信息进行分类标注,结合语意标识的方式使可能存着关联关系的语料都能够在查询中快速展示出来,并有效地排除了非相关语料的查询搜索,提高了公共标识语双语平行语料库的使用效率,对公共标识语的应用具有重要的促进作用。Aiming at the problem that public logos involve many fields, the present invention specially designs an interrelated multi-level labeling form to classify and label the stored public logo corpus information, and combine semantic labeling to make corpus with possible associations All can be quickly displayed in the query, and effectively exclude the query search of non-related corpora, improve the efficiency of the use of the bilingual parallel corpus of public signs, and play an important role in promoting the application of public signs.
附图说明Description of drawings
图1为本发明的原理框图。Fig. 1 is a functional block diagram of the present invention.
具体实施方式Detailed ways
下面结合附图说明和实施例对本发明作进一步说明,本发明的方式包括但不仅限于以下实施例。The present invention will be further described below with reference to the accompanying drawings and embodiments, and the mode of the present invention includes but not limited to the following embodiments.
实施例Example
如图1所示,该公共标识语双语平行语料库系统,包括:As shown in Figure 1, the public logo bilingual parallel corpus system includes:
语料采集模块,用于从外部信息渠道采集获取公共标识语双语平行语料信息;The corpus collection module is used to collect and obtain bilingual parallel corpus information of public signs from external information channels;
分类标注模块,用于将语料采集模块获取到的公共标识语双语平行语料信息按预设的类别进行标注,并输出相应的类别标示符,该预设的类别至少包括对应公共标识语双语平行语料信息的主要分类的一级类别,以及对应公共标识语双语平行语料信息的次要分类的二级类别,该类别标示符包括与一级类别对应的一类标示符和与二级类别对应的二类标示符;The classification and labeling module is used to label the bilingual parallel corpus information of public signs obtained by the corpus collection module according to the preset category, and output the corresponding category identifier. The preset category includes at least the corresponding public sign bilingual parallel corpus The first-level category of the main classification of information, and the second-level category of the secondary classification of bilingual parallel corpus information corresponding to the public logo. class identifier;
平行语料信息分库,其数量与公共标识语双语平行语料信息的一级类别数量相匹配,用于按主要分类分别独立存储所述公共标识语双语平行语料信息;Parallel corpus information sub-database, the number of which matches the number of first-level categories of bilingual parallel corpus information of public signs, and is used to independently store the bilingual parallel corpus information of public signs according to main classifications;
语料信息子库,隶属于每一平行语料信息分库,用于存放当前分类的公共标识语双语平行语料信息按其次要分类所产生的二类标示符,并使该二类标示符通过次要分类的语意语境建立与之相匹配的其他一级类别之间的关联;其中所述语料信息子库配置有关联度值,用于指示不同的公共标识语双语平行语料信息之间的语意相关性;The corpus information sub-database, which belongs to each parallel corpus information sub-database, is used to store the second-type identifiers generated by the secondary classification of the bilingual parallel corpus information of the current classification of public signs, and make the second-type identifiers pass through the secondary The semantic context of the classification establishes the association between other first-level categories that match it; wherein the corpus information sub-base is configured with an association degree value, which is used to indicate the semantic correlation between bilingual parallel corpus information of different public signs sex;
类别索引表,用于记录存放所述类别标示符,并在关联有一级类别的二类标示符上配置跳转接口;The category index table is used to record and store the category identifier, and configure a jump interface on the second category identifier associated with the first category;
查询信息提取模块,用于在查询时将输入的信息按语意标注上其可能涉及的类别标示符,使其直接对照类别索引表在对应的平行语料信息分库中进行遍历式信息查询。The query information extraction module is used to label the input information semantically with the category identifiers it may involve, so that it can directly compare the category index table to perform traversal information query in the corresponding parallel corpus information sub-database.
并且,每一所述平行语料信息分库中存储的公共标识语双语平行语料信息配置有优先级值,该优先级值按其被查询的频次进行排序。In addition, the bilingual parallel corpus information of public logos stored in each parallel corpus information sub-database is configured with a priority value, and the priority values are sorted according to their frequency of being queried.
在实际应用中,使用者在查询一段公共标识语语料时,先由查询信息提取模块进行类别标示符的标注,此后系统先根据其类别标示符分配待查询的与之有所关联的平行语料信息分库,再在这些分库内根据实际关键词信息进行查询,从而快速准确地获取到所需的双语平行语料信息。In practical applications, when a user queries a piece of public logo corpus, the query information extraction module first labels the category identifiers, and then the system first assigns the parallel corpus information associated with it to be queried according to the category identifiers. Sub-databases, and then query according to the actual keyword information in these sub-databases, so as to quickly and accurately obtain the required bilingual parallel corpus information.
上述实施例仅为本发明的优选实施方式之一,不应当用于限制本发明的保护范围,但凡在本发明的主体设计思想和精神上作出的毫无实质意义的改动或润色,其所解决的技术问题仍然与本发明一致的,均应当包含在本发明的保护范围之内。The above-mentioned embodiment is only one of the preferred implementation modes of the present invention, and should not be used to limit the scope of protection of the present invention, but any modification or embellishment without substantive significance made on the main design concept and spirit of the present invention shall be solved by it. If the technical problems are still consistent with the present invention, all should be included in the protection scope of the present invention.
Claims (3)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201911388415.XA CN111177412B (en) | 2019-12-30 | 2019-12-30 | Public logo bilingual parallel corpus system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201911388415.XA CN111177412B (en) | 2019-12-30 | 2019-12-30 | Public logo bilingual parallel corpus system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111177412A CN111177412A (en) | 2020-05-19 |
| CN111177412B true CN111177412B (en) | 2023-03-31 |
Family
ID=70655838
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201911388415.XA Active CN111177412B (en) | 2019-12-30 | 2019-12-30 | Public logo bilingual parallel corpus system |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111177412B (en) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101079028A (en) * | 2007-05-29 | 2007-11-28 | 中国科学院计算技术研究所 | On-line translation model selection method of statistic machine translation |
| WO2010134752A2 (en) * | 2009-05-21 | 2010-11-25 | 주식회사 아이네크 | Semantic search method and system in which a plurality of classification systems are linked |
| US8145636B1 (en) * | 2009-03-13 | 2012-03-27 | Google Inc. | Classifying text into hierarchical categories |
| CN109145301A (en) * | 2018-08-29 | 2019-01-04 | 上海汽车集团股份有限公司 | Information classification approach and device, computer readable storage medium |
| CN109948160A (en) * | 2019-03-15 | 2019-06-28 | 智者四海(北京)技术有限公司 | Short text classification method and device |
| CN110209764A (en) * | 2018-09-10 | 2019-09-06 | 腾讯科技(北京)有限公司 | The generation method and device of corpus labeling collection, electronic equipment, storage medium |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9330167B1 (en) * | 2013-05-13 | 2016-05-03 | Groupon, Inc. | Method, apparatus, and computer program product for classification and tagging of textual data |
-
2019
- 2019-12-30 CN CN201911388415.XA patent/CN111177412B/en active Active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101079028A (en) * | 2007-05-29 | 2007-11-28 | 中国科学院计算技术研究所 | On-line translation model selection method of statistic machine translation |
| US8145636B1 (en) * | 2009-03-13 | 2012-03-27 | Google Inc. | Classifying text into hierarchical categories |
| WO2010134752A2 (en) * | 2009-05-21 | 2010-11-25 | 주식회사 아이네크 | Semantic search method and system in which a plurality of classification systems are linked |
| CN109145301A (en) * | 2018-08-29 | 2019-01-04 | 上海汽车集团股份有限公司 | Information classification approach and device, computer readable storage medium |
| CN110209764A (en) * | 2018-09-10 | 2019-09-06 | 腾讯科技(北京)有限公司 | The generation method and device of corpus labeling collection, electronic equipment, storage medium |
| CN109948160A (en) * | 2019-03-15 | 2019-06-28 | 智者四海(北京)技术有限公司 | Short text classification method and device |
Non-Patent Citations (1)
| Title |
|---|
| 面向事件的多语平行语料库构建研究;张姝等;《计算机应用研究》;20051128(第11期);全文 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111177412A (en) | 2020-05-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Fernandez et al. | Semantic search meets the web | |
| CN104166651B (en) | Method and device for data search based on integration of similar data objects | |
| US20150058309A1 (en) | Keyword presenting system and method based on semantic depth structure | |
| CN101093478A (en) | Method and system for identifying Chinese full name based on Chinese shortened form of entity | |
| CN103186662A (en) | System and method for extracting dynamic public sentiment keywords | |
| CN103631948A (en) | Identifying method of named entities | |
| RU2010107150A (en) | IDENTIFICATION OF SEMANTIC RELATIONS IN INDIRECT SPEECH | |
| CN103049575A (en) | Topic-adaptive academic conference searching system | |
| CN107992608B (en) | An automatic generation method of SPARQL query statement based on keyword context | |
| CN109597895B (en) | A document search method based on knowledge graph | |
| CN102779135A (en) | Method and device for obtaining cross-linguistic search resources and corresponding search method and device | |
| Ahlers et al. | Location-based Web search | |
| CN117112595A (en) | Information query method and device, electronic equipment and storage medium | |
| CN106980639B (en) | Short text data aggregation system and method | |
| CN112148938B (en) | Cross-domain heterogeneous data retrieval system and retrieval method | |
| CN117272073B (en) | Text unit semantic distance precalculation method and device, query method and device | |
| CN102819600A (en) | Keyword searching method facing to relational database of power production management system | |
| CN102622413A (en) | Method and device for answering natural language questions | |
| US8745022B2 (en) | Full text search based on interwoven string tokens | |
| CN103020083B (en) | The automatic mining method of demand recognition template, demand recognition methods and corresponding device | |
| Nobata et al. | Kleio: a knowledge-enriched information retrieval system for biology | |
| CN110928978A (en) | Standard literature classification retrieval method | |
| CN111177412B (en) | Public logo bilingual parallel corpus system | |
| CN108959540A (en) | A kind of more relationship fusion methods and intellectualizing system for the discovery of recessive association knowledge | |
| CN103514214B (en) | Data query method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |