CN117688176B

CN117688176B - Pseudo language family clustering method and device based on multilingual pre-training large model

Info

Publication number: CN117688176B
Application number: CN202311653724.1A
Authority: CN
Inventors: 刘学博; 马新羽; 张民
Original assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Current assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date: 2023-12-04
Filing date: 2023-12-04
Publication date: 2024-09-24
Anticipated expiration: 2043-12-04
Also published as: CN117688176A

Abstract

The invention relates to the technical field of text machine translation, in particular to a pseudo language family clustering method and device based on a multilingual pre-training large model, wherein the method comprises the following steps: establishing a shared language pool; calculating a Fisher information matrix of the language pairs in the shared language pool based on the multi-language pre-training large model, and obtaining a characterization result of the language pairs in the shared language pool; calculating the similarity between the language pairs according to the characterization result to obtain a similarity value; and sorting the similarity among the language pairs according to the similarity value, and selecting auxiliary language pairs conforming to the boundary value according to the preset boundary value to complete pseudo language family clustering based on the multilingual pre-training large model. The invention uses the capability of the multi-language pre-training to characterize the language pairs, more effectively selects and clusters auxiliary languages, improves generalization of the auxiliary languages among different models and data sets, and finally improves translation quality of low-resource language pairs under multi-language co-training.

Description

Pseudo language family clustering method and device based on multilingual pre-training large model

Technical Field

The invention relates to the technical field of machine translation, in particular to a pseudo language family clustering method and device based on a multilingual pre-training large model.

Background

Neural Machine Translation (NMT) has become the dominant Machine Translation (MT) paradigm in academic research and commercial use. In recent years, the NMT framework is found to naturally integrate multiple languages. Thus, research efforts involving MT systems in multiple languages have increased dramatically. Researchers refer to NMT systems that handle translation of more than one language pair as Multilingual NMT (MNMT) systems. The ultimate goal of the MNMT study is to develop a single model for translation between as many languages as possible by efficiently utilizing the available language resources. Although MNMT brings a pleasing improvement in translation quality, these models all rely on large parallel corpora. Since such a corpus exists on only a few language pairs, in most low-resource languages, the translation performance is far from expected. Related studies have shown that for low-resource language translation, multi-language co-training can be advantageous over conventional fine-tuning methods in some cases by introducing additional pairs of auxiliary languages during the fine-tuning stage. However, subsequent studies further indicate that co-training does not always bring about a positive effect, sometimes even leading to a reduction of translation quality, depending on the choice of the co-linguistic pair.

In recent research at home and abroad, it is shown that by fine tuning the model by using a language similar to the target language, the translation quality of the target language pair can be improved without using the data of the target language pair, and it is further explained that there is a synergistic effect between the language pairs. However, the same effect can not be achieved by co-training of any language pair, so that the screening of the co-language pair becomes a key step for improving the translation quality of MNMT in low-resource language pairs. Languages in a language family generally have a common territory and language background, so that more meaning is similar at the character or word level, and from the linguistic perspective, the language pairs have more similar or similar linguistic features such as characters, grammar and the like. At present, the academic research in the field is mainly divided into two directions: on the one hand, researchers often integrate different prior knowledge including language similarity, resource availability, language type, task-specific requirements, etc.; on the other hand, researchers try to apply language embedding, represent each language with embedded vectors and cluster them in an embedding space, for example, language embedding (Language Embedding) layers are added in a model, after multi-language training, embedded vectors are built for each language pair, then language families are built through hierarchical clustering, so that the translation quality of the language pair is improved, or an Adapter structure is embedded in the structure of the model under the condition that the parameters of a pre-training model are kept unchanged, and the translation quality is improved through training of the language family Adapter in a downstream task.

Although these methods can improve the translation quality of language pairs, they also face certain difficulties in practical applications. In particular, training new models or modifying the structure of models can complicate these methods and are difficult to reproduce in cases where the original structure and data of large language models are difficult to obtain.

Disclosure of Invention

In order to solve the technical problems that in the prior art, training a new model or changing the structure of the model can complicate the methods and can not be reproduced under the condition that the original structure and data of a large language model are difficult to acquire, the embodiment of the invention provides a pseudo language family clustering method and device based on a multi-language pre-training large model. The technical scheme is as follows:

in one aspect, a method for clustering pseudo language families based on a multilingual pre-training large model is provided, the method is implemented by a pseudo language family clustering device based on the multilingual pre-training large model, and the method comprises:

s1, establishing a shared language pool;

S2, calculating a Fisher information matrix of the language pairs in the shared language pool based on the multi-language pre-training large model, and obtaining a characterization result of the language pairs in the shared language pool;

S3, calculating the similarity between the language pairs according to the characterization result to obtain a similarity value;

s4, sorting the similarity among the language pairs according to the similarity value, selecting auxiliary language pairs conforming to the boundary value according to the preset boundary value, and completing pseudo language family clustering based on the multilingual pre-training large model.

Optionally, in step S1, establishing the shared language pool includes:

acquiring a TED data set;

Extracting multiple languages in the TED data set, translating the multiple languages into language pairs of English as a basic data set, and establishing a shared language pool.

Optionally, in step S2, based on the multilingual pre-training large model, calculating a fee-house information matrix of the language pairs in the shared language pool, to obtain a characterization result of the language pairs in the shared language pool, including:

acquiring a parallel corpus corresponding to the languages in the shared language pool, and equally dividing data in the parallel corpus into j small batch data sets;

Sequentially inputting the small batch data sets into a multilingual pre-training large model, and outputting a Fisher information matrix of each small batch data set;

Calculating an average Fisher information matrix of each small batch data set after one input round, and taking the average Fisher information matrix as an estimated value to obtain the Fisher information weight of each small batch data set;

And characterizing the distribution of the corresponding language pairs in the shared language pool according to the weight of the Fisher information.

Optionally, in step S3, calculating the similarity between the language pairs according to the characterization result to obtain a similarity value, including:

obtaining a characterization result;

and calculating the distance between the target language pair and the auxiliary language pair by adopting a mean square error method, wherein the distance is similar to the similarity.

and calculating the KL divergence from the auxiliary language to the target language by using the fee-house information matrix to obtain the distance between the target language pair and the auxiliary language pair, wherein the distance is similar to the distance, and the higher the similarity is.

selecting and assigning a value of 1 to the parameter of the previous K, and assigning a value of 0 to the remaining parameters to create a fee-house information mask;

And calculating the distance between the target language pair and the auxiliary language pair according to the number of the parameters activated simultaneously and the number of the parameters activated in the target direction, wherein the distance is similar to the distance, and the similarity is higher.

Optionally, in step S4, the similarity between the language pairs is ordered according to the similarity value, and an auxiliary language pair conforming to the boundary value is selected according to the preset boundary value, so as to complete the pseudo language family clustering based on the multilingual pre-training large model, including:

traversing and calculating the similarity between all language pairs;

Descending order according to the similarity between the language pairs;

presetting an initial searching radius, and defining a boundary range according to the initial searching radius;

Integrating the nearest language pair in the boundary range into an auxiliary language list;

Updating the search radius according to the similarity between the latest added language pair and the target language pair;

and repeatedly updating the search radius until new language pairs are not expanded, obtaining clustered pseudo language families, and completing pseudo language family clustering based on the multilingual pre-training large model.

In another aspect, a pseudo language family clustering device based on a multilingual pre-training large model is provided, the device is applied to a pseudo language family clustering method based on the multilingual pre-training large model, and the device comprises:

the language pool module is used for establishing a shared language pool;

The characterization module is used for calculating a Fisher information matrix of the language pairs in the shared language pool based on the multi-language pre-training large model to obtain a characterization result of the language pairs in the shared language pool;

the similarity calculation module is used for calculating the similarity between the language pairs according to the characterization result to obtain a similarity value;

And the clustering module is used for sequencing the similarity among the language pairs according to the similarity value, selecting auxiliary language pairs conforming to the boundary value according to the preset boundary value, and completing pseudo language family clustering based on the multilingual pre-training large model.

In another aspect, a pseudo language family clustering apparatus based on a multilingual pre-training large model is provided, the pseudo language family clustering apparatus based on the multilingual pre-training large model including: a processor; a memory having stored thereon computer readable instructions which, when executed by the processor, implement any of the pseudo language family clustering methods based on a multilingual pre-training large model as described above.

In another aspect, a computer-readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement any of the above-described pseudo language family clustering methods based on a multilingual pre-training large model is provided.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

Aiming at the limitation that additional priori knowledge is needed or the model architecture needs to be modified in the prior art, the invention provides a method for constructing more effective language to perform multi-language collaborative training on a clustering method. The core objective is to characterize the language pairs by using the capability of the multilingual pre-training itself, more effectively select and cluster the auxiliary languages and improve their generalization between different models and datasets, and finally improve the translation quality of the low-resource language pairs under the multilingual co-training.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a pseudo language family clustering method based on a multilingual pre-training large model provided by an embodiment of the invention;

FIG. 2 is a schematic illustration of a language pair (XX-en) provided by an embodiment of the invention;

FIG. 3 is a schematic diagram of a distribution of 40% higher Fisher information parameters in a model structure provided by an embodiment of the present invention;

FIG. 4 is a block diagram of a pseudo language family clustering device based on a multilingual pre-training large model provided by an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a pseudo language family clustering device based on a multilingual pre-training large model according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is described below with reference to the accompanying drawings.

In embodiments of the invention, words such as "exemplary," "such as" and the like are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion. Furthermore, in embodiments of the present invention, the meaning of "and/or" may be that of both, or may be that of either, optionally one of both.

In the embodiments of the present invention, "image" and "picture" may be sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized. "of", "corresponding (corresponding, relevant)" and "corresponding (corresponding)" are sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized.

In the embodiment of the present invention, sometimes a subscript such as W1 may be wrongly expressed in a non-subscript form such as W1, and the meaning of the subscript is consistent when the distinction is not emphasized.

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a pseudo language family clustering method based on a multi-language pre-training large model, which can be realized by pseudo language family clustering equipment based on the multi-language pre-training large model, and the pseudo language family clustering equipment based on the multi-language pre-training large model can be a terminal or a server. The process flow of the pseudo language family clustering method based on the multilingual pre-training large model as shown in fig. 1 can comprise the following steps:

s101, establishing a shared language pool;

In a possible implementation manner, in step S101, the establishing a shared language pool includes:

acquiring a TED data set;

In one possible implementation, for language pair characterization studies, a shared language pool is first created, containing high-resource and low-resource language pairs and spanning multiple language families. The invention selects a TED data set, and uses 17 kinds of data from language to English (en for short) as a basic data set of the invention. These language pairs together constitute a pool of alternative shared languages that will be used in subsequent selections as alternative auxiliary languages to the low-resource language. These languages span seven different families of languages: the schiff language (Balto-Slavic), the south island language (Austronesian), the indian-illite language (Indo-Iranian), the syncope language (Turkic), the japanese language (Japonic), the korean language (Koreanic) and the rilman language (Germanic) are shown in fig. 2.

S102, calculating a Fisher information matrix of the language pairs in the shared language pool based on the multi-language pre-training large model, and obtaining a characterization result of the language pairs in the shared language pool;

In a possible implementation manner, in step S102, based on the multilingual pre-training large model, a fee-house information matrix of the language pairs in the shared language pool is calculated, and a characterization result of the language pairs in the shared language pool is obtained, including:

The invention selects FIM to calculate the parameters in the pre-training model, calculates the weight of the Fischer-Tropsch information of each parameter, the weight size represents the importance of the parameter, and the parameter with large weight represents the sensitivity of the parameter in a specific translation direction. The parameters used to evaluate and select those parameters that are sensitive to a particular translation direction, i.e., parameters that require extensive weight updates during the fine tuning phase, by parameter calculation are important indicators for evaluating the importance and potential value of a particular parameter. Essentially, it quantifies the variance of the first derivative of the logarithm of the likelihood function. By measuring the size of this metric, the necessity of fine tuning a particular parameter in a subsequent task can be inferred. The original calculation formula is as follows:

Wherein X and Y represent the input and output of the model, respectively; θ represents a parameter of the model; p represents probability distribution of output Y under the condition that the input is X and the parameter is theta; t represents a matrix transfer rank; e represents the desire. For the i-th parameter, use of the diagonal matrix helps to estimate the fischer information matrix:

while using a diagonal matrix helps to estimate FIM, obtaining an accurate probability estimate remains a difficult task. In view of this, we approximate FIM using the following formula:

Here, D represents the entire data set, |d| represents the number of data, and the entire data set is divided into j small batches of equal data quantities and sequentially input into the model for training. The invention inputs the language-corresponding parallel corpus into the model, and calculates FIM of each small batch in only one round (epoch). During model training we accumulate FIM for each small batch using equation (3), but do not back propagate. After one epoch is completed, we calculate the average FIM obtained for each small batch as the final estimate.

Further, the invention analyzes the distribution of high Fisher information parameters in the pre-trained model structure. The distribution of 40% higher parameters was observed in the study as shown in figure 3. The pre-training model was initially divided into 5 parts: encoder attention layer (encoder attention layer, abbreviated as E _a), encoder full connection layer (encoder fully connected layer, abbreviated as E _f), decoder self-attention (decoder self-attention, abbreviated as D _a), decoder cross-attention layer (decoder cross attention layer, abbreviated as D _c), and decoder full connection layer (decoder fully connected layer, abbreviated as D _f). More than 60% of the parameters are distributed in a feed-forward network (FFN), so that the FFN layer is selected to be used for FIM calculation and similarity measurement is carried out according to the FFN layer.

S103, calculating the similarity between the language pairs according to the characterization result to obtain a similarity value;

In a possible implementation manner, in step S103, calculating the similarity between the language pairs according to the characterization result to obtain a similarity value includes:

obtaining a characterization result;

And calculating the distance between the language pair in the shared language pool and the target language pair by adopting a mean square error method, wherein the distance is similar to the target language pair, and the similarity is higher.

In one possible implementation, the mean square error (Mean Square Error, MSE): the calculation formula is as follows, wherein t, a is the distance between the target language pair and the auxiliary language pair, S _(t,a) is the distance between t and a, the closer the distance is, the higher the similarity is, F is FIM, and I F _t I represents the parameter number.

And calculating KL divergence of the language pairs in the shared language pool and the target language pairs by using the fee-house information matrix to obtain the distance between the language pairs in the shared language pool and the target language pairs, wherein the distance is similar, and the similarity is higher.

In one possible embodiment, KL divergence (Kullback-Leibler divergence, abbreviated KL): the FIM is directly used to calculate the KL divergence of the language pairs from the target language pairs in the shared language pool, thereby more accurately representing the distance between the language pairs. The calculation formula is as follows, where the sign is consistent with that described in MSE, |·| represents the calculated absolute value.

In one possible implementation, the similarity is overlapped (Overlap Similarity, overlap): unlike the first two calculations, overlapping similarity does not directly use FIM, and a fishery information mask (fisher information mask, abbreviated as M) is created by selecting and assigning a value of 1 to the parameter of the previous K, while the remaining parameters are assigned a value of 0. The calculation is as follows, wherein Overlapping and active represent the number of parameters activated simultaneously and the number of parameters activated in the target direction.

In the Overlap method, the present invention defaults to 40% as the default parameter for K because it achieves the best translation effect.

S104, sorting the similarity among the language pairs according to the similarity value, and selecting auxiliary language pairs conforming to the boundary value according to the preset boundary value to complete pseudo language family clustering based on the multilingual pre-training large model.

In a possible implementation manner, in step S104, the method includes sorting the similarities between the language pairs according to the similarity values, selecting the auxiliary language pairs conforming to the boundary values according to the preset boundary values, and completing the pseudo language family clustering based on the multilingual pre-training large model, including:

traversing and calculating the similarity between all language pairs;

Descending order according to the similarity between the language pairs;

In a possible embodiment, the invention designs a simple algorithm to select the auxiliary language. First, we calculate the similarity of other language pairs to the target language pair using the similarity measurement method described above, rank the similarity between the language pairs, and then set an initial search radius. Within this predefined boundary, the closest language pairs are integrated into the auxiliary language list. The radius is then adjusted based on the similarity measure of the newly added language pair to the target language pair. This process is repeated until the new language pair is no longer expanded. The present invention refers to such clustered language families as pseudo language families. The algorithm for selecting the auxiliary language pairs is as follows:

1. Arranging the similarity in descending order (MSE and KL in ascending order, overlap descending order) to create a list L;

2. initializing Gap as L1-L0, adding the first language into the auxiliary list;

3. from i=2 iterations to the end of L, per loop we choose the language as follows:

a) If |L [ i-1] -L [ i ] | < Gap, adding the ith language to the auxiliary list, and updating gap= |L [ i-1] -L [ i ] |;

b) If |L [ i-1] -L [ i ] | < Gap, add the ith language to the auxiliary list and update

C) If |L [ i-1] -L [ i ] | > Gap, the loop is terminated;

4. the language pairs in the auxiliary list with the target pair constitute a family of pseudo languages of the target language pair.

In a possible embodiment, to evaluate the method of the invention, the following baseline (baseline) was designed based on the m2m100—418M model:

pre-trained (Pre-training): directly translating the target language pair by using the pre-training model without any fine adjustment;

Ft (fine-tune): fine tuning the basic model by using bilingual data of the target language pair;

lf (langauge family, language family): fine tuning using the traditional language family divided in fig. 2, taking temperature samples and setting temperature to 1.5;

LF+FT: based on the LF method, the data of the target language pair is used for further fine tuning.

For the training phase, the batch size is set to 4096; the method of the invention carries out up-sampling on the training data to make the training data identical so as to ensure that the proportion of each language in each small batch is identical, except that a Hindi (Hindi, short for hi) is adopted, and when the Overlap method is adopted, the same sampling mode as the LF is adopted. Optimization was performed using Adam optimizer, where β ₁＝0.98、β₂ =0.98 and e=10e ^-6. The learning rate is lr=3e ^-5. The translation quality is evaluated in the present invention using the BLEU (bilingual evaluation understudy, bilingual replacement evaluation) value.

TABLE 1 pseudo language family under different method choices

Table 2 BLEU scores for various low resource language pairs under different approaches

Table 1 shows the pseudo language family clustered for low resource language pairs in the shared language pool for the method of the present invention, and table 2 shows the test results for the TED dataset for bos (Persian, fa for short), hindi (Hindi, hi for short), bangla (Bengali, bn for short), indonesian (Indonesian, id for short), malaysia (Malay, ms) to english.

Models (1) to (4) represent the baseline in the current common fine tuning, and models (5) to (7) represent our implementation of the method using different calculation methods.

The invention uses three measurement modes of MSE, KL and Overlap to calculate the distance or similarity between language pairs, so as to cluster pseudo language families and verify on low-resource language pairs. The evaluation results show that the BLEU score is further improved by all three calculation modes, the finally obtained lifting effect is equivalent, and the model (7) obtains the best lifting effect.

FIG. 4 is a block diagram illustrating a pseudo language family clustering apparatus based on a multilingual pre-training large model for a pseudo language family clustering method based on a multilingual pre-training large model, according to an exemplary embodiment. Referring to fig. 4, the apparatus includes a language pool module 410, a characterization module 420, a similarity calculation module 430, and a clustering module 440. For ease of illustration, fig. 4 shows only the main components of the full-flow visualization device 400:

A language pool module 410 for creating a shared language pool;

The characterization module 420 is configured to calculate a fee-house information matrix of the language pairs in the shared language pool based on the multilingual pre-training large model, and obtain a characterization result of the language pairs in the shared language pool;

The similarity calculation module 430 is configured to calculate a similarity between the language pairs according to the characterization result, and obtain a similarity value;

The clustering module 440 is configured to sort the similarities between the language pairs according to the similarity values, select an auxiliary language pair conforming to the boundary values according to the preset boundary values, and complete pseudo language family clustering based on the multilingual pre-training large model.

Optionally, a language pool module 410 for obtaining a TED dataset;

Optionally, the characterization module 420 is configured to obtain a parallel corpus corresponding to a language in the shared language pool, and divide data in the parallel corpus into j small batch data sets;

Optionally, a similarity calculation module 430 is configured to obtain a characterization result;

Optionally, the similarity calculation module 430 is configured to calculate KL divergences between the language pairs in the shared language pool and the target language pairs by using the fee-house information matrix, so as to obtain distances between the language pairs in the shared language pool and the target language pairs, where the distances are similar, and the similarity is higher.

Optionally, a similarity calculation module 430 is configured to select and assign a value of 1 to the parameter of the previous K, and assign a value of 0 to the remaining parameters to create a fee-house information mask;

and calculating the distance between the language pair in the shared language pool and the target language pair according to the number of the parameters activated simultaneously and the number of the parameters activated in the target direction, wherein the distance is similar, and the higher the similarity is.

Optionally, a clustering module 440 for computing the similarity between all language pairs in a traversal manner;

Descending order according to the similarity between the language pairs;

Fig. 5 is a schematic structural diagram of a pseudo language family clustering device based on a multilingual pre-training large model according to an embodiment of the present invention, where, as shown in fig. 5, the pseudo language family clustering device based on the multilingual pre-training large model may include the pseudo language family clustering device based on the multilingual pre-training large model shown in fig. 4. Optionally, a pseudo language family clustering device 510 based on a multilingual pre-trained large model may include a processor 2001.

Optionally, the pseudo language family clustering device 510 based on a multilingual pre-trained large model may also include a memory 2002 and a transceiver 2003.

The processor 2001 may be connected to the memory 2002 and the transceiver 2003 via a communication bus, for example.

The following describes the components of the pseudo language family clustering device 510 based on the multilingual pre-training large model in detail with reference to fig. 5:

the processor 2001 is a control center of the pseudo language family clustering device 510 based on a multilingual pre-training large model, and may be one processor or a generic name of a plurality of processing elements. For example, processor 2001 is one or more central processing units (central processing unit, CPU), but may also be an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention, such as: one or more microprocessors (DIGITAL SIGNAL processors, DSPs), or one or more field programmable gate arrays (field programmable GATE ARRAY, FPGAs).

Alternatively, the processor 2001 may perform various functions of the pseudo language family clustering device 510 based on the multilingual pre-training large model by running or executing a software program stored in the memory 2002, and invoking data stored in the memory 2002.

In a particular implementation, the processor 2001 may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 4, as an example.

In a particular implementation, as one embodiment, the pseudo language family clustering device 510 based on a multilingual pre-trained large model may also include multiple processors, such as processor 2001 and processor 2004 shown in fig. 4. Each of these processors may be a single-core processor (single-CPU) or a multi-core processor (multi-CPU). A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

The memory 2002 is used for storing a software program for executing the solution of the present invention, and is controlled by the processor 2001 to execute the solution, and the specific implementation may refer to the above method embodiment, which is not described herein again.

Alternatively, memory 2002 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, or an electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only memory, EEPROM), compact disc read-only memory (compact disc read-only memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, without limitation. Memory 2002 may be integrated with processor 2001 or may exist separately and be coupled to processor 2001 via an interface circuit (not shown in fig. 5) of pseudo-language family clustering device 510 based on a multilingual pre-trained large model, as embodiments of the present invention are not limited in this regard.

A transceiver 2003 for communicating with a network device or with a terminal device.

Alternatively, transceiver 2003 may include a receiver and a transmitter (not separately shown in fig. 5). The receiver is used for realizing the receiving function, and the transmitter is used for realizing the transmitting function.

Alternatively, the transceiver 2003 may be integrated with the processor 2001, or may exist separately, and be coupled to the processor 2001 through an interface circuit (not shown in fig. 5) of the pseudo language family clustering device 510 based on a multilingual pre-trained large model, as embodiments of the present invention are not particularly limited.

It should be noted that the structure of the pseudo language family clustering device 510 based on the multilingual pre-training large model shown in fig. 5 is not limited to this router, and the actual knowledge structure recognition device may include more or less components than those illustrated, or may combine some components, or may be a different arrangement of components.

In addition, the technical effects of the pseudo language family clustering device 410 based on the multilingual pre-training large model may refer to the technical effects of the pseudo language family clustering method based on the multilingual pre-training large model described in the above method embodiment, and will not be described herein.

It is to be appreciated that the processor 2001 in embodiments of the invention may be a central processing unit (central processing unit, CPU) which may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application Specific Integrated Circuits (ASICs), off-the-shelf programmable gate arrays (field programmable GATE ARRAY, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be appreciated that the memory in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM), an electrically erasable programmable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of random access memory (random access memory, RAM) are available, such as static random access memory (STATIC RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (doubledata RATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (direct rambus RAM, DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware (e.g., circuitry), firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.

In the present invention, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A pseudo language family clustering method based on a multilingual pre-training large model, the method comprising:

s1, establishing a shared language pool;

in the step S2, based on the multilingual pre-training large model, a fee-house information matrix of the language pairs in the shared language pool is calculated, and a characterization result of the language pairs in the shared language pool is obtained, including:

acquiring a parallel corpus corresponding to the languages in the shared language pool, and equally dividing the data in the parallel corpus into j small batch data sets;

Calculating an average Fisher information matrix of each small batch data set after one input round, and taking the average Fisher information matrix as an estimated value to obtain Fisher information weight of each small batch data set;

Characterizing the distribution of the corresponding language pairs in the shared language pool according to the Fisher information weight;

in the step S3, calculating the similarity between the language pairs according to the characterization result to obtain a similarity value, including:

obtaining a characterization result; selecting a target language pair;

calculating the distance between the language pair in the shared language pool and the target language pair by adopting a mean square error method, wherein the distance is similar to the target language pair, and the higher the similarity is;

Or selecting a target language pair;

Calculating KL divergence of the language pairs in the shared language pool and the target language pairs by using a fee-house information matrix to obtain the distance between the language pairs in the shared language pool, wherein the distance is similar to the distance, and the similarity is higher;

Or selecting a target language pair;

According to the number of parameters activated simultaneously and the number of parameters activated in the target direction, calculating the distance between the language pair in the shared language pool and the target language pair, wherein the distance is similar to the distance, and the similarity is higher;

S4, sorting the similarity among the language pairs according to the similarity value, selecting auxiliary language pairs conforming to the boundary value according to a preset boundary value, and completing pseudo language family clustering based on a multilingual pre-training large model;

In the step S4, the similarity between the language pairs is ordered according to the similarity value, and an auxiliary language pair conforming to the boundary value is selected according to a preset boundary value, so as to complete pseudo language family clustering based on the multilingual pre-training large model, including:

traversing and calculating the similarity between all language pairs;

Descending order according to the similarity between the language pairs;

2. The method according to claim 1, wherein in the step S1, the step of creating the shared language pool includes:

acquiring a TED data set;

Extracting multiple languages in the TED data set, translating the multiple languages into language pairs of English to serve as a basic data set, and establishing a shared language pool.

3. A pseudo-language family clustering device based on a multilingual pre-training large model, the device comprising:

the language pool module is used for establishing a shared language pool;

The characterization module is used for acquiring a parallel corpus corresponding to the languages in the shared language pool and equally dividing the data in the parallel corpus into j small batch data sets;

characterizing the distribution of corresponding language pairs in the shared language pool according to the weight of the Fisher information;

the similarity calculation module is used for obtaining a characterization result; selecting a target language pair;

The similarity calculation module is used for selecting a target language pair; calculating KL divergence of the language pairs in the shared language pool and the target language pairs by using a fee-house information matrix to obtain the distance between the language pairs in the shared language pool and the target language pairs, wherein the distance is similar, and the higher the similarity is;

The similarity calculation module is used for selecting a target language pair; selecting and assigning a value of 1 to the parameter of the previous K, and assigning a value of 0 to the remaining parameters to create a fee-house information mask;

The clustering module is used for sequencing the similarity among the language pairs according to the similarity value, selecting auxiliary language pairs conforming to the boundary value according to a preset boundary value, and completing pseudo language family clustering based on a multilingual pre-training large model;

The clustering module is used for traversing and calculating the similarity between all language pairs;

Descending order according to the similarity between the language pairs;

4. A pseudo language family clustering device based on a multilingual pre-training large model, the pseudo language family clustering device based on the multilingual pre-training large model comprising:

A processor;

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of claims 1 to 2.

5. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, which is callable by a processor for executing the method according to any one of claims 1 to 2.