Disclosure of Invention
In view of the problems, the invention provides a large-model-based method and a large-model-based device for checking the quality of a collection-accelerating sensitive word, which solve the problems of high cost and low efficiency of the traditional collection-accelerating quality check field mainly by manpower.
The technical scheme includes that a large-model-based collection-accelerating sensitive word quality inspection method comprises the steps of obtaining on-line generated collection-accelerating records, calling a translation model API interface to convert the collection-accelerating records into recording texts, preprocessing the recording texts, segmenting long texts in the recording texts to obtain input texts, importing the input texts into an original model to conduct quality inspection, outputting a first quality inspection result, constructing a local collection-accelerating rule vector knowledge base, calling the vector knowledge base based on a large language model, importing the input texts into the large language model to conduct quality inspection, outputting a second quality inspection result, conducting P-tuning training according to historical collection-accelerating record data and based on the large language model to obtain customized large language model, importing the input texts into the customized large language model to conduct quality inspection to obtain a third quality inspection result, and finally judging that at least one of the first quality inspection result, the second quality inspection result and the third quality inspection result is non-uniform.
The method comprises the steps of collecting a collection record based on expert experience or historical customer complaint cases, screening out a speaking part of a collector after the collection record is converted into a record text, converting the record text into 512-dimensional vectors by using coding software, storing the converted vectors in a database, converting the record text to be evaluated into the 512-dimensional vectors, performing inner product calculation with all vectors in the database, wherein the larger the inner product is, the higher the similarity is, and if the similarity exceeds a set threshold value, indicating that the corresponding record has the history-generated non-compliance problem.
The method comprises the steps of obtaining a context related to a user request by reading content, filling a template by using the request content and the context content to obtain a prompt word, and inputting the prompt word into a large language model.
Preferably, preprocessing the recorded text comprises removing recorded text for less than 30 seconds and adding target tag information based on expert experience and historical complaint information.
The method comprises the steps of collecting recording and text data in the induction field, preprocessing the data, utilizing an ASR technology to identify the recording data, distinguishing induction personnel and overdue users, converting the recording data into the text data, utilizing expert labelling to distinguish the text data, generating training samples according to whether positive and negative labels are properly marked or not, dividing the training samples into a training set and a testing set, wherein the training set is used for P-training, the testing set is used for evaluating model effects, configuring P-training model parameters, customizing the large language model when the model effects reach a set threshold, and deploying the customized large language model in a production environment for calling by an induction system in an API mode.
The method comprises the steps of identifying recording data by using an ASR technology, distinguishing an adductor from overdue users, and converting the recording data into text data, wherein a whisperX model is used for designating a language as Chinese, a speaker is 2 persons, recording files are input to a whisperX model, the speaker and the speaking content text are output, and the speaking text data of the adductor are screened out according to the fixed opening time of the adductor.
The method comprises the steps of enabling the custom large language model to be imported into a production environment, adjusting the model state to be an eval mode, providing API service to the outside by using fastapi interfaces, providing a recording text which is required to be evaluated and added with a prompt word on the API, and returning an evaluation result of the recording text.
The invention further provides a large-model-based collection-induction sensitive word quality inspection device, which comprises an acquisition module, a recording conversion module, a preprocessing module, a first quality inspection module, a second quality inspection module, a third quality inspection module, a quality inspection result and a third quality inspection module, wherein the acquisition module is used for acquiring a collection-induction recording generated on line, the recording conversion module is used for calling a translation model API interface to convert the collection-induction recording into a recording text, the preprocessing module is used for preprocessing the recording text and dividing a long text in the recording text to obtain an input text, the first quality inspection module is used for importing the input text into an original model to conduct quality inspection, the second quality inspection module is used for constructing a local collection-induction rule vector knowledge base, calling the vector knowledge base based on the large language model and importing the input text into the large language model to conduct quality inspection, the second quality inspection result is output, the model training module is used for conducting P-training according to the historical collection-induction recording data and based on the large language model to obtain a customized large language model, the third quality inspection module is used for importing the input text into the customized large language model to conduct quality inspection, and obtaining the third quality inspection result is obtained, and the first quality inspection result and the third quality inspection result is at least different from the first quality inspection result.
Compared with the prior art, the method has the beneficial effects that through mining unstructured collection recording data accumulated in financial institution collection business, after preprocessing operations such as data cleaning, a speaking object is identified, a collection sensitive word model is generated, sensitive content of speaking content is more accurately identified, quality inspection standard check is finally completed, and risk conditions of the sensitive words possibly related in collection voice are output. By utilizing the technology of prompting and receiving sensitive word quality inspection, the prompting and receiving voice and text can be analyzed in an automatic mode, so that the labor cost is reduced. By utilizing a large language model technology, sensitive words can be found through analysis of the prompting voice and text data, and the prompting efficiency is improved. The quality inspection personnel can conduct spot inspection more pertinently, the workload of manual quality inspection is reduced, and the working efficiency is improved. The method for checking the quality of the words with the prompt and collection sensitivity can ensure compliance, improve efficiency, reduce cost and disputes, and promote the development of large language model technology in the financial field.
Detailed Description
It is to be understood that, according to the technical solution of the present invention, those skilled in the art may propose various alternative structural modes and implementation modes without changing the true spirit of the present invention. Accordingly, the following detailed description and drawings are merely illustrative of the invention and are not intended to be exhaustive or to limit the invention to the precise form disclosed.
An embodiment according to the invention is shown in connection with fig. 1 and 2. A method for checking quality of a collection-accelerating sensitive word based on a large model comprises the following steps:
S101, acquiring the on-line generated collection record.
S102, calling a translation model API interface to convert the collect-urging record into a record text.
S103, preprocessing the recording text, and dividing the long text in the recording text to obtain an input text. And the segmented texts are respectively subjected to segment quality inspection prediction, so that the bottleneck of large language model input limitation can be solved.
The method comprises the steps of preprocessing the recorded text, namely removing the recorded text for less than 30 seconds, and adding target tag information based on expert experience and historical complaint information.
S104, importing the input text into an original model for quality inspection, and outputting a first quality inspection result. Invoking the original model may identify a portion of the apparent NSFW (not civilized or unsuitable) terms.
S105, constructing a local compliance vector knowledge base, calling the vector knowledge base based on the large language model, importing an input text into the large language model for quality inspection, and outputting a second quality inspection result. And calling a vector knowledge base of the local sensitive words through the large predictive model to identify the business non-compliance content.
Specifically, the construction of the local compliance vector knowledge base includes:
(1) Based on expert experience or historical customer complaint cases, collecting collection records and expanding collection compliance field information available for a pre-trained large language model.
(2) After the collection prompting recording is converted into a recording text, screening out a speaking part of a collection prompting person;
(3) The recorded text is converted into 512-dimensional vectors using encoding software, and the converted vectors are stored in a database. The coding software is google-universal-encoding.
(4) After converting the sound recording text to be evaluated into 512-dimensional vectors, carrying out inner product calculation with all vectors in a database, wherein the larger the inner product is, the higher the similarity is;
(5) If the similarity exceeds the set threshold, the problem of non-compliance of the corresponding record, which has occurred historically, is indicated. For example, if the similarity threshold is set to be 0.8, if the similarity threshold is exceeded, the record is indicated to have the problem of non-compliance which has occurred historically.
Before the input text is imported into the large language model for quality inspection, the method further comprises the steps of reading the content of the local vector knowledge base, obtaining the context related to the user request, filling templates with the request content and the context content to obtain prompt words, and inputting the prompt words into the large language model.
S106, P-training is carried out based on the large language model according to the historical prompting recording data, and the customized large language model is obtained.
Specifically, the method comprises the following steps:
1) Recording and text data in the field of collection are collected, and the data are preprocessed. The preprocessing comprises the operations of cleaning, denoising, labeling and the like on the data. For example, remove less than 30 seconds of recorded text and add target tag information based on expert experience and historical complaint information.
2) And identifying the recording data by using an ASR technology, distinguishing the collecting personnel and overdue users, and converting the recording data into text data in local batch.
The ASR technique is specifically module DiarizationPipeline of the whisperX model. When the record is identified, the appointed language is Chinese, the speaker is 2 persons, the record file (wav or MP3 format) is input to whisperX model, and the model directly outputs the speaker and the text of the speaking content. The speech text data of the collector can be screened out according to the fixed opening time of the collector. The fixed opening time may be "i am XX bank. . . ".
3) And distinguishing text data by using expert labeling, and generating a training sample according to whether positive and negative labels are properly labeled.
For example, the training sample format of the model is:
{ "input": "please determine whether the following catalyst records are compliant [ catalyst record text",
"Output": "non-compliance" }
4) The training samples are divided into a training set for p-training and a test set for evaluating the model effect. Typically thousands of labeled samples are pre-trained.
5) And configuring p-tuning model parameters, and customizing the training of the large language model when the model effect reaches a set threshold value. The model parameters with relatively large influence on the result are learning rate, training data are read in to train on the GPU, and evaluation is carried out on the test set after training is completed.
The P-tuning technology is adopted to fine tune the large language model, basic parameter values in the pre-trained large language model are not changed, fine tuning training is only carried out on a prompt word embedding layer in the large language model, and the method can be completed on a single-card GPU due to fewer trainable parameters. The customized model of P-tuning can output the quality inspection result of sensitive words, and the stability and accuracy of model output are greatly improved compared with the model only using pre-training.
And performing model fine adjustment on the large language model by using a P-tuning technology based on the historical induced harvest record, wherein the customized model after fine adjustment can identify the quality inspection risk of the user in the record end to end.
6) The customized large language model is deployed in a production environment and can be called by an acceleration system in an API mode. By adopting the sensitive word detection method based on the deep learning algorithm, the sensitive words in the field of collection can be efficiently detected, and the conditions of missed detection and false detection are reduced.
The method comprises the steps of arranging the custom large language model in a production environment, importing the custom large language model into the production environment, adjusting the model state to be an eval mode, providing an API service to the outside by using a fastapi interface, providing a recording text which needs to be evaluated and is added with a prompt word on the API, and returning an evaluation result of the recording text.
S107, importing the input text into a custom large language model for quality inspection to obtain a third quality inspection result. The voice which is difficult to identify can be identified by calling the custom large language model.
S108, if at least one of the first quality inspection result, the second quality inspection result and the third quality inspection result is not compliant, the final quality inspection result is not compliant.
Specifically, in the first quality inspection result, the second quality inspection result and the third quality inspection result, as long as one of the first quality inspection result, the second quality inspection result and the third quality inspection result is not compliant, the final quality inspection result is not compliant. The evaluation of the three quality tests is from three different angles, and the emphasis is different, namely the first quality test is to simply identify obvious dirty words through a large language model, the second quality test is to find out similar non-compliance cases as in history, and the third quality test is to expand the potential non-compliance cases which do not appear before prediction through the model. Through three quality inspection processes, accurate quality inspection can be effectively performed on different objects, and quality inspection efficiency and accuracy are improved.
For example, a cursory can be directly identified by the first quality inspection by a cursory in the phone. The cashier asks the customer to pay back in the phone to the cashier's private account (a historically frequent case of non-compliance) user for payment, which can be identified by the second quality check. Some hints and inducement actions that may not be as well defined may be identified by the third quality inspection.
Optionally, the detected sensitive words can be compared with quality inspection standards, so that the quality inspection flow is optimized, and the quality inspection efficiency and accuracy are improved. By adjusting and optimizing the corpus, the content of the corpus is continuously optimized according to the latest requirements of industry and the specifications of financial institutions.
Referring to fig. 3, the invention also provides a large-model-based collection-accelerating sensitive word quality inspection device, which comprises:
an acquisition module 101, configured to acquire an on-line generated collect-promoting recording;
The recording conversion module 102 is used for calling the translation model API interface to convert the prompting recording into a recording text;
The preprocessing module 103 is used for preprocessing the recording text and dividing the long text in the recording text to obtain an input text;
the first quality inspection module 104 is configured to import the input text into the original model for quality inspection, and output a first quality inspection result;
the second quality inspection module 105 is configured to construct a local compliance vector knowledge base, call the vector knowledge base based on the large language model, import the input text into the large language model for quality inspection, and output a second quality inspection result;
the model training module 106 is used for collecting recording data according to history, and performing P-training based on the large language model to obtain a customized large language model;
a third quality inspection module 107, configured to import the input text into a custom large language model for quality inspection, and obtain a third quality inspection result;
The quality inspection result module 108 determines that the final quality inspection result is non-compliant if at least one of the first quality inspection result, the second quality inspection result, and the third quality inspection result is non-compliant.
In summary, the method has the beneficial effects that through mining unstructured collection recording data accumulated in financial institution collection business, after preprocessing operations such as data cleaning, a speaking object is identified, a collection sensitive word model is generated, sensitive content of speaking content is more accurately identified, quality inspection standard verification is finally completed, and risk conditions of the sensitive words possibly related in collection voice are output. By utilizing the technology of prompting and receiving sensitive word quality inspection, the prompting and receiving voice and text can be analyzed in an automatic mode, so that the labor cost is reduced. By utilizing a large language model technology, sensitive words can be found through analysis of the prompting voice and text data, and the prompting efficiency is improved. The quality inspection personnel can conduct spot inspection more pertinently, the workload of manual quality inspection is reduced, and the working efficiency is improved. The method for checking the quality of the words with the prompt and collection sensitivity can ensure compliance, improve efficiency, reduce cost and disputes, and promote the development of large language model technology in the financial field.
The invention is the targeted application optimization of intelligent voice recognition and large language model projects in the financial field, and is an innovative attempt of a large model in the financial field in the field of collection-accelerating quality inspection compliance. The technology can be applied to quality inspection of voice, text and other data in the field of collection, effectively identifies sensitive words, improves quality inspection efficiency and accuracy, and is beneficial to protecting consumer rights and benefits and improving industry images. Meanwhile, the technical method can be also applied to other fields needing compliance management and control, such as financial product recommendation, live broadcast and other emerging industries. Besides the training output result of the large language model, the technical method reserves the rule judgment scheme of the knowledge base such as expert marking and the like, comprehensively gives the quality inspection result, and effectively combines the advantages of manpower and the large model.
The invention provides a large-model-based method and a large-model-based device for checking the quality of a collection-accelerating sensitive word, which are used for detecting data such as voice, text and the like in the collection-accelerating field by using a machine learning algorithm and have higher technical innovation. The patent discloses a specific implementation process of a collection-accelerating sensitive word quality inspection technology and an application method based on a large model, which is helpful for promoting technology communication and cooperation and promoting development of related technologies. The system can reduce cost and increase efficiency in the field of boosting and gathering, improve quality inspection efficiency and reduce labor cost, and can greatly reduce the workload of manual quality inspection and improve quality inspection efficiency by using a large model for gathering sensitive words and compliance quality inspection. The large language model has higher detection accuracy, can effectively find out sensitive words and compliance problems in the field of collection, and improves quality inspection accuracy. The patent can better monitor the compliance of the collection industry and protect the rights and interests of consumers through the collection-accelerating sensitive words and the compliance quality inspection technology based on the large language model. For the collection industry, the collection sensitive word and compliance quality inspection technology of the patent can improve the industry image and enhance the trust of society to the collection industry. After the technical method of the patent is popularized, the operation of the harvest accelerating industry can be standardized, illegal harvest accelerating actions are prevented, and the healthy development of the industry is promoted.
It should be appreciated that the integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
The technical scope of the present invention is not limited to the above description, and those skilled in the art may make various changes and modifications to the above-described embodiments without departing from the technical spirit of the present invention, and these changes and modifications should be included in the scope of the present invention.