CN117149972B

CN117149972B - Method and device for checking quality of collection-accelerating sensitive words based on large model

Info

Publication number: CN117149972B
Application number: CN202311103890.4A
Authority: CN
Inventors: 陈希; 徐维; 段祖宁
Original assignee: Jiangsu Sushang Bank Co ltd
Current assignee: Jiangsu Sushang Bank Co ltd
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2025-01-17
Anticipated expiration: 2043-08-30
Also published as: CN117149972A

Abstract

The present invention proposes a method and device for quality inspection of sensitive words in debt collection based on a large model, the method comprising: obtaining a debt collection recording generated online; calling a translation model API interface to convert the debt collection recording into a recording text; preprocessing the recording text, and segmenting the long text to obtain the input text; importing the input text into the original model for quality inspection, and outputting a first quality inspection result; constructing a local debt collection compliance vector knowledge base, calling the vector knowledge base based on a large language model, and importing the input text into the large language model for quality inspection, and outputting a second quality inspection result; according to historical debt collection recording data, and training based on a large language model, obtaining a customized large language model; importing the input text into the customized large language model for quality inspection, and obtaining a third quality inspection result; if there is one non-compliance, the quality inspection result is non-compliance. The present invention can ensure compliance collection, improve efficiency, reduce costs, reduce disputes, and promote the development of large language models in the financial field.

Description

Method and device for checking quality of collection-accelerating sensitive words based on large model

Technical Field

The invention relates to the technical field of finance, in particular to a method and a device for checking quality of a collection-accelerating sensitive word based on a large model.

Background

Along with the increasing supervision of the financial industry, the compliance requirement on the collection behavior is higher and higher. Enterprises need to identify and filter sensitive words in the collection process so as to ensure that collection behavior accords with relevant laws and regulations and industry standards and reduce potential legal risks. Post-credit collection is an important link of risk management of financial institutions and is the most manual intervention link. Especially traditional collection quality inspection field relies on the manual work to go on, and is with high costs and inefficiency, can not satisfy the development demand of financial collection trade.

Disclosure of Invention

In view of the problems, the invention provides a large-model-based method and a large-model-based device for checking the quality of a collection-accelerating sensitive word, which solve the problems of high cost and low efficiency of the traditional collection-accelerating quality check field mainly by manpower.

The technical scheme includes that a large-model-based collection-accelerating sensitive word quality inspection method comprises the steps of obtaining on-line generated collection-accelerating records, calling a translation model API interface to convert the collection-accelerating records into recording texts, preprocessing the recording texts, segmenting long texts in the recording texts to obtain input texts, importing the input texts into an original model to conduct quality inspection, outputting a first quality inspection result, constructing a local collection-accelerating rule vector knowledge base, calling the vector knowledge base based on a large language model, importing the input texts into the large language model to conduct quality inspection, outputting a second quality inspection result, conducting P-tuning training according to historical collection-accelerating record data and based on the large language model to obtain customized large language model, importing the input texts into the customized large language model to conduct quality inspection to obtain a third quality inspection result, and finally judging that at least one of the first quality inspection result, the second quality inspection result and the third quality inspection result is non-uniform.

The method comprises the steps of collecting a collection record based on expert experience or historical customer complaint cases, screening out a speaking part of a collector after the collection record is converted into a record text, converting the record text into 512-dimensional vectors by using coding software, storing the converted vectors in a database, converting the record text to be evaluated into the 512-dimensional vectors, performing inner product calculation with all vectors in the database, wherein the larger the inner product is, the higher the similarity is, and if the similarity exceeds a set threshold value, indicating that the corresponding record has the history-generated non-compliance problem.

The method comprises the steps of obtaining a context related to a user request by reading content, filling a template by using the request content and the context content to obtain a prompt word, and inputting the prompt word into a large language model.

Preferably, preprocessing the recorded text comprises removing recorded text for less than 30 seconds and adding target tag information based on expert experience and historical complaint information.

The method comprises the steps of collecting recording and text data in the induction field, preprocessing the data, utilizing an ASR technology to identify the recording data, distinguishing induction personnel and overdue users, converting the recording data into the text data, utilizing expert labelling to distinguish the text data, generating training samples according to whether positive and negative labels are properly marked or not, dividing the training samples into a training set and a testing set, wherein the training set is used for P-training, the testing set is used for evaluating model effects, configuring P-training model parameters, customizing the large language model when the model effects reach a set threshold, and deploying the customized large language model in a production environment for calling by an induction system in an API mode.

The method comprises the steps of identifying recording data by using an ASR technology, distinguishing an adductor from overdue users, and converting the recording data into text data, wherein a whisperX model is used for designating a language as Chinese, a speaker is 2 persons, recording files are input to a whisperX model, the speaker and the speaking content text are output, and the speaking text data of the adductor are screened out according to the fixed opening time of the adductor.

The method comprises the steps of enabling the custom large language model to be imported into a production environment, adjusting the model state to be an eval mode, providing API service to the outside by using fastapi interfaces, providing a recording text which is required to be evaluated and added with a prompt word on the API, and returning an evaluation result of the recording text.

The invention further provides a large-model-based collection-induction sensitive word quality inspection device, which comprises an acquisition module, a recording conversion module, a preprocessing module, a first quality inspection module, a second quality inspection module, a third quality inspection module, a quality inspection result and a third quality inspection module, wherein the acquisition module is used for acquiring a collection-induction recording generated on line, the recording conversion module is used for calling a translation model API interface to convert the collection-induction recording into a recording text, the preprocessing module is used for preprocessing the recording text and dividing a long text in the recording text to obtain an input text, the first quality inspection module is used for importing the input text into an original model to conduct quality inspection, the second quality inspection module is used for constructing a local collection-induction rule vector knowledge base, calling the vector knowledge base based on the large language model and importing the input text into the large language model to conduct quality inspection, the second quality inspection result is output, the model training module is used for conducting P-training according to the historical collection-induction recording data and based on the large language model to obtain a customized large language model, the third quality inspection module is used for importing the input text into the customized large language model to conduct quality inspection, and obtaining the third quality inspection result is obtained, and the first quality inspection result and the third quality inspection result is at least different from the first quality inspection result.

Compared with the prior art, the method has the beneficial effects that through mining unstructured collection recording data accumulated in financial institution collection business, after preprocessing operations such as data cleaning, a speaking object is identified, a collection sensitive word model is generated, sensitive content of speaking content is more accurately identified, quality inspection standard check is finally completed, and risk conditions of the sensitive words possibly related in collection voice are output. By utilizing the technology of prompting and receiving sensitive word quality inspection, the prompting and receiving voice and text can be analyzed in an automatic mode, so that the labor cost is reduced. By utilizing a large language model technology, sensitive words can be found through analysis of the prompting voice and text data, and the prompting efficiency is improved. The quality inspection personnel can conduct spot inspection more pertinently, the workload of manual quality inspection is reduced, and the working efficiency is improved. The method for checking the quality of the words with the prompt and collection sensitivity can ensure compliance, improve efficiency, reduce cost and disputes, and promote the development of large language model technology in the financial field.

Drawings

The disclosure of the present invention is described with reference to the accompanying drawings. It is to be understood that the drawings are designed solely for the purposes of illustration and not as a definition of the limits of the invention. In the drawings, like reference numerals are used to refer to like parts. Wherein:

FIG. 1 is a flow chart of a method for checking quality of a collection-oriented sensitive word according to an embodiment of the invention;

FIG. 2 is a schematic diagram of another flow chart of a method for checking quality of a collection-oriented word according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a device for checking quality of a collection-sensitive word according to an embodiment of the present invention.

Detailed Description

It is to be understood that, according to the technical solution of the present invention, those skilled in the art may propose various alternative structural modes and implementation modes without changing the true spirit of the present invention. Accordingly, the following detailed description and drawings are merely illustrative of the invention and are not intended to be exhaustive or to limit the invention to the precise form disclosed.

An embodiment according to the invention is shown in connection with fig. 1 and 2. A method for checking quality of a collection-accelerating sensitive word based on a large model comprises the following steps:

S101, acquiring the on-line generated collection record.

S102, calling a translation model API interface to convert the collect-urging record into a record text.

S103, preprocessing the recording text, and dividing the long text in the recording text to obtain an input text. And the segmented texts are respectively subjected to segment quality inspection prediction, so that the bottleneck of large language model input limitation can be solved.

The method comprises the steps of preprocessing the recorded text, namely removing the recorded text for less than 30 seconds, and adding target tag information based on expert experience and historical complaint information.

S104, importing the input text into an original model for quality inspection, and outputting a first quality inspection result. Invoking the original model may identify a portion of the apparent NSFW (not civilized or unsuitable) terms.

S105, constructing a local compliance vector knowledge base, calling the vector knowledge base based on the large language model, importing an input text into the large language model for quality inspection, and outputting a second quality inspection result. And calling a vector knowledge base of the local sensitive words through the large predictive model to identify the business non-compliance content.

Specifically, the construction of the local compliance vector knowledge base includes:

(1) Based on expert experience or historical customer complaint cases, collecting collection records and expanding collection compliance field information available for a pre-trained large language model.

(2) After the collection prompting recording is converted into a recording text, screening out a speaking part of a collection prompting person;

(3) The recorded text is converted into 512-dimensional vectors using encoding software, and the converted vectors are stored in a database. The coding software is google-universal-encoding.

(4) After converting the sound recording text to be evaluated into 512-dimensional vectors, carrying out inner product calculation with all vectors in a database, wherein the larger the inner product is, the higher the similarity is;

(5) If the similarity exceeds the set threshold, the problem of non-compliance of the corresponding record, which has occurred historically, is indicated. For example, if the similarity threshold is set to be 0.8, if the similarity threshold is exceeded, the record is indicated to have the problem of non-compliance which has occurred historically.

Before the input text is imported into the large language model for quality inspection, the method further comprises the steps of reading the content of the local vector knowledge base, obtaining the context related to the user request, filling templates with the request content and the context content to obtain prompt words, and inputting the prompt words into the large language model.

S106, P-training is carried out based on the large language model according to the historical prompting recording data, and the customized large language model is obtained.

Specifically, the method comprises the following steps:

1) Recording and text data in the field of collection are collected, and the data are preprocessed. The preprocessing comprises the operations of cleaning, denoising, labeling and the like on the data. For example, remove less than 30 seconds of recorded text and add target tag information based on expert experience and historical complaint information.

2) And identifying the recording data by using an ASR technology, distinguishing the collecting personnel and overdue users, and converting the recording data into text data in local batch.

The ASR technique is specifically module DiarizationPipeline of the whisperX model. When the record is identified, the appointed language is Chinese, the speaker is 2 persons, the record file (wav or MP3 format) is input to whisperX model, and the model directly outputs the speaker and the text of the speaking content. The speech text data of the collector can be screened out according to the fixed opening time of the collector. The fixed opening time may be "i am XX bank. . . ".

3) And distinguishing text data by using expert labeling, and generating a training sample according to whether positive and negative labels are properly labeled.

For example, the training sample format of the model is:

{ "input": "please determine whether the following catalyst records are compliant [ catalyst record text",

"Output": "non-compliance" }

4) The training samples are divided into a training set for p-training and a test set for evaluating the model effect. Typically thousands of labeled samples are pre-trained.

5) And configuring p-tuning model parameters, and customizing the training of the large language model when the model effect reaches a set threshold value. The model parameters with relatively large influence on the result are learning rate, training data are read in to train on the GPU, and evaluation is carried out on the test set after training is completed.

The P-tuning technology is adopted to fine tune the large language model, basic parameter values in the pre-trained large language model are not changed, fine tuning training is only carried out on a prompt word embedding layer in the large language model, and the method can be completed on a single-card GPU due to fewer trainable parameters. The customized model of P-tuning can output the quality inspection result of sensitive words, and the stability and accuracy of model output are greatly improved compared with the model only using pre-training.

And performing model fine adjustment on the large language model by using a P-tuning technology based on the historical induced harvest record, wherein the customized model after fine adjustment can identify the quality inspection risk of the user in the record end to end.

6) The customized large language model is deployed in a production environment and can be called by an acceleration system in an API mode. By adopting the sensitive word detection method based on the deep learning algorithm, the sensitive words in the field of collection can be efficiently detected, and the conditions of missed detection and false detection are reduced.

The method comprises the steps of arranging the custom large language model in a production environment, importing the custom large language model into the production environment, adjusting the model state to be an eval mode, providing an API service to the outside by using a fastapi interface, providing a recording text which needs to be evaluated and is added with a prompt word on the API, and returning an evaluation result of the recording text.

S107, importing the input text into a custom large language model for quality inspection to obtain a third quality inspection result. The voice which is difficult to identify can be identified by calling the custom large language model.

S108, if at least one of the first quality inspection result, the second quality inspection result and the third quality inspection result is not compliant, the final quality inspection result is not compliant.

Specifically, in the first quality inspection result, the second quality inspection result and the third quality inspection result, as long as one of the first quality inspection result, the second quality inspection result and the third quality inspection result is not compliant, the final quality inspection result is not compliant. The evaluation of the three quality tests is from three different angles, and the emphasis is different, namely the first quality test is to simply identify obvious dirty words through a large language model, the second quality test is to find out similar non-compliance cases as in history, and the third quality test is to expand the potential non-compliance cases which do not appear before prediction through the model. Through three quality inspection processes, accurate quality inspection can be effectively performed on different objects, and quality inspection efficiency and accuracy are improved.

For example, a cursory can be directly identified by the first quality inspection by a cursory in the phone. The cashier asks the customer to pay back in the phone to the cashier's private account (a historically frequent case of non-compliance) user for payment, which can be identified by the second quality check. Some hints and inducement actions that may not be as well defined may be identified by the third quality inspection.

Optionally, the detected sensitive words can be compared with quality inspection standards, so that the quality inspection flow is optimized, and the quality inspection efficiency and accuracy are improved. By adjusting and optimizing the corpus, the content of the corpus is continuously optimized according to the latest requirements of industry and the specifications of financial institutions.

Referring to fig. 3, the invention also provides a large-model-based collection-accelerating sensitive word quality inspection device, which comprises:

an acquisition module 101, configured to acquire an on-line generated collect-promoting recording;

The recording conversion module 102 is used for calling the translation model API interface to convert the prompting recording into a recording text;

The preprocessing module 103 is used for preprocessing the recording text and dividing the long text in the recording text to obtain an input text;

the first quality inspection module 104 is configured to import the input text into the original model for quality inspection, and output a first quality inspection result;

the second quality inspection module 105 is configured to construct a local compliance vector knowledge base, call the vector knowledge base based on the large language model, import the input text into the large language model for quality inspection, and output a second quality inspection result;

the model training module 106 is used for collecting recording data according to history, and performing P-training based on the large language model to obtain a customized large language model;

a third quality inspection module 107, configured to import the input text into a custom large language model for quality inspection, and obtain a third quality inspection result;

The quality inspection result module 108 determines that the final quality inspection result is non-compliant if at least one of the first quality inspection result, the second quality inspection result, and the third quality inspection result is non-compliant.

In summary, the method has the beneficial effects that through mining unstructured collection recording data accumulated in financial institution collection business, after preprocessing operations such as data cleaning, a speaking object is identified, a collection sensitive word model is generated, sensitive content of speaking content is more accurately identified, quality inspection standard verification is finally completed, and risk conditions of the sensitive words possibly related in collection voice are output. By utilizing the technology of prompting and receiving sensitive word quality inspection, the prompting and receiving voice and text can be analyzed in an automatic mode, so that the labor cost is reduced. By utilizing a large language model technology, sensitive words can be found through analysis of the prompting voice and text data, and the prompting efficiency is improved. The quality inspection personnel can conduct spot inspection more pertinently, the workload of manual quality inspection is reduced, and the working efficiency is improved. The method for checking the quality of the words with the prompt and collection sensitivity can ensure compliance, improve efficiency, reduce cost and disputes, and promote the development of large language model technology in the financial field.

The invention is the targeted application optimization of intelligent voice recognition and large language model projects in the financial field, and is an innovative attempt of a large model in the financial field in the field of collection-accelerating quality inspection compliance. The technology can be applied to quality inspection of voice, text and other data in the field of collection, effectively identifies sensitive words, improves quality inspection efficiency and accuracy, and is beneficial to protecting consumer rights and benefits and improving industry images. Meanwhile, the technical method can be also applied to other fields needing compliance management and control, such as financial product recommendation, live broadcast and other emerging industries. Besides the training output result of the large language model, the technical method reserves the rule judgment scheme of the knowledge base such as expert marking and the like, comprehensively gives the quality inspection result, and effectively combines the advantages of manpower and the large model.

The invention provides a large-model-based method and a large-model-based device for checking the quality of a collection-accelerating sensitive word, which are used for detecting data such as voice, text and the like in the collection-accelerating field by using a machine learning algorithm and have higher technical innovation. The patent discloses a specific implementation process of a collection-accelerating sensitive word quality inspection technology and an application method based on a large model, which is helpful for promoting technology communication and cooperation and promoting development of related technologies. The system can reduce cost and increase efficiency in the field of boosting and gathering, improve quality inspection efficiency and reduce labor cost, and can greatly reduce the workload of manual quality inspection and improve quality inspection efficiency by using a large model for gathering sensitive words and compliance quality inspection. The large language model has higher detection accuracy, can effectively find out sensitive words and compliance problems in the field of collection, and improves quality inspection accuracy. The patent can better monitor the compliance of the collection industry and protect the rights and interests of consumers through the collection-accelerating sensitive words and the compliance quality inspection technology based on the large language model. For the collection industry, the collection sensitive word and compliance quality inspection technology of the patent can improve the industry image and enhance the trust of society to the collection industry. After the technical method of the patent is popularized, the operation of the harvest accelerating industry can be standardized, illegal harvest accelerating actions are prevented, and the healthy development of the industry is promoted.

It should be appreciated that the integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

The technical scope of the present invention is not limited to the above description, and those skilled in the art may make various changes and modifications to the above-described embodiments without departing from the technical spirit of the present invention, and these changes and modifications should be included in the scope of the present invention.

Claims

1. A collection sensitive word quality inspection method based on a large model, characterized by comprising the following steps:

Obtain collection recordings generated online;

Calling the translation model API interface to convert the collection recording into recording text;

Preprocessing the recorded text and segmenting the long text in the recorded text to obtain input text;

Importing the input text into the original model for quality inspection, and outputting a first quality inspection result;

Building a local debt collection compliance vector knowledge base, calling the vector knowledge base based on the large language model, importing the input text into the large language model for quality inspection, and outputting a second quality inspection result;

Based on historical collection recording data and P-tuning training based on the large language model, a customized large language model is obtained;

The method of obtaining a customized large language model by performing P-tuning training based on the historical collection recording data and the large language model includes:

Collect audio and text data in the field of debt collection and pre-process the data;

Use ASR technology to identify recording data, distinguish between debt collectors and overdue users, and convert recording data into text data;

Using experts to label the text data, assigning positive and negative labels based on compliance, and generating training samples;

Dividing the training samples into a training set and a test set, wherein the training set is used for p-tuning training and the test set is used for evaluating the model effect;

Configure the p-tuning model parameters. When the model effect reaches the set threshold, the customized large language model training is completed.

Deploy the customized large language model in a production environment so that it can be called by the collection system through an API;

Importing the input text into the customized large language model for quality inspection to obtain a third quality inspection result;

If at least one of the first quality inspection result, the second quality inspection result and the third quality inspection result is non-compliant, the final quality inspection result is non-compliant.

2. The method for quality inspection of sensitive words for debt collection based on a large model according to claim 1, characterized in that the construction of a local debt collection compliance vector knowledge base comprises:

Collect collection recordings based on expert experience or historical customer complaint cases;

After converting the debt collection recording into a text recording, the speech portion of the debt collector is screened out;

Using encoding software to convert the recorded text into a 512-dimensional vector, and storing the converted vector in a database;

After converting the recorded text to be evaluated into a 512-dimensional vector, the inner product is calculated with all the vectors in the database. The larger the inner product, the higher the similarity;

If the similarity exceeds the set threshold, it means that the corresponding recording has historical non-compliance issues.

3. The method for quality inspection of sensitive words in debt collection based on a large model according to claim 1, characterized in that before the input text is imported into the large language model for quality inspection, it also includes:

Read the content and obtain the context related to the user request;

Fill the template with the request content and context content to obtain the prompt words;

The cue words are input into a large language model.

4. The large-model-based collection sensitive word quality inspection method according to claim 1 is characterized in that the recording text is preprocessed, including: removing recording text less than 30 seconds, and adding target label information based on expert experience and historical complaint information.

5. The method for quality inspection of sensitive words for debt collection based on a large model according to claim 1 is characterized in that the use of ASR technology to identify recording data, distinguish between debt collectors and overdue users, and convert the recording data into text data includes:

Using the whisperX model, specify the language as Chinese and the speakers as 2 people;

Input the recording file to the whisperX model and output the speaker and speech content text;

The speech text data of the debt collectors are filtered out according to their fixed opening remarks.

6. The method for quality inspection of sensitive words in debt collection based on a large model according to claim 1, characterized in that the customized large language model is deployed in a production environment, comprising:

Import the customized large language model into the production environment, and adjust the model state to eval mode;

Use fastapi interface to provide API services to the outside world;

Provide the recording text to be evaluated with the prompt words on the API, and the evaluation result of the recording text will be returned.

7. A debt collection sensitive word quality inspection device based on a large model, characterized by comprising:

The acquisition module is used to obtain the collection recordings generated online;

A recording conversion module, used to call the translation model API interface to convert the collection recording into a recording text;

A preprocessing module, used to preprocess the recorded text and segment the long text in the recorded text to obtain input text;

A first quality inspection module, used for importing the input text into the original model for quality inspection and outputting a first quality inspection result;

A second quality inspection module is used to build a local debt collection compliance vector knowledge base, call the vector knowledge base based on the large language model, import the input text into the large language model for quality inspection, and output a second quality inspection result;

The model training module is used to perform P-tuning training based on the large language model according to the historical collection recording data to obtain a customized large language model; the P-tuning training based on the large language model according to the historical collection recording data to obtain a customized large language model includes:

A third quality inspection module, used for importing the input text into the customized large language model for quality inspection to obtain a third quality inspection result;

Quality inspection result module: if at least one of the first quality inspection result, the second quality inspection result and the third quality inspection result is non-compliant, the final quality inspection result is non-compliant.