CN111291535A

CN111291535A - Script processing method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN111291535A
Application number: CN202010136869.4A
Authority: CN
Inventors: 郏昕; 阳任科; 赵冲翔
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-03-02
Filing date: 2020-03-02
Publication date: 2020-06-16
Anticipated expiration: 2040-03-02
Also published as: CN111291535B

Abstract

Embodiments of the present invention provide a script processing method, apparatus, electronic device, and computer-readable storage medium, which belong to the technical field of computers. In the method, the script to be processed is divided into multiple episodes according to the preset expression range of the episode number, the episode is divided into multiple scene texts according to the preset scene number expression range, and the scene information characters contained in the scene text are extracted. , determine the scene information characters contained in the scene text, the scene number of the scene text, and the episode number of the episode to which the scene text belongs, as the to-be-arranged information of the scene text, and the to-be-arranged information of the scene text and the body text in the scene text, Combine according to preset forms. Extracting a single scene text as the processing object can reduce the coupling degree within the script to a certain extent, thereby improving the extraction accuracy. The scene text is recombined according to the preset form, so that the internal form of the scene text is consistent, which is convenient for processing.

Description

Script processing method, apparatus, electronic device, and computer-readable storage medium

技术领域technical field

本发明属于计算机技术领域，特别是涉及一种剧本处理方法、装置、电子设备及计算机可读存储介质。The present invention belongs to the field of computer technology, and in particular, relates to a script processing method, apparatus, electronic device and computer-readable storage medium.

背景技术Background technique

在规范化管理、拍摄统筹管理、剧本智能评估等应用场景中，往往涉及到对剧本中的待整理信息进行分析。其中，待整理信息指的是集编号、场景编号、时间、地点及人物的场景信息字符。而这些场景信息字符往往分散在剧本内容中，现有技术中往往是预先限定固定的几种固定的场景信息格式模板，并利用固定格式模板直接从剧本中提取待整理信息。In application scenarios such as standardized management, filming overall management, and script intelligent evaluation, it often involves analyzing the information to be sorted out in the script. The information to be organized refers to the episode number, scene number, time, place, and scene information characters of characters. These scene information characters are often scattered in the script content. In the prior art, several fixed scene information format templates are often pre-defined, and the information to be sorted is directly extracted from the script by using the fixed format templates.

由于编剧的书写习惯不同，每个剧本中文本的编写结构存在较大差异，这样，在剧本的格式与固定格式模板中的格式差距较大时，会导致按照固定格式模板提取的待整理信息的准确性较低。Due to the different writing habits of screenwriters, the writing structure of the text in each script is quite different. In this way, when the format of the script is greatly different from the format in the fixed-format template, the information to be sorted out extracted according to the fixed-format template will be lost. less accurate.

发明内容SUMMARY OF THE INVENTION

本发明提供一种剧本处理方法、装置、电子设备及计算机可读存储介质，以便解决提取的待整理信息的准确性较低的问题。The present invention provides a script processing method, device, electronic device, and computer-readable storage medium, so as to solve the problem of low accuracy of extracted information to be organized.

在本发明实施的第一方面，首先提供了一种剧本处理方法，应用于电子设备，该方法包括：In the first aspect of the implementation of the present invention, a script processing method is firstly provided, which is applied to an electronic device, and the method includes:

根据预设的集编号表述范围，确定所述待处理剧本中包含的集编号以及所述集编号的位置，并根据所述包含的集编号及所述集编号的位置将所述待处理剧本分割为多个剧集；Determine the episode number included in the to-be-processed script and the position of the episode number according to a preset expression range of episode numbers, and divide the to-be-processed script according to the included episode number and the position of the episode number for multiple episodes;

对于至少一个所述剧集，根据预设的场景编号表述范围，确定所述剧集中包含的场景编号以及所述场景编号的位置，并根据所述场景编号及所述场景编号的位置将所述剧集分割为所述多个场景文本；For at least one of the episodes, the scene number included in the episode and the position of the scene number are determined according to the preset scene number expression range, and the scene number and the position of the scene number are determined according to the scene number and the position of the scene number. dividing the episode into the plurality of scene texts;

对于至少一个所述场景文本，提取所述场景文本中包含的场景信息字符；For at least one of the scene texts, extracting scene information characters contained in the scene texts;

将所述场景文本中包含的场景信息字符、所述场景文本的场景编号及所述场景文本所属剧集的集编号，确定为所述场景文本的待整理信息；Determine the scene information characters contained in the scene text, the scene number of the scene text, and the episode number of the drama to which the scene text belongs, as the information to be sorted out of the scene text;

将所述场景文本的待整理信息以及所述场景文本中的正文文本，按照预设形式进行组合，形成目标剧本。The to-be-arranged information of the scene text and the body text in the scene text are combined according to a preset form to form a target script.

在本发明实施的第二方面，还提供了一种剧本处理装置，应用于电子设备，该装置包括：In a second aspect of the implementation of the present invention, a script processing apparatus is also provided, which is applied to electronic equipment, and the apparatus includes:

第一确定模块，用于根据预设的集编号表述范围，确定所述待处理剧本中包含的集编号以及所述集编号的位置，并根据所述包含的集编号及所述集编号的位置将所述待处理剧本分割为多个剧集；a first determination module, configured to determine the episode number included in the script to be processed and the position of the episode number according to the preset episode number expression range, and according to the included episode number and the position of the episode number dividing the to-be-processed script into a plurality of episodes;

第二确定模块，用于对于至少一个所述剧集，根据预设的场景编号表述范围，确定所述剧集中包含的场景编号以及所述场景编号的位置，并根据所述场景编号及所述场景编号的位置将所述剧集分割为所述多个场景文本；The second determining module is configured to, for at least one of the episodes, determine the scene number included in the episode and the location of the scene number according to the preset scene number expression range, and determine the scene number and the location of the scene number according to the scene number and the scene number. The position of the scene number divides the episode into the plurality of scene texts;

提取模块，用于对于至少一个所述场景文本，提取所述场景文本中包含的场景信息字符；an extraction module for extracting scene information characters contained in the scene text for at least one of the scene texts;

第三确定模块，用于将所述场景文本中包含的场景信息字符、所述场景文本的场景编号及所述场景文本所属剧集的集编号，确定为所述场景文本的待整理信息；A third determining module, configured to determine the scene information characters contained in the scene text, the scene number of the scene text, and the episode number of the drama to which the scene text belongs, as the information to be sorted out of the scene text;

组合模块，用于将所述场景文本的待整理信息以及所述场景文本中的正文文本，按照预设形式进行组合，形成目标剧本。The combining module is configured to combine the to-be-arranged information of the scene text and the body text in the scene text according to a preset form to form a target script.

在本发明实施的又一方面，还提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有指令，当其在计算机上运行时，使得计算机执行上述任一所述的剧本处理方法。In yet another aspect of the implementation of the present invention, there is also provided a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium runs on a computer, the computer executes any one of the above-mentioned scripts Approach.

在本发明实施的又一方面，还提供了一种包含指令的计算机程序产品，当其在计算机上运行时，使得计算机执行上述任一所述的剧本处理方法。In yet another aspect of the implementation of the present invention, there is also provided a computer program product containing instructions, which, when executed on a computer, cause the computer to execute any one of the above-mentioned script processing methods.

本发明实施例提供的剧本处理方法，可以根据预设的集编号表述范围，确定待处理剧本中包含的集编号以及集编号的位置，并根据包含的集编号及集编号的位置将待处理剧本分割为多个剧集，根据预设的场景编号表述范围，确定剧集中包含的场景编号及场景编号的位置，并根据场景编号及场景编号的位置将剧集分割为多个场景文本，对于至少一个场景文本，提取场景文本中包含的场景信息字符，将场景文本中包含的场景信息字符、场景文本的场景编号及场景文本所属剧集的集编号，确定为场景文本的待整理信息，将所述场景文本的待整理信息以及所述场景文本中的正文文本，按照预设形式进行组合，形成目标剧本。本发明实施例中，通过先将待处理剧本划分为场景文本，以单个场景文本为处理对象进行提取，一定程度上可以降低剧本内部的耦合度，进而可以降低剧本格式对场景信息提取的干扰，提高提取准确性。同时，在提取到场景信息之后，会将场景文本按照预设形式重新组合，这样，可以使剧本中各个场景文本内部的形式保持一致，进而方便后续对该剧本进行处理。The script processing method provided by the embodiment of the present invention can determine the episode number and the position of the episode number contained in the script to be processed according to the preset expression range of the episode number, and classify the script to be processed according to the included episode number and the position of the episode number. Divide into multiple episodes, according to the preset scene number expression range, determine the scene number and the location of the scene number included in the episode, and divide the episode into multiple scene texts according to the scene number and the location of the scene number. At least one scene text, extract the scene information characters contained in the scene text, determine the scene information characters contained in the scene text, the scene number of the scene text, and the episode number of the episode to which the scene text belongs, as the information to be sorted out of the scene text, The to-be-arranged information of the scene text and the body text in the scene text are combined according to a preset form to form a target script. In the embodiment of the present invention, by first dividing the script to be processed into scene texts, and extracting a single scene text as the processing object, the coupling degree within the script can be reduced to a certain extent, and the interference of the script format on the extraction of scene information can be reduced. Improve extraction accuracy. At the same time, after the scene information is extracted, the scene text will be recombined according to the preset form, so that the internal form of each scene text in the script can be kept consistent, thereby facilitating subsequent processing of the script.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。In order to illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that are required in the description of the embodiments or the prior art.

图1是本发明实施例提供的一种剧本处理方法的步骤流程图；1 is a flow chart of the steps of a script processing method provided by an embodiment of the present invention;

图2-1是本发明实施例提供的另一种剧本处理方法的步骤流程图；2-1 is a flowchart of steps of another script processing method provided by an embodiment of the present invention;

图2-2是本发明实施例提供的一种预处理示意图；2-2 is a schematic diagram of a preprocessing provided by an embodiment of the present invention;

图2-3是本发明实施例提供的一种处理流程示意图；2-3 are schematic diagrams of a processing flow provided by an embodiment of the present invention;

图2-4是本发明实施例提供的一种处理示意图；2-4 are schematic diagrams of processing provided by an embodiment of the present invention;

图2-5是本发明实施例提供的一种场景文本的组成示意图；2-5 are schematic diagrams of the composition of a scene text provided by an embodiment of the present invention;

图3是本发明实施例提供的一种剧本处理装置的框图；3 is a block diagram of a script processing apparatus provided by an embodiment of the present invention;

图4是本发明实施例提供的一种电子设备的结构图。FIG. 4 is a structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行描述。The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

图1是本发明实施例提供的一种剧本处理方法的步骤流程图，该方法可以应用于电子设备，如图1所示，该方法可以包括：FIG. 1 is a flowchart of steps of a script processing method provided by an embodiment of the present invention. The method can be applied to an electronic device. As shown in FIG. 1 , the method can include:

步骤101、根据预设的集编号表述范围，确定所述待处理剧本中包含的集编号以及所述集编号的位置，并根据所述包含的集编号及所述集编号的位置将所述待处理剧本分割为多个剧集。Step 101: Determine the episode number included in the to-be-processed script and the position of the episode number according to a preset expression range of episode numbers, and assign the to-be-processed episode number according to the included episode number and the position of the episode number. Process the script split into multiple episodes.

本发明实施例中，预设的集编号表述范围可以是在执行本步骤之前预先获取的。集编号表述范围中可以包含有的多种表述形式的集编号。具体的在获取这些表述形式的集编号时，可以获取大量的样本剧本，其中，样本剧本可以是从网络中爬取的剧本。然后提取样本剧本表示集编号的字符，接着，按照表述形式对这些表示集编号的字符进行归纳，进而得到多种不同表述形式的集编号。其中，集编号的表述形式可以为文字、数字、英文及罗字符号，等等。In this embodiment of the present invention, the preset set number expression range may be pre-acquired before this step is performed. The set number represents the set number of the various expressions that can be included in the range. Specifically, when acquiring the episode numbers of these representations, a large number of sample scripts may be acquired, wherein the sample scripts may be scripts crawled from the network. Then, the characters representing the episode number of the sample script are extracted, and then these characters representing the episode number are summarized according to the expression form, and then the episode numbers of various expression forms are obtained. Wherein, the expression form of the set number can be characters, numbers, English and roman symbols, and so on.

进一步地，实际应用场景中，剧本往往会包括多个剧集，每个剧集中往往会包含不同的场景，其中，每个场景对应一段场景文本，该场景文本用于描绘该场景的内容。每个剧集及场景文本都会有对应的编号，以方便区分。因此，本步骤中，可以先基于预设的集编号表述范围，确定待处理剧本中包含的集编号及集编号的位置。具体的，可以将预设的集编号表述范围中包含的集编号与待处理剧本中包含的内容进行匹配，然后将相匹配的字符确定为集编号，将该相匹配的字符所处的位置确定为集编号的位置。进一步地，可以根据识别出来的集编号及位置进行分割，得到待处理剧本中包含的剧集。由于预先收集了多种不同表述形式的集编号，因此，本发明实施例中使用这些预设的集编号进行匹配的方式，一定程度上可以避免由于剧本编写人员的书写习惯不同，导致不能准确的识别集编号的问题，进而可以提高剧集划分的准确率。Further, in practical application scenarios, a script often includes multiple episodes, and each episode often includes different scenes, wherein each scene corresponds to a piece of scene text, and the scene text is used to describe the content of the scene. Each episode and scene text will have a corresponding number for easy identification. Therefore, in this step, the episode number and the position of the episode number included in the script to be processed may be determined based on the preset expression range of the episode number. Specifically, the episode number included in the preset episode number expression range can be matched with the content included in the script to be processed, then the matched character is determined as the episode number, and the position of the matched character is determined. The position for the set number. Further, it can be divided according to the identified episode number and position to obtain the episodes included in the script to be processed. Since a variety of episode numbers in different expressions are collected in advance, the method of using these preset episode numbers for matching in the embodiment of the present invention can avoid inaccurate writing habits due to different writing habits of script writers to a certain extent. Identify the issue of episode numbering, which in turn can improve the accuracy of episode division.

步骤102、对于至少一个所述剧集，根据预设的场景编号表述范围，确定所述剧集中包含的场景编号以及所述场景编号的位置，并根据所述场景编号及所述场景编号的位置将所述剧集分割为所述多个场景文本。Step 102, for at least one of the episodes, according to the preset scene number expression range, determine the scene number and the location of the scene number included in the episode, and according to the scene number and the scene number. A location splits the episode into the plurality of scene texts.

本发明实施例中，预设的场景编号表述范围可以是在执行本步骤之前预先获取的。场景编号表述范围中可以包含有的多种表述形式的场景编号。具体的在获取这些表述形式的场景编号时，可以获取大量的样本剧本，然后提取样本剧本表示场景编号的字符，接着，按照表述形式对这些表示场景编号的字符进行归纳，进而得到多种不同表述形式的场景编号。其中，场景编号的表述形式可以为文字、数字、英文及罗字符号，等等。In this embodiment of the present invention, the preset scene number expression range may be pre-acquired before this step is performed. The scene number can contain the scene numbers of various expressions in the expression range. Specifically, when obtaining the scene numbers of these expressions, a large number of sample scripts can be obtained, and then the characters representing the scene numbers of the sample scripts are extracted, and then these characters representing the scene numbers are summarized according to the expressions, and then various expressions are obtained. The scene number in the form. Wherein, the expression form of the scene number may be characters, numbers, English and roman symbols, and so on.

进一步地，基于预设的场景编号表述范围，确定剧集中包含的场景编号及位置时，可以将预设的场景编号表述范围中包含的场景编号与剧集中包含的内容进行匹配，然后将相匹配的字符确定为场景编号，将该相匹配的字符所处的位置确定为场景编号的位置。进一步地，可以根据识别出来的场景编号及位置对该剧集进行分割，进而该剧集中包含的场景文本。由于预先收集了多种不同表述形式的场景编号，因此，本发明实施例中使用这些预设的场景编号进行匹配的方式，一定程度上可以避免由于剧本编写人员的书写习惯不同，导致不能准确的识别场景编号的问题，进而可以提高场景文本划分的准确率。Further, when determining the scene numbers and positions contained in the drama based on the preset scene number expression range, the scene numbers contained in the preset scene number expression range can be matched with the content contained in the drama, and then The matched character is determined as the scene number, and the position of the matched character is determined as the position of the scene number. Further, the episode can be segmented according to the identified scene number and location, and then the scene text contained in the episode. Since a variety of scene numbers in different expression forms are collected in advance, the method of using these preset scene numbers for matching in the embodiment of the present invention can avoid inaccurately caused by different writing habits of script writers to a certain extent. Identifying the problem of scene numbering can improve the accuracy of scene text division.

步骤103、对于至少一个所述场景文本，提取所述场景文本中包含的场景信息字符。Step 103: For at least one of the scene texts, extract scene information characters contained in the scene texts.

本发明实施例中，场景文本中包含的场景信息字符可以是用于表示场景信息的字、词语、数据、英文字符，等等。通过提取该场景文本中包含的场景信息字符，可以确定出该场景文本中包含的场景信息。In this embodiment of the present invention, the scene information characters included in the scene text may be words, words, data, English characters, etc. used to represent scene information. By extracting the scene information characters contained in the scene text, the scene information contained in the scene text can be determined.

步骤104、将所述场景文本中包含的场景信息字符、所述场景文本的场景编号及所述场景文本所属剧集的集编号，确定为所述场景文本的待整理信息。Step 104: Determine the scene information characters contained in the scene text, the scene number of the scene text, and the episode number of the episode to which the scene text belongs, as the information to be sorted out of the scene text.

示例的，假设场景文本中包含的场景信息字符为“下雨、学校操作、全体老师”，场景文本的场景编号为23，场景文本所属剧集的集编号为1，那么可以将“1，23，下雨、学校操作、全体老师”确定为该场景文本的待整理信息。For example, assuming that the scene information characters contained in the scene text are "raining, school operations, all teachers", the scene number of the scene text is 23, and the episode number of the episode to which the scene text belongs is 1, then "1, 23" can be used. , rain, school operations, all teachers" is determined as the information to be sorted out for the scene text.

步骤105、将所述场景文本的待整理信息以及所述场景文本中的正文文本，按照预设形式进行组合，形成目标剧本。Step 105 , combine the to-be-arranged information of the scene text and the body text in the scene text according to a preset form to form a target script.

本发明实施例中，正文文本指的是该场景文本中除待整理信息之外的文本。该预设格式组合可以是根据实际需求设定。其中，该预设形式可以为场景信息位于场景文本的首段，正文文本位于首段之后。这样，通过将场景文本按照预设形式重新组合，可以使剧本中各个场景文本内部的形式保持一致，进而方便后续对该剧本进行处理。In this embodiment of the present invention, the body text refers to the text in the scene text except the information to be sorted out. The preset format combination can be set according to actual needs. The preset form may be that the scene information is located in the first paragraph of the scene text, and the body text is located after the first paragraph. In this way, by recombining the scene texts according to the preset form, the internal forms of each scene text in the script can be kept consistent, thereby facilitating subsequent processing of the script.

综上所述，本发明实施例提供的剧本处理方法，可以根据预设的集编号表述范围，确定待处理剧本中包含的集编号以及集编号的位置，并根据包含的集编号及集编号的位置将待处理剧本分割为多个剧集，根据预设的场景编号表述范围，确定剧集中包含的场景编号及场景编号的位置，并根据场景编号及场景编号的位置将剧集分割为多个场景文本，对于至少一个场景文本，提取场景文本中包含的场景信息字符，将场景文本中包含的场景信息字符、场景文本的场景编号及场景文本所属剧集的集编号，确定为场景文本的待整理信息，将所述场景文本的待整理信息以及所述场景文本中的正文文本，按照预设形式进行组合，形成目标剧本。本发明实施例中，通过先将待处理剧本划分为场景文本，以单个场景文本为处理对象进行提取，一定程度上可以降低剧本内部的耦合度，进而可以降低剧本格式对场景信息提取的干扰，提高提取准确性。同时，在提取到场景信息之后，会将场景文本按照预设形式重新组合，这样，可以使剧本中各个场景文本内部的形式保持一致，进而方便后续对该剧本进行处理。To sum up, the script processing method provided by the embodiment of the present invention can determine the episode number and the position of the episode number included in the script to be processed according to the preset expression range of episode numbers, and determine the episode number and the position of the episode number included in the script to be processed, The location divides the script to be processed into multiple episodes, determines the scene number and the location of the scene number included in the episode according to the preset scene number expression range, and divides the episode into multiple episodes according to the scene number and the location of the scene number. a scene text, for at least one scene text, extract the scene information characters contained in the scene text, and determine the scene information characters contained in the scene text, the scene number of the scene text, and the episode number of the episode to which the scene text belongs, as the scene text. For the information to be arranged, the information to be arranged in the scene text and the body text in the scene text are combined according to a preset form to form a target script. In the embodiment of the present invention, by first dividing the script to be processed into scene texts, and extracting a single scene text as the processing object, the coupling degree within the script can be reduced to a certain extent, and the interference of the script format on the extraction of scene information can be reduced. Improve extraction accuracy. At the same time, after the scene information is extracted, the scene text will be recombined according to the preset form, so that the internal form of each scene text in the script can be kept consistent, thereby facilitating subsequent processing of the script.

图2-1是本发明实施例提供的另一种剧本处理方法的步骤流程图，该方法可以应用于电子设备，如图2-1所示，该方法可以包括：Fig. 2-1 is a flowchart of steps of another script processing method provided by an embodiment of the present invention. The method can be applied to electronic devices. As shown in Fig. 2-1, the method can include:

步骤201、对所述待处理剧本进行预处理操作。Step 201: Perform a preprocessing operation on the script to be processed.

本步骤中，预处理操作可以是用于对待处理剧本进行规范化的操作。具体的，预处理操作可以包括下述操作中的一个或多个操作：In this step, the preprocessing operation may be an operation for normalizing the script to be processed. Specifically, the preprocessing operation may include one or more of the following operations:

(1)将所述待处理剧本中的干扰信息删除。(1) Delete the interference information in the to-be-processed script.

其中，该干扰信息可以是根据实际会对剧本处理产生干扰的信息设定的。示例的，该干扰信息至少可以包括：页码信息、行编码信息、每行起始位置与结束位置的空格、制表符中的一种或多种。页码信息可以是待处理剧本中每一页的页码编号、行编码信息可以是待处理剧本中每一行的行号，这些页码编号及行编码信息可以是在待处理文本为PDF类型的文件时引入的。在删除页码编号及行编码信息时候，可以预先总结PDF类型的文件会引入的所有页码、行编码的格式，然后基于这些不同格式的页码、行编码对待处理剧本进行匹配。由于页码编号、行号都表示的数字，因此，通过将这些信息删除，可以避免其对集编号、场景编号的识别造成干扰。由于每行起始位置与结束位置的空格、制表符会对文本段落的识别造成影响，因此，通过删除这些无用字符，一定程度上可以降低对待处理剧本的处理，进而提高处理效果。具体的，在删除时，可以对待处理剧本进行逐行遍历，以检测每一行中是否存在干扰信息，如果检测到干扰信息则直接进行删除。当然，干扰信息还可以包括其他的信息，例如，word类型下会引入的一些信息。The interference information may be set according to information that actually interferes with script processing. Exemplarily, the interference information may include at least one or more of: page number information, line code information, spaces at the start position and end position of each line, and tab characters. The page number information can be the page number of each page in the script to be processed, and the line code information can be the line number of each line in the script to be processed. These page number and line code information can be introduced when the text to be processed is a PDF file of. When deleting page number and line code information, you can pre-summarize all page number and line code formats introduced by PDF-type files, and then match the script to be processed based on the page numbers and line codes of these different formats. Since both page numbers and line numbers represent numbers, by deleting these information, it can avoid interference with the identification of episode numbers and scene numbers. Since the spaces and tabs at the beginning and end of each line will affect the recognition of text paragraphs, by deleting these useless characters, the processing of the script to be processed can be reduced to a certain extent, thereby improving the processing effect. Specifically, when deleting, the script to be processed may be traversed line by line to detect whether there is interference information in each line, and if interference information is detected, it is directly deleted. Of course, the interference information may also include other information, for example, some information that will be introduced under the word type.

(2)将所述待处理剧本中的字体转换为预设字体。(2) Convert the font in the script to be processed into a preset font.

其中，预设字体可以是预先设定的，示例的，该预设字体可以为简体字。这样，通过将待处剧本中的繁体字均转换为简体字，可以避免由于字体不统一导致识别出错的问题。具体的，可以在网络中查找待处理剧本中每个字对应的简体字，然后利用该简体字替换该字，进而实现转换。当然，也可以采用其他方式实现转换，本发明实施例对此不作限定。The preset font may be preset, for example, the preset font may be simplified characters. In this way, by converting the traditional characters in the script to be processed into simplified characters, the problem of recognition errors caused by inconsistent fonts can be avoided. Specifically, the simplified character corresponding to each character in the script to be processed can be searched on the network, and then the simplified character can be used to replace the character, thereby realizing the conversion. Certainly, the conversion may also be implemented in other manners, which is not limited in this embodiment of the present invention.

(3)根据预设的符号映射关系，将所述待处理剧本中的标点符号转换为所述标点符号在所述符号映射关系中对应的标点符号。(3) According to the preset symbol mapping relationship, the punctuation symbols in the script to be processed are converted into punctuation symbols corresponding to the punctuation symbols in the symbol mapping relationship.

本操作中，建立该预设的符号映射关系时，可以根据标点符号在文本中所起的实际作用，将所起的实际作用相同的标点符号划分至同一类别，得到多个类别。然后，对于每个类别，从该类别中选择一个使用频率最高的标点符号作为该类别的代表性标点符号，然后建立该类别中所有标点符号与该代表性标点符号的映射关系，进而得到该符号映射关系。即，该类别中所有标点符号在该符号映射关系中对应的标点符号为该代表性标点符号。进一步地，在执行本操作时，可以在该映射关系中查找将该待处理剧本中的标点符号对应的标点符号，然后利用该对应的标点符号替换待处理剧本中原来的待处理剧本，进而完整标点符号映射。这样，通过对标点符号进行转化，可以使得起相同作用的标点符号使用相同的表述方式，使得待处理剧本更加规范，进而一定程度上可以降低对后续处理的干扰。In this operation, when establishing the preset symbol mapping relationship, the punctuation marks with the same actual function can be divided into the same category according to the actual functions played by the punctuation marks in the text, so as to obtain multiple categories. Then, for each category, select a punctuation mark with the highest frequency from the category as the representative punctuation mark of the category, and then establish the mapping relationship between all punctuation marks in the category and the representative punctuation mark, and then obtain the symbol Mapping relations. That is, the punctuation symbols corresponding to all the punctuation symbols in the category in the symbol mapping relationship are the representative punctuation symbols. Further, when this operation is performed, the punctuation mark corresponding to the punctuation mark in the script to be processed can be searched in the mapping relationship, and then the corresponding punctuation mark can be used to replace the original script to be processed in the script to be processed, and then complete. Punctuation map. In this way, by converting the punctuation marks, the punctuation marks with the same function can be expressed in the same way, so that the script to be processed is more standardized, and the interference to the subsequent processing can be reduced to a certain extent.

(4)根据预设的可用标点符号范围值，将所述待处理剧本中不属于所述可用标点符号范围值的标点符号删除。(4) According to a preset range of available punctuation marks, delete punctuation marks in the to-be-processed script that do not belong to the range of available punctuation marks.

本操作中，该可用标点符号范围值可以是根据实际情况设定的，示例的，可以将不会对后续处理操作产生干扰的标点符号作为该范围值中包含的成员，进而构建该范围值。具体的，在执行本操作时，可以将待处理剧本中的标点符号与该可用标点符号范围值中包含的标点符号进行一一比对。如果该可用标点符号范围值中存在与该标点符号相同的标点符号，则保留该标点符号，如果该可用标点符号范围值中不存在与该标点符号相同的标点符号，则删除该标点符号。这样，通过有选择性的删除一些标点符号，可以使得待处理剧本更加规范，进而一定程度上可以降低对后续处理的干扰。当然，预处理处理操作还可以包括其他操作，例如，将待处理剧本中的全角符号转换为半角符号，将待处理剧本中的所有非指定范围内的字符进行删除，其中，非指定范围可以是根据实际需求预先定义的无需进行处理的部分以进一步提高规范性，本发明实施例对此不作限定。In this operation, the range value of the available punctuation symbols may be set according to the actual situation. For example, the range value may be constructed by taking punctuation symbols that will not interfere with subsequent processing operations as members included in the range value. Specifically, when this operation is performed, the punctuation marks in the script to be processed may be compared one-to-one with the punctuation marks included in the available punctuation mark range value. If there is a punctuation mark that is the same as the punctuation mark in the available punctuation mark range value, keep the punctuation mark, if there is no punctuation mark with the same punctuation mark in the available punctuation mark range value, delete the punctuation mark. In this way, by selectively deleting some punctuation marks, the script to be processed can be made more standardized, thereby reducing the interference to subsequent processing to a certain extent. Of course, the preprocessing operation may also include other operations, for example, converting the full-width symbols in the script to be processed into half-width symbols, and deleting all characters in the script to be processed that are not within the specified range, where the non-specified range can be The part that does not need to be processed is predefined according to actual requirements to further improve the standardization, which is not limited in this embodiment of the present invention.

示例的，图2-2是本发明实施例提供的一种预处理示意图，如图2-2所示，对于待处理剧本可以进行环节二中的预处理操作，最后得到预处理后的待处理剧本。其中，环节一中读取的范围值及正则表达式，可以是进行预处理以及执行后续步骤时会用到的。这样，通过提前读取，一定程度上可以提高处理效率。需要说明的是，图中仅示出了读取的部分内容，实际应用中，读取的内容还可以包括其他内容。By way of example, Fig. 2-2 is a schematic diagram of a preprocessing provided by an embodiment of the present invention. As shown in Fig. 2-2, the preprocessing operation in link 2 can be performed for the script to be processed, and finally the preprocessed to-be-processed is obtained. script. Among them, the range value and regular expression read in link 1 may be used for preprocessing and subsequent steps. In this way, by reading in advance, processing efficiency can be improved to a certain extent. It should be noted that the figure only shows part of the read content, and in practical applications, the read content may also include other content.

步骤202、根据预设的集编号表述范围，确定所述待处理剧本中包含的集编号以及所述集编号的位置，并根据所述包含的集编号及所述集编号的位置将所述待处理剧本分割为多个剧集。Step 202: Determine the episode number included in the script to be processed and the position of the episode number according to a preset expression range of episode numbers, and assign the to-be-processed episode number according to the included episode number and the position of the episode number. Process the script split into multiple episodes.

具体的，本步骤可以通过下述子步骤(1)～子步骤(3)实现：Specifically, this step can be implemented through the following sub-steps (1) to (3):

子步骤(1)：根据所述预设的集编号表述范围，生成集编号正则表达式；所述集编号正则表达式中定义有所述集编号表述范围中包含的集编号。Sub-step (1): According to the preset expression range of the episode number, a regular expression of the episode number is generated; the episode number regular expression is defined with the episode number included in the expression range of the episode number.

本步骤中，可以将集编号表述范围中包含的这些不同表述形式的集编号作为正则表达式的参数，生成一组描述字符串特征的字符，这一组字符可以表示一种过滤逻辑，进而得到集编号正则表达式。其中，将这些不同表述形式的集编号作为正则表达式的参数，可以确保生成的集编号正则表达式中定义有多种不同表述形式的集编号。In this step, the set numbers of these different expressions included in the set number expression range can be used as parameters of the regular expression to generate a set of characters that describe the characteristics of the character string. This set of characters can represent a filtering logic, and then obtain Set numbering regular expression. Wherein, using the set numbers of these different expressions as the parameters of the regular expression can ensure that the generated set number regular expression defines the set numbers of many different expressions.

子步骤(2)：利用所述集编号正则表达式对所述待处理剧本进行正则匹配，确定所述待处理剧本中包含的各个剧集的集编号以及所述集编号的位置。Sub-step (2): perform regular matching on the script to be processed by using the regular expression of the episode number, and determine the episode number of each episode included in the script to be processed and the position of the episode number.

其中，正则匹配指的是利用正则表达式对所述待处理剧本的内容进行匹配过滤的操作。示例的，通过正则表达式可以将待处理剧本中的每个字符与该正则表达式中定义的字符进行匹配，通过匹配可以将相匹配的字符过滤出来，进而得到待处理剧本中包含的各个剧集的集编号及位置。The regular matching refers to an operation of matching and filtering the content of the script to be processed by using a regular expression. For example, each character in the script to be processed can be matched with the characters defined in the regular expression through a regular expression, and the matching characters can be filtered out through matching, and then each script contained in the script to be processed can be obtained. The set number and position of the set.

子步骤(3)：对于任一所述集编号，将所述集编号的位置与下一集编号的位置之间的文本作为所述集编号表示的剧集并进行分割，得到所述待处理剧本中包括的多个剧集。Sub-step (3): for any of the episode numbers, use the text between the position of the episode number and the position of the next episode number as the episode indicated by the episode number and divide it to obtain the to-be-processed episode. Multiple episodes included in the script.

由于每个集编号往往表示一个剧集的开始，因此，本步骤中，可以先确定该集编号的位置与下一集编号的位置之间的文本，然后将这些文本作为一个剧集并分割出来，进而得到待处理剧本中包括的多个剧集。其中，下一集编号指的是按照待处理剧本的书写顺序，仅位于该集编号之后的集编号。本发明实施例中，通过先根据预设的集编号表述范围，生成集编号正则表达式，以集编号正则表达式确定集编号并进行分割，由于集编号正则表达式可以快速的完成匹配过滤且定义有集编号表述范围中包含的多种不同表述形式的集编号，因此一定程度上可以提高集编号确定效率及准确率，进而可以提高剧集的划分效率。需要说明的是，在存在集补充信息时，可以在进行处理时，提取该补充信息并进行分集，以避免遗漏。其中，集补充信息指的是后续为待处理剧本添加的集。Since each episode number often indicates the beginning of an episode, in this step, the text between the position of the episode number and the position of the next episode number can be determined first, and then these texts are regarded as an episode and divided , and then obtain multiple episodes included in the script to be processed. Wherein, the next episode number refers to the episode number located only after the episode number according to the writing order of the script to be processed. In the embodiment of the present invention, by first expressing the range according to the preset set number, a regular expression of the set number is generated, and the set number is determined and divided by the regular expression of the set number, because the regular expression of the set number can quickly complete the matching filtering and There are defined episode numbers with various expressions included in the expression range of the episode number, so the determination efficiency and accuracy of the episode number can be improved to a certain extent, and further the division efficiency of the episodes can be improved. It should be noted that when there is set supplementary information, the supplementary information may be extracted and performed diversity during processing to avoid omission. Wherein, the episode supplementary information refers to an episode added for the script to be processed subsequently.

步骤203、对于至少一个所述剧集，根据预设的场景编号表述范围，确定所述剧集中包含的场景编号以及所述场景编号的位置，并根据所述场景编号及所述场景编号的位置将所述剧集分割为所述多个场景文本。Step 203: For at least one of the episodes, determine the scene number included in the episode and the location of the scene number according to the preset scene number expression range, and determine the scene number according to the scene number and the scene number. A location splits the episode into the plurality of scene texts.

具体的，本步骤可以通过下述操作实现：根据预设的场景编号表述范围，生成场景编号正则表达式，该场景编号正则表达式中定义有场景编号表述范围中包含的场景编号，接着，利用场景编号正则表达式对剧集进行正则匹配，确定剧集中包含的各个场景的场景编号以及所述场景编号的位置；对于任一场景编号，将场景编号的位置与下一场景的场景编号的位置之间的文本作为场景编号表示的场景文本并进行分割，得到剧集中包括的多个场景文本。其中每个操作的具体实现细节可以参照前述步骤202中的相关描述，本发明实施例对此不作限定。进一步地，本发明实施例中，通过先根据预设的场景编号表述范围，生成场景编号正则表达式，以场景编号正则表达式确定场景编号并进行分割，由于场景编号正则表达式可以快速的完成匹配过滤且定义有场景编号表述范围中包含的多种不同表述形式的场景编号，因此一定程度上可以提高场景编号确定效率及准确率，进而可以提高场景文本的划分效率，同时，由于本发明实施例中采用的集编号正则表达式是使用了多种不同表述形式的集编号生成的，场景编号正则表达式是使用了多种不同表述形式的场景编号生成的，因此，在利用该集编号正则表达式以及场景编号正则表达式进行匹配时，可以实现模糊匹配，即，满足其中一种情况即确定两者匹配，相较于精准匹配，即，必须都满足才确定两者匹配的方式，一定程度上可以扩大可处理的剧本格式范围。需要说明的是，在存在场景补充信息时，可以在进行处理时，提取该补充信息并进行分场景，以避免遗漏。其中，场景补充信息指的是后续为待处理剧本添加的场景文本。Specifically, this step can be implemented by the following operations: generating a scene number regular expression according to the preset scene number expression range, where the scene number regular expression is defined with the scene number included in the scene number expression range, and then, using The scene number regular expression performs regular matching on the episode, and determines the scene number of each scene contained in the episode and the position of the scene number; for any scene number, the position of the scene number is compared with the scene number of the next scene. The text between the positions is used as the scene text represented by the scene number and is segmented to obtain a plurality of scene texts included in the play. For the specific implementation details of each operation, reference may be made to the relevant description in the foregoing step 202, which is not limited in this embodiment of the present invention. Further, in this embodiment of the present invention, the scene number regular expression is generated by first expressing the range of the preset scene number, and the scene number is determined and divided by the scene number regular expression, because the scene number regular expression can be quickly completed. Matching filtering and defining the scene numbers of different expressions included in the scene number expression range can improve the efficiency and accuracy of scene number determination to a certain extent, and further improve the efficiency of scene text division. At the same time, because the implementation of the present invention The episode number regular expression used in the example is generated by using a variety of different expressions for the episode number, and the scene number regular expression is generated using a variety of different expressions in the scene number. Therefore, when using the episode number regular expression When the expression and the scene number regular expression are matched, fuzzy matching can be achieved, that is, if one of the two conditions is satisfied, the two are determined to match. To a certain extent, it can expand the range of script formats that can be handled. It should be noted that when there is scene supplementary information, the supplementary information may be extracted and divided into scenes during processing to avoid omission. Wherein, the supplementary scene information refers to the scene text that is subsequently added to the script to be processed.

进一步地，由于文本可能会以数字开头，因此，确定出来的场景编号，可能是用于表示一个场景的数字编号，也可能仅仅只是文本的数字开头。相应地，本发明实施例中，在完成场景文本的分割之后，还可以进行二次判定，以确保划分的场景文本的正确性。其中，该二次判定的具体过程可以为：确定所述场景文本的文本长度，以及确定所述场景文本的场景编号与相邻的场景文本的场景编号是否连续。具体的，确定文本长度上，可以统计该场景文本中包含的字符数量，然后将该包含的字符数量确定为文本长度。确定编号是否连续时，可以是统计该场景文本的场景编号与前一相邻场景文本的场景编号的差值的绝对值，以及统计该场景文本的场景编号与后一相邻场景文本的场景编号的差值的绝对值。如果该绝对值等于1，则可以确认编号连续。反之，则可以认为编号不连续。Further, since the text may start with a number, the determined scene number may be a number number used to represent a scene, or may just start with a number of the text. Correspondingly, in the embodiment of the present invention, after the segmentation of the scene text is completed, a secondary determination may be performed to ensure the correctness of the segmented scene text. The specific process of the secondary determination may be: determining the text length of the scene text, and determining whether the scene number of the scene text is continuous with the scene number of the adjacent scene text. Specifically, to determine the text length, the number of characters contained in the scene text may be counted, and then the number of contained characters may be determined as the text length. When determining whether the numbers are consecutive, it can be the absolute value of the difference between the scene number of the scene text and the scene number of the previous adjacent scene text, and the scene number of the scene text and the scene number of the next adjacent scene text. The absolute value of the difference. If the absolute value is equal to 1, it can be confirmed that the numbers are consecutive. Otherwise, the numbers can be considered discontinuous.

若所述文本长度小于预设长度阈值且所述场景文本的场景编号与相邻的场景文本的场景编号不连续，则移除所述场景文本。其中，预设长度阈值可以是根据表示一个场景的场景文本会包含的最低字数确定的。如果文本长度小于预设长度阈值，则可以认为该场景文本有可能不是真正的场景文本，进一步地，如果场景文本的场景编号与相邻的场景文本的场景编号也不连续，则可以进一步该场景文本有可能不是真正的场景文本。因此，可以在文本长度小于预设长度阈值且编号不连续的情况下，确认该场景文本不是真正的场景文本，将该场景文本移除，以确保准确性。其中，移除场景文本指的是，将该场景文本从划分得到的场景文本中剔除。具体的，可以是将该场景文本划分为正文文本。本发明实施例中，通过在划分完场景文本之后，对场景文本进行二次判定，可以提高场景文本的准确性，进而可以确保后续对场景文本的处理效果。If the text length is less than a preset length threshold and the scene number of the scene text is not continuous with the scene numbers of the adjacent scene texts, the scene text is removed. The preset length threshold may be determined according to the minimum number of characters that the scene text representing a scene will contain. If the text length is less than the preset length threshold, it can be considered that the scene text may not be the real scene text. Further, if the scene number of the scene text is not continuous with the scene number of the adjacent scene text, the scene text can be further It is possible that the text is not the actual scene text. Therefore, when the text length is less than the preset length threshold and the numbers are discontinuous, it can be confirmed that the scene text is not the real scene text, and the scene text can be removed to ensure accuracy. Wherein, removing the scene text refers to removing the scene text from the divided scene texts. Specifically, the scene text may be divided into body text. In the embodiment of the present invention, by performing a secondary determination on the scene text after the scene text is divided, the accuracy of the scene text can be improved, and the subsequent processing effect of the scene text can be ensured.

为了提高二次判定的准确性，还可以增加场景文本所需满足的条件，例如，可以进一步确定场景文本中是否包含场景信息字符，在该场景文本中不包含场景信息字符且文本长度小于预设长度阈值且编号不连续的情况下，再移除该场景文本。这样，一定程度上可以避免将场景文本误移除，提高二次判定的准确性。In order to improve the accuracy of the secondary determination, the conditions that the scene text needs to meet can also be added. For example, it can be further determined whether the scene text contains scene information characters, and the scene text does not contain scene information characters and the text length is less than the preset length. When the length threshold is reached and the numbering is discontinuous, the scene text is removed. In this way, to a certain extent, the mistaken removal of the scene text can be avoided, and the accuracy of the secondary determination can be improved.

相应地，本发明实施例中，还可以对剧集进行二次判定，以提高划分得到的剧集的准确性。其中，该二次判定的具体过程可以为：确定所述剧集的文本长度，以及确定所述剧集的场景编号与相邻的剧集的场景编号是否连续。若所述文本长度小于预设长度阈值且所述剧集的集编号与相邻的剧集的集编号不连续，则移除所述剧集。具体的，每个步骤的具体实现过程可以参照前述相关描述，本发明实施例对此不作限定。本发明实施例中，通过在划分完剧集之后，对剧集进行二次判定，可以提高剧集的准确性，进而可以确保后续对剧集的处理效果。Correspondingly, in the embodiment of the present invention, a secondary determination of the episodes may also be performed, so as to improve the accuracy of the divided episodes. The specific process of the secondary determination may be: determining the text length of the episode, and determining whether the scene number of the episode is continuous with the scene number of an adjacent episode. If the text length is less than a preset length threshold and the episode number of the episode is not consecutive with the episode number of an adjacent episode, the episode is removed. Specifically, for the specific implementation process of each step, reference may be made to the foregoing related description, which is not limited in this embodiment of the present invention. In the embodiment of the present invention, by performing a secondary determination on the episodes after the episodes are divided, the accuracy of the episodes can be improved, thereby ensuring the subsequent processing effect of the episodes.

步骤204、对于至少一个所述场景文本，提取所述场景文本中包含的场景信息字符。Step 204: For at least one of the scene texts, extract scene information characters contained in the scene texts.

具体的，本步骤可以通过下述子步骤(4)～子步骤(6)实现：Specifically, this step can be implemented through the following sub-steps (4) to (6):

子步骤(4)：根据预设的场景信息提示词范围值，对所述场景文本进行遍历，以确定所述场景文本中是否包含场景信息提示词。Sub-step (4): According to the preset scene information prompt word range value, traverse the scene text to determine whether the scene information prompt word is included in the scene text.

本步骤中，场景信息提示词指的是用于注明后续出现的文本内容是表示场景信息的场景信息字符。例如，有些剧本在编写的时候，会在时间、地点、人物等信息之前，使用提示词进行提示。示例的，假设剧本A中的内容为“1-23本场天气：下雨发生地点:学校操场出场人物：全体老师”，其中，“本场天气”、“发生地点”及“出场人物”，即为场景信息提示词。本步骤中，预设的场景信息提示词范围值中可以包括多种提示次，这些中的提示词可以是预先收集的。例如，可以从样本剧本中提取场景信息提示词或者从网络中搜索常用的场景信息提示词，作为该预设的场景信息提示词。In this step, the scene information prompt words refer to scene information characters used to indicate that the text content that appears subsequently is scene information. For example, when some scripts are written, they will use prompt words to prompt before the time, place, characters and other information. For example, suppose the content in script A is "1-23 The weather in this field: the place where the rain occurs: the school playground Characters: all the teachers", among which, "the weather in this field", "the place of occurrence" and "the characters in the field", It is the scene information prompt word. In this step, the preset scene information prompt word range value may include various prompt times, and the prompt words in these may be collected in advance. For example, a scene information prompt word can be extracted from a sample script or a commonly used scene information prompt word can be searched from the network as the preset scene information prompt word.

进一步地，在确定时，可以将场景信息提示词范围值中包含的每个场景信息提示词语与该场景文本中的各个词语进行对比。如果场景文本中存在相同的词语，则可以确定场景文本中包含场景信息提示词，如果场景文本中不存在相同的词语，则可以确定场景文本中不包含场景信息提示词，这样，通过对整个场景文本，可以实现全面检测，进而一定程度上提高检测的准确性。需要说明的是，剧本编写时，往往会将说明信息放在场景文本的前边，其中，说明信息是体现场景信息的文本。因此，本发明实施例中，还可以仅对场景文本进行遍历，以确定场景文本中是否包含场景信息提示词。这样，可以减少所需遍历的文本量，节省处理资源。Further, when determining, each scene information prompt word included in the scene information prompt word range value may be compared with each word in the scene text. If the same words exist in the scene text, it can be determined that the scene text contains the scene information prompt words. If the same words do not exist in the scene text, it can be determined that the scene text does not contain the scene information prompt words. Text, can achieve comprehensive detection, thereby improving the accuracy of detection to a certain extent. It should be noted that when a script is written, explanatory information is often placed in front of the scene text, where the explanatory information is the text that reflects the scene information. Therefore, in this embodiment of the present invention, only the scene text may be traversed to determine whether the scene text contains the scene information prompt word. In this way, the amount of text that needs to be traversed can be reduced, saving processing resources.

子步骤(5)：若所述场景文本中包含场景信息提示词，将与所述场景信息提示词相邻的字符确定为场景信息字符，并进行提取。Sub-step (5): if the scene text contains scene information prompt words, determine the characters adjacent to the scene information prompt words as scene information characters, and extract them.

由于场景信息提示词后往往邻接的是表示场景信息字符，因此，本步骤中，可以在确定出场景文本中包含场景信息提示词的情况下，直接提取该场景信息提示词相邻的字符，进而得到场景信息字符。Since the scene information prompt words are often adjacent to characters representing scene information, in this step, when it is determined that the scene text contains the scene information prompt words, the adjacent characters of the scene information prompt words can be directly extracted, and then Get scene info characters.

子步骤(6)：若所述场景文本中不包含场景信息提示词，将所述场景文本划分为多个子文本；根据预设的场景信息字符范围值和/或所述子文本中词语的词性，从所述子文本中提取场景信息字符。Sub-step (6): if the scene text does not contain scene information prompt words, divide the scene text into multiple sub-texts; according to the preset scene information character range value and/or the part of speech of the words in the sub-text , extracting scene information characters from the sub-text.

本步骤中，在划分子文本时，可以按照固定字数，将该场景文本等分为多个子文本，其中，一个子文本即为一个待处理的候选字符串。或者，也可以在出现特定符号时，执行一次划分操作，进而得到多个子文本。其中，该特定符号可以是换行符、tab制表符、空格、逗号等，本发明实施例对此不作限定。In this step, when sub-text is divided, the scene text may be equally divided into multiple sub-texts according to a fixed number of characters, wherein one sub-text is a candidate character string to be processed. Alternatively, when a specific symbol appears, a division operation can be performed to obtain multiple subtexts. The specific symbol may be a newline, a tab, a space, a comma, or the like, which is not limited in this embodiment of the present invention.

进一步地，场景信息字符范围值可以是包含表示场景信息时候会使用的字符的集合。预设的场景信息字符范围值可以是从样本剧本中预先提取的或者从网络中预先搜集的。场景信息字符范围值中可以包括常用的表示时间的字符、表示地点的字符、表示天气的字符及表示人名的字符，其中，这些表示时间的字符、表示地点的字符、表示天气的字符及表示人名的字符可以分别属于各自对应的范围值，这些范围值中包含的字符组成该场景信息字符范围值，其中，表示时间的范围值、表示地点的范围值中可以存储的是大致表示时间的词语以及大致表示地点的词语，相应地，基于这些范围值提取到的场景信息字符可以称为大致时间词、大致地点词。当然，它们也可以是混合在一起组成该场景信息字符范围值。本发明实施例中的提到的各种范围值、映射关系、集合及正则表达式可以是在使用之前，预先从设备中读取的。进一步地，表示人名的字符可以是根据预设的人物对白格式，从待处理剧本的对白部分提取的。该预设的人物对白格式可以是预先根据剧本中人物对白的特点设置的。示例的，剧本中的人物对白往往会包含人物的名称，因此，通过人物对白格式，从待处理剧本的对白部分提取人名，一定程度上可以确保所提取人名的准确性。具体的，在提取人名时，可以先根据预设的人物对白格式，与待处理剧本中的内容进行比对，然后，将格式与该预设的人物对白格式相匹配的内容，确定为待处理剧本的对白部分。然后提取该对白部分的特定位置的词语，作为人名。其中，该特定位置可以是标点符号组合之前，该标点符号组合可以为冒号和双引号。Further, the scene information character range value may be a set containing characters that will be used when representing scene information. The preset scene information character range value may be pre-extracted from a sample script or pre-collected from the network. The character range value of scene information may include commonly used characters representing time, character representing location, character representing weather, and character representing person's name, wherein these characters representing time, location, weather and person's name The characters can belong to their corresponding range values, and the characters contained in these range values constitute the character range value of the scene information, wherein the range value representing time and the range value representing location can store words that roughly represent time and Words that roughly represent places, and accordingly, scene information characters extracted based on these range values can be called roughly time words and roughly place words. Of course, they can also be mixed together to form the character range value of the scene information. Various range values, mapping relationships, sets, and regular expressions mentioned in the embodiments of the present invention may be read from a device in advance before being used. Further, the characters representing the person's name may be extracted from the dialogue part of the script to be processed according to a preset character dialogue format. The preset character dialogue format may be set in advance according to the characteristics of the character dialogue in the script. For example, character dialogues in a script often include the names of characters. Therefore, using the character dialogue format to extract the names of people from the dialogue part of the script to be processed can ensure the accuracy of the extracted names to a certain extent. Specifically, when extracting a person's name, you can first compare the content in the script to be processed according to the preset character dialogue format, and then determine the content whose format matches the preset character dialogue format as the content to be processed The dialogue part of the script. Then the words in the specific position of the dialogue part are extracted as the person's name. Wherein, the specific position may be before a punctuation symbol combination, and the punctuation symbol combination may be a colon and a double quotation mark.

进一步地，在根据预设的场景信息字符范围值从子文本中提取场景信息字符时，可以先确定子文本中包含的词语的词性。然后将词性为预设词性且包含特定字的词语，确定为场景信息字符，并进行提取。具体的，在确定词性时，可以将子文本划分为多个词语，然后从网络中查找每个词语对应的词性。预设词性及特定字可以是根据实际情况设定。示例的，预设词性可以为名词，特定字可以为表示地点特征的字，例如，厅、楼、路、屋等，表示人物特征字，例如，们、若干等。这样，根据词性以及特定字进行提取的方式，仅需预先收集部分单个字作为实现基础，实现成本较低。Further, when the scene information characters are extracted from the sub-text according to the preset scene information character range value, the part-of-speech of the words contained in the sub-text may be determined first. Then, words whose part of speech is a preset part of speech and contains a specific character are determined as scene information characters, and are extracted. Specifically, when determining the part of speech, the sub-text can be divided into multiple words, and then the part of speech corresponding to each word is searched from the network. The preset part of speech and specific words can be set according to the actual situation. For example, the preset part of speech may be a noun, and the specific word may be a word representing a place feature, such as a hall, a building, a road, a house, etc., and a character feature word, such as 人, several, etc. In this way, according to the part of speech and the way of extracting specific words, only some single words need to be collected in advance as the realization basis, and the realization cost is low.

进一步地，根据预设的场景信息字符范围值，从子文本中提取场景信息字符时，可以先根据预设的场景信息字符范围值对子文本进行遍历，以确定子文本中是否包含存在于预设的场景信息字符范围值的字符，接着，若包含，则将该字符确定为场景信息字符，并进行提取。这样，通过对子文本进行遍历提取，可以在整个子文本表示一个场景信息字符时，直接提取到场景信息字符。在子文本由多个场景信息字符组成时，能够将其准确的拆分为多个场景信息字符。例如，子文本由表示时间的场景信息字符、表示地点的场景信息字符组成时，可以提取到2个场景信息字符。又或者，子文本由两个表示地点的场景信息字符组成，其中一个表示具体地点，一个表示大致地点，地点所含的词语数量超过1时，那么通过提取，可以得到两个场景信息字符。Further, according to the preset scene information character range value, when the scene information characters are extracted from the sub-text, the sub-text can be traversed according to the preset scene information character range value first, to determine whether the sub-text contains the characters that exist in the sub-text. The character of the set scene information character range value, and then, if it is included, the character is determined as the scene information character, and is extracted. In this way, by traversing and extracting the sub-text, when the entire sub-text represents a scene information character, the scene information character can be directly extracted. When the sub-text consists of multiple scene information characters, it can be accurately split into multiple scene information characters. For example, when the sub-text consists of scene information characters representing time and scene information characters representing location, two scene information characters can be extracted. Alternatively, the sub-text consists of two scene information characters representing locations, one representing a specific location and the other representing a general location. When the number of words contained in the location exceeds 1, two scene information characters can be obtained by extraction.

具体的，可以将场景信息字符范围值中的每个字符与该子文本中的每个字符进行对比，如果存在相一致的字符，则可以认为该相一致的字符为场景信息字符，因此，可以执行提取操作。进一步地，将所述字符确定为场景信息字符，并进行提取之后，还可以在提取到的该场景信息字符为表示人名的字符的情况下，则将该场景信息字符相邻的字符确定为表示人名的字符，并进行提取。由于剧本中当时间地点人物在同一行出现且无提示词时，人物通常放在最后。因此，如果该场景信息字符为表示人名的字符，则可以认为该人名之后的字符也为人名，进而可以继续进行提取。这样，无需进行其他操作，即可快捷的提取到新的场景信息字符，进而一定程度上可以提高提取效率。Specifically, each character in the character range value of the scene information can be compared with each character in the sub-text. If there is a consistent character, it can be considered that the consistent character is a scene information character. Therefore, it can be Perform a fetch operation. Further, after the character is determined as a scene information character and extracted, if the extracted scene information character is a character representing a person's name, then the character adjacent to the scene information character can be determined as a character representing a person's name. characters of a person's name and extract them. Because in the script, when the time, place and characters appear on the same line and there are no cue words, the characters are usually placed at the end. Therefore, if the scene information character is a character representing a person's name, it can be considered that the characters after the person's name are also the person's name, and the extraction can be continued. In this way, new scene information characters can be quickly extracted without performing other operations, thereby improving the extraction efficiency to a certain extent.

本发明实施例中，通过先根据预设的场景信息提示词范围值确定场景文本是否包含场景信息提示词，在场景文本中包含场景信息提示词的情况下，以直接提取的方式即可确定出场景信息字符，进而可以避免执行必要的其他操作，提高处理效率。同时，在场景文本中不包含场景信息提示词的情况下，进一步根据场景信息字符范围值及字符串中词语的词性，确定场景文本中的场景信息字符的方式，相较于仅根据场景信息提示词范围值确定的方式，一定程度上可以避免场景信息字符被遗漏，进而可以确保能够完整准确的提取到场景信息字符，确保提取效果。当然，在本发明的另一可选实施例中，也可以仅根据场景信息提示词提取或仅根据场景信息字符范围值和/或场景文本中子文本中词语的词性进行提取，或者对一部分场景文本采用根据场景信息提示词提取的方式，针对另一部分的场景文本采用根据场景信息字符范围值和/或子文本中词语的词性进行提取的方式。In the embodiment of the present invention, it is first determined whether the scene text contains the scene information prompt word according to the preset scene information prompt word range value, and if the scene text contains the scene information prompt word, it can be determined by direct extraction Scene information characters, which can avoid performing other necessary operations and improve processing efficiency. At the same time, in the case that the scene text does not contain the scene information prompt words, the method of determining the scene information characters in the scene text is further based on the scene information character range value and the part of speech of the words in the character string. The method of determining the word range value can prevent the scene information characters from being omitted to a certain extent, thereby ensuring that the scene information characters can be completely and accurately extracted, and the extraction effect can be ensured. Of course, in another optional embodiment of the present invention, it is also possible to extract only according to the scene information prompt words or only according to the character range value of the scene information and/or the part of speech of the words in the sub-text in the scene text, or extract a part of the scene The text is extracted according to the scene information prompt words, and another part of the scene text is extracted according to the character range value of the scene information and/or the part of speech of the words in the sub-text.

示例的，图2-3是本发明实施例提供的一种处理流程示意图，如图2-3所示，可以根据集编号正则表达式及场景编号正则表达式对预处理后的剧本进行分集、分场景处理，以得到多个场景文本。接着，确定场景文本中是否包含场景信息提示词，在包含的情况下，根据场景信息提示词进行提取，在不包含的情况下，根据预设的场景信息字符范围值、词性及特定字进行提取。最后，得到场景信息。By way of example, FIG. 2-3 is a schematic diagram of a processing flow provided by an embodiment of the present invention. As shown in FIG. 2-3 , the preprocessed script can be divided into sets, Process by scene to get multiple scene texts. Next, it is determined whether the scene text contains the scene information prompt words, and in the case of inclusion, the extraction is performed according to the scene information prompt words, and in the case of not including the scene information, the extraction is performed according to the preset scene information character range value, part of speech and specific words . Finally, get the scene information.

步骤205、将所述场景文本中包含的场景信息字符、所述场景文本的场景编号及所述场景文本所属剧集的集编号，确定为所述场景文本的待整理信息。Step 205: Determine the scene information characters contained in the scene text, the scene number of the scene text, and the episode number of the episode to which the scene text belongs, as the information to be sorted out of the scene text.

具体的，本步骤的实现方式可以参照前述步骤104，本发明实施例对此不作限定。Specifically, for an implementation manner of this step, reference may be made to the foregoing step 104, which is not limited in this embodiment of the present invention.

步骤206、将所述场景文本的待整理信息以及所述场景文本中的正文文本，按照预设形式进行组合，形成目标剧本。Step 206 , combine the to-be-arranged information of the scene text and the body text in the scene text according to a preset form to form a target script.

具体的，本步骤可以通过下述子步骤(7)～子步骤(8)实现：Specifically, this step can be implemented through the following sub-steps (7) to (8):

子步骤(7)：根据所述待整理信息所属的信息类别，为所述待整理信息设置所述所属的信息类别对应的信息类别标识，以及为所述正文文本设置正文文本标识。Sub-step (7): According to the information category to which the information to be sorted belongs, set an information category identifier corresponding to the information category for the information to be sorted, and set a text text identifier for the body text.

本步骤中，信息类别可以包括集编号、场景编号、时间、地点、人物、天气等。信息类别对应的信息类别标识以及正文文本标识可以是根据实际需求优先设定的，本发明实施例对此不作限定。示例的，集编号对应的信息类别标识可以为“episode_id”、场景编号对应的信息类别标识可以为“setting_id”、时间对应的信息类别标识可以为“time”、地点对应的信息类别标识可以为“side”“location”等等、天气对应的信息类别标识可以为“weather”。正文文本标识可以表示为“content”。In this step, the information category may include episode number, scene number, time, place, person, weather, and the like. The information category identifier and the body text identifier corresponding to the information category may be preferentially set according to actual needs, which is not limited in this embodiment of the present invention. For example, the information category identifier corresponding to the episode number may be "episode_id", the information category identifier corresponding to the scene number may be "setting_id", the information category identifier corresponding to time may be "time", and the information category identifier corresponding to the location may be " side", "location", etc., and the information category identifier corresponding to the weather can be "weather". The body text identifier can be represented as "content".

具体的，在设置标识时，可以将信息类别标识作为键，将场景信息字符作为值，组成键值对，得到设置了信息类别标识的场景信息字符，以及，将正文文本标识作为键，将正文文本作为值，组成键值对，得到设置了正文标识的正文文本。需要说明的是，本发明实施例中，还可以将上述场景信息字符对应的键值对，存储至预设存储区域，以实现将不同信息类别的待整理信息以键值对的形式存储。Specifically, when setting the identifier, the information category identifier can be used as a key, and the scene information character can be used as a value to form a key-value pair to obtain the scene information character set with the information category identifier, and the text identifier of the text can be used as a key. The text is used as a value to form a key-value pair to get the body text with the body ID set. It should be noted that, in this embodiment of the present invention, the key-value pairs corresponding to the above scene information characters may also be stored in a preset storage area, so as to store the information to be sorted out of different information categories in the form of key-value pairs.

子步骤(8)：将设置了信息类别标识的待整理信息与设置了正文标识的正文文本组合。Sub-step (8): Combine the information to be sorted out with the information category identifier set with the main text text in which the main text identifier is set.

本步骤中，可以按照预设顺序进行组合。该预设顺序可以是根据实际需求预先设定的。示例的，该预设顺序可以为集编号场景信息-场景编号场景信息-时间场景信息-地点场景信息-天气场景信息-正文。In this step, the combination may be performed in a preset order. The preset sequence may be preset according to actual requirements. For example, the preset sequence may be episode number scene information - scene number scene information - time scene information - location scene information - weather scene information - text.

示例的，图2-4是本发明实施例提供的一种处理示意图，如图2-4所示，原始的待处理剧本的内容为：By way of example, FIG. 2-4 is a schematic diagram of a processing provided by an embodiment of the present invention. As shown in FIG. 2-4, the content of the original script to be processed is:

“1-44.政治保卫总署大门外日外"1-44. Outside the gate of the General Administration of Political Defense

余则成有些紧张，深出一口气走了进去。”Yu Zecheng was a little nervous, took a deep breath and walked in. "

对于该待处理剧本，可以先对该待处理剧本进行预处理，然后进行场景信息提取，其中，预处理及场景提取的操作可以是基于业务字典，正则表达式实现的。该业务字典可以是前述步骤201中涉及的各种范围值、该正则表达式可以是前述步骤中涉及到的正则表达式。接着，可以进行正文规范化，即，进行本步骤，最后得到格式规范化的剧本。示例的，最终得到的剧本的内容可以为：{‘episode_id’：1，‘setting_id’：44，‘time’：‘日’，‘side’：‘外’，‘location’：‘政治保卫总署大门’，‘weather’：‘，’‘content’：‘余则成有些紧张，深出一口气走了进去。’}For the to-be-processed script, the to-be-processed script may be preprocessed first, and then scene information is extracted, wherein the operations of preprocessing and scene extraction may be implemented based on a business dictionary and regular expressions. The business dictionary may be various range values involved in the foregoing step 201, and the regular expression may be the regular expression involved in the foregoing step. Next, text normalization may be performed, that is, this step is performed, and finally a script with a normalized format is obtained. For example, the content of the final script can be: {'episode_id': 1, 'setting_id': 44, 'time': 'day', 'side': 'outside', 'location': 'Political Defense Agency' The door', 'weather': ', ''content': ' Yu Zecheng was a little nervous, took a deep breath and walked in. ’}

进一步地，为了提高正文文本的质量，本发明实施例中，还可以对正文文本进行如下处理：检测所述正文文本中是否存在所包含的字数未达到预设行容纳字数且行尾不包含结尾标点符号的行。其中，预设行容纳字数可以是该待处理剧本中一行所能容纳的最大字数，结尾标点符号指的是所起的作用是表示句子结束的标点符号，例如，句号。进一步地，若存在，则在所述行包含的字数小于第一预设字数阈值和/或所述行的下一相邻行的首词不表示人名的情况下，将所述下一相邻行与所述行合并。具体的，如果某一行的字数并未达到预设行容纳字数且行尾不包含结尾标点符号，则可以认为此处可能出现了错误断行。为了避免检测的准确性，可以进一步确定该行中包含的字数是否小于第一预设字数阈值，以及该行的下一相邻行的首词是否表示人名。如果该行中包含的字数小于第一预设字数阈值，和/或行的下一相邻行的首词不表示人名，则可以进一步认为该行结尾处出现了错误断行。因此，可以将下一相邻行与该行合并，进而实现断行修复。其中，将下一相邻行与该行合并可以是将下一相邻行的头部与该行的尾部连接。Further, in order to improve the quality of the body text, in this embodiment of the present invention, the body text may also be processed as follows: it is detected whether the number of characters contained in the body text does not reach the preset number of accommodating characters in the line and the end of the line does not contain ending punctuation. line of symbols. The preset number of words in a line may be the maximum number of words that can be accommodated in a line in the script to be processed, and the ending punctuation refers to a punctuation that functions to indicate the end of a sentence, for example, a period. Further, if there is, then in the case that the number of characters included in the row is less than the first preset word count threshold and/or the first word of the next adjacent row of the row does not represent a person's name, the next adjacent row will be described. Lines are merged with said lines. Specifically, if the number of words in a certain line does not reach the preset number of words in the line and the end of the line does not contain a trailing punctuation mark, it may be considered that an error line break may have occurred here. In order to avoid detection accuracy, it may be further determined whether the number of words contained in the line is less than the first preset word number threshold, and whether the first word of the next adjacent line of the line represents a person's name. If the number of words contained in the line is less than the first preset word number threshold, and/or the first word of the next adjacent line of the line does not represent a person's name, it may be further considered that an error line break occurs at the end of the line. Therefore, the next adjacent line can be merged with this line, thereby realizing line break repair. Wherein, merging the next adjacent row with the row may be connecting the head of the next adjacent row with the tail of the row.

进一步地，还可以检测所述正文文本中包含的字数大于第二预设字数阈值的段落；对于所述段落中的每一行，确定所述行中包含的结尾标点符号，并将所述行中包含的结尾标点符号之后的文字划分至下一行。其中，第二预设字数阈值可以是根据剧本中段落的最大字数设置的，如果某个段落包含的字数大于该第二预设字数阈值，则说明该段落可能存在应换行而未换行的情况。因此，可以进一步确定该段落中的每一行中包含的结尾标点符号，然后将结尾标点符号之后的文字划分至下一行，进而实现长句断行。其中，将结尾标点符号之后的文字划分至下一行可以是通过在结尾标点符号之后添加换行符实现的。本发明实施例中，通过对正文文本进行断行修复以及长句断行，使得正文文本更加规范，进而方便后期对剧本进行处理。Further, it is also possible to detect paragraphs in which the number of words contained in the body text is greater than the second preset word number threshold; for each line in the paragraph, determine the ending punctuation contained in the line, and put the line in the line. The text after the included trailing punctuation is split to the next line. The second preset word count threshold may be set according to the maximum word count of a paragraph in the script. If a certain paragraph contains more words than the second preset word count threshold, it means that the paragraph may be wrapped but not wrapped. Therefore, it is possible to further determine the ending punctuation mark included in each line in the paragraph, and then divide the text after the ending punctuation mark to the next line, thereby realizing line breaking of a long sentence. Wherein, dividing the text after the ending punctuation mark to the next line may be realized by adding a newline character after the ending punctuation mark. In the embodiment of the present invention, by performing line-break repair and long-sentence line-breaking on the body text, the body text is made more standardized, thereby facilitating the later processing of the script.

示例的，图2-5是本发明实施例提供的一种场景文本的组成示意图，如图2-5所示，场景文本可以包括表示时间、地点、人物及天气的场景信息字符以及正文文本，其中，正文文本经过正文格式规范化处理，即，断行修复及长句断行，得到规范化的正文文本。By way of example, FIG. 2-5 is a schematic diagram of the composition of a scene text provided by an embodiment of the present invention. As shown in FIG. 2-5, the scene text may include scene information characters representing time, place, character, and weather, and body text, Wherein, the body text is processed by body format normalization, that is, line break repair and long sentence line break, to obtain normalized body text.

综上所述，本发明实施例提供的剧本处理方法，会先对待处理剧本进行预处理操作，以降低待处理剧本中的干扰因素，进而一定程度上提高后续的处理效果。接着，会根据预设的集编号表述范围，确定待处理剧本中包含的集编号以及集编号的位置，并根据包含的集编号及集编号的位置将待处理剧本分割为多个剧集，根据预设的场景编号表述范围，确定剧集中包含的场景编号及场景编号的位置，并根据场景编号及场景编号的位置将剧集分割为多个场景文本，对于至少一个场景文本，提取场景文本中包含的场景信息字符，将场景文本中包含的场景信息字符、场景文本的场景编号及场景文本所属剧集的集编号，确定为场景文本的待整理信息，将所述场景文本的待整理信息以及所述场景文本中的正文文本，按照预设形式进行组合，形成目标剧本。本发明实施例中，通过先将待处理剧本划分为场景文本，以单个场景文本为处理对象进行提取，一定程度上可以降低剧本内部的耦合度，进而可以降低剧本格式对场景信息提取的干扰，提高提取准确性。同时，在提取到场景信息之后，会将场景文本按照预设形式重新组合，这样，可以使剧本中各个场景文本内部的形式保持一致，进而方便后续对该剧本进行处理。To sum up, in the script processing method provided by the embodiment of the present invention, a preprocessing operation is performed on the script to be processed, so as to reduce the interference factors in the script to be processed, and further improve the subsequent processing effect to a certain extent. Then, according to the preset expression range of the episode number, the episode number and the position of the episode number included in the script to be processed are determined, and the script to be processed is divided into multiple episodes according to the included episode number and the position of the episode number. The preset scene number expression range, determine the scene number and the location of the scene number included in the episode, and divide the episode into multiple scene texts according to the scene number and the location of the scene number, and extract the scene text for at least one scene text. The scene information characters contained in the scene text, the scene information characters contained in the scene text, the scene number of the scene text, and the episode number of the episode to which the scene text belongs are determined as the information to be sorted out of the scene text, and the information to be sorted out of the scene text is determined. And the text in the scene text is combined according to the preset form to form the target script. In the embodiment of the present invention, by first dividing the script to be processed into scene texts, and extracting a single scene text as the processing object, the coupling degree within the script can be reduced to a certain extent, and the interference of the script format on the extraction of scene information can be reduced. Improve extraction accuracy. At the same time, after the scene information is extracted, the scene text will be recombined according to the preset form, so that the internal form of each scene text in the script can be kept consistent, thereby facilitating subsequent processing of the script.

图3是本发明实施例提供的一种剧本处理装置的框图，该装置可以应用于电子设备，如图3所示，该装置30可以包括：FIG. 3 is a block diagram of a script processing apparatus provided by an embodiment of the present invention. The apparatus can be applied to electronic equipment. As shown in FIG. 3 , the apparatus 30 may include:

第一确定模块301，用于根据预设的集编号表述范围，确定所述待处理剧本中包含的集编号以及所述集编号的位置，并根据所述包含的集编号所述集编号的位置将所述待处理剧本分割为多个剧集。The first determination module 301 is configured to determine the episode number and the position of the episode number contained in the script to be processed according to the preset episode number expression range, and to determine the position of the episode number according to the included episode number The to-be-processed script is divided into a plurality of episodes.

第二确定模块303，用于对于至少一个所述剧集，根据预设的场景编号表述范围，确定所述剧集中包含的场景编号以及所述场景编号的位置，并根据所述场景编号及所述场景编号的位置将所述剧集分割为所述多个场景文本。The second determining module 303 is configured to, for at least one of the episodes, determine the scene number included in the episode and the location of the scene number according to the preset scene number expression range, and determine the scene number and the location of the scene number according to the scene number and the scene number. The location of the scene number divides the episode into the plurality of scene texts.

提取模块303，用于对于至少一个所述场景文本，提取所述场景文本中包含的场景信息字符。The extraction module 303 is configured to, for at least one of the scene texts, extract scene information characters contained in the scene texts.

第三确定模块304，用于将所述场景文本中包含的场景信息字符、所述场景文本的场景编号及所述场景文本所属剧集的集编号，确定为所述场景文本的待整理信息。The third determining module 304 is configured to determine the scene information characters contained in the scene text, the scene number of the scene text, and the episode number of the episode to which the scene text belongs, as the information to be sorted out of the scene text.

组合模块305，用于将所述场景文本的待整理信息以及所述场景文本中的正文文本，按照预设形式进行组合，形成目标剧本。The combining module 305 is configured to combine the to-be-arranged information of the scene text and the body text in the scene text according to a preset form to form a target script.

可选的，所述提取模块303，具体用于：Optionally, the extraction module 303 is specifically used for:

根据预设的场景信息提示词范围值，对所述场景文本进行遍历，以确定所述场景文本中是否包含场景信息提示词。According to the preset scene information prompt word range value, the scene text is traversed to determine whether the scene information prompt word is included in the scene text.

若所述场景文本中包含场景信息提示词，将与所述场景信息提示词相邻的字符确定为场景信息字符，并进行提取。If the scene text contains scene information prompt words, the characters adjacent to the scene information prompt words are determined as scene information characters, and are extracted.

若所述场景文本中不包含场景信息提示词，将所述场景文本划分为多个子文本；根据预设的场景信息字符范围值和/或所述子文本中词语的词性，从所述子文本中提取场景信息字符。If the scene text does not contain scene information prompt words, divide the scene text into multiple sub-texts; according to the preset scene information character range value and/or the part of speech of the words in the sub-text Extract scene information characters.

可选的，所述提取模块303，还具体用于：Optionally, the extraction module 303 is also specifically used for:

确定所述子文本中包含的词语的词性；将词性为预设词性且包含特定字的词语，确定为场景信息字符，并进行提取。Determine the part of speech of the words included in the sub-text; determine the words whose part of speech is a preset part of speech and contain a specific character as scene information characters, and extract them.

和/或，根据预设的场景信息字符范围值对所述子文本进行遍历，以确定所述子文本中是否包含存在于所述预设的场景信息字符范围值的字符；若包含，将所述字符确定为场景信息字符，并进行提取。And/or, the sub-text is traversed according to the preset scene information character range value to determine whether the sub-text contains characters existing in the preset scene information character range value; The described characters are determined as scene information characters and extracted.

其中，所述场景信息字符范围值中至少包含以下信息中的一种：常用的表示时间的字符、表示地点的字符、表示天气的字符及表示人名的字符。Wherein, the character range value of the scene information includes at least one of the following information: commonly used characters representing time, characters representing location, characters representing weather, and characters representing names.

可选的，所述第一确定模块301，具体用于：Optionally, the first determining module 301 is specifically configured to:

根据所述预设的集编号表述范围，生成集编号正则表达式；所述集编号正则表达式中定义有所述集编号表述范围中包含的集编号。A set number regular expression is generated according to the preset set number expression range; the set number regular expression is defined with the set number included in the set number expression range.

利用所述集编号正则表达式对所述待处理剧本进行正则匹配，确定所述待处理剧本中包含的各个剧集的集编号以及所述集编号的位置。Regularly match the to-be-processed script by using the episode number regular expression, and determine the episode number of each episode included in the to-be-processed script and the position of the episode number.

可选的，所述第二确定模块302，具体用于：Optionally, the second determining module 302 is specifically configured to:

根据所述预设的场景编号表述范围，生成场景编号正则表达式。所述场景编号正则表达式中定义有所述场景编号表述范围中包含的场景编号。A scene number regular expression is generated according to the preset scene number expression range. The scene number regular expression defines the scene number included in the scene number expression range.

利用所述场景编号正则表达式对所述剧集进行正则匹配，确定所述剧集中包含的各个场景的场景编号以及所述场景编号的位置。Regularly match the episode by using the scene number regular expression, and determine the scene number of each scene included in the episode and the position of the scene number.

可选的，所述装置30还包括：Optionally, the device 30 further includes:

预处理模块，用于对所述待处理剧本进行预处理操作。A preprocessing module, configured to perform a preprocessing operation on the script to be processed.

其中，所述预处理操作包括下述操作中的至少一个操作：将所述待处理剧本中的干扰信息删除。The preprocessing operation includes at least one of the following operations: deleting interference information in the script to be processed.

将所述待处理剧本中的字体转换为预设字体。Convert the font in the script to be processed into a preset font.

根据预设的符号映射关系，将所述待处理剧本中的标点符号转换为所述标点符号在所述符号映射关系中对应的标点符号。According to a preset symbol mapping relationship, the punctuation symbols in the script to be processed are converted into punctuation symbols corresponding to the punctuation symbols in the symbol mapping relationship.

根据预设的可用标点符号范围值，将所述待处理剧本中不属于所述可用标点符号范围值的标点符号删除。According to a preset range of available punctuation marks, punctuation marks in the script to be processed that do not belong to the range of available punctuation marks are deleted.

可选的，所述组合模块305，具体用于：Optionally, the combination module 305 is specifically used for:

根据所述待整理信息所属的信息类别，为所述待整理信息设置所述所属的信息类别对应的信息类别标识，以及为所述正文文本设置正文文本标识。According to the information category to which the information to be organized belongs, an information category identifier corresponding to the information category to which the information to be organized belongs is set, and a body text identifier is set for the body text.

将设置了信息类别标识的待整理信息与设置了正文标识的正文文本组合。Combining the information to be organized with the information category identifier set and the body text with the body identifier set.

综上所述，本发明实施例提供的剧本处理装置，第一确定模块可以根据预设的集编号表述范围，确定待处理剧本中包含的集编号以及集编号的位置，并根据包含的集编号及集编号的位置将待处理剧本分割为多个剧集，第二确定模块可以根据预设的场景编号表述范围，确定剧集中包含的场景编号及场景编号的位置，并根据场景编号及场景编号的位置将剧集分割为多个场景文本，提取模块可以对于至少一个场景文本，提取场景文本中包含的场景信息字符，第三确定模块可以将场景文本中包含的场景信息字符、场景文本的场景编号及场景文本所属剧集的集编号，确定为场景文本的待整理信息，组合模块可以将所述场景文本的待整理信息以及所述场景文本中的正文文本，按照预设形式进行组合，形成目标剧本。本发明实施例中，通过先将待处理剧本划分为场景文本，以单个场景文本为处理对象进行提取，一定程度上可以降低剧本内部的耦合度，进而可以降低剧本格式对场景信息提取的干扰，提高提取准确性。同时，在提取到场景信息之后，会将场景文本按照预设形式重新组合，这样，可以使剧本中各个场景文本内部的形式保持一致，进而方便后续对该剧本进行处理。To sum up, in the script processing apparatus provided by the embodiment of the present invention, the first determination module may determine the episode number and the position of the episode number contained in the script to be processed according to the preset episode number expression range, and determine the episode number and the position of the episode number contained in the script to be processed, and determine the episode number according to the included episode number. and the position of the episode number to divide the script to be processed into multiple episodes, the second determination module can determine the scene number and the location of the scene number included in the episode according to the preset scene number expression range, and according to the scene number and the scene number The numbered position divides the episode into a plurality of scene texts, the extraction module can extract the scene information characters contained in the scene text for at least one scene text, and the third determination module can extract the scene information characters and scene text contained in the scene text. The scene number and the episode number of the episode to which the scene text belongs are determined as the information to be sorted out of the scene text, and the combining module can combine the to-be-sorted information of the scene text and the body text in the scene text according to a preset form, Create a target script. In the embodiment of the present invention, by first dividing the script to be processed into scene texts, and extracting a single scene text as the processing object, the coupling degree within the script can be reduced to a certain extent, and the interference of the script format on the extraction of scene information can be reduced. Improve extraction accuracy. At the same time, after the scene information is extracted, the scene texts are recombined according to the preset form, so that the internal forms of each scene text in the script can be kept consistent, thereby facilitating subsequent processing of the script.

对于上述装置实施例而言，由于其与方法实施例基本相似，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。For the above-mentioned apparatus embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and reference may be made to the partial descriptions of the method embodiments for related parts.

本发明实施例还提供了一种电子设备，如图4所示，包括处理器401、通信接口402、存储器403和通信总线404，其中，处理器401，通信接口402，存储器403通过通信总线404完成相互间的通信，An embodiment of the present invention further provides an electronic device, as shown in FIG. 4 , including a processor 401 , a communication interface 402 , a memory 403 and a communication bus 404 , wherein the processor 401 , the communication interface 402 , and the memory 403 pass through the communication bus 404 complete communication with each other,

存储器403，用于存放计算机程序；a memory 403 for storing computer programs;

处理器401，用于执行存储器403上所存放的程序时，实现如下步骤：When the processor 401 is used to execute the program stored in the memory 403, the following steps are implemented:

根据预设的集编号表述范围，确定所述待处理剧本中包含的集编号以及所述集编号的位置，并根据所述集编号及所述集编号的位置将所述待处理剧本分割为多个剧集；Determine the episode number included in the to-be-processed script and the position of the episode number according to a preset expression range of episode numbers, and divide the to-be-processed script into multiple parts according to the episode number and the position of the episode number episodes;

上述终端提到的通信总线可以是外设部件互连标准(PeripheralComponentInterconnect，简称PCI)总线或扩展工业标准结构(ExtendedIndustry StandardArchitecture，简称EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示，图中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。The communication bus mentioned by the above terminal may be a Peripheral Component Interconnect (PCI for short) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA for short) bus or the like. The communication bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.

通信接口用于上述终端与其他设备之间的通信。The communication interface is used for communication between the above-mentioned terminal and other devices.

存储器可以包括随机存取存储器(Random Access Memory，简称RAM)，也可以包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。可选的，存储器还可以是至少一个位于远离前述处理器的存储装置。The memory may include random access memory (Random Access Memory, RAM for short), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.

上述的处理器可以是通用处理器，包括中央处理器(Central ProcessingUnit，简称CPU)、网络处理器(Network Processor，简称NP)等；还可以是数字信号处理器(DigitalSignal Processing，简称DSP)、专用集成电路(Application Specific IntegratedCircuit，简称ASIC)、现场可编程门阵列(Field－Programmable Gate Array，简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, referred to as CPU), a network processor (Network Processor, referred to as NP), etc.; may also be a digital signal processor (Digital Signal Processing, referred to as DSP), dedicated Integrated circuit (Application Specific Integrated Circuit, ASIC for short), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.

在本发明提供的又一实施例中，还提供了一种计算机可读存储介质，该计算机可读存储介质中存储有指令，当其在计算机上运行时，使得计算机执行上述实施例中任一所述的剧本处理方法。In yet another embodiment provided by the present invention, a computer-readable storage medium is also provided, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium is run on a computer, the computer is made to execute any one of the above-mentioned embodiments. The described script processing method.

在本发明提供的又一实施例中，还提供了一种包含指令的计算机程序产品，当其在计算机上运行时，使得计算机执行上述实施例中任一所述的剧本处理方法。In yet another embodiment provided by the present invention, there is also provided a computer program product including instructions, which, when running on a computer, causes the computer to execute the script processing method described in any one of the foregoing embodiments.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), among others.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any relationship between these entities or operations. any such actual relationship or sequence exists. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

本说明书中的各个实施例均采用相关的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the system embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and for related parts, please refer to the partial descriptions of the method embodiments.

以上所述仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等，均包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A scenario processing method applied to an electronic device, the method comprising:

determining a set number and a position of the set number contained in the script to be processed according to a preset set number expression range, and dividing the script to be processed into a plurality of scripts according to the set number and the position of the set number;

for at least one episode, determining a scene number and a position of the scene number contained in the episode according to a preset scene number expression range, and dividing the episode into a plurality of scene texts according to the scene number and the position of the scene number;

for at least one scene text, extracting scene information characters contained in the scene text;

determining scene information characters contained in the scene text, a scene number of the scene text and a collection number of an episode to which the scene text belongs as information to be sorted of the scene text;

and combining the information to be sorted of the scene text and the text in the scene text according to a preset form to form a target script.

2. The method according to claim 1, wherein said extracting, for at least one of the scene texts, a scene information character contained in the scene text comprises:

traversing the scene text according to a preset scene information cue word range value to determine whether the scene text contains a scene information cue word;

if the scene text contains scene information prompt words, determining characters adjacent to the scene information prompt words as scene information characters, and extracting the characters;

if the scene text does not contain scene information cue words, dividing the scene text into a plurality of sub-texts; and extracting scene information characters from the sub-texts according to preset scene information character range values and/or the part of speech of the words in the sub-texts.

3. The method according to claim 2, wherein the extracting scene information characters from the sub-text according to preset scene information character range values and/or parts of speech of words in the sub-text comprises:

determining the part of speech of the words contained in the subfile; determining words with parts of speech as preset parts of speech and containing specific characters as scene information characters, and extracting the words;

and/or traversing the sub-text according to a preset scene information character range value to determine whether the sub-text contains characters existing in the preset scene information character range value; if yes, determining the character as a scene information character, and extracting;

wherein, the character range value of the scene information at least comprises one of the following information: commonly used characters representing time, characters representing place, characters representing weather, and characters representing name of person.

4. The method according to claim 1, wherein the determining the episode number included in the scenario to be processed and the position of the episode number according to a preset episode number expression range comprises:

generating a set number regular expression according to the preset set number expression range; the set number regular expression is defined with set numbers contained in the set number expression range;

and performing regular matching on the script to be processed by using the set number regular expression, and determining the set number and the position of the set number of each script contained in the script to be processed.

5. The method according to claim 1, wherein the determining the scene number and the position of the scene number included in the episode according to a preset scene number expression range comprises:

generating a scene number regular expression according to the preset scene number expression range; scene numbers contained in the scene number expression range are defined in the scene number regular expression;

and performing regular matching on the episode by using the scene number regular expression, and determining the scene number of each scene contained in the episode and the position of the scene number.

6. The method of claim 1, further comprising:

carrying out pretreatment operation on the script to be treated;

wherein the preprocessing operation comprises at least one of the following operations:

deleting the interference information in the scenario to be processed;

converting the fonts in the script to be processed into preset fonts;

converting punctuation marks in the script to be processed into punctuation marks corresponding to the punctuation marks in the symbol mapping relation according to a preset symbol mapping relation;

and deleting punctuation marks which do not belong to the available punctuation mark range value in the script to be processed according to a preset available punctuation mark range value.

7. The method according to claim 1, wherein the combining the information to be collated of the scene text and the body text in the scene text according to a preset form comprises:

according to the information category to which the information to be sorted belongs, setting an information category identifier corresponding to the information category to which the information to be sorted belongs for the information to be sorted, and setting a text identifier for the text;

and combining the information to be sorted with the information category identification and the text with the text identification.

8. A scenario processing apparatus applied to an electronic device, the apparatus comprising:

the first determining module is used for determining the episode number and the position of the episode number contained in the scenario to be processed according to a preset episode number expression range, and dividing the scenario to be processed into a plurality of episodes according to the contained episode number and the position of the episode number;

the second determining module is used for determining scene numbers and positions of the scene numbers contained in at least one episode according to a preset scene number expression range, and dividing the episode into a plurality of scene texts according to the scene numbers and the positions of the scene numbers;

the extraction module is used for extracting scene information characters contained in the scene text for at least one scene text;

a third determining module, configured to determine, as information to be sorted of the scene text, a scene information character included in the scene text, a scene number of the scene text, and a collection number of an episode to which the scene text belongs;

and the combination module is used for combining the information to be sorted of the scene text and the text in the scene text according to a preset form to form the target script.

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of any one of claims 1 to 7 when executing a program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.