CN111027560B

CN111027560B - Text detection method and related device

Info

Publication number: CN111027560B
Application number: CN201911084168.4A
Authority: CN
Inventors: 赵诗云; 陈媛媛; 熊剑平
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2023-09-29
Anticipated expiration: 2039-11-07
Also published as: CN111027560A

Abstract

The application discloses a text detection method and a related device. The text detection method comprises the following steps: acquiring an original image obtained by shooting a scene to be detected by the camera device; detecting an original image by using a text detection model to obtain a candidate region; removing the part exceeding the image boundary of the original image in the candidate area; and analyzing the candidate areas after the elimination to determine target areas related to the text in the scene to be detected. By the aid of the scheme, accuracy of text detection can be improved.

Description

Text detection method and related device

Technical Field

The present application relates to the field of information technologies, and in particular, to a text detection method and a related device.

Background

With the advancement of urban construction, the pressure of urban management is increasing, and in various business demands of urban management, text contents such as random advertisements or illegally sprayed advertisements seriously affect urban appearance, so that the inspection of the text contents is one of the key points of urban management.

Currently, image pickup devices such as monitoring cameras are increasingly densely distributed in urban communities, commercial streets and other places, and urban monitoring systems are continuously perfected. With this, city management is being liberated from heavy labor cost pressure, and the level of intelligence is constantly increasing, on the basis of which inspection of text content is also dependent on a constantly perfected city monitoring system. In this case, how to improve the accuracy of text detection is a problem to be solved.

Disclosure of Invention

The application mainly solves the technical problem of providing a text detection method and a related device, which can improve the accuracy of text detection.

In order to solve the above problems, a first aspect of the present application provides a text detection method, including: acquiring an original image obtained by shooting a scene to be detected by the camera device; detecting an original image by using a text detection model to obtain a candidate region; removing the part exceeding the image boundary of the original image in the candidate area; and analyzing the candidate areas after the elimination to determine target areas related to the text in the scene to be detected.

In order to solve the above-mentioned problems, a second aspect of the present application provides a text detection device, including a memory and a processor coupled to each other; the processor is configured to execute the program instructions stored in the memory to implement the text detection method in the first aspect.

In order to solve the above-described problems, a third aspect of the present application provides a storage device storing program instructions executable by a processor for implementing the text detection method in the above-described first aspect.

According to the scheme, the original image obtained by shooting the scene to be detected through the camera device is obtained, the original image is detected through the text detection model, the candidate region is obtained, the part, exceeding the image boundary of the original image, in the candidate region is removed, therefore, analysis is carried out on the candidate region after removal, the target region related to the text in the scene to be detected is determined, further, the detected candidate region, particularly the candidate region close to the image boundary, can be corrected, an accurate basis can be provided for subsequent analysis, and the accuracy of text detection is improved.

Drawings

FIG. 1 is a flow chart of an embodiment of a text detection method of the present application;

FIG. 2 is a schematic diagram of one embodiment of an original image;

FIG. 3 is a schematic diagram of one embodiment of a portion of a culling candidate region beyond an image boundary of an original image;

FIG. 4 is a flowchart of an embodiment of step S14 in FIG. 1;

FIG. 5 is a flowchart illustrating the step S14 of FIG. 1 according to another embodiment;

FIG. 6 is a flowchart of a further embodiment of step S14 in FIG. 1;

FIG. 7 is a schematic diagram of a text detection device according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a frame of another embodiment of the text detection device of the present application;

FIG. 9 is a schematic diagram of a frame of an embodiment of a storage device of the present application.

Detailed Description

The following describes embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a text detection method according to the present application. Specifically, the method may include the steps of:

step S11: and acquiring an original image obtained by shooting a scene to be detected by the image pickup device.

In this embodiment, the scene to be detected may be set according to a specific application scene, for example, for a shop application scene, the scene to be detected may be a shop window of the shop; for street application scenes, the scene to be detected may be a wall of a street; for road application scenarios, the scenario to be detected may be a telegraph pole or a bus stop, which is not exemplified here.

In this embodiment, the image capturing device may be set according to a specific application scenario, for example, in an outdoor application scenario, the image capturing device may be a waterproof camera; for indoor application scenes, the image pickup device may be a general network camera, and the embodiment is not particularly limited herein.

In one implementation scenario, in order to implement real-time detection of a scene to be detected, a plurality of frames of original images obtained by shooting the scene to be detected by the camera device may be obtained in real time, and specifically, the plurality of frames of original images may be obtained by using an RTSP (Real Time Streaming Protocol, real-time streaming protocol) transmission protocol. In another implementation scenario, the multi-frame original image accumulated by the image capturing device in the to-be-detected scenario during a period of time may also be obtained in an offline manner, and the embodiment is not limited herein.

Step S12: and detecting the original image by using the text detection model to obtain a candidate region.

In this embodiment, the candidate region refers to a region suspected of containing text. In one implementation scenario, to improve the accuracy of text detection, the text detection model may be a deep learning based text detection model, such as any of the following: the text detection model based on pixelink, the text detection model based on textboxes, the text detection model based on textboxes++, and the text detection model based on deep learning may be other models, for example, segLink model, EAST model, and the like, which are not particularly limited herein.

In the case that the text content is inclined to the horizontal plane with a high probability in a natural scene, in this embodiment, in order to match the direction of the target object suspected to be text, image data which is contained in the candidate region and is irrelevant to the detected target object is reduced as much as possible, so that interference of irrelevant image data is reduced, further, accuracy of subsequent analysis is improved, and prediction regression can be performed on an initial region corresponding to the target object obtained by detecting the original image by using the text detection model, so that the candidate region matching the direction of the target object is obtained.

In a specific implementation scenario, when the text detection model is a textboxes++ based text detection model, since the textboxes++ based text detection model itself already contains the above-described predictive regression process, when the original image is detected using the textboxes++ based text detection model, a candidate region matching the direction of the target object suspected to be text can be obtained.

In another specific implementation scenario, when the text detection model is other text detection model than the text detection model based on textboxes++, text detection is utilizedThe test model detects the original image to obtain an initial region corresponding to the target object suspected to be text, and then carries out predictive regression on the obtained initial region to obtain a candidate region matched with the direction of the target object suspected to be text. Specifically, referring to fig. 2 in combination, fig. 2 is a schematic diagram of an embodiment of an original image, as shown in fig. 2, a dashed box is a detected initial region, an arrow direction is a regression direction, a solid box is a candidate region output after prediction regression, and in a specific implementation scenario, the initial region may be denoted as b ₀ ＝(x ₀ ,y ₀ ,w ₀ ,h ₀ ) Wherein (x) ₀ ,y ₀ ) Represents the center, w, of the initial region ₀ Indicating the width of the initial region, h ₀ Representing the height of the initial region, the initial region can be represented by the coordinates of four corner points thereof asWherein the relationship between the various parameters can be expressed by the following formula:

in addition, when the obtained initial region is subjected to predictive regression, the text detection model can also obtain predictive regression information, and in the case of the above representation using four corner points, the predictive regression information can be expressed as (Δx, Δy, Δw, Δh, Δx ₁ ,Δy ₁ ,Δx ₂ ,Δy ₂ ,Δx ₃ ,Δy ₃ ,Δx ₄ ,Δy ₄ C), wherein c represents confidence, and the final output candidate regionThe parameters of the method can be calculated by adopting the following formula:

besides, the expression mode of four corner coordinates can be adopted, the expression mode of the upper left point, the upper right point and the height of the rotating rectangle can be adopted, and finally the upper left point, the upper right point and the height of the rotating rectangle of the candidate region are calculated according to the prediction regression information, so that the candidate region is determined and obtained, and the specific calculation process is not repeated here.

Step S13: and eliminating the part exceeding the image boundary of the original image in the candidate area.

As shown in fig. 2, the detected candidate region, especially the candidate region close to the image boundary, is very prone to exceeding the image boundary, and in order to provide an accurate data basis for the subsequent analysis based on the candidate region, in this embodiment, the part exceeding the image boundary in the candidate region is removed. Specifically, referring to fig. 3 in combination, fig. 3 is a schematic diagram of an embodiment of removing a portion of the candidate region beyond the image boundary of the original image, and as shown in fig. 3, if the candidate region abcd exceeds the upper boundary of the original image, the intersection point of ab and bc with the upper boundary may be obtained, and the excess portion is cut out, so as to obtain a remaining aefcd portion. The candidate region is beyond the boundary of the other image of the original image, and so on, the embodiment is not exemplified here.

Step S14: and analyzing the candidate areas after the elimination to determine target areas related to the text in the scene to be detected.

In this embodiment, the analysis of the candidate region after the culling may be a texture-based analysis algorithm, specifically, the image data of the candidate region after the culling may be scanned on a plurality of scales, and then, using such as: classifying the pixel points by text characteristics such as high-density edges, gray level changes, waveform distribution and the like; alternatively, a region-based analysis algorithm is also possible, specifically, the pixels may be organized into connected domains by using the characteristics (such as color) that the pixels have similarity, and then those connected domains that are unlikely to be text are excluded by using geometric or texture information; or, an analysis algorithm based on stroke width transformation may be further adopted, specifically, a canny edge of image data of the candidate region after being removed may be calculated first, then the stroke width of the image may be calculated according to direction information of the edge, pixels may be collected into a connected domain according to the stroke width information, and the connected domain may be filtered by using geometric reasoning (such as an aspect ratio of the connected domain, a variance, a mean value, a median value, etc. of the connected domain strokes), so as to determine a target region related to text in the scene to be detected.

Referring to fig. 4, fig. 4 is a flowchart illustrating an embodiment of step S14 in fig. 1. Specifically, the method may include the steps of:

step S41: and screening the candidate areas after the elimination by adopting a non-maximum suppression mode.

In the practical application process, the number of candidate areas obtained after the original image is detected and the rejection process is performed may be more than 1, and in this case, in order to obtain the candidate area that is most matched with the target object suspected to be text, to improve the accuracy of text detection, a Non-maximum suppression mode (Non-Maximum Suppression, NMS) may be used to screen the candidate area after rejection.

For example, there are 2 target objects suspected to be text in the original image, in this embodiment, the target object a corresponds to the candidate region 01, the candidate region 02, and the candidate region 03 after the rejection, the target object B corresponds to the candidate region 04 and the candidate region 05 after the rejection, when screening, the candidate region with the highest confidence is selected from the candidate regions after the rejection, for example, the candidate region 01 with the highest confidence, and whether the overlapping ratio IoU (interaction-over-Union) of the candidate region 02, the candidate region 03, the candidate region 04, and the candidate region 05 after the rejection and the candidate region 01 after the rejection is greater than a preset threshold is determined respectively, if yes, the corresponding candidate region is rejected, and the candidate region is reserved and marked. The candidate region 03 and the candidate region 01 after the rejection are both candidate regions corresponding to the target object a, so the overlapping rate is larger than the preset threshold, and if the overlapping rate of the candidate region 03 and the candidate region 02 is larger than the preset threshold, the candidate region 02 and the candidate region 03 can be rejected, the candidate region 01 is reserved and marked, and at the moment, the candidate region 04 and the candidate region 05 remain, the candidate region with the highest confidence is selected from the candidate regions, for example, the candidate region with the highest confidence in the candidate region 04 is the candidate region, whether the overlapping rate of the candidate region 05 after the rejection and the candidate region 04 after the rejection is larger than the preset threshold is judged, if so, the candidate region 05 is rejected, the candidate region 04 is reserved and marked, and the candidate region 01 corresponding to the target object a and the candidate region 04 corresponding to the target object B can be reserved finally.

Step S42: and taking the candidate area obtained by screening as a target area suspected to contain text.

And taking the candidate areas obtained by screening as target areas suspected to contain texts, so that the target areas corresponding to the target objects suspected to be texts can be reserved.

Different from the previous embodiment, the candidate regions after being removed are screened by adopting a non-maximum suppression mode, so that the candidate regions can be screened more accurately, and texts, particularly texts close to the image boundary, can be detected accurately.

Referring to fig. 5, fig. 5 is a flowchart illustrating another embodiment of step S14 in fig. 1. Specifically, the method may include the steps of:

step S51: and screening the candidate areas after the elimination by adopting a non-maximum suppression mode.

In detail, please refer to step S41 in the above embodiment.

Step S52: and taking the candidate area obtained by screening as a target area suspected to contain text.

In detail, please refer to step S42 in the above embodiment.

Step S53: and screening in each pixel point of the target area by adopting a maximum stable extremum area mode to obtain the target pixel point.

In the practical application process, the target area may include objects unrelated to the text, such as patterns and textures, in addition to the target object suspected to be the text, and in this embodiment, the maximum stable extremum area mode is adopted to screen each pixel point in the target area, so as to obtain a target pixel point, where the target pixel point in this embodiment is a pixel point with a high probability related to the text.

In a specific implementation scenario, before the maximum stable extremum region mode is adopted to screen in each pixel point of the target region, gray processing may be further performed on the image data of the target region, so as to obtain a gray image corresponding to the image data of the target region. Specifically, the gray level image (gray level value is 0-255) may be subjected to binarization processing, the threshold value is sequentially increased from 0 to 255, in the process of increasing the threshold value, the area of some connected regions changes little along with the increase of the threshold value, and the region is the maximum stable extremum region, please refer to the following formula:

wherein Q is _i The area of the i-th connected domain is represented, and Δ represents a small threshold change.

When v _i When the pixel point in the connected domain is smaller than the preset threshold, the i-th connected domain is considered as the maximum stable extremum region, and the pixel point in the connected domain is the target pixel point in the embodiment.

Step S54: and updating the target area suspected to contain the text based on the target pixel points obtained by screening.

Specifically, the minimum circumscribed rectangle of the target pixel points obtained through screening can be obtained, and the minimum circumscribed rectangle is updated to be the target area, so that the pixel points irrelevant to the target object can be filtered, and the accuracy of text detection is improved.

Different from the foregoing embodiment, after the candidate region obtained by screening by adopting non-maximum suppression is used as the target region suspected to contain text, the maximum stable extremum region mode is further adopted to screen in each pixel point of the target region, so as to obtain the target pixel point, and the target region suspected to contain text is updated based on the screened target pixel point, so that false detection on patterns, textures and the like can be reduced, and the accuracy of text detection can be improved.

Referring to fig. 6, fig. 6 is a flowchart of another embodiment of step S14 in fig. 1. Specifically, the method may include the steps of:

step S61: and screening the candidate areas after the elimination by adopting a non-maximum suppression mode.

In detail, please refer to step S41 in the above embodiment.

Step S62: and taking the candidate area obtained by screening as a target area suspected to contain text.

In detail, please refer to step S42 in the above embodiment.

Step S63: and screening in each pixel point of the target area by adopting a maximum stable extremum area mode to obtain the target pixel point.

Specifically, refer to step S53 in the above embodiment.

Step S64: and updating the target area suspected to contain the text based on the target pixel points obtained by screening.

In detail, please refer to step S54 in the above embodiment.

Step S65: and counting gradient values of all pixel points in the updated target area.

Specifically, the image data of the updated target area may be convolved by using a sobel convolution factor, so as to obtain gradient values of all pixel points in the updated target area.

Step S66: based on the statistically derived gradient values, it is determined whether text is contained in the target area after updating.

Specifically, screening pixels with gradient values larger than a first preset threshold in the updated target area, judging whether the average value of gradient values of the screened pixels is larger than a second preset threshold, if so, determining that the updated target area contains text, otherwise, determining that the updated target area does not contain text. In this embodiment, the first preset threshold and the second preset threshold may be set according to specific situations, and are not particularly limited herein.

Different from the foregoing embodiment, after updating the target area, gradient values of all pixels in the updated target area are further counted, and based on the gradient values obtained by counting, whether the updated target area contains text is determined, so that the area containing texture and pattern can be further excluded, false detection of non-text objects such as pattern and texture can be further reduced, and further accuracy of text detection is improved.

Referring to fig. 7, fig. 7 is a schematic diagram of a text detection device 70 according to an embodiment of the application. The text detection device 70 comprises an image acquisition module 71, a text detection module 72, an out-of-range rejection module 73 and an image analysis module 74, wherein the image acquisition module 71 is used for acquiring an original image shot by an imaging device on a scene to be detected, the text detection module 72 is used for detecting the original image by using a text detection model to obtain a candidate region, the out-of-range rejection module 73 is used for rejecting a part of the candidate region beyond the image boundary of the original image, and the image analysis module 74 is used for analyzing the candidate region after rejection and determining a target region related to the text in the scene to be detected.

According to the scheme, the original image obtained by shooting the scene to be detected through the camera device is obtained, the original image is detected through the text detection model, the candidate region is obtained, and the part, exceeding the image boundary of the original image, in the candidate region is removed, so that analysis is carried out on the candidate region after removal, the target region related to the text in the scene to be detected is determined, further, the detected candidate region, particularly the candidate region close to the image boundary, can be corrected, an accurate basis can be provided for subsequent analysis, and the accuracy of text detection is improved.

In some embodiments, the image analysis module 74 includes a first screening sub-module for screening the candidate region after the culling in a non-maximum suppression manner, and the image analysis module 74 further includes a first updating sub-module for taking the candidate region obtained by the screening as the target region suspected to include text.

In some embodiments, the image analysis module 74 further includes a second screening sub-module for screening among the pixels in the target area by using the maximum stable extremum area mode to obtain the target pixel, and the image analysis module 74 further includes a second updating sub-module for updating the target area suspected to include the text based on the screened target pixel.

In some embodiments, the second updating sub-module is specifically configured to obtain a minimum bounding rectangle of the filtered target pixel point, and update the minimum bounding rectangle to the target area.

In some embodiments, the image analysis module 74 further includes a gradient statistics sub-module for counting gradient values of all pixels in the target region after the updating, and the image analysis module 74 further includes a determination sub-module for determining whether text is included in the target region after the updating based on the counted gradient values.

In some embodiments, the determining submodule includes a pixel screening unit, configured to screen pixels whose gradient values in the target area after updating are greater than a first preset threshold, and further includes a gradient judging unit, configured to judge whether an average value of gradient values of the pixels obtained by screening is greater than a second preset threshold, and further configured to determine that text is included in the target area after updating when it is judged that the average value of gradient values of the pixels obtained by screening is greater than the second preset threshold, and further configured to determine that text is not included in the target area after updating when it is judged that the average value of gradient values of the pixels obtained by screening is not greater than the second preset threshold.

In some embodiments, the gradient statistics sub-module is specifically configured to perform convolution processing on the updated image data of the target area by using a sobel convolution factor, so as to obtain gradient values of all pixel points in the updated target area.

Referring to fig. 8, fig. 8 is a schematic diagram of a text detection device 80 according to an embodiment of the application. The text detection device 80 comprises a memory 81 and a processor 82 coupled to each other, the processor 82 being configured to execute program instructions stored in the memory 81 to implement the steps of any of the above-described text detection method embodiments.

In particular, the processor 82 is configured to control itself and the memory 81 to implement the steps of any of the text detection method embodiments described above. The processor 82 may also be referred to as a CPU (Central Processing Unit ). The processor 82 may be an integrated circuit chip having signal processing capabilities. The processor 82 may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 82 may be commonly implemented by a plurality of integrated circuit chips.

In this embodiment, the processor 82 is configured to obtain an original image obtained by capturing a scene to be detected by the imaging device, the processor 82 is further configured to detect the original image by using a text detection model to obtain a candidate region, the processor 82 is further configured to reject a portion of the candidate region beyond an image boundary of the original image, and the processor 82 is further configured to analyze the candidate region after rejection to determine a target region related to the text in the scene to be detected.

In some embodiments, the processor 82 is configured to use a non-maximum suppression manner to filter the candidate region after the culling, and the processor 82 is configured to use the candidate region obtained by filtering as the target region suspected to contain text.

In some embodiments, the processor 82 is further configured to screen each pixel point in the target area by using a maximum stable extremum area mode to obtain a target pixel point, and the processor 82 is further configured to update the target area suspected to include text based on the screened target pixel point.

In some embodiments, the processor 82 is further configured to obtain a minimum bounding rectangle of the filtered target pixel, and the processor 82 is further configured to update the minimum bounding rectangle to the target area.

In some embodiments, the processor 82 is further configured to calculate a gradient value for all pixels in the target region after the update, and the processor 82 is further configured to determine whether text is included in the target region after the update based on the calculated gradient value.

In some embodiments, the processor 82 is further configured to screen the pixel whose gradient value in the target area after updating is greater than a first preset threshold, the processor 82 is further configured to determine whether the average value of the gradient values of the pixel obtained by screening is greater than a second preset threshold, the processor 82 is further configured to determine that the target area after updating contains text when the average value of the gradient values of the pixel obtained by screening is greater than the second preset threshold, and the processor 82 is further configured to determine that the target area after updating does not contain text when the average value of the gradient values of the pixel obtained by screening is not greater than the second preset threshold.

In some embodiments, the processor 82 is further configured to convolve the updated image data of the target area with a sobel convolution factor to obtain gradient values of all pixel points in the updated target area.

In some embodiments, the text detection apparatus 80 further includes an imaging device for capturing an original image of the scene to be detected.

Referring to fig. 9, fig. 9 is a schematic diagram of a frame of a storage device 90 according to an embodiment of the application. The storage device 90 stores program instructions 901 executable by the processor, the program instructions 901 for implementing the steps in any of the text detection method embodiments described above.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A text detection method, comprising:

acquiring an original image obtained by shooting a scene to be detected by the camera device;

detecting the original image by using a text detection model to obtain an initial area containing suspected text;

carrying out predictive regression on the initial region to obtain a candidate region matched with the text direction;

removing the part of the candidate region beyond the image boundary of the original image;

analyzing the candidate areas after the elimination, and determining a target area related to the text in the scene to be detected;

the analyzing the candidate region after the elimination, and determining the target region related to the text in the scene to be detected includes:

calculating the canny edge of the image data of the candidate region after rejection;

calculating the stroke width of the image according to the direction information of the edge;

and gathering pixels into connected domains according to the stroke width, filtering the connected domains by utilizing geometric reasoning, and determining a target area related to the text in the scene to be detected.

2. The method for detecting text according to claim 1, wherein the analyzing the candidate region after the culling to determine the target region related to the text in the scene to be detected further comprises:

screening the candidate areas after the elimination by adopting a non-maximum suppression mode;

and taking the candidate area obtained by screening as a target area suspected to contain text.

3. The text detection method according to claim 2, wherein after the candidate region obtained by screening is taken as the target region suspected to contain text, the method further comprises:

screening in each pixel point of the target area by adopting a maximum stable extremum area mode to obtain a target pixel point;

and updating the target area suspected to contain the text based on the target pixel points obtained by screening.

4. The text detection method of claim 3, wherein updating the target region suspected of containing text based on the filtered target pixel points comprises:

obtaining a minimum circumscribed rectangle of the target pixel point obtained by screening;

and updating the minimum bounding rectangle into the target area.

5. The text detection method of claim 3, wherein after updating the target area suspected of containing text based on the target pixel obtained by the filtering, the method further comprises:

counting gradient values of all pixel points in the updated target area;

based on the statistically derived gradient values, it is determined whether text is contained in the target area after updating.

6. The text detection method of claim 5, wherein determining whether text is contained in the updated target region based on the statistically derived gradient values comprises:

screening pixel points with gradient values larger than a first preset threshold value in the updated target area;

if the average value of the gradient values of the pixel points obtained through screening is larger than a second preset threshold value, determining that the updated target area contains text;

and if the mean value of the gradient values of the pixel points obtained through screening is not greater than the second preset threshold value, determining that the updated target area does not contain text.

7. The text detection method of claim 5, wherein the gradient values of all pixels in the target area after the statistical update include:

and carrying out convolution processing on the updated image data of the target area by using a sobel convolution factor to obtain gradient values of all pixel points in the updated target area.

8. A text detection device comprising a memory and a processor coupled to each other;

the processor is configured to execute the program instructions stored in the memory to implement the text detection method of any one of claims 1 to 7.

9. The text detection device of claim 8, further comprising an imaging device for capturing an original image of a scene to be detected.

10. A storage device storing program instructions executable by a processor for implementing the text detection method of any one of claims 1 to 7.