CN114596571B

CN114596571B - An intelligent lens-free text recognition system

Info

Publication number: CN114596571B
Application number: CN202210246740.8A
Authority: CN
Inventors: 张颖而; 皇甫江涛
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2024-10-18
Anticipated expiration: 2042-03-14
Also published as: CN114596571A

Abstract

The invention discloses an intelligent lens-free character recognition system. The system comprises an optical module and a calculation imaging and intelligent text positioning recognition module, wherein the optical module consists of a mask plate with a modulatable amplitude and a sensor, wherein the amplitude distribution of transmitted light of the mask plate is modeled as a two-dimensional convolution layer and can be optimized as a parameter; the computational imaging and text positioning recognition module comprises a computational imaging model, a text positioning model and a text recognition model, wherein input data are raw data obtained on a sensor after passing through the optical module, the input data are output into a text form of predicted text, and meanwhile, the light transmission amplitude distribution of a mask plate in the optical module and the computational imaging network parameters are optimized through result feedback. The invention realizes the optimization of the lens-free imaging and word recognition deep learning model of the software and hardware integration, improves the accuracy of word positioning and word recognition under the lens-free condition, and has the advantages of universality and universality of each module of the system and strong practical applicability.

Description

An intelligent lens-free text recognition system

技术领域Technical Field

本发明属于无透镜成像领域，具体涉及一种智能无透镜文字识别系统。The invention belongs to the field of lens-free imaging, and in particular relates to an intelligent lens-free text recognition system.

背景技术Background Art

随着视觉任务的快速发展和应用，相机被集成在各种硬件设备上。某些应用场景对相机尺寸有严格的要求，无透镜相机是一种使用薄掩膜版替代镜头的成像系统，因此可以大大减小相机尺寸。With the rapid development and application of visual tasks, cameras are integrated into various hardware devices. Some application scenarios have strict requirements on the size of the camera. Lensless cameras are imaging systems that use thin masks instead of lenses, so the size of the camera can be greatly reduced.

和带镜头的相机相比，无透镜相机需要对传感器上收集的数据进行计算成像才能恢复图像，但是基于无透镜重建的图像存在模糊、分辨率的缺点，导致无法胜任很多视觉任务，目前尚未有对基于无透镜的非单个字符文字检测和识别的研究。Compared with cameras with lenses, lensless cameras need to perform computational imaging on the data collected on the sensor to restore the image. However, the images reconstructed based on lensless technology have the disadvantages of blur and resolution, which makes them incapable of many visual tasks. Currently, there is no research on non-single-character text detection and recognition based on lensless technology.

因此，需要一套无透镜文字识别系统。Therefore, a lens-free text recognition system is needed.

发明内容Summary of the invention

针对目前无透镜成像技术由于较差的成像质量而未应用于非单个字母的文字定位和识别的情况，本发明提供了一种基于无透镜的文字定位和识别系统。识别准确率高且该系统方法具有通用性。In view of the fact that the current lensless imaging technology has not been applied to the positioning and recognition of texts other than single letters due to its poor imaging quality, the present invention provides a lensless text positioning and recognition system with high recognition accuracy and universal applicability.

本发明采用的技术方案如下：The technical solution adopted by the present invention is as follows:

本发明的智能无透镜文字识别系统包括光学模块和计算成像及文字定位识别模块，光学模块主要由平行放置的可调制幅度掩膜板和光学传感器组成，待识别目标放置于光学模块前方，待识别目标发出的光线经可调制幅度掩膜板散射后，在光学传感器的平面上投射形成投影图像(原始数据)，光学传感器将投影图像传输至计算成像及文字定位识别模块；The intelligent lens-free text recognition system of the present invention comprises an optical module and a computational imaging and text positioning recognition module. The optical module is mainly composed of a modulated amplitude mask plate and an optical sensor placed in parallel. The target to be recognized is placed in front of the optical module. The light emitted by the target to be recognized is scattered by the modulated amplitude mask plate and then projected on the plane of the optical sensor to form a projection image (raw data). The optical sensor transmits the projection image to the computational imaging and text positioning recognition module.

计算成像及文字识别模块包括计算成像模型、文字定位模型和文字识别模型，三个模型串行连接；计算成像及文字识别模块的输入为经光学模块后在传感器上得到的投影图像，输出为投影图像上文字的文本形式。The computational imaging and text recognition module includes a computational imaging model, a text positioning model and a text recognition model, and the three models are connected in series; the input of the computational imaging and text recognition module is the projection image obtained on the sensor after passing through the optical module, and the output is the text form of the text on the projection image.

所述的可调制幅度掩膜板为由k*k个单元格组成的二值化掩膜版，每个单元格的值为1或0，1表示光线能通过，0表示光线不能通过。The modulated amplitude mask plate is a binary mask plate composed of k*k cells, and the value of each cell is 1 or 0, 1 indicates that light can pass through, and 0 indicates that light cannot pass through.

投影图像经计算成像模型输出预测的重建图像；文字定位模型对输入的重建图像进行处理，输出图像中文字的位置；将文字定位模型的输出结果输入文字识别模型后，输出图像的文字识别结果；The projected image is outputted as a predicted reconstructed image by the computational imaging model; the text positioning model processes the input reconstructed image and outputs the position of the text in the image; after the output result of the text positioning model is inputted into the text recognition model, the text recognition result of the image is outputted;

计算成像及文字识别模块训练过程中，仅计算成像模型参与训练，需更新参数，文字定位模型和文字识别模型不参与训练。During the training of computational imaging and text recognition modules, only the computational imaging model participates in the training and needs to update parameters. The text positioning model and text recognition model do not participate in the training.

计算成像模型为编码器-解码器体系的神经网络，具体采用U-NET；文字定位模型采用任意文字定位模型结构，具体采用CTPN；文字识别模型采用任意文字识别模型结构，具体采用CRNN。The computational imaging model is a neural network of an encoder-decoder system, specifically using U-NET; the text localization model adopts an arbitrary text localization model structure, specifically using CTPN; the text recognition model adopts an arbitrary text recognition model structure, specifically using CRNN.

可调制幅度掩膜板上的图案通过液晶显示器显示，掩模版上的图案随机生成或通过训练优化后确定；通过训练优化后确定掩模版图案的方法包括以下步骤：The pattern on the modulated amplitude mask is displayed through a liquid crystal display, and the pattern on the mask is randomly generated or determined after training optimization; the method for determining the mask pattern after training optimization includes the following steps:

1)将待识别目标与光学模块的成像过程建模为二维卷积层，具体为：1) The imaging process of the target to be identified and the optical module is modeled as a two-dimensional convolutional layer, specifically:

m＝w*om＝w*o

其中，w表示掩模版上的幅度分布，即掩模版上单元格的值分布；以掩模版中心点为原点构建坐标系，(i，j)为掩膜板上单元格中心点的坐标，w_i，j表示掩膜板上坐标为(i，j)的单元格的值；Wherein, w represents the amplitude distribution on the mask, that is, the value distribution of the cell on the mask; a coordinate system is constructed with the center point of the mask as the origin, (i, j) is the coordinate of the center point of the cell on the mask, and w _{i, j} represents the value of the cell with coordinates (i, j) on the mask;

o表示待识别目标不经过掩模版时在传感器平面上缩放后的图像(即o表示待识别目标经过孔径时在传感器平面上缩放后的图像)；以传感器平面中心点为原点构建坐标系，(x，y)表示投影图像的像素点在传感器平面上的坐标值，o_x，y表示待识别目标不经过掩模版时在传感器平面的(x，y)处的像素值；o_x+i，y+j表示在传感器平面上(x+i,y+j)处的像素值；o represents the image after scaling on the sensor plane when the target to be identified does not pass through the mask (i.e., o represents the image after scaling on the sensor plane when the target to be identified passes through the aperture); a coordinate system is constructed with the center point of the sensor plane as the origin, (x, y) represents the coordinate value of the pixel point of the projected image on the sensor plane, o _{x, y} represents the pixel value at (x, y) on the sensor plane when the target to be identified does not pass through the mask; o _{x+i, y+j} represents the pixel value at (x+i, y+j) on the sensor plane;

m表示待识别目标经过掩模版后投影在传感器平面上的图像；m_x，y表示待识别目标经过掩模版后在传感器平面的(x，y)处的像素值；m represents the image of the target to be identified after passing through the mask and projected onto the sensor plane; m _{x, y} represents the pixel value of the target to be identified at (x, y) on the sensor plane after passing through the mask;

k表示掩模版上单元格的行数或列数，i∈[1，k]；k represents the number of rows or columns of cells on the mask, i∈[1, k];

2)将二维卷积层进行二值化得到二值神经网络二维卷积层，结果如下：2) Binarize the two-dimensional convolutional layer to obtain a two-dimensional convolutional layer of a binary neural network. The results are as follows:

其中，in,

其中，w^b表示对w进行二值化处理后的结果；Among them, w ^b represents the result after binarization of w;

由于掩模版只有0和1值，我们使用二值神经网络来训练，二值神经网络使用sign函数将二连续值映射到-1或+1，随后加1并除以2；Since the mask has only 0 and 1 values, we use a binary neural network for training. The binary neural network uses the sign function to map two continuous values to -1 or +1, then adds 1 and divides by 2;

3)将二值神经网络二维卷积层的参数w^b作为模型参数与计算成像及文字定位识别模块一同训练优化；3) The parameters w ^b of the two-dimensional convolutional layer of the binary neural network are used as model parameters to train and optimize together with the computational imaging and text positioning and recognition modules;

3.1)训练过程中，通过电路调整对掩膜板的图案做随机初始化，并将随机初始化的结果作为二值神经网络卷积层的初始参数；3.1) During the training process, the pattern of the mask is randomly initialized through circuit adjustment, and the result of random initialization is used as the initial parameter of the convolution layer of the binary neural network;

3.2)系统前向传播过程的训练：固定待识别目标，在真实物理场景中测量待识别目标经掩膜版后在光学传感器的平面上得到的投影图像，并将其作为计算成像及文字定位识别模块的输入；3.2) Training of the system forward propagation process: Fix the target to be identified, measure the projection image of the target to be identified on the plane of the optical sensor after passing through the mask in the real physical scene, and use it as the input of the computational imaging and text positioning recognition module;

反向传播过程的训练：计算成像及文字定位识别模块输出的预测图像与真实图像标签的损失函数Loss，将损失函数Loss反向传播至二值神经网络卷积层，更新二值神经网络卷积层参数w^b，并根据更新的参数w^b调制可调掩膜版，调制结果作为下一轮训练时模型前向传播过程中的掩膜版图案；Training of the back-propagation process: Calculate the loss function Loss between the predicted image and the real image label output by the imaging and text positioning recognition module, back-propagate the loss function Loss to the binary neural network convolution layer, update the binary neural network convolution layer parameter w ^b , and modulate the adjustable mask according to the updated parameter w ^b . The modulation result is used as the mask pattern in the forward propagation process of the model in the next round of training;

3.3)训练完成后得到的掩膜版图案为优化后的结果。3.3) The mask pattern obtained after training is the optimized result.

所述的可调制掩模版的单元格尺寸大小和传感器平面上的像素点尺寸大小相同；待识别目标与可调制幅度掩膜板之间的距离d1远大于可调制幅度掩膜板和光学传感器之间的距离d2，d1>100d2；因此将掩模版上的幅度分布近似等于掩模版上的幅度分布在传感器平面上的投影。The cell size of the modulatable mask is the same as the pixel size on the sensor plane; the distance d1 between the target to be identified and the modulatable amplitude mask is much larger than the distance d2 between the modulatable amplitude mask and the optical sensor, d1>100d2; therefore, the amplitude distribution on the mask is approximately equal to the projection of the amplitude distribution on the mask on the sensor plane.

所述的计算成像及文字定位识别模块在训练过程中的损失函数Loss为：The loss function Loss of the computational imaging and text positioning recognition module during the training process is:

Loss＝a×Loss1+b×Loss2；Loss = a × Loss1 + b × Loss2;

其中，Loss1为计算成像模型输出的预测图像与真实图像标签(待识别目标图像)之间的误差；Loss2为计算成像及文字定位识别模块最终输出的预测文本与待识别目标的真实文字标签(待识别目标图像上的文本信息)之间的误差；a和b为权重。Among them, Loss1 is the error between the predicted image output by the computational imaging model and the real image label (the target image to be identified); Loss2 is the error between the predicted text finally output by the computational imaging and text positioning and recognition module and the real text label of the target to be identified (the text information on the target image to be identified); a and b are weights.

本发明的有效效益：Effective benefits of the present invention:

本发明的无透镜文字识别系统能够减少镜头带来的尺寸限制，使得相机被集成在其他设备上更加方便。The lens-free text recognition system of the present invention can reduce the size limitation caused by the lens, making it more convenient to integrate the camera into other devices.

本发明实现了软硬件一体化的无透镜成像和文字识别的深度学习模型优化，提高了在无透镜下的文字定位和文字识别的准确率，且该系统的每个模块具有通用性和普适性，具有很强的实际应用性。The present invention realizes the optimization of deep learning models of lens-free imaging and text recognition with integrated hardware and software, improves the accuracy of text positioning and text recognition in the lens-free environment, and each module of the system is universal and applicable and has strong practical applicability.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明的整体数据流。FIG. 1 is an overall data flow of the present invention.

图2是本发明中光学模块的原理图。FIG. 2 is a schematic diagram of an optical module in the present invention.

图3是本发明中计算成像及文字定位识别模块的原理图。FIG. 3 is a schematic diagram of the computational imaging and text positioning and recognition module of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图及具体实施例对本发明作进一步详细说明。The present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

如图1所示，本发明的无透镜文字识别系统包括光学模块、计算成像及文字定位识别模块。待识别目标通过光学模块获得原始数据，原始数据通过计算成像及文字定位识别模块获得文字的文本形式。As shown in Figure 1, the lensless text recognition system of the present invention includes an optical module, a computational imaging and text positioning recognition module. The target to be recognized obtains raw data through the optical module, and the raw data obtains the text form of the text through the computational imaging and text positioning recognition module.

光学模块由可调制幅度掩膜板和传感器组成，其中可调制幅度掩膜板是一种二值化的掩模版，其分为k*k个单元格，单元格的值为1或0，1代表光线能通过，0代表光线不能通过，放置在光学传感器前，且两者平行放置，物体放置在掩膜版前，物体发出的光线经过可调制幅度掩膜板散射后，将特定的投影图像(原始数据)投射在传感器所处平面上，由传感器记录后传到计算成像及文字定位识别模块。该可调制幅度掩膜板的图案可通过电路实时控制，具体可通过液晶显示器显示，掩膜板上的图案既可以随机生成，也可以通过训练优化更新掩膜版；通过训练优化获得的掩膜版能使计算成像及文字定位识别模块的识别效果更好。当光学模块中的掩膜版图案固定时，在不优化掩膜版的情况下，使用固定掩膜版图案，仍然可以通过传感器上得到的原始数据，通过计算成像及文字定位识别模块，获得文字的定位和识别结果。The optical module is composed of a modulated amplitude mask and a sensor, wherein the modulated amplitude mask is a binary mask, which is divided into k*k cells, the value of the cell is 1 or 0, 1 represents that the light can pass through, and 0 represents that the light cannot pass through. It is placed in front of the optical sensor, and the two are placed in parallel. The object is placed in front of the mask, and the light emitted by the object is scattered by the modulated amplitude mask, and a specific projection image (raw data) is projected on the plane where the sensor is located, which is recorded by the sensor and transmitted to the computational imaging and text positioning recognition module. The pattern of the modulated amplitude mask can be controlled in real time by the circuit, and can be displayed by a liquid crystal display. The pattern on the mask can be randomly generated or updated by training optimization; the mask obtained by training optimization can make the recognition effect of the computational imaging and text positioning recognition module better. When the mask pattern in the optical module is fixed, without optimizing the mask, using a fixed mask pattern, the positioning and recognition results of the text can still be obtained by the raw data obtained on the sensor through the computational imaging and text positioning recognition module.

如图2所示，训练优化掩模版图案的方法如下：As shown in FIG2 , the method for training and optimizing the mask pattern is as follows:

1)将成像目标与掩模版交互的过程建模为二维卷积层1) Model the interaction between the imaging target and the mask as a two-dimensional convolutional layer

m＝w*om＝w*o

o表示可调制幅度掩膜板全透明情况下，待识别目标不经过掩模版时在传感器平面上缩放后的图像；以传感器平面中心点为原点构建坐标系，(x，y)表示传感器平面上像素点的坐标值，o_x，y表示待识别目标不经过掩模版时在传感器平面的(x，y)处的像素值；o represents the scaled image of the target to be identified on the sensor plane when the modulated amplitude mask is fully transparent and the target does not pass through the mask; a coordinate system is constructed with the center point of the sensor plane as the origin, (x, y) represents the coordinate value of the pixel point on the sensor plane, and o _{x, y} represents the pixel value at (x, y) on the sensor plane when the target to be identified does not pass through the mask;

k表示掩模版上单元格的行数或列数，i∈[1，k]。k represents the number of rows or columns of cells on the mask, i∈[1,k].

由于掩模版单元格像素大小和传感器像素大小相同，且d1>>d2(d1大于100倍的d2)，所以w可近似等于掩模版上幅度分布在传感器平面上的投影。Since the pixel size of the mask unit cell is the same as the sensor pixel size, and d1>>d2 (d1 is greater than 100 times d2), w can be approximately equal to the projection of the amplitude distribution on the mask on the sensor plane.

2)将二维卷积层的参数w作为模型参数与后续计算成像及文字定位识别模块一同训练优化。2) The parameters w of the two-dimensional convolutional layer are used as model parameters to train and optimize together with the subsequent computational imaging and text positioning and recognition modules.

2.1)将二维卷积层进行二值化得到二值神经网络：2.1) Binarize the two-dimensional convolutional layer to obtain a binary neural network:

其中，w^b表示对w进行二值化处理后的结果。Among them, w ^b represents the result of binarization of w.

由于掩模版只有0和1值，我们使用二值神经网络来训练，二值神经网络使用sign函数将w映射到-1或+1，随后加1并除以2。Since the mask has only 0 and 1 values, we use a binary neural network for training. The binary neural network uses the sign function to map w to -1 or +1, then adds 1 and divides by 2.

2.2)训练过程中，通过电路调整对掩膜板图案做随机初始化，并将该值作为二值神经网络卷积层的初始参数。2.2) During the training process, the mask pattern is randomly initialized through circuit adjustment, and this value is used as the initial parameter of the convolutional layer of the binary neural network.

在系统的前向传播的过程中：固定待测目标物，在真实物理场景中测量经过掩膜版以后传感器上得到的原始数据，并将其作为计算成像及文字定位识别模块的输入，In the forward propagation process of the system: fix the target object to be measured, measure the raw data obtained on the sensor after passing through the mask in the real physical scene, and use it as the input of the computational imaging and text positioning recognition module.

在经过计算成像及文字定位识别模块后得到与真实标签的误差，After the computational imaging and text positioning and recognition modules, the error with the actual label is obtained.

在梯度反向传播过程中：计算成像及文字定位识别模块的输出与真实标签的误差，计算梯度并反向传播，将误差反向传播至二值神经网络卷积层，更新二值神经网络卷积层，并根据更新的权重结果去调制可调掩膜版，作为下一轮训练时模型前向传播过程中的掩膜版图案。During the gradient back propagation process: calculate the error between the output of the imaging and text positioning recognition module and the true label, calculate the gradient and back propagate it, back propagate the error to the binary neural network convolution layer, update the binary neural network convolution layer, and modulate the adjustable mask according to the updated weight result as the mask pattern in the forward propagation process of the model in the next round of training.

训练完成后，固定掩膜版图案。After training is completed, the mask pattern is fixed.

如图3所示，计算成像及文字识别模块包含计算成像模型、文字定位模型和文字识别模型。使用过程中，该模块由三个模型串行，输入数据为经过光学模块后在传感器上得到的原始数据，输出为预测的文字的文本形式。训练过程中，仅计算成像模型是需要更新参数的，文字定位模型和文字识别模型不参与训练。As shown in Figure 3, the computational imaging and text recognition module includes a computational imaging model, a text positioning model, and a text recognition model. During use, the module is composed of three models in series, the input data is the original data obtained on the sensor after passing through the optical module, and the output is the text form of the predicted text. During the training process, only the computational imaging model needs to update the parameters, and the text positioning model and the text recognition model do not participate in the training.

计算成像模型是一种基于深度学习的成像方法，能够由传感器上获得的原始数据(模型输入)计算得到预测的重建图像(模型输出)。计算成像模型是一种结构为编码器-解码器的神经网络，具体的可采用U-NET结构的网络。训练过程中，以包含字母和数字的图像作为待识别目标，其经过光学模块后在传感器上获得的原始数据作为模型输入，训练损失函数由两部分组成：1)待识别目标图像(真实图像标签)与模型输出的预测图像之间的误差Loss1；2)将预测的重建图像输入后接的文字定位模型和文字识别模型，得到预测的文本，计算预测文本与待识别目标的文字标签之间的误差Loss2。损失函数Loss＝a×Loss1+b×Loss2。The computational imaging model is an imaging method based on deep learning, which can calculate the predicted reconstructed image (model output) from the raw data (model input) obtained on the sensor. The computational imaging model is a neural network with an encoder-decoder structure, and a U-NET structure network can be used specifically. During the training process, images containing letters and numbers are used as targets to be identified, and the raw data obtained on the sensor after passing through the optical module is used as the model input. The training loss function consists of two parts: 1) the error Loss1 between the target image to be identified (real image label) and the predicted image output by the model; 2) the predicted reconstructed image is input into the subsequent text positioning model and text recognition model to obtain the predicted text, and the error Loss2 between the predicted text and the text label of the target to be identified is calculated. The loss function Loss = a × Loss1 + b × Loss2.

梯度反向传播：该损失函数计算梯度回传更新计算成像模型及前文所述的二维卷积层，最终得到对应训练好的计算成像模型。Gradient back propagation: This loss function calculates the gradient backpropagation to update the computational imaging model and the two-dimensional convolutional layer mentioned above, and finally obtains the corresponding trained computational imaging model.

文字定位模型是一个神经网络模型。输入数据是图像，输出数据是文字所在的位置。该模型可采用任意文字定位模型结构，具体的结构可采用如CTPN：图像经过VGG16网络，在该网络最后的卷积层的每一行都计算3*3滑动窗口，并将结果通过BLSTM结构连接，最后输入一个全连接层，输出为预测的坐标和置信度分数。且该模型是训练好的，在该系统中不参与损失函数的后向传播更新参数。The text localization model is a neural network model. The input data is an image, and the output data is the location of the text. The model can use any text localization model structure, and the specific structure can be CTPN: the image passes through the VGG16 network, and each row of the last convolution layer of the network calculates a 3*3 sliding window, and the results are connected through the BLSTM structure, and finally input into a fully connected layer, and the output is the predicted coordinates and confidence score. And the model is trained, and does not participate in the back propagation update parameter of the loss function in this system.

文字识别模型是一个神经网络模型。输入数据是带字母数字的图像，输出是文字识别的结果。该模型可采用任意文字识别模型结构，如CRNN：图像首先经过若干卷积层，提取特征序列，进入BLSTM,最终输出预测字母的分数。且该模型是训练好的，在该系统中不参与损失函数的后向传播更新参数。The text recognition model is a neural network model. The input data is an image with letters and numbers, and the output is the result of text recognition. The model can use any text recognition model structure, such as CRNN: the image first passes through several convolutional layers to extract the feature sequence, enters the BLSTM, and finally outputs the score of the predicted letter. And the model is trained, and does not participate in the back propagation of the loss function to update the parameters in this system.

Claims

1. An intelligent lens-free text recognition system, characterized in that it comprises an optical module and a computational imaging and text positioning recognition module, wherein the optical module is composed of a modulated amplitude mask plate and an optical sensor placed in parallel, and a target to be recognized is placed in front of the optical module, and the light emitted by the target to be recognized is scattered by the modulated amplitude mask plate and then projected on the plane of the optical sensor to form a projection image, and the optical sensor transmits the projection image to the computational imaging and text positioning recognition module;

The pattern on the modulated amplitude mask is displayed on a liquid crystal display, and the pattern on the mask is determined after training optimization;

The computational imaging and text recognition module includes a computational imaging model, a text positioning model and a text recognition model, and the three models are connected in series; the input of the computational imaging and text recognition module is the projection image obtained on the sensor after passing through the optical module, and the output is the text form of the text on the projection image;

The method for determining the mask pattern after training optimization comprises the following steps:

1) The imaging process of the target to be identified and the optical module is modeled as a two-dimensional convolutional layer, specifically:

m＝w*o

Wherein, w represents the amplitude distribution on the mask, that is, the value distribution of the cell on the mask; a coordinate system is constructed with the center point of the mask as the origin, (i, j) is the coordinate of the center point of the cell on the mask, and w _{i, j} represents the value of the cell with coordinates (i, j) on the mask;

o represents the scaled image of the target to be identified on the sensor plane when it does not pass through the mask; a coordinate system is constructed with the center point of the sensor plane as the origin, (x, y) represents the coordinate value of the pixel point of the projected image on the sensor plane, o _{x, y} represents the pixel value at (x, y) on the sensor plane when the target to be identified does not pass through the mask; o _{x+i, y+j} represents the pixel value at (x+i, y+j) on the sensor plane;

m represents the image of the target to be identified after passing through the mask and projected onto the sensor plane; m _x,y represents the pixel value of the target to be identified at (x,y) on the sensor plane after passing through the mask;

k represents the number of rows or columns of cells on the mask, i∈[1,k];

2) Binarize the two-dimensional convolutional layer to obtain a two-dimensional convolutional layer of a binary neural network. The results are as follows:

in,

Among them, w ^b represents the result after binarization of w;

3) The parameters w ^b of the two-dimensional convolutional layer of the binary neural network are used as model parameters to train and optimize together with the computational imaging and text positioning and recognition modules;

3.1) During the training process, the pattern of the mask is randomly initialized through circuit adjustment, and the result of random initialization is used as the initial parameter of the convolution layer of the binary neural network;

3.2) Training of the system forward propagation process: Fix the target to be identified, measure the projection image of the target to be identified on the plane of the optical sensor after passing through the mask in the real physical scene, and use it as the input of the computational imaging and text positioning recognition module;

Training of the back-propagation process: Calculate the loss function Loss between the predicted image and the real image label output by the imaging and text positioning recognition module, back-propagate the loss function Loss to the binary neural network convolution layer, update the binary neural network convolution layer parameter w ^b , and modulate the adjustable mask according to the updated parameter w ^b . The modulation result is used as the mask pattern in the forward propagation process of the model in the next round of training;

3.3) The mask pattern obtained after training is the optimized result.

2. An intelligent lens-free text recognition system according to claim 1, characterized in that the modulated amplitude mask is a binary mask consisting of k*k cells, and the value of each cell is 1 or 0, 1 indicates that light can pass through, and 0 indicates that light cannot pass through.

3. The intelligent lens-free text recognition system according to claim 1 is characterized in that the projection image is outputted as a predicted reconstructed image by the computational imaging model; the text positioning model processes the input reconstructed image and outputs the position of the text in the image; after the output result of the text positioning model is inputted into the text recognition model, the text recognition result of the image is outputted;

During the training process of computational imaging and text recognition modules, only the computational imaging model participates in the training, while the text positioning model and text recognition model do not participate in the training.

4. An intelligent lens-free text recognition system according to claim 3, characterized in that the computational imaging model is a neural network of an encoder-decoder system, specifically U-NET; the text positioning model adopts an arbitrary text positioning model structure, specifically CTPN; the text recognition model adopts an arbitrary text recognition model structure, specifically CRNN.

5. According to claim 1, an intelligent lens-free text recognition system is characterized in that the cell size of the modulated mask is the same as the pixel size on the sensor plane; the distance d1 between the target to be identified and the modulated amplitude mask is much larger than the distance d2 between the modulated amplitude mask and the optical sensor, d1>100d2; therefore, the amplitude distribution on the mask is approximately equal to the projection of the amplitude distribution on the mask on the sensor plane.

6. The intelligent lens-free text recognition system according to claim 3, characterized in that the loss function Loss of the computational imaging and text positioning recognition module during the training process is:

Loss = a × Loss1 + b × Loss2;

Among them, Loss1 is the error between the predicted image output by the computational imaging model and the real image label; Loss2 is the error between the predicted text finally output by the computational imaging and text positioning and recognition module and the real text label of the target to be recognized; a and b are weights.