CN116468903A

CN116468903A - A method and related device for processing point cloud data

Info

Publication number: CN116468903A
Application number: CN202310328338.9A
Authority: CN
Inventors: 曾艺涵; 徐航; 叶超强; 王春微; 冯柏岚
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-07-21

Abstract

A processing method of point cloud data is applied to the technical field of artificial intelligence. In the method, based on point cloud data and images acquired for the same object, the object contained in the images is firstly identified, and then the point cloud clusters corresponding to the object in the images in the point cloud data are determined through the mapping relation between the point cloud data and the images. In this way, in the model training stage, the characteristics of the point cloud clusters and the text corresponding to the same object can be extracted through the model, so that the contrast learning training of the point cloud data characteristics and the text characteristics is realized, and the object identification in the point cloud data can be realized based on the model obtained through training.

Description

Point cloud data processing method and related device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a processing method and a related device of point cloud data.

Background

The goal of a three-dimensional (3D) object locating and discriminating task is to locate and classify objects in a 3D scene constructed from point cloud data. With the continuous development of intelligent robots and automatic driving, the positioning and recognition of 3D objects in open world environments are becoming important targets for perception.

With the rapid development of artificial intelligence technology, better effects can be achieved by adopting an artificial intelligence-based model to locate and identify 3D objects. However, when locating and identifying 3D objects by the model, a large amount of annotation data is often required to train the model to provide good performance. Because the 3D scene in the open world is too wide, the actual application can only bear the labeling cost of partial data, and the training of the 3D object positioning and recognition task in the semantics of the open world is limited. Therefore, how to build a positioning and identifying model of the open world under the condition of lacking data annotation becomes a development trend of attention of the industry.

At present, in the related art, point cloud data is projected into a plurality of depth maps, and the plurality of depth maps and text vocabularies are respectively input into an image feature extractor and a text feature extractor in a model to perform feature extraction so as to realize contrast learning training of image features and text features. In the test process of the model, a depth map and a test text obtained by projecting the point cloud data can be input into the model, feature distances between image features and text features output by the model are calculated, and further, texts corresponding to the point cloud data are determined, so that identification of objects in the point cloud data is realized. However, the related art needs to project the point cloud data into a depth map, which often loses a large number of original 3D data structures, and the actual performance of the model is easy to be poor.

Disclosure of Invention

The application provides a processing method of point cloud data, which can effectively improve the performance of a model for realizing object identification in the point cloud data.

The first aspect of the application provides a processing method of point cloud data, which is applied to the technical field of artificial intelligence. The method comprises the following steps: first, first point cloud data and a first image are acquired, wherein the first point cloud data and the first image are acquired by a first object. In addition, in addition to the first object, other objects may be included in the first point cloud data and the first image, that is, a plurality of objects may be included in the first point cloud data and the first image.

And then, acquiring an image block of the first object in the first image and a text name corresponding to the first object, namely realizing the alignment of the image and the text, and determining the text semantics of the image block in the first image.

And secondly, determining a first point cloud cluster corresponding to the first object in the first point cloud data based on the mapping relation between the first point cloud data and the first image and the image block. The mapping relation can realize conversion between points in the first point cloud data and pixel points in the first image. That is, based on the mapping relation, the pixel point corresponding to the point in the first point cloud data in the first image can be determined, and the pixel point corresponding to the pixel point in the first image in the first point cloud data can also be determined.

And extracting the characteristics of the first point cloud cluster and the text name through a first model to obtain first point cloud characteristics and first text characteristics, wherein the first model comprises a first network and a second network, the first point cloud characteristics are obtained by extracting the characteristics of the first point cloud cluster through the first network, and the first text characteristics are obtained by extracting the characteristics of the text name through the second network. Wherein the first network and the second network are neural networks which are structurally different and independent from each other.

Finally, the first model is updated based on a first loss function, and a second model is obtained, wherein the first loss function is related to a first difference between the first point cloud feature and the second text feature. The first loss function may have a positive correlation with a first difference between the first point cloud feature and the second text feature. In this way, in the process of updating the first model based on the first loss function, the update target of the first model is to reduce the first loss function as much as possible, that is, reduce the difference between the first point cloud feature and the first text feature corresponding to the same object, so that the second model obtained by training can output the point cloud feature and the text feature with the smallest difference as much as possible when extracting the point cloud feature and the text feature corresponding to the same object.

In the scheme, based on point cloud data and images acquired for the same object, the object included in the image is firstly identified, and then a point cloud cluster corresponding to the object in the image in the point cloud data is determined through a mapping relation between the point cloud data and the image. In this way, in the model training stage, the characteristics of the point cloud clusters and the text corresponding to the same object can be extracted through the model, so that the contrast learning training of the point cloud data characteristics and the text characteristics is realized, and the object identification in the point cloud data can be realized based on the model obtained through training.

In the scheme, the image is adopted as an intermediary between the point cloud data and the text, and the text corresponding to the point cloud data is determined in advance through the mapping relation between the point cloud data and the image brought by the acquisition stage, so that the alignment between the point cloud data and the text is realized, the original data structure loss brought by projecting the point cloud data into the depth map is avoided, the object understanding in the 3D scene is facilitated, and the performance of the model can be effectively improved.

In one possible implementation, the method further includes: and acquiring a second point cloud characteristic and a second text characteristic, wherein the second point cloud characteristic is obtained by extracting characteristics of a point cloud cluster corresponding to a second object through a first network, the second text characteristic is obtained by extracting characteristics of a text name of the second object through the second network, and the first object and the second object are different objects. For example, the first object and the second object may be different objects belonging together in the first image; the second object may also be an object in another image, for example in a training data belonging to the same batch as the first image.

The first loss function is related to a first difference and a second difference, wherein the second difference is a difference between the first text feature and the second point cloud feature.

In one possible implementation, the first loss function has a positive correlation with the first difference and the first loss function has a negative correlation with the second difference.

In the scheme, the first loss function is built based on the first difference and the second difference at the same time, so that the first model can learn the point cloud characteristics and the text characteristics which are zoomed in and correspond to the same object at the same time in the training process, and learn the point cloud characteristics and the text characteristics which are zoomed out and correspond to different objects at the same time. In this way, after training to obtain the second model, the second model can output the point cloud features and the text features with the smallest difference as possible when extracting the point cloud features and the text features corresponding to the same object; when the point cloud features and text features corresponding to different objects are extracted, the second model can output the point cloud features and text features with the largest difference as possible, so that the second model is used for identifying the category of the point cloud data and has good performance.

In one possible implementation, the first model may further include a third network. The third network is used for extracting the characteristics of the image, namely, the input of the third network is the image, and the output is the characteristics of the image. The third network is a neural network which is different in structure from the first network and the second network and is independent from each other. The third network may be, for example, an attention network.

The method further comprises the steps of: and extracting the characteristics of the image blocks through a third network to obtain first image characteristics. The first loss function is related to a first difference and a third difference, wherein the third difference is a difference between the first image feature and the first point cloud feature.

In the scheme, the loss function is constructed by introducing the differences between the point cloud features and the text features and between the point cloud features and the image features, so that the model can learn to pull up the point cloud features, the text features and the image features corresponding to the same object as much as possible in the process of training the model based on the loss function, and the performance of the model obtained through final training can be effectively improved.

In one possible implementation, the method further includes: and acquiring second point cloud data, wherein the second point cloud data is point cloud data for identifying the object to be executed.

Inputting the second point cloud data into the first network in the second model to obtain a third point cloud characteristic;

and inputting the texts into a second network in a second model to obtain a plurality of text features. Wherein, a plurality of text features are in one-to-one correspondence with a plurality of texts. And, the plurality of texts may be determined according to the acquisition scene of the second point cloud data. For example, in an autopilot scenario, the plurality of text may include, for example, text of "car," "truck," "bicycle," "pedestrian," "roadblock," "tree," and the like.

And determining a target text corresponding to the second point cloud data according to the difference between the third point cloud feature and each text feature in the plurality of text features, wherein the target text corresponds to the target text feature, and the target text feature is the text feature with the smallest difference with the third point cloud feature in the plurality of text features.

That is, after determining the difference between the third point cloud feature and each of the plurality of text features, the difference between the third point cloud feature and each of the text features may be ordered in order from small to large, and one text feature having the smallest difference from the third point cloud feature (i.e., the above-described target text feature) may be determined, and then the text corresponding to the text feature may be regarded as the target text.

In a possible implementation manner, the mapping relationship is obtained based on a device for acquiring the first point cloud data and a device for acquiring the first image. That is, the means for acquiring the first point cloud data and the means for acquiring the first image affect the conversion relationship between the first point cloud data and the first image.

In one possible implementation, the first point cloud data is acquired by a lidar, the first image is acquired by an image sensor, and the lidar and the image sensor are disposed on the same device.

In one possible implementation manner, determining a first point cloud cluster corresponding to the first object in the first point cloud data based on a mapping relationship between the first point cloud data and the first image and the image block includes: determining a second point cloud cluster corresponding to the image block in the first point cloud data based on a coordinate system conversion matrix between the laser radar and the image sensor; and clustering the points in the second point cloud cluster to obtain a first point cloud cluster, wherein the second point cloud cluster comprises the first point cloud cluster.

In the scheme, the points belonging to the foreground object in the second point cloud cluster can be effectively extracted by carrying out clustering treatment on the points in the second point cloud cluster, namely, the points belonging to the first object are extracted, so that the points not belonging to the first object are removed, and the accuracy of the finally obtained first point cloud cluster is ensured.

In one possible implementation, the first point cloud data and the first image are acquired by the same depth camera.

In one possible implementation manner, determining a first point cloud cluster corresponding to the first object in the first point cloud data based on a mapping relationship between the first point cloud data and the first image and the image block includes: extracting a foreground part in the image block to obtain a foreground image block; and determining a first point cloud cluster corresponding to the foreground image block in the first point cloud data based on the coordinate system conversion matrix of the depth camera.

In the scheme, the foreground part in the image is firstly extracted, then the image of the foreground part is converted into the point cloud cluster, other objects except the first object can be effectively extracted before the image is converted into the point cloud cluster, and the accuracy of the finally obtained first point cloud cluster is ensured.

In one possible implementation manner, acquiring an image block of the first object in the first image and a text name corresponding to the first object includes: and carrying out target recognition on the first image through a second model to obtain an image block of the first object in the first image and a text name corresponding to the first object, wherein the second model is a pre-training model.

A second aspect of the present application provides a processing device for point cloud data, including:

the acquisition module is used for acquiring first point cloud data and a first image, wherein the first point cloud data and the first image are acquired from a first object;

the acquisition module is also used for acquiring an image block of the first object in the first image and a text name corresponding to the first object;

the processing module is used for determining a first point cloud cluster corresponding to the first object in the first point cloud data based on the mapping relation between the first point cloud data and the first image and the image block;

The processing module is further used for extracting features of the first point cloud cluster and the text name through a first model to obtain first point cloud features and first text features, wherein the first model comprises a first network and a second network, the first point cloud features are obtained by extracting features of the first point cloud cluster through the first network, and the first text features are obtained by extracting features of the text name through the second network;

the processing module is further configured to update the first model based on a first loss function, and obtain a second model, where the first loss function is related to a first difference between the first point cloud feature and the second text feature.

In a possible implementation manner, the obtaining module is further configured to obtain a second point cloud feature and a second text feature, where the second point cloud feature is obtained by extracting features from a point cloud cluster corresponding to a second object by using a first network, the second text feature is obtained by extracting features from a text name of the second object by using a second network, and the first object and the second object are different objects;

the first loss function is related to a first difference and a second difference, the second difference being a difference between the first text feature and the second point cloud feature.

In one possible implementation, the first model further includes a third network;

the processing module is further used for:

extracting the characteristics of the image blocks through a third network to obtain first image characteristics;

the first loss function is related to a first difference and a third difference, wherein the third difference is a difference between the first image feature and the first point cloud feature.

In one possible implementation manner, the obtaining module is further configured to obtain second point cloud data;

the processing module is also used for inputting the second point cloud data into the first network in the second model to obtain a third point cloud characteristic;

the processing module is also used for inputting a plurality of texts into a second network in the second model to obtain a plurality of text features;

and the processing module is further used for determining a target text corresponding to the second point cloud data according to the difference between the third point cloud feature and each text feature in the plurality of text features, wherein the target text corresponds to the target text feature, and the target text feature is the text feature with the smallest difference with the third point cloud feature in the plurality of text features.

In one possible implementation, the mapping relationship is based on the means for acquiring the first point cloud data and the means for acquiring the first image.

In one possible implementation, the processing module is further configured to:

determining a second point cloud cluster corresponding to the image block in the first point cloud data based on a coordinate system conversion matrix between the laser radar and the image sensor;

and clustering the points in the second point cloud cluster to obtain a first point cloud cluster, wherein the second point cloud cluster comprises the first point cloud cluster.

In one possible implementation, the processing module is further configured to:

extracting a foreground part in the image block to obtain a foreground image block;

and determining a first point cloud cluster corresponding to the foreground image block in the first point cloud data based on the coordinate system conversion matrix of the depth camera.

In one possible implementation, the processing module is further configured to:

and carrying out target recognition on the first image through a second model to obtain an image block of the first object in the first image and a text name corresponding to the first object, wherein the second model is a pre-training model.

A third aspect of the present application provides a processing device for point cloud data, which may include a processor, the processor being coupled to a memory, the memory storing program instructions, the program instructions stored in the memory, when executed by the processor, implementing the method of the first aspect or any implementation manner of the first aspect. For the steps in each possible implementation manner of the first aspect executed by the processor, reference may be specifically made to the first aspect, which is not described herein.

A fourth aspect of the present application provides a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the method of any of the implementations of the first aspect.

A fifth aspect of the present application provides a circuitry comprising processing circuitry configured to perform the method of any of the implementations of the first aspect described above.

A sixth aspect of the present application provides a computer program product which, when run on a computer, causes the computer to perform the method of any of the implementations of the first aspect described above.

A seventh aspect of the present application provides a chip system comprising a processor for supporting a server or a threshold value obtaining device to implement the functions involved in any implementation manner of the first aspect, for example, to send or process data and/or information involved in the method. In one possible design, the chip system further includes memory to hold program instructions and data necessary for the server or communication device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

The advantages of the second to seventh aspects may be referred to the description of the first aspect, and are not described here again.

Drawings

Fig. 1 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of another convolutional neural network according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of processing point cloud data in the related art;

FIG. 4 is a schematic diagram of a system architecture 400 according to an embodiment of the present disclosure;

Fig. 5 is a schematic structural diagram of an electronic device 101 according to an embodiment of the present application;

fig. 6 is a flow chart of a method for processing point cloud data according to an embodiment of the present application;

FIG. 7 is a schematic diagram of identifying an object in an image by a second model according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a system architecture according to an embodiment of the present disclosure;

fig. 9A is a schematic view of object recognition of an outdoor open world according to an embodiment of the present application;

fig. 9B is a schematic view of object recognition of an indoor open world according to an embodiment of the present application;

FIG. 10 is a schematic flow chart for locating and identifying 3D objects in the open world according to an embodiment of the present application;

fig. 11 is a schematic flow chart of tri-modal data alignment in an outdoor scene according to an embodiment of the present application;

fig. 12 is a schematic flow chart of tri-modal data alignment in an indoor scene according to an embodiment of the present application;

FIG. 13 is a schematic flow chart of performing contrast pre-training on text, image and point cloud data using a tri-modal contrast pre-training tri-tower model according to an embodiment of the present application;

fig. 14 is a schematic flow chart of a three-mode joint feature discrimination provided in the embodiment of the present application;

Fig. 15 is a schematic structural diagram of a processing device for point cloud data according to an embodiment of the present application;

fig. 16 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 17 is a schematic diagram of a chip according to an embodiment of the present disclosure;

fig. 18 is a schematic structural diagram of a computer readable storage medium according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can appreciate, with the development of technology and the appearance of new scenes, the technical solutions provided in the embodiments of the present application are applicable to similar technical problems.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which the embodiments of the application described herein have been described for objects of the same nature. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, some technical terms related to embodiments of the present application are described below.

(1) Open World (Open-World)

The open world refers to a generalized open scene that does not provide an environment priori label, i.e., an open scene that does not label individual objects that appear in the environment.

(2) Point cloud data (Point group data)

Point cloud data refers to a set of vectors in a three-dimensional coordinate system. Typically, the point cloud data is obtained by scanning the environment with a laser scanner (e.g., a lidar). The point cloud data is recorded in the form of points, and each point contains three-dimensional coordinates. In some cases, points in the point cloud data may contain color information or reflection intensity information. The reflected intensity information refers to the echo intensity collected by the laser scanner receiving device, and is related to the surface material, roughness, incident angle direction of the target, the emission energy of the instrument and the laser wavelength.

(3) Laser Radar (Laser Radar)

A lidar is a radar system that detects a characteristic quantity such as a position, a speed, etc. of a target by emitting a laser beam. The working principle of the laser radar is that a detection signal (laser beam) is emitted to a target, then a received signal (target echo) reflected from the target is compared with the emission signal and is properly processed, and further related information of the target such as parameters of the distance, the azimuth, the height, the speed, the gesture, the shape and the like of the target are obtained, so that the detection of the target such as a road surface or an obstacle on the ground surface is realized. Generally, the laser radar may be composed of a laser transmitter, an optical receiver, an information processing system, etc., where the laser transmitter converts the electric pulse into an optical pulse and emits the optical pulse, and the optical receiver restores the optical pulse reflected from the target into the electric pulse and sends the electric pulse to the information processing system for processing.

(4) Image sensor

The image sensor converts the light image on the light sensing surface into an electric signal in a corresponding proportional relation with the light image by utilizing the photoelectric conversion function of the photoelectric device. In contrast to light sensitive elements of "point" light sources such as photodiodes, phototriodes, etc., an image sensor is a functional device that divides the light image on its light-receiving surface into a number of small cells that are converted into a usable electrical signal. Briefly, an image sensor is a device for acquiring images.

(5) Depth camera

A depth camera is a type of camera that is capable of simultaneously acquiring an image of a target and a distance between the target and the depth camera. Based on the depth camera, the distance between each pixel point in the image and the depth camera can be obtained, and the three-dimensional space coordinate of each pixel point in the image can be obtained by adding the two-dimensional coordinate of the pixel point in the image.

That is, the depth camera can acquire a spatial distance between each object in the image and the depth camera in addition to the image, as compared to the conventional camera.

(6) Neural network

The neural network may be composed of neural units, which may refer to an arithmetic unit with xs (i.e., input data) and intercept 1 as inputs, and the output of the arithmetic unit may be:

Where s=1, 2, … … n, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to a next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by joining together a plurality of the above-described single neural units, i.e., the output of one neural unit may be the input of another neural unit. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

(7) Deep neural network (Deep Neural Network DNN)

Deep neural networks, also known as multi-layer neural networks, are understood to be neural networks having many hidden layers, where "many" are notThere are special metrics. From DNNs, which are divided by the location of the different layers, the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN appears to be complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression: Wherein (1)>Is an input vector, +.>Is the output vector, +.>Is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for the input vector +.>The output vector is obtained by such simple operation>Since DNN has a large number of layers, the coefficient W and the offset vector +.>And thus a large number. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: it is assumed that in DNN of one three layers, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as +.>Superscript 3 rd generationThe number of layers in which the table coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. The summary is: the coefficients from the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as +.>It should be noted that the input layer is devoid of W parameters. In deep neural networks, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks. The process of training the deep neural network, i.e. learning the weight matrix, has the final objective of obtaining a weight matrix (a weight matrix formed by a number of layers of vectors W) for all layers of the trained deep neural network.

(8) Convolutional neural network (Convosutionas Neuras Network CNN)

A convolutional neural network is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer. The feature extractor can be seen as a filter and the convolution process can be seen as a convolution with an input image or feature map using a trainable filter. The convolutional layers refer to nerve cell layers (e.g., the first convolutional layer and the second convolutional layer in this embodiment) in the convolutional neural network that perform convolutional processing on the input signal. In the convolutional layer of the convolutional neural network, one neural unit may be connected with only a part of adjacent layer neural units. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way image information is extracted is independent of location. The underlying principle in this is: the statistics of a certain part of the image are the same as other parts. I.e. meaning that the image information learned in one part can also be used in another part. So we can use the same learning to get the image information for all positions on the image. In the same convolution layer, a plurality of convolution kernels may be used to extract different image information, and in general, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.

Specifically, as shown in fig. 1, convolutional Neural Network (CNN) 100 may include an input layer 110, a convolutional layer/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.

The structure formed by the convolution layer/pooling layer 120 and the neural network layer 130 may be a first convolution layer and a second convolution layer described in the application, where the input layer 110 is connected to the convolution layer/pooling layer 120, the convolution layer/pooling layer 120 is connected to the neural network layer 130, an output of the neural network layer 130 may be input to an activation layer, and the activation layer may perform a nonlinear processing on the output of the neural network layer 130.

Convolution layer/pooling layer 120. Convolution layer: the convolutional/pooling layer 120 as shown in fig. 2 may include layers as examples 121-126, in one implementation, 121 being a convolutional layer, 122 being a pooling layer, 123 being a convolutional layer, 124 being a pooling layer, 125 being a convolutional layer, 126 being a pooling layer; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e. the output of the convolution layer may be used as input to a subsequent pooling layer or as input to another convolution layer to continue the convolution operation.

Taking the example of the convolution layer 121, the convolution layer 121 may include a plurality of convolution operators, which are also called kernels, and function in image processing as a filter that extracts specific information from an input image matrix, where the convolution operators may be essentially a weight matrix, which is usually predefined, and where the weight matrix is usually processed on the input image in a horizontal direction (or two pixels followed by two pixels … … depending on the value of the step size stride) to perform the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix produces a convolved output of a single depth dimension, but in most cases does not use a single weight matrix, but instead applies multiple weight matrices of the same dimension. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrices can be used to extract different features in the image, for example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific color of the image, another weight matrix is used to blur … … unnecessary noise points in the image, the dimensions of the weight matrices are the same, the dimensions of feature images extracted by the weight matrices with the same dimensions are the same, and the extracted feature images with the same dimensions are combined to form the output of convolution operation.

The weight values in the weight matrices are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can extract information from the input image, so that the convolutional neural network 100 is helped to perform correct prediction.

When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, features extracted by the later convolutional layers (e.g., 126) become more complex, such as features of high level semantics, which are more suitable for the problem to be solved.

Pooling layer: since it is often desirable to reduce the number of training parameters, the convolutional layers often require periodic introduction of pooling layers, i.e., layers 121-126 as illustrated at 120 in FIG. 1, which may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers.

Neural network layer 130: after processing by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not yet sufficient to output the required output information. Because, as before, the convolution layer/pooling layer 120 will only extract features and reduce the parameters imposed by the input image. However, in order to generate the final output information (the required class information or other relevant information), convolutional neural network 100 needs to utilize neural network layer 130 to generate the output of the number of classes required for one or a group. Thus, multiple hidden layers (131, 132 to 13n as shown in fig. 1) and an output layer 140 may be included in the neural network layer 130, where parameters included in the multiple hidden layers may be pre-trained according to relevant training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and so on.

After the underlying layers of the neural network layer 130, i.e., the final layer of the overall convolutional neural network 100 is the output layer 140, the output layer 140 has a class-cross entropy-like loss function, specifically for calculating the prediction error, once the forward propagation of the overall convolutional neural network 100 (e.g., propagation from 110 to 140 in fig. 2) is completed (e.g., propagation from 140 to 110 in fig. 2) and the backward propagation (e.g., propagation from 140 to 110 in fig. 2) begins to update the weights and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the desired result.

It should be noted that, the convolutional neural network 100 shown in fig. 1 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network prediction models, for example, multiple convolutional layers/pooling layers shown in fig. 2 are parallel, and the features extracted respectively are all input to the full neural network layer 130 for processing.

(9) Attention network

An attention network is a network model that utilizes an attention mechanism to increase the speed of model training. Currently, typical attention networks include a transducer model. The model applying the attention mechanism can give different weights to each part of the input sequence, thereby extracting more important characteristic information in the input sequence, so that the model is finally output more accurately.

In deep learning, the attention mechanism may be implemented by a weight vector describing importance: when an element is predicted or inferred, the association between the element and other elements is determined by a weight vector. For example, for a certain pixel in an image or a certain word in a sentence, the correlation between the target element and other elements may be quantitatively estimated using the attention vector, and the weighted sum of the attention vectors is taken as an approximation of the target value.

The attention mechanism in deep learning simulates the attention mechanism of the human brain. For example, when a human is viewing a picture, while the human eye can see the full view of the picture, when the human is looking deep and carefully, the human eye focuses only on a portion of the picture, where the human brain is primarily focused on this small pattern. That is, when a human carefully observes an image, the attention of the human brain to the whole image is not balanced and is distinguished by a certain weight, which is the core idea of the attention mechanism.

In brief, human vision processing systems tend to selectively focus on certain portions of an image, while ignoring other irrelevant information, thereby facilitating perception of the human brain. Similarly, in deep learning attention mechanisms, certain portions of the input may be more relevant than others in some questions involving language, speech, or vision. Thus, by means of the attention mechanism in the attention model, the attention model can be caused to perform different processing on different parts of the input data, such that the attention model only dynamically focuses on data related to the task.

(10) Multilayer perceptron (Multilayer Perceptron, MLP)

The multi-layer perceptron is a feed-forward artificial neural network model capable of mapping multiple data sets of an input onto a single data set of an output. A typical multi-layer perceptron includes three network layers: an input layer, a hidden layer and an output layer. Also, in a multi-layer perceptron, the different network layers are fully connected (i.e., any neuron in the upper layer is connected to all neurons in the lower layer).

(11) Loss function

In training the neural network, because the output of the neural network is expected to be as close to the truly desired value as possible, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the truly desired target value and then according to the difference between the two (of course, there is usually an initialization process before the first update, that is, the parameters are preconfigured for each layer in the neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be lower than predicted, and the adjustment is continued until the neural network can predict the truly desired target value or a value very close to the truly desired target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and the training of the neural network becomes the process of reducing the loss as much as possible.

(12) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial prediction model in the training process, so that the error loss of the prediction model is smaller and smaller. Specifically, the input signal is forwarded until the output is generated with error loss, and the parameters in the initial prediction model are updated by back propagation of the error loss information, so that the error loss converges. The back propagation algorithm is a back propagation motion that dominates the error loss, aiming at deriving parameters of the optimal prediction model, such as the weight matrix.

(13) Gradient descent method (Gradient)

Gradient descent is a first order optimization algorithm commonly used in machine learning to recursively approximate minimum deviation prediction models. To find the local minima of a function using the gradient descent method, an iterative search must be performed for a specified step distance point in the opposite direction of the gradient (or approximate gradient) to the current point on the function. The gradient descent method is one of the most commonly employed methods when solving the predictive model parameters of machine learning algorithms, i.e., unconstrained optimization problems.

Specifically, when the minimum value of the loss function is solved, the minimum loss function and the prediction model parameter value can be obtained through one-step iterative solution by a gradient descent method. Conversely, if we need to solve the maximum of the loss function, then we need to iterate with a gradient-lifting method.

(14) Softmax function

The Softmax function is also called a normalized exponential function, and is a generalization of the logic function. The Softmax function can transform one K-dimensional vector Z containing arbitrary real numbers into another K-dimensional vector σ (Z) such that each element in the transformed vector σ (Z) ranges between (0, 1) and the sum of all elements is 1. The Softmax function may be calculated as shown in equation 1.

Wherein σ (z) _j Representing the value of the j-th element in the vector transformed by the Softmax function; zj represents the value of the j-th element in the vector Z; zk represents the value of the kth element in the vector Z; sigma represents the sum.

(15) Pre-training model

The pre-training model is a well-trained, stored network that has been trained on a large data set.

(16) Density-based clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, DBSCAN)

DBSCAN is a relatively representative density-based clustering algorithm. Unlike the partitioning and hierarchical clustering method, which defines clusters as the largest set of densely connected points, it is possible to partition a region having a sufficiently high density into clusters and find clusters of arbitrary shape in a noisy spatial database.

(17) Open domain vocabulary

The open domain vocabulary is a generalized dictionary library based on general world understanding.

With the rapid development of artificial intelligence technology, using models based on artificial intelligence (such as attention networks) to locate and identify 3D objects can achieve better results. However, when a 3D object is located and identified by a model, a large amount of labeling data (i.e., the class of each object is labeled in the point cloud data to be processed) is often required to train the model, so that the model has good performance. Because the 3D scene in the open world is too wide, the actual application can only bear the labeling cost of partial data, and the training of the 3D object positioning and recognition task in the semantics of the open world is limited. Therefore, how to build a positioning and identifying model of the open world under the condition of lacking data annotation becomes a development trend of attention of the industry.

Referring to fig. 3, fig. 3 is a schematic flow chart of processing point cloud data in the related art. As shown in fig. 3, in the related art, first, point cloud data corresponding to an object is manually selected, and the point cloud data is projected into a plurality of depth maps according to a plurality of camera positions. For example, the point cloud data corresponding to one aircraft is projected as a depth map at a plurality of angles.

And then, respectively inputting the plurality of depth maps and text vocabularies corresponding to the plurality of depth maps into an image feature extractor and a text feature extractor in the model to perform feature extraction. After a plurality of image features are extracted by an image feature extractor, the plurality of image features are input into an adaptation network formed by a plurality of full convolution layers, and fusion features of the image features under a plurality of view angles are output by the adaptation network so as to execute contrast learning training of the fusion features and text features.

In the test process of the model, a depth map and a test text obtained by projecting the point cloud data can be input into the model, feature distances between image features and text features output by the model are calculated, and further, texts corresponding to the point cloud data are determined, so that identification of objects in the point cloud data is realized. However, the related art needs to project the point cloud data into a depth map, which often loses a large number of original 3D data structures, and the actual performance of the model is easy to be poor. Moreover, object recognition in point cloud data is performed based on point cloud data input of a single object (namely, input data is point cloud data only comprising a single object), and the object recognition method is difficult to generalize and apply to 3D scene understanding.

In view of this, the embodiment of the application provides a processing method of point cloud data, based on point cloud data and images acquired for the same object, objects included in the images are first identified and obtained, and then a point cloud cluster corresponding to the objects in the images in the point cloud data is determined through a mapping relationship between the point cloud data and the images. In this way, in the model training stage, the characteristics of the point cloud clusters and the text corresponding to the same object can be extracted through the model, so that the contrast learning training of the point cloud data characteristics and the text characteristics is realized, and the object identification in the point cloud data can be realized based on the model obtained through training. In the scheme, the image is adopted as an intermediary between the point cloud data and the text, and the text corresponding to the point cloud data is determined in advance through the mapping relation between the point cloud data and the image brought by the acquisition stage, so that the alignment between the point cloud data and the text is realized, the original data structure loss brought by projecting the point cloud data into the depth map is avoided, the object understanding in the 3D scene is facilitated, and the performance of the model can be effectively improved.

For easy understanding, a scenario and a system architecture to which the method for processing point cloud data provided in the embodiments of the present application is applied are described below.

In one possible implementation manner, the method for processing point cloud data provided by the embodiment may be applied to an autopilot scenario. For example, a point cloud data acquisition device and an image acquisition device are deployed on an automatic driving vehicle, the automatic driving vehicle acquires point cloud data and images of the same object in the driving process, and the acquired point cloud data and images are uploaded to a server. The processing method of the point cloud data provided by the embodiment is adopted on the server to process the point cloud data and the image so as to train and obtain a model which can be used for executing object identification in the point cloud data. And then, the server transmits the trained model to the automatic driving vehicle, and the automatic driving vehicle identifies objects in the open world through the model in the automatic driving process, so that the automatic driving vehicle can execute corresponding automatic driving strategies based on the identification result of the objects.

In addition, after the point cloud data and the image are acquired by the automatic driving vehicle, the point cloud data and the image may be processed by the automatic driving vehicle itself by adopting the processing method of the point cloud data provided by the embodiment, so as to train and obtain a model capable of being used for executing object identification in the point cloud data. That is, the autonomous vehicle no longer needs to interact with the server, but rather training and application of the model is accomplished by the autonomous vehicle.

In another possible implementation manner, the method for processing point cloud data provided by the embodiment may be applied to an object recognition scene of a robot. For example, a point cloud data acquisition device and an image acquisition device are deployed on an indoor or outdoor robot, the robot acquires point cloud data and images of the same object in the running process, and the acquired point cloud data and images are uploaded to a server. The processing method of the point cloud data provided by the embodiment is adopted on the server to process the point cloud data and the image so as to train and obtain a model which can be used for executing object identification in the point cloud data. Then, the server issues the trained model to the robot, and the robot identifies objects (such as indoor objects or outdoor obstacles) in the open world through the model in the driving process, so that the robot can execute a corresponding obstacle avoidance strategy based on the identification result of the objects.

Similarly, after the robot collects the point cloud data and the image, the point cloud data and the image may be processed by the robot itself by adopting the processing method of the point cloud data provided in the embodiment, so as to train to obtain a model capable of being used for executing object recognition in the point cloud data, instead of uploading the point cloud data and the image to the server. That is, the robot no longer needs to interact with the server, but rather the robot completes the training and application of the model.

In another possible implementation manner, the method for processing point cloud data provided by the embodiment may be applied to an object recognition scene of an unmanned aerial vehicle. For example, a point cloud data acquisition device and an image acquisition device are deployed on the unmanned aerial vehicle, the unmanned aerial vehicle acquires point cloud data and images of the same object in the flight process, and the acquired point cloud data and images are uploaded to a server. The processing method of the point cloud data provided by the embodiment is adopted on the server to process the point cloud data and the image so as to train and obtain a model which can be used for executing object identification in the point cloud data. Then, the server issues the model obtained through training to the unmanned aerial vehicle, and the unmanned aerial vehicle identifies objects in the open world (such as trees, houses and other obstacles encountered in the flight process) through the model in the flight process, so that the unmanned aerial vehicle can execute a corresponding obstacle avoidance strategy based on the identification result of the objects.

In addition, in the case that the user controls the unmanned aerial vehicle through the user equipment such as a remote controller, a smart phone, a tablet personal computer, a notebook computer or a personal computer, the unmanned aerial vehicle can return the point cloud data and the image acquired through the point cloud data acquisition device and the image acquisition device to the user equipment. In this way, the user can obtain a model through the point cloud data and the image training acquired from the user equipment; or, the user can feed back the point cloud data and the image to the server through the user equipment, and the server trains to obtain a model and sends the model to the user equipment. In the application stage of the model, the unmanned aerial vehicle can return the acquired point cloud data to the user equipment in real time, the user equipment carries out object recognition on the point cloud data through the model obtained through training, and an object recognition result is fed back to the unmanned aerial vehicle. Or after the user equipment acquires the model, the model can be deployed on the unmanned aerial vehicle, and the unmanned aerial vehicle can execute object recognition in real time through the model in the flight process.

Referring to fig. 4, a schematic diagram of a system architecture 400 is provided in an embodiment of the present application. As shown in fig. 4, in the system architecture 400, the execution device 410 may be implemented by one or more servers, optionally in conjunction with other computing devices, such as: data storage, routers, load balancers and other devices; the execution device 410 may be disposed on one physical site or distributed across multiple physical sites. The execution device 410 may use the data in the data storage system 420, or call the program code in the data storage system 420 to implement the processing method of the point cloud data provided in the embodiment of the application, so as to obtain a model.

The user may operate respective user devices (e.g., local device 401 and local device 402) to interact with execution device 410. Each local device may represent any computing device, such as a robot, smart car, personal computer, computer workstation, smart phone, tablet, smart camera, etc.

The local device of each user may interact with the execution device 410 through a communication network of any communication mechanism/communication standard, which may be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.

In one implementation, the execution device 410 is configured to implement the method for processing point cloud data provided in the embodiments of the present application, and send the obtained model to the local device 401 and the local device 402 through a communication network, so that the local device 401 and the local device 402 can implement deployment and operation of the model.

In another implementation, one or more aspects of the execution device 410 may be implemented by each local device, for example, the local device 401 may provide local data or feedback calculation results for the execution device 410, or perform a method for processing point cloud data provided in an embodiment of the present application.

It should be noted that all functions of the execution device 410 may also be implemented by the local device. For example, local device 401 implements the functionality of executing device 410 and providing services to its own users, or to users of local device 402.

In general, the method for processing point cloud data provided in the embodiments of the present application may be applied to an electronic device, for example, the above-mentioned executing device 410, the local device 401, or the local device 402.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device 101 according to an embodiment of the present application. As shown in fig. 5, the electronic device 101 includes a processor 103, the processor 103 being coupled to a system bus 105. Processor 103 may be one or more processors, each of which may include one or more processor cores. A display adapter 107, which may drive a display 109, the display 109 being coupled to the system bus 105. The system bus 105 is coupled to an input output (I/O) bus via a bus bridge 111. I/O interface 115 is coupled to an I/O bus. The I/O interface 115 communicates with various I/O devices such as an input device 117 (e.g., a touch screen, etc.), an external memory 121 (e.g., hard disk, floppy disk, optical disk, or USB drive), a multimedia interface, etc. Transceiver 123 (which may transmit and/or receive radio communication signals), camera 155 (which may capture still and moving digital video images), and external USB port 125. Wherein the interface to which I/O interface 115 is optionally connected may be a USB interface.

The processor 103 may be any conventional processor including a reduced instruction set computing (reduced instruction set Computing, RISC) processor, a complex instruction set computing (complex instruction set computing, CISC) processor, or a combination thereof. In the alternative, the processor may be a dedicated device such as an ASIC.

Electronic device 101 may communicate with software deploying server 149 through network interface 129. The network interface 129 is illustratively a hardware network interface, such as a network card. The network 127 may be an external network, such as the Internet, or an internal network, such as an Ethernet or virtual private network (virtual private network, VPN). Optionally, the network 127 may also be a wireless network, such as a WiFi network, cellular network, or the like.

The hard disk drive interface 131 is coupled to the system bus 105. The hardware drive interface is connected to the hard disk drive 133. Internal memory 135 is coupled to system bus 105. The data running in the internal memory 135 may include an Operating System (OS) 137, applications 143, and a schedule of the electronic device 101.

The operating system includes Shell 139 and kernel 141.Shell 139 is an interface between the user and the kernel of the operating system. A shell is the outermost layer of the operating system. Shell manages interactions between users and the operating system: waiting for user input, interpreting the user input to the operating system, and processing output results of a variety of operating systems.

Kernel 141 is made up of those parts of the operating system that are used to manage memory, files, peripherals, and system resources. Kernel 141 interacts directly with the hardware, the operating system kernel typically runs processes and provides inter-process communication, CPU time slice management, interrupts, memory management, and IO management, among others.

Referring to fig. 6, fig. 6 is a flow chart of a processing method of point cloud data according to an embodiment of the present application. As shown in fig. 6, the processing method of the point cloud data includes the following steps 601 to 605.

In step 601, first point cloud data and a first image are acquired, where the first point cloud data and the first image are acquired from a first object.

In this embodiment, the first point cloud data and the first image may be acquired from the same object (i.e., the first object) at the same time point. That is, the first object is included in both the first point cloud data and the first image. In addition, in addition to the first object, other objects may be included in the first point cloud data and the first image, that is, a plurality of objects may be included in the first point cloud data and the first image.

For example, in an autopilot scenario, both the first point cloud data and the first image may be acquired by the autopilot vehicle at the same time. One vehicle in front of the autonomous vehicle (i.e., the first object) may be included in the first point cloud data and the first image, and obstacles such as pedestrians, water horses, or railings near the autonomous vehicle may be included in the first point cloud data and the first image.

The first point cloud data may be acquired by a laser radar, the first image is acquired by an image sensor, and the laser radar and the image sensor are disposed on the same device, which may be an autonomous vehicle, for example. On an autonomous vehicle, both the lidar and the image sensor may be used to acquire objects in the same direction so that the acquired first point cloud data and the first image can include the same object.

Still alternatively, the first point cloud data and the first image may be acquired by the same depth camera, which may be deployed on a robot, for example. The depth camera can acquire a first image and acquire a spatial distance between each pixel point in the first image and the depth camera. Therefore, based on the spatial distance between each pixel point in the first image and the depth camera, each pixel point in the first image can be converted into a point in the point cloud, thereby obtaining first point cloud data.

Step 602, obtaining an image block of the first object in the first image and a text name corresponding to the first object.

It will be appreciated that, since the first image generally includes other objects in addition to the first object, the image block of the first object in the first image may be acquired in this step to locate the actual position of the first object in the first image. In addition, the text name corresponding to the first object can be further determined while the image block of the first object in the first image is determined. In simple terms, this step is actually to align the image with the text, and determine the text semantics of the image block in the first image.

In this step, the second model may be used to perform object recognition on the first image, so as to obtain an image block of the first object in the first image and a text name corresponding to the first object. That is, the second model can identify the object included in the first image, mark the image block of the identified first object in the first image, and give the text name corresponding to the first object. The second model is a pre-training model and is used for executing target recognition on the image. The second model may specifically be a convolutional neural network structure, which is not specifically limited in this embodiment.

For example, referring to fig. 7, fig. 7 is a schematic diagram of identifying an object in an image through a second model according to an embodiment of the present application. As shown in fig. 7, the first image is input into a pre-trained second model, and one image block (i.e., an image block containing an automobile) in the first image may be output from the second model, and a text name "automobile" corresponding to the image block may be output. In addition, the first image includes objects such as trees, roadblocks, street lamps, fire hydrants and the like in addition to the automobiles. Therefore, when the second model has the function of identifying the above objects, the second model can also output image blocks corresponding to objects such as trees, road blocks, street lamps, fire hydrants, and the like, and text names corresponding to the objects.

Step 603, determining a first point cloud cluster corresponding to the first object in the first point cloud data based on the mapping relationship between the first point cloud data and the first image and the image block.

Because the first point cloud data and the first image are acquired by acquiring the same object at the same time point, a mapping relation exists between the first point cloud data and the first image, and the mapping relation can realize conversion between points in the first point cloud data and pixel points in the first image. That is, based on the mapping relation, the pixel point corresponding to the point in the first point cloud data in the first image can be determined, and the pixel point corresponding to the pixel point in the first image in the first point cloud data can also be determined.

Since the first point cloud data and the first image may each include a plurality of objects, after determining an image block corresponding to the first object in the first image, a point corresponding to the image block in the first point cloud data may be determined based on the mapping relationship, so as to obtain a first point cloud cluster. The first point cloud cluster comprises a plurality of points, and the points included in the first point cloud cluster are part of points in the first point cloud data, namely the first point cloud cluster is a subset of the first point cloud data.

Specifically, the mapping relationship may be obtained based on the device that collects the first point cloud data and the device that collects the first image. That is, the means for acquiring the first point cloud data and the means for acquiring the first image affect the conversion relationship between the first point cloud data and the first image.

For example, where the first point cloud data is acquired by a lidar and the first image is acquired by an image sensor, there may be a coordinate system conversion matrix between the lidar and the image sensor. The coordinate system conversion matrix is used for realizing conversion between a two-dimensional coordinate system of the image and a three-dimensional coordinate system of the point cloud data. In this way, based on the coordinate system conversion matrix between the laser radar and the image sensor, a second point cloud cluster corresponding to the image block in the first point cloud data can be determined; then, clustering is performed on the points in the second point cloud cluster, so that a first point cloud cluster can be obtained, wherein the second point cloud cluster comprises the first point cloud cluster, namely the first point cloud cluster is a subset of the second point cloud cluster. The clustering processing of the points in the second point cloud cluster may be performed by using a DBSCAN clustering algorithm, so that the points belonging to the foreground object in the second point cloud cluster can be extracted.

It will be appreciated that in the case where it is determined that the image block corresponding to the first object is obtained, the image block may include some other objects besides the first object, that is, the first object does not occupy the entire image block. In this case, when the second point cloud cluster corresponding to the image block in the first point cloud data is determined based on the coordinate system transformation matrix, the second point cloud cluster includes a part of other objects in addition to the first object.

Therefore, in this embodiment, by performing clustering processing on the points in the second point cloud cluster, the points belonging to the foreground object in the second point cloud cluster can be effectively extracted, that is, the points belonging to the first object are extracted, so that the points not belonging to the first object are removed, and the accuracy of the finally obtained first point cloud cluster is ensured.

For another example, in the case where the first point cloud data and the first image are both acquired by the depth camera, the depth camera may also have a coordinate system conversion matrix for indicating a conversion relationship between the first point cloud data and the first image acquired in the depth direction.

Similarly, since the image block may include some other object in addition to the first object, in obtaining the image block, it may be to first extract a foreground portion in the image block, to obtain a foreground image block, where the foreground image block includes only the first object. And then, determining a first point cloud cluster corresponding to the foreground image block in the first point cloud data based on a coordinate system conversion matrix of the depth camera. The foreground portion in the image block may be extracted by using a GrabCut segmentation algorithm, for example.

In step 604, features of the first point cloud cluster and the text name are extracted through a first model, so as to obtain a first point cloud feature and a first text feature, wherein the first model comprises a first network and a second network, the first point cloud feature is obtained by extracting features of the first point cloud cluster through the first network, and the first text feature is obtained by extracting features of the text name through the second network.

In this embodiment, the first network and the second network are neural networks that are different in structure and independent of each other. And the input of the first network is a first point cloud cluster, and the output of the first network is a first point cloud characteristic corresponding to the first point cloud cluster. The input of the second network is the text name of the first object, and the output of the second network is the first text feature corresponding to the text name.

The first network may be, for example, an MLP, and the second network may be, for example, an attention network (such as a transducer).

Step 605 updates the first model based on a first loss function, resulting in a second model, wherein the first loss function is related to a first difference between the first point cloud feature and the second text feature.

In this embodiment, updating the first model based on the first loss function may specifically be updating parameters of the first network and the second network in the first model.

After extracting the first point cloud feature and the first text feature by the first model, a first loss function may be constructed based on the first point cloud feature and the first text feature. Wherein the first loss function may have a positive correlation with a first difference between the first point cloud feature and the second text feature. That is, the larger the first difference between the first point cloud feature and the first text feature, the larger the first loss function, and the smaller the first difference between the first point cloud feature and the first text feature, the smaller the first loss function.

In this way, in updating the first model based on the first loss function, the update target of the first model is to reduce the first loss function as much as possible, that is, reduce the difference between the first point cloud feature and the first text feature corresponding to the same object. The training of the first model is performed based on the steps, so that the first model can learn Xi Lajin as far as possible to correspond to the point cloud features and the text features of the same object in the training process, and the point cloud features and the text features with the smallest difference can be output when the point cloud features and the text features corresponding to the same object are extracted from the second model obtained through training.

In the embodiment of the application, based on the point cloud data and the image acquired for the same object, the object included in the image is firstly identified, and then the point cloud cluster corresponding to the object in the image in the point cloud data is determined through the mapping relation between the point cloud data and the image. In this way, in the model training stage, the characteristics of the point cloud clusters and the text corresponding to the same object can be extracted through the model, so that the contrast learning training of the point cloud data characteristics and the text characteristics is realized, and the object identification in the point cloud data can be realized based on the model obtained through training.

The first loss function is constructed based on the first difference, so that the first model learns to pull the point cloud characteristic and the text characteristic corresponding to the same object in the training process. In some embodiments, the first loss function may also be constructed based on other manners, so that the first model can learn to zoom in on the point cloud features and text features corresponding to the same object, and simultaneously learn to zoom out on the point cloud features and text features corresponding to different objects.

Illustratively, the method for processing point cloud data may further include, based on the steps 601 to 605, the steps of: and acquiring a second point cloud feature and a second text feature. The second point cloud features are obtained by extracting features of point cloud clusters corresponding to the second object through the first network, and the second text features are obtained by extracting features of text names of the second object through the second network. The first object and the second object are different objects. For example, the first object and the second object may be different objects belonging together in the first image; the second object may also be an object in another image, for example in a training data belonging to the same batch as the first image.

And, in step 605, the first loss function used to train the first model is related to a first difference and a second difference, the second difference being a difference between the first text feature and the second point cloud feature. In addition, the first loss function has a positive correlation with the first difference, and the first loss function has a negative correlation with the second difference.

Since the first text feature corresponds to the first object and the second point cloud feature corresponds to the second object, the second difference between the first text feature and the second point cloud feature is actually the difference between the text feature and the point cloud feature corresponding to the different object. Meanwhile, the first loss function is built based on the first difference and the second difference, so that the first model can learn to zoom in the point cloud characteristics and the text characteristics corresponding to the same object and learn to zoom out the point cloud characteristics and the text characteristics corresponding to different objects in the training process. In this way, after training to obtain the second model, the second model can output the point cloud features and the text features with the smallest difference as possible when extracting the point cloud features and the text features corresponding to the same object; when the point cloud features and text features corresponding to different objects are extracted, the second model can output the point cloud features and text features with the largest difference as possible, so that the second model is used for identifying the category of the point cloud data and has good performance.

In one possible embodiment, the first model may further comprise a third network. The third network is used for extracting the characteristics of the image, namely, the input of the third network is the image, and the output is the characteristics of the image. The third network is a neural network which is different in structure from the first network and the second network and is independent from each other. The third network may be, for example, an attention network (such as a transducer).

In the above method, after obtaining the image block, the features of the image block may also be extracted through a third network, so as to obtain the first image feature. And, in step 605, the first loss function is related to a first difference and a third difference, the third difference being a difference between the first image feature and the first point cloud feature. And, there is a positive correlation between the first loss function and the third difference. That is, the larger the third difference between the first point cloud feature and the first image feature, the larger the first loss function, and the smaller the third difference between the first point cloud feature and the first image feature, the smaller the first loss function.

Optionally, the first loss function may also be related to the first difference, the second difference, and the third difference described above.

The process of training the model is described above, and how to apply the trained model will be described below.

For example, after training to obtain the second model, second point cloud data may be acquired, where the second point cloud data is point cloud data of object identification to be performed. For example, the second point cloud data may be point cloud data to be identified, which is collected by the autonomous vehicle during driving; or, the second point cloud data may be point cloud data to be identified, which is collected by the robot in the moving process.

And then, inputting the second point cloud data into the first network in the second model to obtain a third point cloud characteristic extracted by the first network.

And secondly, inputting the texts into a second network in the second model to obtain a plurality of text features extracted by the second network. Wherein, a plurality of text features are in one-to-one correspondence with a plurality of texts. And, the plurality of texts may be determined according to the acquisition scene of the second point cloud data. For example, in an autopilot scenario, the plurality of text may include, for example, text of "car," "truck," "bicycle," "pedestrian," "roadblock," "tree," and the like. In an indoor robot scenario, the plurality of texts may be, for example, texts such as "table", "chair", "sofa", "bed", "trash can", "door", and the like.

And finally, determining the difference between the third point cloud feature and each text feature in the plurality of text features, and determining the target text corresponding to the second point cloud data according to the difference between the third point cloud feature and each text feature in the plurality of text features. The target text corresponds to a target text feature, and the target text feature is a text feature with the smallest difference with the third point cloud feature in the text features.

The execution flow of the processing method of the point cloud data provided by the embodiment of the application is introduced above, and the execution process of the processing method of the point cloud data in the actual application process will be described in detail below with reference to specific examples.

Referring to fig. 8, fig. 8 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 8, a sensor data collection module, a general perception module, and a decision inference module are included in the system architecture.

The sensor data collection module is used for collecting point cloud data and images and transmitting the point cloud data and the images to the universal perception module.

The universal perception module is used for realizing object recognition in the open world, such as obstacle recognition or universal object recognition in outdoor or indoor environments.

The decision-making reasoning module is used for executing corresponding decision-making reasoning based on the object recognition result transmitted by the universal perception module.

Referring to fig. 9A and 9B, fig. 9A is a schematic view of object recognition of an outdoor open world according to an embodiment of the present application; fig. 9B is a schematic view of object recognition of an indoor open world according to an embodiment of the present application. As shown in fig. 9A, the above-mentioned general perception module may be applied to a vehicle-mounted system of an automatic driving vehicle, so as to assist the automatic driving vehicle to perceive any kind of obstacle object in the open world, thereby enabling the decision-making inference module of the automatic driving vehicle to perform obstacle avoidance processing and route planning control according to different object kinds, and realizing safe driving.

As shown in fig. 9B, the above-mentioned general sensing module may be applied to a robot, to assist the robot in positioning and identifying any object in the open world environment, so that the decision-making inference module of the robot executes corresponding instructions such as obstacle avoidance or cleaning.

In addition, the universal perception module further comprises a tri-modal data alignment module, a tri-modal comparison pre-training tri-tower model and a multi-modal joint discrimination module. The three modes refer to data of three different modes, namely text, image and point cloud data.

The three-mode alignment module is used for associating and aligning two mode data of the point cloud data and the image based on a coordinate system conversion matrix of a sensor for collecting the point cloud data and the image. And the text and the image are aligned through the pre-trained graphic model, so that the alignment of three-mode data, namely the point cloud data, the image and the text, is realized based on the image as an intermediary.

The tri-modal contrast pre-training tri-tower model is used to perform contrast pre-training based on point cloud data, images, and text corresponding to the same object, and to zoom in feature space corresponding to the three modal data of the same object, such that the three feature extractors can output aligned associated features.

The multi-mode joint judging module is used for combining the image characteristics and the point cloud characteristics, calculating the association with the open domain text by utilizing the characteristic relation with the text, and obtaining the classification result of the general class object.

Referring to fig. 10, fig. 10 is a schematic flow chart for locating and identifying a 3D object in the open world according to an embodiment of the present application. As shown in fig. 10, the procedure of locating and identifying the 3D object of the open world includes the following steps S1 to S3.

And S1, aligning the tri-mode data.

(1) Registering text and images

In an outdoor autopilot scenario, in this embodiment, point cloud data P acquired by a lidar _s And scene image I acquired by the image sensor _s For input, the scene S epsilon|S| is used for positioning and classifying and judging the general class 3D object.

Fig. 11 is a schematic flow chart of tri-modal data alignment in an outdoor scenario according to an embodiment of the present application. As shown in FIG. 11, an open domain vocabulary library X ^T And 2D image I _s As input (i.e. the first image described above) by means of a pre-trained teletext detection model M _VLM (i.e., the second model described above), obtain the open domain vocabulary library X ^T The object corresponding to the text in the 2D image I _s Image region set onNamely:

wherein each 2D image blockCorresponds to an open text vocabulary->

(2) Alignment image and point cloud features

As shown in fig. 11, a matrix G is transformed using a coordinate system between the lidar and the image sensor _OUT Acquiring and 2D image blocksCorresponding 3D point cloud cone cluster +.>Specifically, the 3D point cloud to 2D image pixel projection process is as follows:

[u,v]＝G _OUT ×[x,y,z]

wherein [ u, v]Is the coordinates of the pixel points in the 2D image, [ x, y, z ] ]The coordinate of the midpoint of the 3D point cloud data is I is an internal reference matrix of the image sensor, R _C Is the external reference matrix of the image sensor, R _L Is an external reference matrix of the laser radar, and is arranged inside a 2D image blockProjection pixels [ u, v ]]The 3D point cloud corresponding to' is the acquired point cloud cluster +.>Then, a point cloud cluster is acquired by using a DBSCAN clustering algorithm>3D object inside->Thereby realizing the object of eliminating the background part in the point cloud cluster.

Thus, a text-image-point cloud triplet data pair set can be finally obtained

The above describes the alignment of images, point cloud data and text obtained in an outdoor scene. In an indoor scene, since the manner of acquiring the point cloud data and the image is different, the process of aligning the image and the point cloud data is also different. In addition, the process of aligning images and text is the same in indoor and outdoor scenes.

Fig. 12 is a schematic flow chart of tri-modal data alignment in an indoor scene according to an embodiment of the present application. As shown in fig. 12, in an indoor scene, both the image and the point cloud data are acquired by a depth camera.

After aligning the image and text, firstExtracting 2D object image blocks by using GrabCut segmentation algorithm Foreground part->Coordinate system conversion matrix G using depth camera _IN Acquisition and image area->Corresponding 3D object point cloud cluster +.>Specifically, the coordinates (x, y, z) of each point in the point cloud cluster may be obtained by projection of the positions (u ', v ', d ') of the pixel points in the image, as follows:

G _IN ＝I×R _C

wherein [ u ', v ', d ] ']For the coordinates and depth of the pixel points in the 2D image, [ x, y, z ]]The coordinate I of the midpoint of the 3D point cloud data is an internal reference matrix of the camera, R _C Is an extrinsic matrix of the camera.

Thus, a text-image-point cloud triplet data pair set can be finally obtained

And S2, pre-training the tri-modal characteristics.

After aligning the text, image, and point cloud data, a tri-modal contrast pre-training tri-tower model may be employed to perform contrast pre-training on the text, image, and point cloud data. Referring to fig. 13, fig. 13 is a schematic flow chart of performing contrast pre-training on text, image and point cloud data by using a tri-modal contrast pre-training tri-tower model according to an embodiment of the present application.

As shown in fig. 13, the tri-modal contrast pre-trained tri-tower model (i.e., the first model described above) includes a point cloud feature extractor(i.e. the first network described above), image feature extractor +. >(i.e. the third network described above) and text feature extractor->(i.e., the second network described above). Based on the text-image-point cloud triple data pair set D acquired in the step S1, respectively inputting the aligned point cloud data, the image and the text into three feature extractors to respectively obtain corresponding 3D point cloud features f ^P 2D image feature f ^I And text feature f ^T . The feature space of the three modalities is then pulled up by contrast pre-training, so that the three feature extractors can output the aligned associated features. The loss function constructed in the contrast pre-training process is specifically:

wherein L (T, I, P) is a loss function; n represents the number of text-image-point cloud triplet data pairs; τ is the temperature coefficient used to adjust the intensity of the comparative pre-training. The goal of performing comparative pre-training based on the loss function described above is to: zooming in features between text and point clouds corresponding to the same object and features between images and point clouds corresponding to the same object, zooming out features between text and point clouds corresponding to different objects and features between images and point clouds corresponding to different objects.

Therefore, based on the mode for training the tri-modal comparison pre-training tri-tower model, the obtained tri-modal comparison pre-training tri-tower model can output similar characteristics when text, image and point cloud data corresponding to the same object are processed, and can output characteristics with larger difference when text, image and point cloud data corresponding to different objects are processed.

Step S3, three-mode joint feature discrimination

Step S3 is performed in a test phase (i.e., an application phase of the model), based on the training-derived trimodal versus pre-trained triphone model. Referring to fig. 14, fig. 14 is a flow chart of a three-mode joint feature determination provided in an embodiment of the present application.

As shown in fig. 14, in an actual application scenario, an image in the open world is acquired by a sensorAnd Point cloud Cluster->And +.>And Point cloud Cluster->Input image feature extractor respectively>And Point cloud feature extractor->Acquiring corresponding image features->And Point cloud feature->

Then, the text library vocabulary is input into a text feature extractorAcquiring text characteristics->Where K represents the text library length (i.e., the number of texts) and C represents the feature dimension (i.e., the feature vector for each text is C-dimension).

Secondly, calculating the sum of the image features and the point cloud features, calculating the feature distance between the sum of the image features and the point cloud features and the text features, and selecting the text feature with the shortest distance as the discrimination result with the largest classification probability, namely:

wherein,,representing feature distances between the sum of the image features and the point cloud features and the text features;Representing the classification probability corresponding to each text.

The method provided by the embodiment of the present application is described above in detail, and the apparatus for performing the method provided by the embodiment of the present application will be described next.

Referring to fig. 15, fig. 15 is a schematic structural diagram of a processing device for point cloud data according to an embodiment of the present application.

As shown in fig. 15, the processing device for point cloud data includes:

an acquiring module 1501, configured to acquire first point cloud data and a first image, where the first point cloud data and the first image are acquired by acquiring a first object;

the obtaining module 1501 is further configured to obtain an image block of the first object in the first image and a text name corresponding to the first object;

a processing module 1502, configured to determine, based on a mapping relationship between the first point cloud data and the first image and the image block, a first point cloud cluster corresponding to the first object in the first point cloud data;

the processing module 1502 is further configured to extract, through a first model, features of a first point cloud cluster and a text name, to obtain a first point cloud feature and a first text feature, where the first model includes a first network and a second network, the first point cloud feature is obtained by extracting features of the first point cloud cluster by the first network, and the first text feature is obtained by extracting features of the text name by the second network;

The processing module 1502 is further configured to update the first model based on a first loss function, to obtain a second model, wherein the first loss function is related to a first difference between the first point cloud feature and the second text feature.

In a possible implementation manner, the obtaining module 1501 is further configured to obtain a second point cloud feature and a second text feature, where the second point cloud feature is obtained by extracting features from a point cloud cluster corresponding to a second object by a first network, the second text feature is obtained by extracting features from a text name of the second object by a second network, and the first object and the second object are different objects;

the processing module 1502 is further configured to:

In one possible implementation, the obtaining module 1501 is further configured to obtain second point cloud data;

the processing module 1502 is further configured to input the second point cloud data into the first network in the second model to obtain a third point cloud feature;

the processing module 1502 is further configured to input a plurality of texts into a second network in the second model, to obtain a plurality of text features;

the processing module 1502 is further configured to determine, according to a difference between the third point cloud feature and each text feature in the plurality of text features, a target text corresponding to the second point cloud data, where the target text corresponds to a target text feature, and the target text feature is a text feature in the plurality of text features that has a smallest difference from the third point cloud feature.

In one possible implementation, the processing module 1502 is further configured to:

Referring to fig. 16, fig. 16 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 1600 may be specifically represented by a mobile phone, a tablet, a notebook computer, a smart wearable device, a server, etc., which is not limited herein. Specifically, the execution device 1600 includes: a receiver 1601, a transmitter 1602, a processor 1603, and a memory 1604 (where the number of processors 1603 in the execution device 1600 may be one or more, one processor is illustrated in fig. 16), where the processor 1603 may include an application processor 16031 and a communication processor 16032. In some embodiments of the present application, the receiver 1601, transmitter 1602, processor 1603, and memory 1604 may be connected by a bus or other means.

Memory 1604 may include read only memory and random access memory, and provides instructions and data to processor 1603. A portion of the memory 1604 may also include non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 1604 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for implementing various operations.

The processor 1603 controls the operation of the execution device. In a specific application, the individual components of the execution device are coupled together by a bus system, which may include, in addition to a data bus, a power bus, a control bus, a status signal bus, etc. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.

The methods disclosed in the embodiments of the present application may be applied to the processor 1603 or implemented by the processor 1603. Processor 1603 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in processor 1603. The processor 1603 may be a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor, or a microcontroller, and may further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

The processor 1603 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1604 and the processor 1603 reads information in the memory 1604 and performs the steps of the method described above in connection with its hardware.

The receiver 1601 is operable to receive input digital or character information and to generate signal inputs related to performing device related settings and function control. The transmitter 1602 is operable to output numeric or character information via a first interface; the transmitter 1602 may also be used to send instructions to the disk group through the first interface to modify data in the disk group; the transmitter 1602 may also include a display device such as a display screen.

The electronic device provided in this embodiment of the present application may specifically be a chip, where the chip includes: the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit to cause the chip in the execution device to perform the method of determining the model structure described in the above embodiment, or to cause the chip in the training device to perform the method of determining the model structure described in the above embodiment. Alternatively, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM), etc.

Specifically, referring to fig. 17, fig. 17 is a schematic structural diagram of a chip provided in an embodiment of the present application, where the chip may be represented as a neural network processor NPU 1700, and the NPU 1700 is mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU distributes tasks. The NPU has a core part of an arithmetic circuit 1703, and the controller 1704 controls the arithmetic circuit 1703 to extract matrix data in a memory and perform multiplication.

In some implementations, the arithmetic circuit 1703 includes a plurality of processing units (PEs) inside. In some implementations, the operational circuit 1703 is a two-dimensional systolic array. The arithmetic circuit 1703 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuitry 1703 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1702 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 1701 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 1708.

The unified memory 1706 is used for storing input data and output data. The weight data is directly transferred to the weight memory 1702 through the memory cell access controller (Direct Memory Access Controller, DMAC) 1705. The input data is also carried into the unified memory 1706 through the DMAC.

BIU Bus Interface Unit, bus interface unit 1710, is used for the AXI bus to interact with the DMAC and instruction fetch memory (Instruction Fetch Buffer, IFB) 1709.

Bus interface unit 1710 (Bus Interface Unit, BIU) is used for fetching instruction from external memory by instruction fetch memory 1709 and for fetching raw data of input matrix a or weight matrix B from external memory by memory unit access controller 1705.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1706 or to transfer weight data to the weight memory 1702 or to transfer input data to the input memory 1701.

The vector calculation unit 1707 includes a plurality of operation processing units, and further processes such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like are performed on the output of the operation circuit 1703 as needed. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as batch normalization (Batch Normalization), pixel-level summation, up-sampling of a characteristic plane and the like.

In some implementations, the vector computation unit 1707 can store the vector of processed outputs to the unified memory 1706. For example, the vector calculation unit 1707 may perform a linear function; alternatively, a nonlinear function is applied to the output of the arithmetic circuit 1703, such as linear interpolation of the feature planes extracted by the convolutional layer, and then such as a vector of accumulated values, to generate the activation value. In some implementations, the vector computation unit 1707 generates a normalized value, a pixel-level summed value, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 1703, for example for use in subsequent layers in a neural network.

An instruction fetch memory (instruction fetch buffer) 1709 connected to the controller 1704, for storing instructions used by the controller 1704;

the unified memory 1706, input memory 1701, weight memory 1702, and finger memory 1709 are all On-Chip memories. The external memory is proprietary to the NPU hardware architecture.

The processor mentioned in any of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.

Referring to fig. 18, fig. 18 is a schematic structural diagram of a computer readable storage medium according to an embodiment of the present application. The present application also provides a computer readable storage medium, in some embodiments, the method disclosed in fig. 6 above may be implemented as computer program instructions encoded on a computer readable storage medium in a machine readable format or encoded on other non-transitory media or articles of manufacture.

Fig. 18 schematically illustrates a conceptual partial view of an example computer-readable storage medium comprising a computer program for executing a computer process on a computing device, arranged in accordance with at least some embodiments presented herein.

In one embodiment, computer-readable storage medium 1800 is provided using signal bearing medium 1801. The signal bearing medium 1801 may include one or more program instructions 1802 that, when executed by one or more processors, may provide the functionality or portions of the functionality described above with respect to fig. 6.

In some examples, the signal bearing medium 1801 may comprise a computer readable medium 1803 such as, but not limited to, a hard disk drive, compact Disk (CD), digital Video Disk (DVD), digital tape, memory, ROM or RAM, and the like.

In some implementations, the signal bearing medium 1801 may comprise a computer recordable medium 1804, such as, but not limited to, memory, read/write (R/W) CD, R/W DVD, and the like. In some implementations, the signal bearing medium 1801 may include a communication medium 1805 such as, but not limited to, a digital and/or analog communication medium (e.g., fiber optic cable, waveguide, wired communications link, wireless communications link, etc.). Thus, for example, the signal bearing medium 1801 may be conveyed by a communication medium 1805 in wireless form (e.g., a wireless communication medium conforming to the IEEE 802.11 standard or other transmission protocol).

The one or more program instructions 1802 may be, for example, computer-executable instructions or logic-implemented instructions. In some examples, a computing device of the computing device may be configured to provide various operations, functions, or actions in response to program instructions 1802 communicated to the computing device through one or more of computer-readable medium 1803, computer-recordable medium 1804, and/or communication medium 1805.

It should be further noted that the above-described apparatus embodiments are merely illustrative, where elements described as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection therebetween, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course may be implemented by dedicated hardware including application specific integrated circuits, dedicated CPUs, dedicated memories, dedicated components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment in many cases for the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a training device, or a network device, etc.) to perform the method of the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be stored by a computer or data storage devices such as training devices, data centers, and the like, that contain an integration of one or more available media. Usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid State Disks (SSDs)), among others.

Claims

1. A method for processing point cloud data, characterized in that it includes:

Acquire first point cloud data and first image, wherein the first point cloud data and first image are obtained by collecting data from a first object;

Obtain the image block of the first object in the first image and the text name corresponding to the first object;

Based on the mapping relationship between the first point cloud data and the first image, and the image block, the first point cloud cluster corresponding to the first object in the first point cloud data is determined;

The first point cloud cluster and the text name are extracted using a first model to obtain first point cloud features and first text features. The first model includes a first network and a second network. The first point cloud features are obtained by the first network extracting features from the first point cloud cluster, and the first text features are obtained by the second network extracting features from the text name.

The first model is updated based on the first loss function to obtain the second model, wherein the first loss function is related to the first difference between the first point cloud features and the second text features.

2. The method according to claim 1, characterized in that the method further comprises:

The second point cloud feature and the second text feature are obtained. The second point cloud feature is obtained by the first network extracting features from the point cloud cluster corresponding to the second object. The second text feature is obtained by the second network extracting features from the text name of the second object. The first object and the second object are different objects.

The first loss function is related to the first difference and the second difference, where the second difference is the difference between the first text feature and the second point cloud feature.

3. The method according to claim 2, wherein the first loss function is positively correlated with the first difference, and the first loss function is negatively correlated with the second difference.

4. The method according to any one of claims 1-3, wherein the first model further comprises a third network;

The method further includes:

The features of the image patch are extracted using the third network to obtain the first image features;

The first loss function is related to the first difference and the third difference, wherein the third difference is the difference between the first image feature and the first point cloud feature.

5. The method according to any one of claims 1-4, characterized in that the method further comprises:

Obtain the second point cloud data;

The second point cloud data is input into the first network in the second model to obtain the third point cloud features;

Multiple texts are input into the second network of the second model to obtain multiple text features;

Based on the difference between the third point cloud feature and each of the plurality of text features, the target text corresponding to the second point cloud data is determined, wherein the target text corresponds to a target text feature, and the target text feature is the text feature with the smallest difference from the third point cloud feature among the plurality of text features.

6. The method according to any one of claims 1-5, wherein the mapping relationship is obtained based on the device for acquiring the first point cloud data and the device for acquiring the first image.

7. The method according to any one of claims 1-6, wherein the first point cloud data is acquired by a lidar, the first image is acquired by an image sensor, and the lidar and the image sensor are deployed on the same device.

8. The method according to claim 7, wherein determining the first point cloud cluster corresponding to the first object in the first point cloud data based on the mapping relationship between the first point cloud data and the first image and the image patch comprises:

Based on the coordinate system transformation matrix between the lidar and the image sensor, the second point cloud cluster corresponding to the image patch in the first point cloud data is determined;

Clustering is performed on the points in the second point cloud cluster to obtain the first point cloud cluster, wherein the second point cloud cluster includes the first point cloud cluster.

9. The method according to any one of claims 1-6, wherein the first point cloud data and the first image are acquired by the same depth camera.

10. The method according to claim 9, wherein determining the first point cloud cluster corresponding to the first object in the first point cloud data based on the mapping relationship between the first point cloud data and the first image and the image patch comprises:

Extract the foreground portion from the image block to obtain the foreground image block;

Based on the coordinate system transformation matrix of the depth camera, the first point cloud cluster corresponding to the foreground image patch in the first point cloud data is determined.

11. The method according to any one of claims 1-10, characterized in that, obtaining the image block of the first object in the first image and the text name corresponding to the first object includes:

The first image is used to perform target recognition by the second model to obtain the image patch of the first object in the first image and the text name corresponding to the first object, wherein the second model is a pre-trained model.

12. A point cloud data processing apparatus, characterized in that it comprises:

The acquisition module acquires first point cloud data and first image, which are obtained by collecting data from a first object.

The acquisition module is further configured to acquire the image block of the first object in the first image and the text name corresponding to the first object;

The processing module is used to determine the first point cloud cluster corresponding to the first object in the first point cloud data based on the mapping relationship between the first point cloud data and the first image and the image block;

The processing module is further configured to extract features of the first point cloud cluster and the text name through the first model to obtain first point cloud features and first text features, wherein the first model includes a first network and a second network, the first point cloud features are obtained by the first network extracting features from the first point cloud cluster, and the first text features are obtained by the second network extracting features from the text name.

The processing module is further configured to update the first model based on a first loss function to obtain a second model, wherein the first loss function is related to a first difference between the first point cloud features and the second text features.

13. The apparatus according to claim 12, characterized in that,

The acquisition module is further configured to acquire a second point cloud feature and a second text feature. The second point cloud feature is obtained by the first network extracting features from the point cloud cluster corresponding to the second object. The second text feature is obtained by the second network extracting features from the text name of the second object. The first object and the second object are different objects.

14. The apparatus according to claim 13, wherein the first loss function is positively correlated with the first difference, and the first loss function is negatively correlated with the second difference.

15. The apparatus according to any one of claims 12-14, wherein the first model further comprises a third network;

The processing module is further configured to:

16. The apparatus according to any one of claims 12-15, characterized in that,

The acquisition module is also used to acquire second point cloud data;

The processing module is also used to input the second point cloud data into the first network in the second model to obtain the third point cloud features;

The processing module is also used to input multiple texts into the second network in the second model to obtain multiple text features;

The processing module is further configured to determine the target text corresponding to the second point cloud data based on the difference between the third point cloud feature and each of the plurality of text features, wherein the target text corresponds to a target text feature, and the target text feature is the text feature with the smallest difference from the third point cloud feature among the plurality of text features.

17. The apparatus according to any one of claims 12-16, wherein the mapping relationship is obtained based on the apparatus for acquiring the first point cloud data and the apparatus for acquiring the first image.

18. The apparatus according to any one of claims 12-17, wherein the first point cloud data is acquired by a lidar, the first image is acquired by an image sensor, and the lidar and the image sensor are deployed on the same device.

19. The apparatus according to claim 18, wherein the processing module is further configured to:

20. The apparatus according to any one of claims 12-17, wherein the first point cloud data and the first image are acquired by the same depth camera.

21. The apparatus according to claim 20, wherein the processing module is further configured to:

22. The apparatus according to any one of claims 12-21, wherein the processing module is further configured to:

23. A point cloud data processing apparatus, characterized in that it comprises a memory and a processor; the memory stores code, and the processor is configured to execute the code, wherein when the code is executed, the apparatus performs the method as described in any one of claims 1 to 11.

24. A computer storage medium, characterized in that the computer storage medium stores instructions that, when executed by a computer, cause the computer to perform the method according to any one of claims 1 to 11.

25. A computer program product, characterized in that the computer program product stores instructions that, when executed by a computer, cause the computer to perform the method according to any one of claims 1 to 11.