WO2025200079A1 - Learnable encoder converting point cloud to grid for visual recognition - Google Patents
Learnable encoder converting point cloud to grid for visual recognitionInfo
- Publication number
- WO2025200079A1 WO2025200079A1 PCT/CN2024/090762 CN2024090762W WO2025200079A1 WO 2025200079 A1 WO2025200079 A1 WO 2025200079A1 CN 2024090762 W CN2024090762 W CN 2024090762W WO 2025200079 A1 WO2025200079 A1 WO 2025200079A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- point cloud
- graph
- generating
- nodes
- cloud graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional objects
Definitions
- the CNN 150 may output information indicating conditions of objects.
- Example conditions may include classification, gesture, pose, movement, action, mood, orientation, interest, traffic-related condition, other types of conditions, or some combination thereof.
- Conditions of objects may be used in various applications, such as human pose lifting, skeleton-based human action recognition, 3D mesh reconstruction, traffic navigation, social network analysis, recommend system, scientific computing, and so on.
- the training module 120 may jointly train at least part of the PC2G encoder 140 and at least part of the CNN 150.
- the training module 120 may train the up-sampling model 170, the reshaping module 180, and the CNN 150 through the same training process.
- the training module 120 may input one or more training samples into the visual recognition module 110, e.g., directly input into the up-sampling model 170.
- the training module 120 may cause executions of the up-sampling model 170, the reshaping module 180, and the CNN 150 on the one or more training samples.
- the CNN 150 may also be further trained in the second training stage.
- the values of the internal parameters of the CNN 150 may start with the values that are learned in the first training stage.
- the training module 120 may further adjust the values of the internal parameters of the CNN 150 to further train the CNN 150 in the second training stage.
- the up-sampling model 170, the reshaping module 180, and the CNN are jointly trained in the second training stage.
- the training module 120 also determines hyperparameters for training the visual recognition module 110.
- Hyperparameters are variables specifying the training process. Hyperparameters are different from parameters inside the visual recognition module 110 ( “internal parameters, ” e.g., internal parameters of the up-sampling model 170, internal parameters of the reshaping module 180, weights for convolution operations in the CNN 150, etc. ) .
- hyperparameters include variables determining the architecture of at least part of the visual recognition module 110, such as number of hidden layers in the CNN 150, and so on. Hyperparameters also include variables which determine how the visual recognition module 110 is trained, such as batch size, number of epochs, etc.
- the training module 120 also adds an activation function to a hidden layer or the output layer.
- An activation function of a layer transforms the weighted sum of the input of the layer to an output of the layer.
- the activation function may be, for example, a rectified linear unit (ReLU) activation function, a tangent activation function, or other types of activation functions.
- ReLU rectified linear unit
- the training module 120 may train the visual recognition module 110 for a predetermined number of epochs.
- the number of epochs is a hyperparameter that defines the number of times that the DL algorithm will work through the entire training dataset.
- One epoch means that each sample in the training dataset has had an opportunity to update the internal parameters of the visual recognition module 110.
- the training module 120 may stop updating the internal parameters of the visual recognition module 110, and the visual recognition module 110 is considered trained.
- the width W may be the total number of elements in a row of the feature map 300.
- the height H may be the total number of elements in a column of the feature map 300.
- Each element is represented by a dark circle in FIG. 3.
- an element may be a node in the up-sampled graph, e.g., a node 210 or a node 215.
- the up-sampled graph is input into the reshaping module 180, and the reshaping module 180 outputs the feature map 300.
- the feature map 300 is a 2D tensor.
- the feature map 300 may be a 3D tensor, e.g., a tensor with a spatial size H ⁇ W ⁇ C.
- C may be 2, 3, or a larger number.
- FIG. 4 illustrates a convolution 400 executed on a grid 410, in accordance with various embodiments.
- the grid 410 may be generated by reshaping a graph.
- the grid 410 is two-dimensional and has a spatial size of 6 ⁇ 6, i.e., there are six elements in each row and six elements in each column. Every element of the grid 410 is represented by a black circle in FIG. 4. An element may be a node in the graph.
- the convolution 400 may be a deep learning operation in a CNN, e.g., the CNN 150.
- the convolution 400 has a kernel 420 with a spatial size of 3 ⁇ 3, i.e., the kernel 420 has nine weights arranged in three rows and three columns.
- the grid 410 is used as an IFM of the convolution 400.
- multiply-accumulate (MAC) operations are performed as the kernel 420 slides through the grid 410, as indicated by the arrows in FIG. 4.
- the stride for applying the kernel on the grid 700 is 1, meaning the kernel slides one data element at a time. In other embodiments, the stride may be more than one.
- the convolution 400 generates an OFM 430, which is a 6 ⁇ 6 tensor and has the same spatial size as the grid 410.
- the OFM 430 may have a different spatial size.
- the convolution 400 may include padding, through which additional data elements are added to the grid 410 before the kernel is applied on the grid 410. In an example where the padding factor is 1, one additional row is added to the top of the grid 410, one additional row is added to the bottom of the grid 410, one additional row is added to the right of the grid 410, and one additional row is added to the left of the grid 410. The size of the grid 410 after the padding would become 8 ⁇ 8.
- the OFM 430 may be processed in additional deep learning operations in the CNN, e.g., another convolution, activation function, pooling operation, linear transformation, and so on.
- the CNN may output information indicating one or more conditions of the object 310.
- the real matrix 501 may be denoted as The real matrix 501 may be a continuous approximation of the binary matrix 503.
- the real matrix 501 may be used to assist the learning of the binary matrix 503, which may be denoted as ⁇ .
- directly learning the binary matrix 503 would cut off the gradient flow in the backward path 520, which can make the training non-differentiable.
- Straight-through estimator (STE) may be used for parameter update to solve this problem.
- the binarization module 502 may convert the real matrix 501 to the binary matrix 503 by binarizing the real matrix 501 row by row, according to:
- the binary matrix 503 may then be used to convert an up-sampled graph 504 to a grid 505.
- the grid 505 is a 3D tensor that has a spatial size of 5 ⁇ 5 ⁇ 3. In other embodiments, the grid may have a different shape or dimension. For instance, the grid 505 may have one or more larger dimensions. Each element in the grid 505 may be a different node in the up-sampled graph 504.
- the grid 505 is input into a CNN 506 and may be processed by the CNN 506 as an IFM.
- the CNN 506 may be an example of the CNN 150 in FIG. 1.
- the CNN 506 may generate a label that indicates visual recognition of an object represented by the up-sampled graph 504.
- the multiplication applied between a kernel-sized patch of the IFM 640 and a kernel may be a dot product.
- a dot product is the elementwise multiplication between the kernel-sized patch of the IFM 640 and the corresponding kernel, which is then summed, always resulting in a single value. Because it results in a single value, the operation is often referred to as the “scalar product. ”
- Using a kernel smaller than the IFM 640 is intentional as it allows the same kernel (set of weights) to be multiplied by the IFM 640 multiple times at different points on the IFM 640. Specifically, the kernel is applied systematically to each overlapping part or kernel-sized patch of the IFM 640, left to right, top to bottom.
- the depthwise convolution 683 In the depthwise convolution 683, the input channels are not combined. Rather, MAC operations are performed on an individual input channel and an individual kernel and produce an output channel. As shown in FIG. 6, the depthwise convolution 683 produces a depthwise output tensor 680.
- the depthwise output tensor 680 is represented by a 5 ⁇ 5 ⁇ 3 3D matrix.
- the depthwise output tensor 680 includes 3 output channels, each of which is represented by a 5 ⁇ 5 2D matrix.
- the 5 ⁇ 5 2D matrix includes 5 output elements in each row and five output elements in each column.
- Each output channel is a result of MAC operations of an input channel of the IFM 640 and a kernel of the filter 650.
- the OFM 660 is then passed to the next layer in the sequence.
- the OFM 660 is passed through an activation function.
- An example activation function is ReLU.
- ReLU is a calculation that returns the value provided as input directly, or the value zero if the input is zero or less.
- the convolutional layer 610 may receive several images as input and calculate the convolution of each of them with each of the kernels. This process can be repeated several times. For instance, the OFM 660 is passed to the subsequent convolutional layer 610 (i.e., the convolutional layer 610 following the convolutional layer 610 generating the OFM 660 in the sequence) .
- the subsequent convolutional layers 610 perform a convolution on the OFM 660 with new kernels and generate a new feature map.
- the new feature map may also be normalized and resized.
- the new feature map can be kernelled again by a further subsequent convolutional layer 610, and so on.
- the pooling layers 620 down-sample feature maps generated by the convolutional layers, e.g., by summarizing the presence of features in the patches of the feature maps.
- a pooling layer 620 is placed between two convolution layers 610: a preceding convolutional layer 610 (the convolution layer 610 preceding the pooling layer 620 in the sequence of layers) and a subsequent convolutional layer 610 (the convolution layer 610 subsequent to the pooling layer 620 in the sequence of layers) .
- a pooling layer 620 is added after a convolutional layer 610, e.g., after an activation function (e.g., ReLU, etc. ) has been applied to the OFM 660.
- an activation function e.g., ReLU, etc.
- a pooling layer 620 receives feature maps generated by the preceding convolution layer 610 and applies a pooling operation to the feature maps.
- the pooling operation reduces the size of the feature maps while preserving their important characteristics. Accordingly, the pooling operation improves the efficiency of the CNN and avoids over-learning.
- the pooling layers 620 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map) , max pooling (calculating the maximum value for each patch of the feature map) , or a combination of both.
- the size of the pooling operation is smaller than the size of the feature maps.
- the pooling operation is 2 ⁇ 2 pixels applied with a stride of two pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size.
- a pooling layer 620 applied to a feature map of 6 ⁇ 6 results in an output pooled feature map of 3 ⁇ 3.
- the output of the pooling layer 620 is inputted into the subsequent convolution layer 610 for further feature extraction.
- the pooling layer 620 operates upon each feature map separately to create a new set of the same number of pooled feature maps.
- the fully-connected layers 630 are the last layers of the CNN.
- the fully-connected layers 630 may be convolutional or not.
- the fully-connected layers 630 may also be referred to as linear layers.
- a fully-connected layer 630 (e.g., the first fully-connected layer in the CNN 600) may receive an input operand.
- the input operand may define the output of the convolutional layers 610 and pooling layers 620 and includes the values of the last feature map generated by the last pooling layer 620 in the sequence.
- the fully-connected layer 630 may apply a linear transformation to the input operand through a weight matrix.
- the weight matrix may be a kernel of the fully-connected layer 630.
- the linear transformation may include a tensor multiplication between the input operand and the weight matrix.
- the result of the linear transformation may be an output operand.
- the fully-connected layer may further apply a non-linear transformation (e.g., by using a non-linear activation function) on the result of the linear transformation to generate an output operand.
- the output operand may contain as many elements as there are classes: element i represents the probability that the image belongs to class i. Each element is therefore between 0 and 6, and the sum of all is worth one. These probabilities are calculated by the last fully-connected layer 630 by using a logistic function (binary classification) or a SoftMax function (multi-class classification) as an activation function.
- the input tensor 710 has a spatial size H in ⁇ W in ⁇ C in , where H in is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of activations in a column in the 3D matrix of each input channel) , W in is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of activations in a row in the 3D matrix of each input channel) , and C in is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of input channels) .
- the input tensor 710 has a spatial size of 7 ⁇ 7 ⁇ 3, i.e., the input tensor 710 includes three input channels and each input channel has a 7 ⁇ 7 2D matrix.
- Each input element in the input tensor 710 may be represented by a (X, Y, Z) coordinate. In other embodiments, the height, width, or depth of the input tensor 710 may be different.
- Each filter 720 includes weights arranged in a 3D matrix. The values of the weights may be determined through training the DNN.
- a filter 720 has a spatial size H f ⁇ W f ⁇ C f , where H f is the height of the filter (i.e., the length along the Y axis, which indicates the number of weights in a column in each kernel) , W f is the width of the filter (i.e., the length along the X axis, which indicates the number of weights in a row in each kernel) , and C f is the depth of the filter (i.e., the length along the Z axis, which indicates the number of channels) . In some embodiments, C f equals C in .
- each filter 720 in FIG. 7 has a spatial size of 7 ⁇ 3 ⁇ 3, i.e., the filter 720 includes 7 convolutional kernels with a spatial size of 3 ⁇ 3.
- the height, width, or depth of the filter 720 may be different.
- the spatial size of the convolutional kernels is smaller than the spatial size of the 2D matrix of each input channel in the input tensor 710.
- each filter 720 slides across the input tensor 710 and generates a 2D matrix for an output channel in the output tensor 730.
- the 2D matrix has a spatial size of 5 ⁇ 5.
- the output tensor 730 includes activations (also referred to as “output activations, ” “elements, ” or “output element” ) arranged in a 3D matrix.
- An output activation is a data point in the output tensor 730.
- MAC operations can be performed on a 3 ⁇ 3 ⁇ 3 subtensor 715 (which is highlighted with a dotted pattern in FIG. 7) in the input tensor 710 and each filter 720.
- the result of the MAC operations on the subtensor 715 and one filter 720 is an output activation.
- an output activation may include 8 bits, e.g., one byte.
- an output activation may include more than one byte. For instance, an output element may include two bytes.
- a vector 735 is produced.
- the vector 735 is highlighted with slashes in FIG. 7.
- the vector 735 includes a sequence of output activations, which are arranged along the Z axis.
- the output activations in the vector 735 have the same (X, Y) coordinate, but the output activations correspond to different output channels and have different Z coordinates.
- the dimension of the vector 735 along the Z axis may equal the total number of output channels in the output tensor 730.
- FIG. 8 illustrates an AI-based visual recognition environment 800, in accordance with various embodiments.
- the AI-based visual recognition environment 800 includes a visual recognition module 810, client devices 820 (individually referred to as client device 820) , and a third-party system 830.
- client devices 820 individually referred to as client device 820
- third-party system 830 the AI-based visual recognition environment 800 may include fewer, more, or different components.
- the AI-based visual recognition environment 800 may include a different number of client devices 820 or more than one third-party system 830.
- the visual recognition module 810 performs visual recognition tasks, e.g., detection of conditions of objects. For instance, the visual recognition module 810 may track 3D motions of an object by estimating 3D poses of the object. In some embodiments, the visual recognition module 810 may receive one or more point clouds captured by one or more sensors placed in a local area where an object is located. The visual recognition module 810 may receive the point clouds from one or more client devices 820 or the third-party system 830. Also, the visual recognition module 810 may transmit information indicating visual recognition of the object to one or more client devices 820 or the third-party system 830. Additionally or alternatively, the visual recognition module 810 may transmit content items generated using the estimated 3D poses of the object to one or more client devices 820 or the third-party system 830. An example of the visual recognition module 810 is the visual recognition module 110 in FIG. 1.
- the client devices 820 are in communication with the visual recognition module 810.
- the client device 820 may receive 3D pose graphical representations from the visual recognition module 810 and display the 3D pose graphical representations to one or more users associated with the client device 820.
- a client device 820 may facilitate an interface with one or more depth cameras in a local area and may send commands to the depth cameras to capture depth images to be used by the visual recognition module 810.
- the client device 820 may facilitate an interface with one or more projectors in a local area and may provide content items to the projectors for the projectors to present the content items in the local area.
- the client device 820 may generate the content items using motion tracking results from the visual recognition module 810.
- a client device may have one or more users, whose motions may be tracked by the visual recognition module 810.
- a client device 820 may execute one or more applications allowing one or more users of the client device 820 to interact with the visual recognition module 810.
- a client device 820 executes a browser application to enable interaction between the client device 820 and the visual recognition module 810.
- a client device 820 interacts with the visual recognition module 810 through an application programming interface (API) running on a native operating system of the client device 820, such as or ANDROID TM .
- API application programming interface
- a client device 820 may be one or more computing devices capable of receiving user input as well as transmitting and/or receiving data via the network 840.
- a client device 820 is a conventional computer system, such as a desktop or a laptop computer.
- a client device 820 may be a device having computer functionality, such as a personal digital assistant (PDA) , a mobile telephone, a smartphone, an autonomous vehicle, or another suitable device.
- PDA personal digital assistant
- a client device 820 is configured to communicate via the network 840.
- a client device 820 is an integrated computing device that operates as a standalone network-enabled device.
- the client device 820 includes display, speakers, microphone, camera, and input device.
- a client device 820 is a computing device for coupling to an external media device such as a television or other external display and/or audio output system.
- the client device 820 may couple to the external media device via a wireless interface or wired interface and may utilize various functions of the external media device such as its display, speakers, microphone, camera, and input devices.
- the client device 820 may be configured to be compatible with a generic external media device that does not have specialized software, firmware, or hardware specifically for interacting with the client device 820.
- the third-party system 830 is an online system that may communicate with the visual recognition module 810 or at least one of the client devices 820.
- the third-party system 830 may provide data to the visual recognition module 810 for 3D pose estimation.
- the data may include depth images, data for training DNNs, data for validating DNNs, and so on.
- the third-party system 830 may be a social media system, an online image gallery, an online searching system, and so on. Additionally or alternatively, the third-party system 830 may use results of 3D pose estimation in various applications. For instance, the third-party system 830 may use motion tracking results from the visual recognition module 810 for action recognition, sport analysis, virtual reality, augmented reality, film and game production, telepresence, and so on.
- the visual recognition module 810, client devices 820, and third-party system 830 are connected through a network 840.
- the network 840 may comprise any combination of local area and/or wide area networks, using both wired and/or wireless communication systems.
- the network 840 may use standard communications technologies and/or protocols.
- the network 840 may include communication links using technologies such as Ethernet, 8010.11, worldwide interoperability for microwave access (WiMAX) , 3G, 4G, code division multiple access (CDMA) , digital subscriber line (DSL) , etc.
- networking protocols used for communicating via the network 840 may include multiprotocol label switching (MPLS) , transmission control protocol/Internet protocol (TCP/IP) , hypertext transport protocol (HTTP) , simple mail transfer protocol (SMTP) , and file transfer protocol (FTP) .
- MPLS multiprotocol label switching
- TCP/IP transmission control protocol/Internet protocol
- HTTP hypertext transport protocol
- SMTP simple mail transfer protocol
- FTP file transfer protocol
- Data exchanged over the network 840 may be represented using any suitable format, such as hypertext markup language (HTML) or extensible markup language (XML) .
- HTML hypertext markup language
- XML extensible markup language
- all or some of the communication links of the network 840 may be encrypted using any suitable technique or techniques.
- FIG. 9 is a flowchart showing a method 900 of visual recognition, in accordance with various embodiments.
- the method 900 may be a method of 3D visual recognition.
- the method 900 may be performed by the visual recognition module 110 in FIG. 1.
- the method 900 is described with reference to the flowchart illustrated in FIG. 9, many other methods for visual recognition may alternatively be used.
- the order of execution of the steps in FIG. 9 may be changed.
- some of the steps may be changed, eliminated, or combined.
- the visual recognition module 110 generates 910 a point cloud graph by removing one or more points from a point cloud capturing an object.
- the point cloud graph comprises a first group of nodes.
- a node encodes a feature in the object, such as a portion of the object.
- the point cloud graph also includes one or more edges. An edge connects two adjacent nodes. In some embodiments, the edge represents a topological connection between two features in the object that are encoded by the two nodes.
- the visual recognition module 110 generates the feature map in two stages. In some embodiments, the visual recognition module 110 generates an additional point cloud graph from the point cloud graph, the point cloud graph comprising the second group of nodes. The visual recognition module 110 transforms the additional point cloud graph into the feature map by rearranging the second group of nodes into the grid structure. In some embodiments, the visual recognition module 110 generates the additional point cloud graph by interpolating one or more new nodes between two nodes in the point cloud graph. The two nodes are connected through an edge in the point cloud graph. In some embodiments, the visual recognition module 110 generates the additional point cloud graph by applying an up-sampling matrix and an adjacency matrix on the point cloud graph.
- the visual recognition module 110 inputs the point cloud graph into a trained model.
- the trained model comprises a learnable binary matrix.
- the visual recognition module 110 generates the feature map using the learnable binary matrix.
- the trained model is trained by updating one or more parameters of a real matrix and binarizing the real matrix row by row to obtain the binary matrix.
- the visual recognition module 110 executes 930 one or more deep learning operations in a neural network on the up-sampled grid representation of the object.
- the neural network may be a CNN, e.g., the CNN 150 or the CNN 600.
- the one or more deep learning operations comprises a convolution.
- the convolution is executed on the up-sampled grid representation of the object.
- the convolution has a kernel. The kernel has a smaller size than the feature map.
- the visual recognition module 110 determines 940 a condition of the object based on an output of the neural network.
- the neural network outputs information describing the condition of the object.
- the condition of the object may be a pose, movement, gesture, orientation, mood, color, shape, size, or other types of conditions of the object.
- FIG. 10 is a block diagram of an example computing device 1000, in accordance with various embodiments.
- the computing device 1000 can be used as at least part of the computer vision system 100.
- a number of components are illustrated in FIG. 10 as included in the computing device 1000, but any one or more of these components may be omitted or duplicated, as suitable for the application.
- some or all of the components included in the computing device 1000 may be attached to one or more motherboards. In some embodiments, some or all of these components are fabricated onto a single system on a chip (SoC) die.
- SoC system on a chip
- the computing device 1000 may not include one or more of the components illustrated in FIG. 10, but the computing device 1000 may include interface circuitry for coupling to the one or more components.
- the computing device 1000 may not include a display device 1006, but may include display device interface circuitry (e.g., a connector and driver circuitry) to which a display device 1006 may be coupled.
- the computing device 1000 may not include an audio input device 1018 or an audio output device 1008, but may include audio input or output device interface circuitry (e.g., connectors and supporting circuitry) to which an audio input device 1018 or audio output device 1008 may be coupled.
- the computing device 1000 may include a processing device 1002 (e.g., one or more processing devices) .
- the processing device 1002 processes electronic data from registers and/or memory to transform that electronic data into other electronic data that may be stored in registers and/or memory.
- the computing device 1000 may include a memory 1004, which may itself include one or more memory devices such as volatile memory (e.g., DRAM) , nonvolatile memory (e.g., read-only memory (ROM) ) , high bandwidth memory (HBM) , flash memory, solid state memory, and/or a hard drive.
- the memory 1004 may include memory that shares a die with the processing device 1002.
- the memory 1004 includes one or more non-transitory computer-readable media storing instructions executable to perform operations for performing 3D visual recognition, e.g., the method 900 described above in conjunction with FIG. 9 or some operations performed by the computer vision system 100 described above in conjunction with FIG. 1.
- the instructions stored in the one or more non-transitory computer-readable media may be executed by the processing device 1002.
- the computing device 1000 may include a communication chip 1012 (e.g., one or more communication chips) .
- the communication chip 1012 may be configured for managing wireless communications for the transfer of data to and from the computing device 1000.
- wireless and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data through the use of modulated electromagnetic radiation through a nonsolid medium. The term does not imply that the associated devices do not contain any wires, although in some embodiments they might not.
- the communication chip 1012 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.10 family) , IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment) , Long-Term Evolution (LTE) project along with any amendments, updates, and/or revisions (e.g., advanced LTE project, ultramobile broadband (UMB) project (also referred to as "3GPP2" ) , etc. ) .
- IEEE Institute for Electrical and Electronic Engineers
- Wi-Fi IEEE 802.10 family
- IEEE 802.16 standards e.g., IEEE 802.16-2005 Amendment
- LTE Long-Term Evolution
- LTE Long-Term Evolution
- UMB ultramobile broadband
- WiMAX Broadband Wireless Access
- the communication chip 1012 may operate in accordance with a Global System for Mobile Communication (GSM) , General Packet Radio Service (GPRS) , Universal Mobile Telecommunications System (UMTS) , High Speed Packet Access (HSPA) , Evolved HSPA (E-HSPA) , or LTE network.
- GSM Global System for Mobile Communication
- GPRS General Packet Radio Service
- UMTS Universal Mobile Telecommunications System
- HSPA High Speed Packet Access
- E-HSPA Evolved HSPA
- the communication chip 1012 may operate in accordance with Enhanced Data for GSM Evolution (EDGE) , GSM EDGE Radio Access Network (GERAN) , Universal Terrestrial Radio Access Network (UTRAN) , or Evolved UTRAN (E-UTRAN) .
- the communication chip 1012 may operate in accordance with CDMA, Time Division Multiple Access (TDMA) , Digital Enhanced Cordless Telecommunications (DECT) , Evolution-Data Optimized (EV-DO) , and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond.
- the communication chip 1012 may operate in accordance with other wireless protocols in other embodiments.
- the computing device 1000 may include an antenna 1022 to facilitate wireless communications and/or to receive other wireless communications (such as AM or FM radio transmissions) .
- the communication chip 1012 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., the Ethernet) .
- the communication chip 1012 may include multiple communication chips. For instance, a first communication chip 1012 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second communication chip 1012 may be dedicated to longer-range wireless communications such as global positioning system (GPS) , EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others.
- GPS global positioning system
- a first communication chip 1012 may be dedicated to wireless communications
- a second communication chip 1012 may be dedicated to wired communications.
- the computing device 1000 may include a display device 1006 (or corresponding interface circuitry, as discussed above) .
- the display device 1006 may include any visual indicators, such as a heads-up display, a computer monitor, a projector, a touchscreen display, a liquid crystal display (LCD) , a light-emitting diode display, or a flat panel display, for example.
- LCD liquid crystal display
- the computing device 1000 may include an audio output device 1008 (or corresponding interface circuitry, as discussed above) .
- the audio output device 1008 may include any device that generates an audible indicator, such as speakers, headsets, or earbuds, for example.
- the computing device 1000 may include an audio input device 1018 (or corresponding interface circuitry, as discussed above) .
- the audio input device 1018 may include any device that generates a signal representative of a sound, such as microphones, microphone arrays, or digital instruments (e.g., instruments having a musical instrument digital interface (MIDI) output) .
- MIDI musical instrument digital interface
- the computing device 1000 may include a GPS device 1016 (or corresponding interface circuitry, as discussed above) .
- the GPS device 1016 may be in communication with a satellite-based system and may receive a location of the computing device 1000, as known in the art.
- the computing device 1000 may include another output device 1010 (or corresponding interface circuitry, as discussed above) .
- Examples of the other output device 1010 may include an audio codec, a video codec, a printer, a wired or wireless transmitter for providing information to other devices, or an additional storage device.
- the computing device 1000 may include another input device 1020 (or corresponding interface circuitry, as discussed above) .
- Examples of the other input device 1020 may include an accelerometer, a gyroscope, a compass, an image capture device, a keyboard, a cursor control device such as a mouse, a stylus, a touchpad, a bar code reader, a Quick Response (QR) code reader, any sensor, or a radio frequency identification (RFID) reader.
- the computing device 1000 may have any desired form factor, such as a handheld or mobile computer system (e.g., a cell phone, a smart phone, a mobile internet device, a music player, a tablet computer, a laptop computer, a netbook computer, an ultrabook computer, a PDA, an ultramobile personal computer, etc. ) , a desktop computer system, a server or other networked computing component, a printer, a scanner, a monitor, a set-top box, an entertainment control unit, a vehicle control unit, a digital camera, a digital video recorder, or a wearable computer system.
- the computing device 1000 may be any other electronic device that processes data.
- Example 1 provides a method, including generating a point cloud graph by removing one or more points from a point cloud capturing at least part of an object, the point cloud graph including a first group of nodes; generating a feature map from the point cloud graph, the feature map including a second group of nodes that are arranged in a grid structure, the second group of nodes including more nodes than the first group of nodes; executing one or more deep learning operations in a neural network on the feature map; and determining a condition of the object based on an output of the neural network.
- Example 4 provides the method of any one of examples 1-3, in which generating the point cloud graph includes selecting a point in the point cloud based on a state value of the point, the state value indicating a degree of invalidity of the point; and generating a node in the point cloud graph from the selected point.
- Example 5 provides the method of any one of examples 1-4, in which generating the feature map includes generating an additional point cloud graph from the point cloud graph, the point cloud graph including the second group of nodes; and transforming the additional point cloud graph into the feature map by rearranging the second group of nodes into the grid structure.
- Example 7 provides the method of example 5 or 6, in which generating the additional point cloud graph includes applying an up-sampling matrix and an adjacency matrix on the point cloud graph.
- Example 8 provides the method of any one of examples 1-7, in which generating the feature map includes inputting the point cloud graph into a trained model, the trained model including a learnable binary matrix and generating the feature map using the learnable binary matrix.
- Example 9 provides the method of example 8, in which the trained model is trained by: updating one or more parameters of a real matrix; and binarizing the real matrix row by row to obtain the binary matrix.
- Example 10 provides the method of any one of examples 1-9, in which the one or more deep learning operations includes a convolution.
- Example 11 provides one or more non-transitory computer-readable media storing instructions executable to perform operations, the operations including generating a point cloud graph by removing one or more points from a point cloud capturing at least part of an object, the point cloud graph including a first group of nodes; generating a feature map from the point cloud graph, the feature map including a second group of nodes that are arranged in a grid structure, the second group of nodes including more nodes than the first group of nodes; executing one or more deep learning operations in a neural network on the feature map; and determining a condition of the object based on an output of the neural network.
- Example 12 provides the one or more non-transitory computer-readable media of example 11, in which generating the point cloud graph includes dividing the point cloud into one or more regions; and for each region of the point cloud, removing one or more points in the region.
- Example 13 provides the one or more non-transitory computer-readable media of example 11 or 12, in which generating the point cloud graph includes dividing the point cloud into one or more regions; and for each region of the point cloud, generating a node in the point cloud graph by averaging points in the region.
- Example 20 provides the one or more non-transitory computer-readable media of any one of examples 11-19, in which the one or more deep learning operations includes a convolution.
- Example 21 provides an apparatus, including a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations including generating a point cloud graph by removing one or more points from a point cloud capturing at least part of an object, the point cloud graph including a first group of nodes, generating a feature map from the point cloud graph, the feature map including a second group of nodes that are arranged in a grid structure, the second group of nodes including more nodes than the first group of nodes, executing one or more deep learning operations in a neural network on the feature map, and determining a condition of the object based on an output of the neural network.
- Example 22 provides the apparatus of example 21, in which generating the point cloud graph includes dividing the point cloud into one or more regions; and for each region of the point cloud, removing one or more points in the region.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Image Analysis (AREA)
Abstract
A learnable point cloud-to-grid (PC2G) encoder and a neural network may be used to perform point cloud-based visual recognition tasks. The PC2G may receive a point cloud capturing at least part of an object. The PC2G encoder may remove one or more points from the point cloud and generate a sparse point cloud graph. The PC2G encoder may convert the sparse point cloud graph to an up-sampled point cloud graph that has more nodes than the sparse point cloud graph. The PC2G encoder may further transform the up-sampled point cloud graph into a feature map with a grid structure by rearranging the nodes in the up-sampled point cloud graph. The feature map may be then processed by the neural network, which may be a convolutional neural network. The neural network may output data indicating one or more conditions (e.g., pose, movement, gesture, orientation, shape, color, etc.) of the object.
Description
Cross-Reference to Related Application
This application claims the benefit of and hereby incorporates by reference, for all purposes, the entirety of the contents of International Application No. PCT/CN2024/084402, filed March 28, 2024, and entitled “LEARNABLE POINT CLOUD TO GRID ENCODER FOR THREE-DIMENSIONAL VISUAL RECOGNITION. ”
This disclosure relates generally to computer vision, and more specifically, three-dimensional (3D) visual recognition using a learnable encoder that converts point cloud to grid.
The last decade has witnessed a rapid rise in AI (artificial intelligence) based data processing, particularly based on deep neural networks (DNNs) , such as convolutional neural networks (CNNs) , and so on. DNNs are widely used in the domains of computer vision, speech recognition, image, and video processing mainly due to their ability to achieve beyond human-level accuracy. Computer vision tasks include methods for acquiring, processing, analyzing, or understanding visual images to produce information, such as descriptions of objects in the images, decisions, and so on. Many computer vision tasks are performed using graph-structured data. A graph is a data structure comprising a collection of nodes (or vertices) and one or more edges. An edge is a connection of two nodes.
Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.
FIG. 1 illustrates an example computer vision system, in accordance with various embodiments.
FIG. 2A illustrates an example sparse point cloud graph, in accordance with various embodiments.
FIG. 2B illustrates an example up-sampled point cloud graph, in accordance with various embodiments.
FIG. 3 illustrates an example up-sampled grid converted from an up-sampled point cloud graph, in accordance with various embodiments.
FIG. 4 illustrates a convolution executed on an up-sampled grid, in accordance with various embodiments.
FIG. 5 illustrates a process of training a learnable encoder that converts point graphs to feature maps, in accordance with various embodiments.
FIG. 6 illustrates an example CNN, in accordance with various embodiments.
FIG. 7 illustrates an example convolution, in accordance with various embodiments.
FIG. 8 illustrates an AI-based visual recognition environment, in accordance with various embodiments.
FIG. 9 is a flowchart showing a method of visual recognition, in accordance with various embodiments.
FIG. 10 is a block diagram of an example computing device, in accordance with various embodiments.
Overview
A point cloud may be a set of data points in a 3D coordinate system that represents the spatial positions of objects or surfaces in a 3D space. It can be obtained from various sources, such as 3D scanners, Light Detection and Ranging (LiDAR) sensors, depth cameras, and so on. Point cloud-based 3D visual recognition is an important AI application with great potential and value in real scenarios, such as autonomous driving, robotics, virtual reality, augmented reality, mixed reality, simultaneous localization and mapping (SLAM) , and so on.
Currently available approaches for point cloud-based 3D visual recognition include PointNet and its variants, which typically apply a series of fully-connected layers to directly process the raw point cloud data for learning local point features. Another type of point cloud-based 3D visual recognition approaches is voxel-based approaches, which typically
convert point clouds into 3D voxel grids with each grids representing the occupancy of density of points in the voxel and then use 3D CNNs for end-to-end feature extraction.
Despite the success of these solutions in addressing the representation problem of point cloud, they fail to address the critical topological relations of point cloud data. For example, PointNet treats each point independently but disregards the global context and relationships between points. A typical voxel-based approach partitions the 3D space of point cloud into a voxel grid and voxelizes the point cloud data, which can cause the loss of the point cloud structure and a limited spatial precision. That can degrade the performance of 3D visual recognition.
Embodiments of the present disclosure may improve on at least some of the challenges and issues described above by providing a combination of a learnable point cloud-to-grid (PC2G) encoder and a DNN to perform point cloud-based 3D visual recognition tasks. The present disclosure provides an approach that can improve the point cloud representation learning for point cloud-based 3D visual recognition. It can utilize the efficient 2D convolutions and capture the complex relationships within the point cloud graph for improved feature learning.
In various embodiments of the present disclosure, a computer vision system may use a PC2G encoder and a CNN to determine conditions of objects based on point clouds. Example conditions of object may include classification, gesture, pose, movement, action, mood, facial expression, orientation, interest, traffic-related condition, other types of conditions, or some combination thereof. Conditions of objects may be used in various applications, such as human pose lifting, skeleton-based human action recognition, 3D mesh reconstruction, traffic navigation, social network analysis, recommend system, scientific computing, and so on. In an example, a point cloud capturing at least part of an object may be obtained, e.g., by a sensor detecting the object. The point cloud may include points distributed in a 3D space. A point in the point cloud may have 3D coordinates that indicate the location of the point in the 3D space. A point may also have a state value that encodes metadata of the point, such as a degree of validity, a degree of invalidity, classification, and so on. The PC2G encoder may remove one or more points from the point cloud, e.g., based on the state values of the points, and generate a sparse point cloud graph. The total number of nodes in the sparse point cloud is less than the total number of points in the point cloud.
The PC2G encoder may further convert the sparse point cloud graph to an up-sampled point cloud graph, which has more nodes than the sparse point cloud graph. In an example, the PC2G encoder may interpolate one or more new nodes between adjacent nodes in the sparse point cloud graph. A new node may be placed along an edge connecting the adjacent nodes. The PC2G encoder may also transform the up-sampled point cloud graph into a feature map with a grid structure by rearranging the nodes in the up-sampled point cloud graph. The feature map may be a grid representation of the object. The feature map may be input into and processed by the CNN. The CNN may output data indicating one or more conditions of the object.
The grid representation is different and more advantageous over voxel-based point cloud inputs in currently available visual recognition approaches. For example, the point cloud input is down-sampled to construct a sparse graph and then is up-sampled through end-to-end learning conditioned on the target task. Also, the grid representation (e.g., a grid patch) is filled by up-sampled point cloud nodes with its layout reflecting the topological relations of point cloud features for 3D visual modeling. Further, the grid representation can maintain the high spatial resolution of point cloud compared with voxels. Instead of being down-sampled as in CNNs, the spatial size of the grid patch can be fixed during representation learning to maintain the point cloud semantics. Moreover, multiple parts of the PC2G encoder (e.g., the module for up-sampling and the module for transforming) can be jointly learned. In some embodiments, these parts of the PC2G encoder may even be joint learned with the CNN. This can enable the learning of a grid patch with decent layout for 3D visual recognition.
The computation cost of our PC2G can be negligible. Specifically, let K denote the number of point cloud graph nodes, let H×W denote the spatial size of the grid patch and let C (here usually C=3) denote the dimension of point cloud feature. As defined above, the number of multiply-accumulate (MAC) operations is HWKC for up-sampling and HWC for transforming. The PC2G encoder can achieve significant performance improvement with this negligible extra compute.
For purposes of explanation, specific numbers, materials and configurations are set forth in order to provide a thorough understanding of the illustrative implementations. However, it will be apparent to one skilled in the art that the present disclosure may be
practiced without the specific details or/and that the present disclosure may be practiced with only some of the described aspects. In other instances, well known features are omitted or simplified in order not to obscure the illustrative implementations.
Further, references are made to the accompanying drawings that form a part hereof, and in which is shown, by way of illustration, embodiments that may be practiced. It is to be understood that other embodiments may be utilized, and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense.
Various operations may be described as multiple discrete actions or operations in turn, in a manner that is most helpful in understanding the claimed subject matter. However, the order of description should not be construed as to imply that these operations are necessarily order dependent. In particular, these operations may not be performed in the order of presentation. Operations described may be performed in a different order from the described embodiment. Various additional operations may be performed or described operations may be omitted in additional embodiments.
For the purposes of the present disclosure, the phrase “A or B” or the phrase "A and/or B" means (A) , (B) , or (A and B) . For the purposes of the present disclosure, the phrase “A, B, or C” or the phrase "A, B, and/or C" means (A) , (B) , (C) , (A and B) , (A and C) , (B and C) , or (A, B, and C) . The term "between, " when used with reference to measurement ranges, is inclusive of the ends of the measurement ranges.
The description uses the phrases "in an embodiment" or "in embodiments, " which may each refer to one or more of the same or different embodiments. The terms "comprising, " "including, " "having, " and the like, as used with respect to embodiments of the present disclosure, are synonymous. The disclosure may use perspective-based descriptions such as "above, " "below, " "top, " "bottom, " and "side" to explain various features of the drawings, but these terms are simply for ease of discussion, and do not imply a desired or required orientation. The accompanying drawings are not necessarily drawn to scale. Unless otherwise specified, the use of the ordinal adjectives “first, ” “second, ” and “third, ” etc., to describe a common object, merely indicates that different instances of like objects are being referred to and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking or in any other manner.
In the following detailed description, various aspects of the illustrative implementations will be described using terms commonly employed by those skilled in the art to convey the substance of their work to others skilled in the art.
The terms “substantially, ” “close, ” “approximately, ” “near, ” and “about, ” generally refer to being within +/-20%of a target value as described herein or as known in the art. Similarly, terms indicating orientation of various elements, e.g., “coplanar, ” “perpendicular, ” “orthogonal, ” “parallel, ” or any other angle between the elements, generally refer to being within +/-5-20%of a target value as described herein or as known in the art.
In addition, the terms “comprise, ” “comprising, ” “include, ” “including, ” “have, ” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a method, process, device, or DNN accelerator that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such method, process, device, or DNN accelerators. Also, the term “or” refers to an inclusive “or” and not to an exclusive “or. ”
The systems, methods and devices of this disclosure each have several innovative aspects, no single one of which is solely responsible for all desirable attributes disclosed herein. Details of one or more implementations of the subject matter described in this specification are set forth in the description below and the accompanying drawings.
Example Computer Vision System
FIG. 1 illustrates an example computer vision system 100, in accordance with various embodiments. The computer vision system 100 includes a visual recognition module 110, a training module 120, a validating module 125, and a datastore 130. In other embodiments, alternative configurations, different or additional components may be included in the computer vision system 100. Further, functionality attributed to a component of the computer vision system 100 may be accomplished by a different component included in the computer vision system 100 or by a different system.
The visual recognition module 110 performs point cloud-based computer vision tasks. The visual recognition module 110 may receive point clouds, which capture objects, and detect various conditions of the objects using the point clouds. As shown in FIG. 1, the visual recognition module 110 includes a PC2G encoder 140 and a CNN 150. In other embodiments, the visual recognition module 110 may include fewer, more, or different
components. The PC2G encoder 140 may convert point clouds of objects to grid representations of objects. A grid representation may include nodes arranged in a grid structure. A grid representation may be processed by the CNN 150 as an input feature map of a convolutional layer in the CNN 150. As shown in FIG. 1, the PC2G encoder 140 includes a down-sampling module 160, an up-sampling model 170, and a reshaping module 180. In other embodiments, alternative configurations, different or additional components may be included in the PC2G encoder 140. Further, functionality attributed to a component of the PC2G encoder 140 may be accomplished by a different component included in the visual recognition module 110 or in the computer vision system 100 or by a different system.
The down-sampling module 160 down-samples point clouds to generate sparse graphs. In some embodiments, the down-sampling module 160 may perform graph construction transformation. The down-sampling module 160 may transform a point cloud to a sparse graph by removing at least one point (e.g., invalid point or noisy point) from the point cloud. The sparse graph is also referred to as a point cloud graph. The sparse graph may be a graph representation of the object. A graph representation of an object may be a graph representing the object. The sparse graph may include nodes representing features of the object. A feature of the object may be a portion of the object, an attribute (e.g., location, position, color, shape, size, texture, material, etc. ) of a portion of the object, an attribute of the object, other types of features, or some combination thereof. The nodes may be connected by one or more edges that represent relationships between the features. Examples of the object may include human body, human body part, animal, plant, vehicle, robot, and so on.
The total number of nodes in the sparse graph may be smaller than the total number of points in the point cloud. The down-sampling module 160 may generate a node in the point cloud graph using at least one point in the point cloud. In an example, the down-sampling module 160 may select one or more points from the point cloud and generate a node from each selected point. The down-sampling module 160 may remove one or more unselected points from the point cloud. Alternatively, the down-sampling module 160 may select one or more points to be removed from the point cloud and generate a node in the sparse graph using an unselected point. In another example, the down-sampling module 160 may generate a node in the sparse graph based on multiple points in the point cloud. For
instance, the down-sampling module 160 may select multiple points and generate a node based on the selected points.
In some embodiments, the down-sampling module 160 may divide the point cloud into regions. The regions may have no overlap with each other. The down-sampling module 160 may generate a node in the sparse graph from each of the regions. In some embodiments, the down-sampling module 160 may down-sample the point cloud to a specific number of points. In an example, the down-sampling module 160 may select a maximum point from each region and generate a node based on the maximum point. In another example, the down-sampling module 160 may average all or some of the points in a region and generate a node based on the average of the points. For instance, the down-sampling module 160 may identify valid points in the point cloud, e.g., based on state values of the points, and average the valid points within each of non-overlapping regions.
In some embodiments, a point cloud with N points may be denoted as P= {p1, …, pN} , where pi= (xi, si) is a point with 3D coordinates xi and a state value si. In some embodiments (e.g., embodiments where the down-sampling module 160 divides the point cloud into regions) , the point cloud P may be split into K spatially non-overlapping regions {R1, …, RK} . The down-sampled K points may be denoted as P′= {p′1, …, p′K} , where p′i may be calculated upon the region Ri. The down-sampling module 160 may generate a sparse graph denoted as G= (P′, E, A) , which uses P′ as the vertices and connecting a point to its neighbors within a fixed radius r:
E= { (p′i, p′j) |||x′i-x′j||2<r}
A∈ {0, 1} K×K may denote an adjacency matrix. When there is an edge between points p′i and p′j, the entry A (i, j) =1; otherwise, A (i, j) =0. The corresponding feature of the sparse graph may be denoted as X∈RK×C, where C denotes the point cloud feature dimension.
The up-sampling model 170 generates up-sampling graph representation of objects. In some embodiments, the up-sampling model 170 generates an up-sampled graph from a sparse graph generated by the down-sampling module 160. For instance, the up-sampling model 170 may interpolate one or more new nodes into the sparse graph, which constitutes at least part of an up-sampling transform. In some embodiments, the down-sampling transform performed by the down-sampling module 160 may lead to information loss as raw
points are removed, which may impact the accuracy of visual recognition. The up-sampling transform performed by the up-sampling model 170 may address this problem, e.g., by interpolating expressive nodes conditioned on the target task, which can improve the efficiency and feature learning for point cloud data.
In an example, the up-sampling model 170 may transform a graph from K nodes to H×W nodes, where H×W≥K. The up-sampling model 170 may generate the up-sampled graph using a transform denoted as X′= (Λ·A) ·X, wheredenotes the up-sampling matrix anddenotes the up-sampled graph. By using the adjacency matrix A, the up-sampling model 170 can interpolate new nodes using adjacent nodes along existing edges in the sparse graph. For instance, the up-sampling model 170 may add one or more new nodes between two adjacent nodes that are connected by an edge in the sparse graph. That can facilitate the PC2G encoder 140 with the association of point cloud topological prior for improved 3D visual representation.
The reshaping module 180 reshapes up-sampled graphs generated by the up-sampling model 170 into feature maps with grid structures. In some embodiments, the reshaping module 180 may perform an index transform on an up-sampled graph, in which the nodes in the up-sampled graph are rearranged into one or more rows and one or more columns, which constitute a grid representation of the object. A grid representation may be a grid representing an object. The grid may include elements arranged in a grid structure that has rows and columns. In some embodiments, the reshaping module 180 may allocate the nodes in the up-sampled graph one by one to the desired grid elements in the grid structure.
In some embodiments, the reshaping module 180 performs a mapping from the index of nodes in the up-sampled point cloud graph X′to the spatial index of grid cells in a grid patch DH×W which is defined by a binary matrix Φ∈ {0, 1} HW×HW. Each row φi∈ {0, 1} HW in Φ may be a one-hot vector indicating the selected index of node in the up-sampled point cloud graph, that is, a grid cell di is filled with a specific graph node vj when φi,j=1. The reshaping module 180 may obtain a feature map Y from the up-sampled point cloud graph X′ through: Y=reshape (Φ·X′) , where “·” is a row-by-column multiplication, with each row of the product matrix being the selected graph node feature. The layout of the grid patch DH×W can be learned along with Φ. The reshape (·) operation of the
reshaping module 180 rearranges the output of Φ·X′ into a grid representation of the object. The grid representation may be used as an input feature map (IFM) having a spatial size denoted as H×W×C, where H is the height of the feature map, which may equal to the total number of grid cells in a column, W is the width of the feature map, which may equal to the total number of grid cells in a row, and C may indicating the number of input channels.
The CNN 150 may receive grid representations generated by the PC2G encoder 140. The CNN 150 may include one or more convolutional layers. One or more convolutions may be performed on each grid representation generated by the PC2G encoder 140. A grid representation may be used as an input tensor (e.g., an IFM) of a convolution. Each grid element (or grid cell) , which may be a node in the corresponding up-sampled graph, may be used as an activation of the convolution. The convolution may have a kernel including a plurality of weights. The kernel may have a smaller size than the IFM. For instance, the height or width of the kernel may be smaller than the height or width of the IFM. An example of the convolution may be the convolution illustrated in FIG. 7. The convolution may product an output feature map (OFM) , which may be further processed by other layers of the CNN 150. The CNN 150 may also include other types of layers, such as linear layers, pooling layers, and so on. An example of the CNN 150 is the CNN 600 in FIG. 6.
The CNN 150 may output information indicating conditions of objects. Example conditions may include classification, gesture, pose, movement, action, mood, orientation, interest, traffic-related condition, other types of conditions, or some combination thereof. Conditions of objects may be used in various applications, such as human pose lifting, skeleton-based human action recognition, 3D mesh reconstruction, traffic navigation, social network analysis, recommend system, scientific computing, and so on.
The training module 120 trains the visual recognition module 110. In the training process, the training module 120 may form a training dataset. The training dataset includes training samples and ground-truth labels. The training samples may be point clouds. Each training sample may be associated with one or more ground-truth labels. A ground-truth label of a training sample may be a known or verified label that answers the problem or question that the visual recognition module 110 will be used to answer, such as known or verified visual recognition of one or more conditions of an object captured by the point
cloud in the training sample. In an example where the visual recognition module 110 is used to estimate pose, a ground-truth label may indicate a ground-truth pose of an object in the training sample. The ground-truth label may be a numerical value that indicates a pose or a likelihood of the object having a pose.
The training module 120 may modify internal parameters of the visual recognition module 110 based on the ground-truth labels of the training samples and the outputs of the visual recognition module 110 that are generated by processing the training samples. The internal parameters may include one or more internal parameters of the PC2G encoder 140 or one or more internal parameters of the CNN 150. In some embodiments, the training module 120 modifies the internal parameters of the visual recognition module 110 to minimize the error between labels of the training samples that are generated by the visual recognition module 110 and the ground-truth labels. In some embodiments, the training module 120 uses a cost function or loss function to minimize the error.
In some embodiments, the training module 120 may jointly train at least part of the PC2G encoder 140 and at least part of the CNN 150. In an example, the training module 120 may train the up-sampling model 170, the reshaping module 180, and the CNN 150 through the same training process. The training module 120 may input one or more training samples into the visual recognition module 110, e.g., directly input into the up-sampling model 170. The training module 120 may cause executions of the up-sampling model 170, the reshaping module 180, and the CNN 150 on the one or more training samples. The training module 120 may further update one or more internal parameters of the up-sampling model 170, the reshaping module 180, and the CNN 150 based on the outputs of the visual recognition module 110 and one or more ground-truth labels associated with the one or more training samples. The outputs of the visual recognition module 110 may be outputs of the CNN 150. By jointly training the up-sampling model 170, the reshaping module 180, and the CNN 150 in an end-to-end manner, relationships between features of objects may be automatically learned.
In other embodiments, the training module 120 may conduct progressive training of the up-sampling model 170 and the reshaping module 180. The training module 120 may train the up-sampling model 170 and the reshaping module 180 in multiple training stages for multiple tasks to be performed by the up-sampling model 170 and the reshaping module
180. In an example, the training module may train the up-sampling model 170 and the reshaping module 180 jointly with the CNN 150 in the first training stage. In the first training stage, the up-sampling model 170 may be trained for up-sampling sparse graphs to generate up-sampled graphs, the reshaping module 180 may be trained to transform up-sampled graphs generated by the up-sampling model 170 to grids, and the CNN 150 may be trained to process grids generated by the reshaping module 180 to determine conditions of objects.
In the second training stage, the training module 120 may update the internal parameters of the up-sampling model 170 and the reshaping module 180 to train the up-sampling model 170 and the reshaping module 180 for tasks to be performed in a different stage. For instance, the up-sampling model 170 may be trained for converting grids generated by the reshaping module 180 to up-sampled graphs, and the reshaping module 180 may be trained for transforming up-sampled graphs generated by the up-sampling model 170 to up-sampled grids. The CNN 150, which has been trained to process grids in the first training stage, can process the up-sampled grids generated by the reshaping module 180. The values of the internal parameters of the up-sampling model 170 or the reshaping module 180 that are determined in the first training stage may be fixed and may not be adjustable in the second training stage. The internal parameters of the up-sampling model 170 or the reshaping module 180 that are updated in the second training stage may be different internal parameters from the internal parameters determined in the first raining stage.
The CNN 150 may also be further trained in the second training stage. The values of the internal parameters of the CNN 150 may start with the values that are learned in the first training stage. The training module 120 may further adjust the values of the internal parameters of the CNN 150 to further train the CNN 150 in the second training stage. In some embodiments, the up-sampling model 170, the reshaping module 180, and the CNN are jointly trained in the second training stage.
In some embodiments, the training module 120 may further train the up-sampling model 170 or the reshaping module 180 in one or more subsequent training stages for additional tasks to be performed by the up-sampling model 170 or the reshaping module 180. In each training stage, the up-sampling model 170 or the reshaping module 180 of the corresponding stage may be trained. With multiple training stages, the up-sampling model
170 or the reshaping module 180 can learn progressively.
In some embodiments, the training module 120 may also form validation datasets for validating performance of the visual recognition module 110 after training by the validating module 125. A validation dataset may include validation samples and ground-truth labels of the validation samples. The validation dataset may include different samples from the training dataset used for training the visual recognition module 110. In an embodiment, a part of a training dataset may be used to initially train the visual recognition module 110, and the rest of the training dataset may be held back as a validation subset used by the validating module 125 to validate performance of the visual recognition module 110. The portion of the training dataset not including the validation subset may be used to train the visual recognition module 110.
The training module 120 also determines hyperparameters for training the visual recognition module 110. Hyperparameters are variables specifying the training process. Hyperparameters are different from parameters inside the visual recognition module 110 ( “internal parameters, ” e.g., internal parameters of the up-sampling model 170, internal parameters of the reshaping module 180, weights for convolution operations in the CNN 150, etc. ) . In some embodiments, hyperparameters include variables determining the architecture of at least part of the visual recognition module 110, such as number of hidden layers in the CNN 150, and so on. Hyperparameters also include variables which determine how the visual recognition module 110 is trained, such as batch size, number of epochs, etc. A batch size defines the number of training samples to work through before updating the parameters of the visual recognition module 110. The batch size is the same as or smaller than the number of samples in the training dataset. The training dataset can be divided into one or more batches. The number of epochs defines how many times the entire training dataset is passed forward and backwards through the entire network. The number of epochs defines the number of times that the deep learning algorithm works through the entire training dataset. One epoch means that each training sample in the training dataset has had an opportunity to update the internal parameters of the visual recognition module 110. An epoch may include one or more batches. The number of epochs may be 15, 150, 500, 1500, or even larger.
The training module 120 may define the architecture of the visual recognition
module 110 (or part of the visual recognition module 110, e.g., the CNN 150) , e.g., based on some of the hyperparameters. The architecture of the CNN 150includes an input layer, an output layer, and a plurality of hidden layers. The input layer of the CNN 150 may include tensors (e.g., a multi-dimensional array) specifying attributes of the IFM, such as the height of the IFM, the width of the IFM, and the depth of the IFM (e.g., the number of channels in the IFM) . The output layer includes labels of objects in the input layer. The hidden layers are layers between the input layer and output layer. The hidden layers include one or more convolutional layers and one or more other types of layers, such as pooling layers, fully-connected layers, normalization layers, softmax or logistic layers, and so on. The convolutional layers of the CNN 150 may convert the IFM to a feature map that is represented by a tensor specifying the feature map height, the feature map width, and the feature map channels. A pooling layer is used to reduce the spatial volume of IFM after convolution. It is used between two convolutional layers. A fully-connected layer involves weights, biases, and neurons. It connects neurons in one layer to neurons in another layer. It is used to classify images between different category by training.
In the process of defining the architecture of the CNN 150, the training module 120 also adds an activation function to a hidden layer or the output layer. An activation function of a layer transforms the weighted sum of the input of the layer to an output of the layer. The activation function may be, for example, a rectified linear unit (ReLU) activation function, a tangent activation function, or other types of activation functions.
The training module 120 may train the visual recognition module 110 for a predetermined number of epochs. The number of epochs is a hyperparameter that defines the number of times that the DL algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update the internal parameters of the visual recognition module 110. After the training module 120 finishes the predetermined number of epochs, the training module 120 may stop updating the internal parameters of the visual recognition module 110, and the visual recognition module 110 is considered trained.
The validating module 125 verifies accuracy of the visual recognition module 110 after the visual recognition module 110 is trained. In some embodiments, the validating module 125 inputs samples in a validation dataset into the visual recognition module 110
and uses the outputs of the visual recognition module 110 to determine the model accuracy. In some embodiments, a validation dataset may be formed of some or all the samples in the training dataset. Additionally or alternatively, the validation dataset includes additional samples, other than those in the training sets. In some embodiments, the validating module 125 determines may determine an accuracy score measuring the precision, recall, or a combination of precision and recall of the DNN. The validating module 125 may use the following metrics to determine the accuracy score: Precision = TP/ (TP + FP) and Recall = TP / (TP + FN) , where precision may be how many the visual recognition module 110 correctly predicted (TP or true positives) out of the total it predicted (TP + FP or false positives) , and recall may be how many the visual recognition module 110 correctly predicted (TP) out of the total number of objects that did have the property in question (TP + FN or false negatives) . The F-score (F-score = 2 *PR/ (P + R) ) unifies precision and recall into a single measure.
The validating module 125 may compare the accuracy score with a threshold score. In an example where the validating module 125 determines that the accuracy score is lower than the threshold score, the validating module 125 instructs the training module 120 to re-train the visual recognition module 110. In one embodiment, the training module 120 may iteratively re-train the visual recognition module 110 until the occurrence of a stopping condition, such as the accuracy measurement indication that the visual recognition module 110 may be sufficiently accurate, or a number of training rounds having taken place.
The datastore 130 stores data received, generated, used, or otherwise associated with the computer vision system 100. For example, the datastore 130 stores the datasets used by the training module 120 and validating module 125. The datastore 130 may also store data generated by the training module 120 and validating module 125, such as the hyperparameters for training the visual recognition module 110, internal parameters of the visual recognition module 110, and so on. As another example, the datastore 130 may store point clouds to be processed by the visual recognition module 110 for performing computer vision tasks. The datastore 130 may also store outputs of the visual recognition module 110 or components of the visual recognition module 110, such as sparse graphs, up-sampled graphs, feature maps, labels indicating conditions of objects, and so on. In the embodiment of FIG. 1, the datastore 130 is a component of the computer vision system 100. In other
embodiments, the datastore 130 may be external to the computer vision system 100 and communicate with the computer vision system 100 through a network.
Example Point Cloud Graphs
FIG. 2A illustrates an example sparse point cloud graph 200, in accordance with various embodiments. The sparse point cloud graph 200 is a graph representing a chair. The sparse point cloud graph 200 includes information encoding features of the chair. The sparse point cloud graph 200 may be generated by the down-sampling module 160 in FIG. 1 using a point cloud that captures the chair.
For the purpose of illustration, FIG. 2A also shows a zoomed-in view of a subset 205 of the sparse point cloud graph 200. As shown in FIG. 2A, the sparse point cloud graph 200 includes a plurality of nodes 210 (individually referred to as “node 210” ) and various edges 220 (individually referred to as “edge 220” ) connecting the nodes 210. The nodes 210 are shown as black circles in FIG. 2A, and the edges 220 are shown as short lines in FIG. 2A. Each node 210 may represent a point of the chair. In some embodiments, a node 210 may have 3D coordinates that indicate a position of the point in a 3D space. The node 210 may also include information that indicates one or more attributes of the points, such as validity or invalidity of the point, importance of the point, and so on. The edges 220 may represent topological relationships between the points. For instance, an edge 220 connects two nodes 220 representing two adjoining points.
FIG. 2B illustrates an example up-sampled point cloud graph 250, in accordance with various embodiments. The up-sampled point cloud graph 250 is another graph representation of the chair. The up-sampled point cloud graph 250 may be converted from the sparse point cloud graph 200, e.g., by the up-sampling model 170. For the purpose of illustration, FIG. 2B also shows a zoomed-in view of a subset 255 of the up-sampled point cloud graph 250. The subset 255 may correspond to the subset 205. For instance, the subset 255 may be generated from up-sampling the subset 205. The up-sampled point cloud graph 250 may include all the nodes 210 in the sparse point cloud graph 200 plus nodes 215. The nodes 215 are represented by dotted circles in FIG. 2B. The nodes 215 are new nodes that do not exist in the sparse point cloud graph 200. In some embodiments, the position of a node 215 is determined based on the positions of some nodes 210. For instance, a node 215 may be between two adjacent nodes 210. In some embodiments, a node 215 may be placed
on an edge 220. With the addition of the nodes 215, the up-sampled point cloud graph 250 has more nodes than the sparse point cloud graph 200. As shown in FIGS. 2A and 2B, the up-sampled point cloud graph 250 appears to be denser than the sparse point cloud graph 200.
Example Grid-structured Feature Maps
FIG. 3 illustrates a feature map 300, in accordance with various embodiments. The feature map 300 may be a grid representation of an object, e.g., the chair in FIGS. 2A and 2B. The feature map 300 may be generated from an up-sampled graph, the up-sampled graph 250 in FIG. 2B. For the purpose of illustration, the feature map 300 has a 2D grid structure, which may be denoted as DH×W. The grid structure has rows along the X axis and columns along the Y axis. The feature map 300 may have a width W along the X axis and a height H along the Y axis. The spatial size of the feature map 300 may be the width W, the height H, or the area H×W. In some embodiments, the width W may be the total number of elements in a row of the feature map 300. The height H may be the total number of elements in a column of the feature map 300. Each element is represented by a dark circle in FIG. 3. In some embodiments, an element may be a node in the up-sampled graph, e.g., a node 210 or a node 215. In some embodiments, the up-sampled graph is input into the reshaping module 180, and the reshaping module 180 outputs the feature map 300. For the purpose of illustration, the feature map 300 is a 2D tensor. In other embodiments, the feature map 300 may be a 3D tensor, e.g., a tensor with a spatial size H×W×C. In an example, C may be 2, 3, or a larger number.
A subtensor 305 in the feature map 300 is shown in FIG. 3. The subtensor 305 is a portion of the feature map 300. The subtensor 305 may correspond to the subset 205 of the sparse point cloud 200 in FIG. 2A or the subset 255 of the up-sampled point cloud 250 in FIG. 2B. For instance, the subtensor 305 may be generated by rearranging the nodes in the subset 255 of the up-sampled point cloud 250 into a grid.
FIG. 4 illustrates a convolution 400 executed on a grid 410, in accordance with various embodiments. The grid 410 may be generated by reshaping a graph. For the purpose of illustration, the grid 410 is two-dimensional and has a spatial size of 6×6, i.e., there are six elements in each row and six elements in each column. Every element of the grid 410 is represented by a black circle in FIG. 4. An element may be a node in the graph.
The convolution 400 may be a deep learning operation in a CNN, e.g., the CNN 150. In the embodiments of FIG. 4, the convolution 400 has a kernel 420 with a spatial size of 3×3, i.e., the kernel 420 has nine weights arranged in three rows and three columns. The grid 410 is used as an IFM of the convolution 400. During the convolution 400, multiply-accumulate (MAC) operations are performed as the kernel 420 slides through the grid 410, as indicated by the arrows in FIG. 4. For the purpose of illustration, the stride for applying the kernel on the grid 700 is 1, meaning the kernel slides one data element at a time. In other embodiments, the stride may be more than one.
In the embodiments of FIG. 4, the convolution 400 generates an OFM 430, which is a 6×6 tensor and has the same spatial size as the grid 410. In other embodiments, the OFM 430 may have a different spatial size. The convolution 400 may include padding, through which additional data elements are added to the grid 410 before the kernel is applied on the grid 410. In an example where the padding factor is 1, one additional row is added to the top of the grid 410, one additional row is added to the bottom of the grid 410, one additional row is added to the right of the grid 410, and one additional row is added to the left of the grid 410. The size of the grid 410 after the padding would become 8×8. The OFM 430 may be processed in additional deep learning operations in the CNN, e.g., another convolution, activation function, pooling operation, linear transformation, and so on. The CNN may output information indicating one or more conditions of the object 310.
Example Learning Process
FIG. 5 illustrates a process of training a learnable PC2G encoder that converts point graphs to feature maps, in accordance with various embodiments. The training process has a forward path 510 and a backward path 520. The learnable PC2G encoder has a real matrix 501, a binarization module 502, and a binary matrix 503. The learnable PC2G encoder may have other components, which are not shown in FIG. 5. The learnable PC2G encoder may be an example of the PC2G encoder 140 in FIG. 1.
In some embodiments, the real matrix 501 may be denoted asThe real matrix 501 may be a continuous approximation of the binary matrix 503. The real matrix 501 may be used to assist the learning of the binary matrix 503, which may be denoted as Φ. In some embodiments, directly learning the binary matrix 503 would cut off the gradient flow in the backward path 520, which can make the training non-differentiable. Straight-through
estimator (STE) may be used for parameter update to solve this problem.
In the forward path 510, the binarization module 502 may convert the real matrix 501 to the binary matrix 503 by binarizing the real matrix 501 row by row, according to:
For each row in the binary matrix 503, the column with maximum value in the corresponding row of the real matrix 501 may be set to one. Otherwise, the column in the real matrix 501 may be set to zero. In the backward path 520, continuous gradient may be used to update the real matrix 501, instead of the binary matrix 503. By introducing the real matrix 501 as a continuous approximation of the binary matrix 503, it can enable the searching for a decent layout of grid patch for improved visual expressiveness.
The binary matrix 503 may then be used to convert an up-sampled graph 504 to a grid 505. The grid 505 is a 3D tensor that has a spatial size of 5×5×3. In other embodiments, the grid may have a different shape or dimension. For instance, the grid 505 may have one or more larger dimensions. Each element in the grid 505 may be a different node in the up-sampled graph 504. The grid 505 is input into a CNN 506 and may be processed by the CNN 506 as an IFM. The CNN 506 may be an example of the CNN 150 in FIG. 1. The CNN 506 may generate a label that indicates visual recognition of an object represented by the up-sampled graph 504.
In the backward path 520, the real matrix 501 may be updated based on a loss. The loss may indicate a difference between the label generated by the CNN 506 and a ground-truth label associated with the up-sampled graph 504, e.g., a known or verified visual recognition of the object. The update of the real matrix 501 may include a change of the value of at least one element of the real matrix 501. In some embodiments, internal parameters (e.g., weights) of the CNN 506 may be updated as well in the backward path 520. After the backward path 520 is complete, the forward path 510 may be performed again, e.g., by using a different up-sampled graph. The backward path 520 may be performed again as well. The training process may be an iterative process, in which the forward path 510 and the backward path 520 are repeated till the learnable PC2G encoder is fully trained.
Example CNN
FIG. 6 illustrates an example CNN 600, in accordance with various embodiments. The CNN 600 in FIG. 6 may be an example of the CNN 150 in FIG. 1 or the CNN 506 in FIG. 5. The
CNN 600 is trained to receive grid-structured data and output information indicating conditions of objects. In the embodiments of FIG. 6, the CNN 600 includes a sequence of layers comprising a plurality of convolutional layers 610 (individually referred to as “convolutional layer 610” ) , a plurality of pooling layers 620 (individually referred to as “pooling layer 620” ) , and a plurality of fully-connected layers 630 (individually referred to as “fully-connected layer 630” ) . In other embodiments, the CNN 600 may include fewer, more, or different layers. In an inference of the CNN 600, the layers of the CNN 600 execute tensor computation that includes many tensor operations, such as convolution (e.g., MAC operations, etc. ) , pooling operations, elementwise operations (e.g., elementwise addition, elementwise multiplication, etc. ) , other types of tensor operations, or some combination thereof.
The convolutional layers 610 summarize the presence of features in the input to the CNN 600. The convolutional layers 610 function as feature extractors. The first layer of the CNN 600 is a convolutional layer 610. In an example, a convolutional layer 610 performs a convolution on an input tensor 640 (also referred to as IFM 640) and a filter 650. As shown in FIG. 6, the IFM 640 is represented by a 7×7×3 three-dimensional (3D) matrix. The IFM 640 includes 3 input channels, each of which is represented by a 7×7 two-dimensional (2D) matrix. The 7×7 2D matrix includes 7 input elements (also referred to as input points) in each row and seven input elements in each column. The filter 650 is represented by a 3×3×3 3D matrix. The filter 650 includes 3 kernels, each of which may correspond to a different input channel of the IFM 640. A kernel is a 2D matrix of weights, where the weights are arranged in columns and rows. A kernel can be smaller than the IFM. In the embodiments of FIG. 6, each kernel is represented by a 3×3 2D matrix. The 3×3 kernel includes 3 weights in each row and three weights in each column. Weights can be initialized and updated by backpropagation using gradient descent. The magnitudes of the weights can indicate importance of the filter 650 in extracting features from the IFM 640.
The convolution includes MAC operations with the input elements in the IFM 640 and the weights in the filter 650. The convolution may be a standard convolution 663 or a depthwise convolution 683. In the standard convolution 663, the whole filter 650 slides across the IFM 640. All the input channels are combined to produce an output tensor 660 (also referred to as OFM 660) . The OFM 660 is represented by a 5×5 2D matrix. The 5×5 2D
matrix includes 5 output elements (also referred to as output points) in each row and five output elements in each column. For the purpose of illustration, the standard convolution includes one filter in the embodiments of FIG. 6. In embodiments where there are multiple filters, the standard convolution may produce multiple output channels in the OFM 660.
The multiplication applied between a kernel-sized patch of the IFM 640 and a kernel may be a dot product. A dot product is the elementwise multiplication between the kernel-sized patch of the IFM 640 and the corresponding kernel, which is then summed, always resulting in a single value. Because it results in a single value, the operation is often referred to as the “scalar product. ” Using a kernel smaller than the IFM 640 is intentional as it allows the same kernel (set of weights) to be multiplied by the IFM 640 multiple times at different points on the IFM 640. Specifically, the kernel is applied systematically to each overlapping part or kernel-sized patch of the IFM 640, left to right, top to bottom. The result from multiplying the kernel with the IFM 640 one time is a single value. As the kernel is applied multiple times to the IFM 640, the multiplication result is a 2D matrix of output elements. As such, the 2D output matrix (i.e., the OFM 660) from the standard convolution 663 is referred to as an OFM.
In the depthwise convolution 683, the input channels are not combined. Rather, MAC operations are performed on an individual input channel and an individual kernel and produce an output channel. As shown in FIG. 6, the depthwise convolution 683 produces a depthwise output tensor 680. The depthwise output tensor 680 is represented by a 5×5×3 3D matrix. The depthwise output tensor 680 includes 3 output channels, each of which is represented by a 5×5 2D matrix. The 5×5 2D matrix includes 5 output elements in each row and five output elements in each column. Each output channel is a result of MAC operations of an input channel of the IFM 640 and a kernel of the filter 650. For instance, the first output channel (patterned with dots) is a result of MAC operations of the first input channel (patterned with dots) and the first kernel (patterned with dots) , the second output channel (patterned with horizontal strips) is a result of MAC operations of the second input channel (patterned with horizontal strips) and the second kernel (patterned with horizontal strips) , and the third output channel (patterned with diagonal stripes) is a result of MAC operations of the third input channel (patterned with diagonal stripes) and the third kernel (patterned with diagonal stripes) . In such a depthwise convolution, the number of input channels equals
the number of output channels, and each output channel corresponds to a different input channel. The input channels and output channels are referred to collectively as depthwise channels. After the depthwise convolution, a pointwise convolution 693 is then performed on the depthwise output tensor 680 and a 6×1×3 tensor 690 to produce the OFM 660.
The OFM 660 is then passed to the next layer in the sequence. In some embodiments, the OFM 660 is passed through an activation function. An example activation function is ReLU. ReLU is a calculation that returns the value provided as input directly, or the value zero if the input is zero or less. The convolutional layer 610 may receive several images as input and calculate the convolution of each of them with each of the kernels. This process can be repeated several times. For instance, the OFM 660 is passed to the subsequent convolutional layer 610 (i.e., the convolutional layer 610 following the convolutional layer 610 generating the OFM 660 in the sequence) . The subsequent convolutional layers 610 perform a convolution on the OFM 660 with new kernels and generate a new feature map. The new feature map may also be normalized and resized. The new feature map can be kernelled again by a further subsequent convolutional layer 610, and so on.
In some embodiments, a convolutional layer 610 has four hyperparameters: the number of kernels, the size F kernels (e.g., a kernel is of dimensions F×F×D pixels) , the S step with which the window corresponding to the kernel is dragged on the image (e.g., a step of one means moving the window one pixel at a time) , and the zero-padding P (e.g., adding a black contour of P pixels thickness to the input image of the convolutional layer 610) . The convolutional layers 610 may perform various types of convolutions, such as 2-dimensional convolution, dilated or atrous convolution, spatial separable convolution, depthwise separable convolution, transposed convolution, and so on. The CNN 600 includes 66 convolutional layers 610. In other embodiments, the CNN 600 may include a different number of convolutional layers.
The pooling layers 620 down-sample feature maps generated by the convolutional layers, e.g., by summarizing the presence of features in the patches of the feature maps. A pooling layer 620 is placed between two convolution layers 610: a preceding convolutional layer 610 (the convolution layer 610 preceding the pooling layer 620 in the sequence of layers) and a subsequent convolutional layer 610 (the convolution layer 610 subsequent to
the pooling layer 620 in the sequence of layers) . In some embodiments, a pooling layer 620 is added after a convolutional layer 610, e.g., after an activation function (e.g., ReLU, etc. ) has been applied to the OFM 660.
A pooling layer 620 receives feature maps generated by the preceding convolution layer 610 and applies a pooling operation to the feature maps. The pooling operation reduces the size of the feature maps while preserving their important characteristics. Accordingly, the pooling operation improves the efficiency of the CNN and avoids over-learning. The pooling layers 620 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map) , max pooling (calculating the maximum value for each patch of the feature map) , or a combination of both. The size of the pooling operation is smaller than the size of the feature maps. In various embodiments, the pooling operation is 2×2 pixels applied with a stride of two pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size. In an example, a pooling layer 620 applied to a feature map of 6×6 results in an output pooled feature map of 3×3. The output of the pooling layer 620 is inputted into the subsequent convolution layer 610 for further feature extraction. In some embodiments, the pooling layer 620 operates upon each feature map separately to create a new set of the same number of pooled feature maps.
The fully-connected layers 630 are the last layers of the CNN. The fully-connected layers 630 may be convolutional or not. The fully-connected layers 630 may also be referred to as linear layers. In some embodiments, a fully-connected layer 630 (e.g., the first fully-connected layer in the CNN 600) may receive an input operand. The input operand may define the output of the convolutional layers 610 and pooling layers 620 and includes the values of the last feature map generated by the last pooling layer 620 in the sequence. The fully-connected layer 630 may apply a linear transformation to the input operand through a weight matrix. The weight matrix may be a kernel of the fully-connected layer 630. The linear transformation may include a tensor multiplication between the input operand and the weight matrix. The result of the linear transformation may be an output operand. In some embodiments, the fully-connected layer may further apply a non-linear transformation (e.g., by using a non-linear activation function) on the result of the linear transformation to generate an output operand. The output operand may contain as many elements as there
are classes: element i represents the probability that the image belongs to class i. Each element is therefore between 0 and 6, and the sum of all is worth one. These probabilities are calculated by the last fully-connected layer 630 by using a logistic function (binary classification) or a SoftMax function (multi-class classification) as an activation function.
FIG. 7 illustrates an example convolution, in accordance with various embodiments. The convolution may be a deep learning operation in a convolutional layer of a CNN, e.g., the CNN 150 in FIG. 1, the CNN 506 in FIG. 5, and the CNN 600 in FIG. 6. The convolution can be executed on an input tensor 710 and filters 720 (individually referred to as “filter 720” ) . The result of the convolution is an output tensor 730. In some embodiments, the convolution is performed by a DNN accelerator.
In the embodiments of FIG. 7, the input tensor 710 includes activations (also referred to as “input activations, ” “elements, ” or “input elements” ) arranged in a 3D matrix. An input element is a data point in the input tensor 710. The input tensor 710 has a spatial size Hin×Win×Cin, where Hin is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of activations in a column in the 3D matrix of each input channel) , Win is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of activations in a row in the 3D matrix of each input channel) , and Cin is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of input channels) . For the purpose of simplicity and illustration, the input tensor 710 has a spatial size of 7×7×3, i.e., the input tensor 710 includes three input channels and each input channel has a 7×7 2D matrix. Each input element in the input tensor 710 may be represented by a (X, Y, Z) coordinate. In other embodiments, the height, width, or depth of the input tensor 710 may be different.
Each filter 720 includes weights arranged in a 3D matrix. The values of the weights may be determined through training the DNN. A filter 720 has a spatial size Hf×Wf×Cf, where Hf is the height of the filter (i.e., the length along the Y axis, which indicates the number of weights in a column in each kernel) , Wf is the width of the filter (i.e., the length along the X axis, which indicates the number of weights in a row in each kernel) , and Cf is the depth of the filter (i.e., the length along the Z axis, which indicates the number of channels) . In some embodiments, Cf equals Cin. For purpose of simplicity and illustration, each filter 720 in FIG. 7 has a spatial size of 7×3×3, i.e., the filter 720 includes 7 convolutional
kernels with a spatial size of 3×3. In other embodiments, the height, width, or depth of the filter 720 may be different. The spatial size of the convolutional kernels is smaller than the spatial size of the 2D matrix of each input channel in the input tensor 710.
An activation or weight may take one or more bytes in a memory. The number of bytes for an activation or weight may depend on the data format. For example, when the activation or weight has an INT8 format, the activation takes one byte. When the activation or weight has a FP16 format, the activation or weight takes two bytes. Other data formats may be used for activations or weights.
In the convolution, each filter 720 slides across the input tensor 710 and generates a 2D matrix for an output channel in the output tensor 730. In the embodiments of FIG. 7, the 2D matrix has a spatial size of 5×5. The output tensor 730 includes activations (also referred to as “output activations, ” “elements, ” or “output element” ) arranged in a 3D matrix. An output activation is a data point in the output tensor 730. The output tensor 730 has a spatial size Hout×Wout×Cout, where Hout is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of output activations in a column in the 2D matrix of each output channel) , Wout is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of output activations in a row in the 2D matrix of each output channel) , and Cout is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of output channels) . Cout may equal the number of filters 720 in the convolution. Hout and Wout may depend on the heights and weights of the input tensor 710 and each filter 720.
As a part of the convolution, MAC operations can be performed on a 3×3×3 subtensor 715 (which is highlighted with a dotted pattern in FIG. 7) in the input tensor 710 and each filter 720. The result of the MAC operations on the subtensor 715 and one filter 720 is an output activation. In some embodiments (e.g., embodiments where the convolution is an integral convolution) , an output activation may include 8 bits, e.g., one byte. In other embodiments (e.g., embodiments where the convolution is a floating-point convolution) , an output activation may include more than one byte. For instance, an output element may include two bytes.
After the MAC operations on the subtensor 715 and all the filters 720 are finished, a vector 735 is produced. The vector 735 is highlighted with slashes in FIG. 7. The vector 735
includes a sequence of output activations, which are arranged along the Z axis. The output activations in the vector 735 have the same (X, Y) coordinate, but the output activations correspond to different output channels and have different Z coordinates. The dimension of the vector 735 along the Z axis may equal the total number of output channels in the output tensor 730. After the vector 735 is produced, further MAC operations are performed to produce additional vectors till the output tensor 730 is produced.
In some embodiments, the output activations in the output tensor 730 may be further processed based on one or more activation functions before they are stored or inputted into the next layer of the CNN. The processing based on the one or more activation functions may be at least part of the post processing of the convolution. In some embodiments, the post processing may include one or more other computations, such as offset computation, bias computation, and so on. The results of the post processing may be stored in a local memory of the compute block and be used as input to the next layer. In some embodiments, the input activations in the input tensor 710 may be results of post processing of the previous layer. Even though the input tensor 710, filters 720, and output tensor 730 are 3D tensors in FIG. 7, the input tensor 710, a filter 720, or the output tensor 730 may be a 2D tensor in other embodiments.
Example AI-based Visual Recognition Environment
FIG. 8 illustrates an AI-based visual recognition environment 800, in accordance with various embodiments. The AI-based visual recognition environment 800 includes a visual recognition module 810, client devices 820 (individually referred to as client device 820) , and a third-party system 830. In other embodiments, the AI-based visual recognition environment 800 may include fewer, more, or different components. For instance, the AI-based visual recognition environment 800 may include a different number of client devices 820 or more than one third-party system 830.
The visual recognition module 810 performs visual recognition tasks, e.g., detection of conditions of objects. For instance, the visual recognition module 810 may track 3D motions of an object by estimating 3D poses of the object. In some embodiments, the visual recognition module 810 may receive one or more point clouds captured by one or more sensors placed in a local area where an object is located. The visual recognition module 810 may receive the point clouds from one or more client devices 820 or the third-party system
830. Also, the visual recognition module 810 may transmit information indicating visual recognition of the object to one or more client devices 820 or the third-party system 830. Additionally or alternatively, the visual recognition module 810 may transmit content items generated using the estimated 3D poses of the object to one or more client devices 820 or the third-party system 830. An example of the visual recognition module 810 is the visual recognition module 110 in FIG. 1.
The client devices 820 are in communication with the visual recognition module 810. For example, the client device 820 may receive 3D pose graphical representations from the visual recognition module 810 and display the 3D pose graphical representations to one or more users associated with the client device 820. As another example, a client device 820 may facilitate an interface with one or more depth cameras in a local area and may send commands to the depth cameras to capture depth images to be used by the visual recognition module 810. Additionally or alternatively, the client device 820 may facilitate an interface with one or more projectors in a local area and may provide content items to the projectors for the projectors to present the content items in the local area. The client device 820 may generate the content items using motion tracking results from the visual recognition module 810. A client device may have one or more users, whose motions may be tracked by the visual recognition module 810.
In some embodiments, a client device 820 may execute one or more applications allowing one or more users of the client device 820 to interact with the visual recognition module 810. For example, a client device 820 executes a browser application to enable interaction between the client device 820 and the visual recognition module 810. In another embodiment, a client device 820 interacts with the visual recognition module 810 through an application programming interface (API) running on a native operating system of the client device 820, such asor ANDROIDTM.
A client device 820 may be one or more computing devices capable of receiving user input as well as transmitting and/or receiving data via the network 840. In one embodiment, a client device 820 is a conventional computer system, such as a desktop or a laptop computer. Alternatively, a client device 820 may be a device having computer functionality, such as a personal digital assistant (PDA) , a mobile telephone, a smartphone, an autonomous vehicle, or another suitable device. A client device 820 is configured to
communicate via the network 840. In an embodiment, a client device 820 is an integrated computing device that operates as a standalone network-enabled device. For example, the client device 820 includes display, speakers, microphone, camera, and input device. In another embodiment, a client device 820 is a computing device for coupling to an external media device such as a television or other external display and/or audio output system. In this embodiment, the client device 820 may couple to the external media device via a wireless interface or wired interface and may utilize various functions of the external media device such as its display, speakers, microphone, camera, and input devices. Here, the client device 820 may be configured to be compatible with a generic external media device that does not have specialized software, firmware, or hardware specifically for interacting with the client device 820.
The third-party system 830 is an online system that may communicate with the visual recognition module 810 or at least one of the client devices 820. In some embodiments, the third-party system 830 may provide data to the visual recognition module 810 for 3D pose estimation. The data may include depth images, data for training DNNs, data for validating DNNs, and so on. The third-party system 830 may be a social media system, an online image gallery, an online searching system, and so on. Additionally or alternatively, the third-party system 830 may use results of 3D pose estimation in various applications. For instance, the third-party system 830 may use motion tracking results from the visual recognition module 810 for action recognition, sport analysis, virtual reality, augmented reality, film and game production, telepresence, and so on.
The visual recognition module 810, client devices 820, and third-party system 830 are connected through a network 840. The network 840 may comprise any combination of local area and/or wide area networks, using both wired and/or wireless communication systems. In one embodiment, the network 840 may use standard communications technologies and/or protocols. For example, the network 840 may include communication links using technologies such as Ethernet, 8010.11, worldwide interoperability for microwave access (WiMAX) , 3G, 4G, code division multiple access (CDMA) , digital subscriber line (DSL) , etc. Examples of networking protocols used for communicating via the network 840 may include multiprotocol label switching (MPLS) , transmission control protocol/Internet protocol (TCP/IP) , hypertext transport protocol (HTTP) , simple mail transfer protocol (SMTP) , and file
transfer protocol (FTP) . Data exchanged over the network 840 may be represented using any suitable format, such as hypertext markup language (HTML) or extensible markup language (XML) . In some embodiments, all or some of the communication links of the network 840 may be encrypted using any suitable technique or techniques.
Example Method of Visual Recognition
FIG. 9 is a flowchart showing a method 900 of visual recognition, in accordance with various embodiments. The method 900 may be a method of 3D visual recognition. The method 900 may be performed by the visual recognition module 110 in FIG. 1. Although the method 900 is described with reference to the flowchart illustrated in FIG. 9, many other methods for visual recognition may alternatively be used. For example, the order of execution of the steps in FIG. 9 may be changed. As another example, some of the steps may be changed, eliminated, or combined.
The visual recognition module 110 generates 910 a point cloud graph by removing one or more points from a point cloud capturing an object. The point cloud graph comprises a first group of nodes. In some embodiments, a node encodes a feature in the object, such as a portion of the object. The point cloud graph also includes one or more edges. An edge connects two adjacent nodes. In some embodiments, the edge represents a topological connection between two features in the object that are encoded by the two nodes.
In some embodiments, the visual recognition module 110 generates the point cloud graph by dividing the point cloud into one or more regions and for each region of the point cloud, removing one or more points in the region. In some embodiments, the visual recognition module 110 generates the point cloud graph by dividing the point cloud into one or more regions and for each region of the point cloud, generating a node in the point cloud graph by averaging points in the region.
The visual recognition module 110 generates 920 a feature map from the point cloud graph. The feature map comprises a second group of nodes that are arranged in a grid structure. The second group of nodes comprises more nodes than the first group of nodes. In some embodiments, the feature map is a grid patch.
In some embodiments, the visual recognition module 110 generates the feature map in two stages. In some embodiments, the visual recognition module 110 generates an additional point cloud graph from the point cloud graph, the point cloud graph comprising
the second group of nodes. The visual recognition module 110 transforms the additional point cloud graph into the feature map by rearranging the second group of nodes into the grid structure. In some embodiments, the visual recognition module 110 generates the additional point cloud graph by interpolating one or more new nodes between two nodes in the point cloud graph. The two nodes are connected through an edge in the point cloud graph. In some embodiments, the visual recognition module 110 generates the additional point cloud graph by applying an up-sampling matrix and an adjacency matrix on the point cloud graph.
In some embodiments, the visual recognition module 110 inputs the point cloud graph into a trained model. The trained model comprises a learnable binary matrix. The visual recognition module 110 generates the feature map using the learnable binary matrix. In some embodiments, the trained model is trained by updating one or more parameters of a real matrix and binarizing the real matrix row by row to obtain the binary matrix.
The visual recognition module 110 executes 930 one or more deep learning operations in a neural network on the up-sampled grid representation of the object. The neural network may be a CNN, e.g., the CNN 150 or the CNN 600. In some embodiments, the one or more deep learning operations comprises a convolution. In some embodiments, the convolution is executed on the up-sampled grid representation of the object. The convolution has a kernel. The kernel has a smaller size than the feature map.
The visual recognition module 110 determines 940 a condition of the object based on an output of the neural network. In some embodiments, the neural network outputs information describing the condition of the object. The condition of the object may be a pose, movement, gesture, orientation, mood, color, shape, size, or other types of conditions of the object.
Example Computing Device
FIG. 10 is a block diagram of an example computing device 1000, in accordance with various embodiments. In some embodiments, the computing device 1000 can be used as at least part of the computer vision system 100. A number of components are illustrated in FIG. 10 as included in the computing device 1000, but any one or more of these components may be omitted or duplicated, as suitable for the application. In some embodiments, some or all of the components included in the computing device 1000 may be attached to one or more
motherboards. In some embodiments, some or all of these components are fabricated onto a single system on a chip (SoC) die. Additionally, in various embodiments, the computing device 1000 may not include one or more of the components illustrated in FIG. 10, but the computing device 1000 may include interface circuitry for coupling to the one or more components. For example, the computing device 1000 may not include a display device 1006, but may include display device interface circuitry (e.g., a connector and driver circuitry) to which a display device 1006 may be coupled. In another set of examples, the computing device 1000 may not include an audio input device 1018 or an audio output device 1008, but may include audio input or output device interface circuitry (e.g., connectors and supporting circuitry) to which an audio input device 1018 or audio output device 1008 may be coupled.
The computing device 1000 may include a processing device 1002 (e.g., one or more processing devices) . The processing device 1002 processes electronic data from registers and/or memory to transform that electronic data into other electronic data that may be stored in registers and/or memory. The computing device 1000 may include a memory 1004, which may itself include one or more memory devices such as volatile memory (e.g., DRAM) , nonvolatile memory (e.g., read-only memory (ROM) ) , high bandwidth memory (HBM) , flash memory, solid state memory, and/or a hard drive. In some embodiments, the memory 1004 may include memory that shares a die with the processing device 1002. In some embodiments, the memory 1004 includes one or more non-transitory computer-readable media storing instructions executable to perform operations for performing 3D visual recognition, e.g., the method 900 described above in conjunction with FIG. 9 or some operations performed by the computer vision system 100 described above in conjunction with FIG. 1. The instructions stored in the one or more non-transitory computer-readable media may be executed by the processing device 1002.
In some embodiments, the computing device 1000 may include a communication chip 1012 (e.g., one or more communication chips) . For example, the communication chip 1012 may be configured for managing wireless communications for the transfer of data to and from the computing device 1000. The term "wireless" and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data through the use of modulated electromagnetic radiation
through a nonsolid medium. The term does not imply that the associated devices do not contain any wires, although in some embodiments they might not.
The communication chip 1012 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.10 family) , IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment) , Long-Term Evolution (LTE) project along with any amendments, updates, and/or revisions (e.g., advanced LTE project, ultramobile broadband (UMB) project (also referred to as "3GPP2" ) , etc. ) . IEEE 802.16 compatible Broadband Wireless Access (BWA) networks are generally referred to as WiMAX networks, an acronym that stands for worldwide interoperability for microwave access, which is a certification mark for products that pass conformity and interoperability tests for the IEEE 802.16 standards. The communication chip 1012 may operate in accordance with a Global System for Mobile Communication (GSM) , General Packet Radio Service (GPRS) , Universal Mobile Telecommunications System (UMTS) , High Speed Packet Access (HSPA) , Evolved HSPA (E-HSPA) , or LTE network. The communication chip 1012 may operate in accordance with Enhanced Data for GSM Evolution (EDGE) , GSM EDGE Radio Access Network (GERAN) , Universal Terrestrial Radio Access Network (UTRAN) , or Evolved UTRAN (E-UTRAN) . The communication chip 1012 may operate in accordance with CDMA, Time Division Multiple Access (TDMA) , Digital Enhanced Cordless Telecommunications (DECT) , Evolution-Data Optimized (EV-DO) , and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond. The communication chip 1012 may operate in accordance with other wireless protocols in other embodiments. The computing device 1000 may include an antenna 1022 to facilitate wireless communications and/or to receive other wireless communications (such as AM or FM radio transmissions) .
In some embodiments, the communication chip 1012 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., the Ethernet) . As noted above, the communication chip 1012 may include multiple communication chips. For instance, a first communication chip 1012 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second communication chip 1012 may be dedicated to longer-range wireless communications such as global positioning system (GPS) , EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others. In
some embodiments, a first communication chip 1012 may be dedicated to wireless communications, and a second communication chip 1012 may be dedicated to wired communications.
The computing device 1000 may include battery/power circuitry 1014. The battery/power circuitry 1014 may include one or more energy storage devices (e.g., batteries or capacitors) and/or circuitry for coupling components of the computing device 1000 to an energy source separate from the computing device 1000 (e.g., AC line power) .
The computing device 1000 may include a display device 1006 (or corresponding interface circuitry, as discussed above) . The display device 1006 may include any visual indicators, such as a heads-up display, a computer monitor, a projector, a touchscreen display, a liquid crystal display (LCD) , a light-emitting diode display, or a flat panel display, for example.
The computing device 1000 may include an audio output device 1008 (or corresponding interface circuitry, as discussed above) . The audio output device 1008 may include any device that generates an audible indicator, such as speakers, headsets, or earbuds, for example.
The computing device 1000 may include an audio input device 1018 (or corresponding interface circuitry, as discussed above) . The audio input device 1018 may include any device that generates a signal representative of a sound, such as microphones, microphone arrays, or digital instruments (e.g., instruments having a musical instrument digital interface (MIDI) output) .
The computing device 1000 may include a GPS device 1016 (or corresponding interface circuitry, as discussed above) . The GPS device 1016 may be in communication with a satellite-based system and may receive a location of the computing device 1000, as known in the art.
The computing device 1000 may include another output device 1010 (or corresponding interface circuitry, as discussed above) . Examples of the other output device 1010 may include an audio codec, a video codec, a printer, a wired or wireless transmitter for providing information to other devices, or an additional storage device.
The computing device 1000 may include another input device 1020 (or corresponding interface circuitry, as discussed above) . Examples of the other input device 1020 may include
an accelerometer, a gyroscope, a compass, an image capture device, a keyboard, a cursor control device such as a mouse, a stylus, a touchpad, a bar code reader, a Quick Response (QR) code reader, any sensor, or a radio frequency identification (RFID) reader.
The computing device 1000 may have any desired form factor, such as a handheld or mobile computer system (e.g., a cell phone, a smart phone, a mobile internet device, a music player, a tablet computer, a laptop computer, a netbook computer, an ultrabook computer, a PDA, an ultramobile personal computer, etc. ) , a desktop computer system, a server or other networked computing component, a printer, a scanner, a monitor, a set-top box, an entertainment control unit, a vehicle control unit, a digital camera, a digital video recorder, or a wearable computer system. In some embodiments, the computing device 1000 may be any other electronic device that processes data.
Select Examples
The following paragraphs provide various examples of the embodiments disclosed herein.
Example 1 provides a method, including generating a point cloud graph by removing one or more points from a point cloud capturing at least part of an object, the point cloud graph including a first group of nodes; generating a feature map from the point cloud graph, the feature map including a second group of nodes that are arranged in a grid structure, the second group of nodes including more nodes than the first group of nodes; executing one or more deep learning operations in a neural network on the feature map; and determining a condition of the object based on an output of the neural network.
Example 2 provides the method of example 1, in which generating the point cloud graph includes dividing the point cloud into one or more regions; and for each region of the point cloud, removing one or more points in the region.
Example 3 provides the method of example 1 or 2, in which generating the point cloud graph includes dividing the point cloud into one or more regions; and for each region of the point cloud, generating a node in the point cloud graph by averaging points in the region.
Example 4 provides the method of any one of examples 1-3, in which generating the point cloud graph includes selecting a point in the point cloud based on a state value of the point, the state value indicating a degree of invalidity of the point; and generating a node in
the point cloud graph from the selected point.
Example 5 provides the method of any one of examples 1-4, in which generating the feature map includes generating an additional point cloud graph from the point cloud graph, the point cloud graph including the second group of nodes; and transforming the additional point cloud graph into the feature map by rearranging the second group of nodes into the grid structure.
Example 6 provides the method of example 5, in which generating the additional point cloud graph includes interpolating one or more new nodes between two nodes in the point cloud graph, in which the two nodes are connected through an edge in the point cloud graph.
Example 7 provides the method of example 5 or 6, in which generating the additional point cloud graph includes applying an up-sampling matrix and an adjacency matrix on the point cloud graph.
Example 8 provides the method of any one of examples 1-7, in which generating the feature map includes inputting the point cloud graph into a trained model, the trained model including a learnable binary matrix and generating the feature map using the learnable binary matrix.
Example 9 provides the method of example 8, in which the trained model is trained by: updating one or more parameters of a real matrix; and binarizing the real matrix row by row to obtain the binary matrix.
Example 10 provides the method of any one of examples 1-9, in which the one or more deep learning operations includes a convolution.
Example 11 provides one or more non-transitory computer-readable media storing instructions executable to perform operations, the operations including generating a point cloud graph by removing one or more points from a point cloud capturing at least part of an object, the point cloud graph including a first group of nodes; generating a feature map from the point cloud graph, the feature map including a second group of nodes that are arranged in a grid structure, the second group of nodes including more nodes than the first group of nodes; executing one or more deep learning operations in a neural network on the feature map; and determining a condition of the object based on an output of the neural network.
Example 12 provides the one or more non-transitory computer-readable media of
example 11, in which generating the point cloud graph includes dividing the point cloud into one or more regions; and for each region of the point cloud, removing one or more points in the region.
Example 13 provides the one or more non-transitory computer-readable media of example 11 or 12, in which generating the point cloud graph includes dividing the point cloud into one or more regions; and for each region of the point cloud, generating a node in the point cloud graph by averaging points in the region.
Example 14 provides the one or more non-transitory computer-readable media of any one of examples 11-13, in which generating the point cloud graph includes selecting a point in the point cloud based on a state value of the point, the state value indicating a degree of invalidity of the point; and generating a node in the point cloud graph from the selected point.
Example 15 provides the one or more non-transitory computer-readable media of any one of examples 11-14, in which generating the feature map includes generating an additional point cloud graph from the point cloud graph, the point cloud graph including the second group of nodes; and transforming the additional point cloud graph into the feature map by rearranging the second group of nodes into the grid structure.
Example 16 provides the one or more non-transitory computer-readable media of example 15, in which generating the additional point cloud graph includes interpolating one or more new nodes between two nodes in the point cloud graph, in which the two nodes are connected through an edge in the point cloud graph.
Example 17 provides the one or more non-transitory computer-readable media of example 15 or 16, in which generating the additional point cloud graph includes applying an up-sampling matrix and an adjacency matrix on the point cloud graph.
Example 18 provides the one or more non-transitory computer-readable media of any one of examples 11-17, in which generating the feature map includes inputting the point cloud graph into a trained model, the trained model including a learnable binary matrix and generating the feature map using the learnable binary matrix.
Example 19 provides the one or more non-transitory computer-readable media of example 18, in which the trained model is trained by: updating one or more parameters of a real matrix; and binarizing the real matrix row by row to obtain the binary matrix.
Example 20 provides the one or more non-transitory computer-readable media of any one of examples 11-19, in which the one or more deep learning operations includes a convolution.
Example 21 provides an apparatus, including a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations including generating a point cloud graph by removing one or more points from a point cloud capturing at least part of an object, the point cloud graph including a first group of nodes, generating a feature map from the point cloud graph, the feature map including a second group of nodes that are arranged in a grid structure, the second group of nodes including more nodes than the first group of nodes, executing one or more deep learning operations in a neural network on the feature map, and determining a condition of the object based on an output of the neural network.
Example 22 provides the apparatus of example 21, in which generating the point cloud graph includes dividing the point cloud into one or more regions; and for each region of the point cloud, removing one or more points in the region.
Example 24 provides the apparatus of example 21 or 22, in which generating the point cloud graph includes selecting a point in the point cloud based on a state value of the point, the state value indicating a degree of invalidity of the point; and generating a node in the point cloud graph from the selected point.
Example 24 provides the apparatus of example 21 or 22, in which generating the feature map includes generating an additional point cloud graph from the point cloud graph, the point cloud graph including the second group of nodes; and transforming the additional point cloud graph into the feature map by rearranging the second group of nodes into the grid structure.
Example 25 provides the apparatus of any one of examples 21-24, in which generating the feature map includes inputting the point cloud graph into a trained model, the trained model including a learnable binary matrix and generating the feature map using the learnable binary matrix.
The above description of illustrated implementations of the disclosure, including what is described in the Abstract, is not intended to be exhaustive or to limit the disclosure
to the precise forms disclosed. While specific implementations of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art will recognize. These modifications may be made to the disclosure in light of the above detailed description.
Claims (25)
- A method, comprising:generating a point cloud graph by removing one or more points from a point cloud capturing at least part of an object, the point cloud graph comprising a first group of nodes;generating a feature map from the point cloud graph, the feature map comprising a second group of nodes that are arranged in a grid structure, the second group of nodes comprising more nodes than the first group of nodes;executing one or more deep learning operations in a neural network on the feature map; anddetermining a condition of the object based on an output of the neural network.
- The method of claim 1, wherein generating the point cloud graph comprises:dividing the point cloud into one or more regions; andfor each region of the point cloud, removing one or more points in the region.
- The method of claim 1, wherein generating the point cloud graph comprises:dividing the point cloud into one or more regions; andfor each region of the point cloud, generating a node in the point cloud graph by averaging points in the region.
- The method of claim 1, wherein generating the point cloud graph comprises:selecting a point in the point cloud based on a state value of the point, the state value indicating a degree of invalidity of the point; andgenerating a node in the point cloud graph from the selected point.
- The method of claim 1, wherein generating the feature map comprises:generating an additional point cloud graph from the point cloud graph, the point cloud graph comprising the second group of nodes; andtransforming the additional point cloud graph into the feature map by rearranging the second group of nodes into the grid structure.
- The method of claim 5, wherein generating the additional point cloud graph comprises:interpolating one or more new nodes between two nodes in the point cloud graph, wherein the two nodes are connected through an edge in the point cloud graph.
- The method of claim 5, wherein generating the additional point cloud graph comprises:applying an up-sampling matrix and an adjacency matrix on the point cloud graph.
- The method of claim 1, wherein generating the feature map comprises:inputting the point cloud graph into a trained model, the trained model comprising a learnable binary matrix and generating the feature map using the learnable binary matrix.
- The method of claim 8, wherein the trained model is trained by:updating one or more parameters of a real matrix; andbinarizing the real matrix row by row to obtain the binary matrix.
- The method of claim 1, wherein the one or more deep learning operations comprises a convolution.
- One or more non-transitory computer-readable media storing instructions executable to perform operations, the operations comprising:generating a point cloud graph by removing one or more points from a point cloud capturing at least part of an object, the point cloud graph comprising a first group of nodes;generating a feature map from the point cloud graph, the feature map comprising a second group of nodes that are arranged in a grid structure, the second group of nodes comprising more nodes than the first group of nodes;executing one or more deep learning operations in a neural network on the feature map; anddetermining a condition of the object based on an output of the neural network.
- The one or more non-transitory computer-readable media of claim 11, wherein generating the point cloud graph comprises:dividing the point cloud into one or more regions; andfor each region of the point cloud, removing one or more points in the region.
- The one or more non-transitory computer-readable media of claim 11, wherein generating the point cloud graph comprises:dividing the point cloud into one or more regions; andfor each region of the point cloud, generating a node in the point cloud graph by averaging points in the region.
- The one or more non-transitory computer-readable media of claim 11, wherein generating the point cloud graph comprises:selecting a point in the point cloud based on a state value of the point, the state value indicating a degree of invalidity of the point; andgenerating a node in the point cloud graph from the selected point.
- The one or more non-transitory computer-readable media of claim 11, wherein generating the feature map comprises:generating an additional point cloud graph from the point cloud graph, the point cloud graph comprising the second group of nodes; andtransforming the additional point cloud graph into the feature map by rearranging the second group of nodes into the grid structure.
- The one or more non-transitory computer-readable media of claim 15, wherein generating the additional point cloud graph comprises:interpolating one or more new nodes between two nodes in the point cloud graph, wherein the two nodes are connected through an edge in the point cloud graph.
- The one or more non-transitory computer-readable media of claim 15, wherein generating the additional point cloud graph comprises:applying an up-sampling matrix and an adjacency matrix on the point cloud graph.
- The one or more non-transitory computer-readable media of claim 11, wherein generating the feature map comprises:inputting the point cloud graph into a trained model, the trained model comprising a learnable binary matrix and generating the feature map using the learnable binary matrix.
- The one or more non-transitory computer-readable media of claim 18, wherein the trained model is trained by:updating one or more parameters of a real matrix; andbinarizing the real matrix row by row to obtain the binary matrix.
- The one or more non-transitory computer-readable media of claim 11, wherein the one or more deep learning operations comprises a convolution.
- An apparatus, comprising:a computer processor for executing computer program instructions; anda non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations comprising:generating a point cloud graph by removing one or more points from a point cloud capturing at least part of an object, the point cloud graph comprising a first group of nodes,generating a feature map from the point cloud graph, the feature map comprising a second group of nodes that are arranged in a grid structure, the second group of nodes comprising more nodes than the first group of nodes,executing one or more deep learning operations in a neural network on the feature map, anddetermining a condition of the object based on an output of the neural network.
- The apparatus of claim 21, wherein generating the point cloud graph comprises:dividing the point cloud into one or more regions; andfor each region of the point cloud, removing one or more points in the region.
- The apparatus of claim 21, wherein generating the point cloud graph comprises:selecting a point in the point cloud based on a state value of the point, the state value indicating a degree of invalidity of the point; andgenerating a node in the point cloud graph from the selected point.
- The apparatus of claim 21, wherein generating the feature map comprises:generating an additional point cloud graph from the point cloud graph, the point cloud graph comprising the second group of nodes; andtransforming the additional point cloud graph into the feature map by rearranging the second group of nodes into the grid structure.
- The apparatus of claim 21, wherein generating the feature map comprises:inputting the point cloud graph into a trained model, the trained model comprising a learnable binary matrix and generating the feature map using the learnable binary matrix.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2024084402 | 2024-03-28 | ||
| CNPCT/CN2024/084402 | 2024-03-28 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025200079A1 true WO2025200079A1 (en) | 2025-10-02 |
Family
ID=97216013
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/090762 Pending WO2025200079A1 (en) | 2024-03-28 | 2024-04-30 | Learnable encoder converting point cloud to grid for visual recognition |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025200079A1 (en) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150213646A1 (en) * | 2014-01-28 | 2015-07-30 | Siemens Aktiengesellschaft | Method and System for Constructing Personalized Avatars Using a Parameterized Deformable Mesh |
| CN109964222A (en) * | 2016-11-03 | 2019-07-02 | 三菱电机株式会社 | System and method for processing an input point cloud with multiple points |
| CN113970922A (en) * | 2020-07-22 | 2022-01-25 | 商汤集团有限公司 | Point cloud data processing method and intelligent driving control method and device |
| US20220277514A1 (en) * | 2021-02-26 | 2022-09-01 | Adobe Inc. | Reconstructing three-dimensional scenes portrayed in digital images utilizing point cloud machine-learning models |
| CN115909319A (en) * | 2022-12-15 | 2023-04-04 | 南京工业大学 | A Hierarchical Graph Network Based Method for 3D Object Detection on Point Clouds |
| CN116310104A (en) * | 2023-03-08 | 2023-06-23 | 武汉纺织大学 | Method, system and storage medium for three-dimensional reconstruction of human body in complex scene |
-
2024
- 2024-04-30 WO PCT/CN2024/090762 patent/WO2025200079A1/en active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150213646A1 (en) * | 2014-01-28 | 2015-07-30 | Siemens Aktiengesellschaft | Method and System for Constructing Personalized Avatars Using a Parameterized Deformable Mesh |
| CN109964222A (en) * | 2016-11-03 | 2019-07-02 | 三菱电机株式会社 | System and method for processing an input point cloud with multiple points |
| CN113970922A (en) * | 2020-07-22 | 2022-01-25 | 商汤集团有限公司 | Point cloud data processing method and intelligent driving control method and device |
| US20220277514A1 (en) * | 2021-02-26 | 2022-09-01 | Adobe Inc. | Reconstructing three-dimensional scenes portrayed in digital images utilizing point cloud machine-learning models |
| CN115909319A (en) * | 2022-12-15 | 2023-04-04 | 南京工业大学 | A Hierarchical Graph Network Based Method for 3D Object Detection on Point Clouds |
| CN116310104A (en) * | 2023-03-08 | 2023-06-23 | 武汉纺织大学 | Method, system and storage medium for three-dimensional reconstruction of human body in complex scene |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20230010142A1 (en) | Generating Pretrained Sparse Student Model for Transfer Learning | |
| US20220051103A1 (en) | System and method for compressing convolutional neural networks | |
| US20230016455A1 (en) | Decomposing a deconvolution into multiple convolutions | |
| US20220101091A1 (en) | Near memory sparse matrix computation in deep neural network | |
| WO2024040601A1 (en) | Head architecture for deep neural network (dnn) | |
| US20230008856A1 (en) | Neural network facilitating fixed-point emulation of floating-point computation | |
| WO2023220878A1 (en) | Training neural network trough dense-connection based knowlege distillation | |
| EP4354348A1 (en) | Sparsity processing on unpacked data | |
| US20230071760A1 (en) | Calibrating confidence of classification models | |
| US20230229910A1 (en) | Transposing Memory Layout of Weights in Deep Neural Networks (DNNs) | |
| EP4195104A1 (en) | System and method for pruning filters in deep neural networks | |
| WO2024040546A1 (en) | Point grid network with learnable semantic grid transformation | |
| US20230298322A1 (en) | Out-of-distribution detection using a neural network | |
| WO2023220888A1 (en) | Modeling graph-structured data with point grid convolution | |
| WO2025200079A1 (en) | Learnable encoder converting point cloud to grid for visual recognition | |
| US20230059976A1 (en) | Deep neural network (dnn) accelerator facilitating quantized inference | |
| WO2024072472A1 (en) | Gradient-free efficient class activation map generation | |
| WO2025097349A1 (en) | Graph-based computer vision using progressive grid learner and convolutional neural network | |
| WO2025123208A1 (en) | Annotation network for three-dimensional pose estimation | |
| WO2025102256A1 (en) | Training neural network with contrastive knowledge distillation | |
| WO2024077463A1 (en) | Sequential modeling with memory including multi-range arrays | |
| WO2025200078A1 (en) | Face tracking based on spatial-temporal aggregation and rigid prior | |
| US20240346293A1 (en) | Out-of-distribution detection using autoencoder appended to neural network | |
| WO2023220867A1 (en) | Neural network with point grid convolutional layer | |
| WO2025107244A1 (en) | Training neural network with generative many-to-one feature distillation |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24932703 Country of ref document: EP Kind code of ref document: A1 |