CN116801102B

CN116801102B - Method for controlling camera, video conference system, electronic device and storage medium

Info

Publication number: CN116801102B
Application number: CN202311054957.XA
Authority: CN
Inventors: 邱恩; 刘思妤
Original assignee: Rockchip Electronics Co Ltd
Current assignee: Rockchip Electronics Co Ltd
Priority date: 2023-08-22
Filing date: 2023-08-22
Publication date: 2024-02-09
Anticipated expiration: 2043-08-22
Also published as: CN116801102A

Abstract

The invention discloses a method for controlling a camera, a video conference system, electronic equipment and a storage medium. The method comprises the following steps: acquiring sound source information acquired by a microphone array associated with a current scene; statistics is carried out on personnel in the scene based on the sound source information so as to obtain sound source statistics information; calculating a scene area where a person in the scene is located according to the sound source statistical information; and controlling a camera associated with the scene according to the current sound source information and the sound source statistics information, so that the focus of the camera changes in the scene area according to the current sound source information. According to the technical scheme, the conference rectangular area is calculated to obtain conference global information so as to ensure the quality of conference global images, the focus of the camera is switched to the instantaneous focus when the current sound source information has sound, so that camera adjustment can be performed according to the speaking condition in time when speaking is ensured during the conference, and the quality of local images is ensured when close view is switched.

Description

Method for controlling camera, video conference system, electronic device and storage medium

Technical Field

The present invention relates to the field of machine vision, and in particular, to a method for controlling a camera, a video conference system, an electronic device, and a storage medium.

Background

Currently, video conferencing systems are generally composed of far-field speech acquisition, rotatable cameras, conference set-top boxes, and large screen display systems. Far-field voice acquisition is responsible for sound acquisition, and a rotatable camera is responsible for acquiring images. The conference set top box is the brain of the video conference system, is responsible for the whole flow management of video call service, is responsible for the management of input and output equipment and the like. However, the current camera cannot be flexibly adjusted according to the voice acquisition data.

Disclosure of Invention

The invention provides a method for controlling a camera, a video conference system, electronic equipment and a storage medium, which can dynamically control the camera according to voice acquisition data.

In one aspect of the invention, a method of controlling a camera is provided. The method comprises the following steps: acquiring sound source information acquired by a microphone array associated with a current scene, and counting personnel in the scene based on the sound source information to obtain sound source statistic information; calculating a scene area where a person in the scene is located according to the sound source statistical information; and controlling a camera associated with the scene according to the current sound source information and the sound source statistics information, so that the focus of the camera changes in the scene area according to the current sound source information.

In another aspect of the invention, a video conferencing system is provided. The system includes a microphone array; at least one camera; and a controller configured to: acquiring sound source information acquired by the microphone array; counting personnel in the conference based on the sound source information to obtain sound source statistic information; calculating a conference area where the personnel in the conference are located according to the sound source statistical information; and controlling the camera according to the current sound source information and the sound source statistical information, so that the focus of the camera is changed in the conference area according to the current sound source information.

In yet another aspect of the present invention, an electronic device is provided. The device includes a memory configured to store an executable program; and a processor configured to execute the executable program to perform the above-described method of controlling a camera.

In yet another aspect of the present invention, a computer-readable storage medium is provided. The medium has stored thereon a computer program to be executed by a processor for implementing the method of controlling a camera as described above.

According to the invention, sound source information in a current scene is acquired from the microphone array, statistics is carried out on personnel in the scene based on the sound source information to obtain sound source statistics information, a scene area where the personnel in the scene are located is calculated, and a camera associated with the scene is controlled according to the current sound source information and the sound source statistics information, so that the focus of the camera is changed in the scene area according to the current sound source information. Therefore, the current sound source information and sound source statistical information are combined to control the switching of the focal point of the camera, and the camera can be regulated in time according to the sound source condition in the scene. In this way, the camera can be dynamically controlled according to the sound data collected by the microphone, and the dependence of the camera on the collected data of the microphone is improved.

Drawings

FIG. 1 is a flow chart of a method of controlling a camera according to an embodiment of the present invention;

fig. 2 is a flowchart of calculating a conference scene focus based on sound source information according to an embodiment of the present invention;

FIG. 3 is a flowchart of controlling camera parameters based on sound source information according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for smoothing zoom of a camera based on sound intensity according to an embodiment of the present invention;

fig. 5 is a block diagram of a videoconferencing system, according to an embodiment of the present invention;

fig. 6 is a block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to describe the technical contents, the achieved objects and effects of the present invention in detail, the following description will be made with reference to the embodiments in conjunction with the accompanying drawings.

In the prior art, the relevance between the voice acquisition and the conference camera of the teleconference is weak, and the camera cannot be flexibly and dynamically adjusted according to the data acquired by the voice.

In order to solve at least the above technical problems, the present disclosure provides a method of controlling a camera. According to the method, sound source information in a current scene is acquired from a microphone array, statistics are conducted on personnel in the scene based on the sound source information to obtain sound source statistics information, a scene area where the personnel in the scene are located is calculated, and a camera associated with the scene is controlled according to the current sound source information and the sound source statistics information so that the focus of the camera changes in the scene area according to the current sound source information. Therefore, the current sound source information and sound source statistical information are combined to control the switching of the focal point of the camera, and the camera can be regulated in time according to the sound source condition in the scene. In this way, the camera can be dynamically controlled according to the sound data collected by the microphone, and the dependence of the camera on the collected data of the microphone is improved.

According to some embodiments of the disclosure, the scene comprises a video conference scene, and the persons in the scene comprise participants. However, it should be understood that the scenarios in accordance with the present disclosure are not limited to video conference scenarios, and in other embodiments may also include any other scenarios that use a microphone array and a camera.

According to some embodiments of the present disclosure, the maximum conference rectangular area is calculated using sound source information of a microphone array using a statistical method, and camera parameters are adjusted to improve digital image quality of the maximum conference rectangular area. Based on the instantaneous sound source information and the sound source statistical information, the instantaneous focus of the conference scene is calculated, the effects of slow switching and delay switching of the instantaneous focus are achieved, and better user experience is achieved. The camera parameters are adjusted based on the maximum conference rectangular area and the instantaneous focus of the conference scene. The maximum conference rectangular area ensures high quality of the global digital image, and the instantaneous focus of the conference scene ensures high quality of the local digital image.

Hereinafter, a technical scheme according to the present disclosure will be described with reference to specific embodiments and with reference to the accompanying drawings.

Fig. 1 is a step schematic diagram illustrating a method 100 of controlling a camera according to an embodiment of the present disclosure. Referring to fig. 1, the method 100 includes the following steps 102-108.

In step 102, sound source information collected by a microphone array associated with a current scene is acquired.

In some embodiments, the sound source information including at least one of a sound source direction, a sound source position, and a sound source intensity is acquired. In some embodiments, the current scene may include a video conference scene.

In step 104, statistics are performed for people in the scene based on the sound source information to obtain sound source statistics.

In some embodiments, statistics are made for the person by the sound source location information and the statistical information is stored using a circular or first-in first-out queue to characterize the person using the sound source location information, wherein the location variance is determined to be the same person over a range. In addition, statistics are performed on the liveness of the person through the sound source intensity information, and the statistical information is stored by using a circular or first-in first-out queue so as to calculate an average value of the sound source intensity in the statistical queue, so that the liveness of the person is represented. In this way, the activity of the person is determined through the sound source intensity mean value, so that the subsequent combination of the activity is convenient for switching the focus of the camera.

In step 106, a scene area where the person in the scene is calculated according to the sound source statistical information.

In some embodiments, the sound source position is expanded to obtain a rectangular area by taking the sound source position as the center according to the sound source statistical information, and the rectangular area represents the area where the personnel corresponding to the sound source position are located. Further, rectangular combination operation is performed for rectangular areas corresponding to a plurality of persons to obtain the scene area. In some embodiments, the camera is controlled such that a capture area of the camera covers at least the scene area. In this way, conference global information can be obtained to ensure the quality of conference global images.

In some embodiments, according to the position information corresponding to the sound source information of the same voiceprint feature in the sound source statistical information, if the variance of the position information is within a preset range, a minimum first rectangular area corresponding to the position information is calculated, and the first rectangular area is used as a personnel area where a person is located. In this way, the positional variance of the sound source information can be determined to be the same person within a certain range. In some embodiments, a minimum second rectangular area including at least all of the person areas is calculated, the second rectangular area being taken as a scene area. In this way, conference global information can be obtained to ensure the quality of conference global images.

In step 108, a camera associated with the scene is controlled in accordance with the current sound source information and the sound source statistics such that a focus of the camera changes in the scene area in accordance with the current sound source information. In some embodiments, if the current sound source information meets a threshold condition, calculating an instantaneous focus associated with the current sound source information in the scene in combination with the current sound source information and the sound source statistics, and switching the focus of the camera to the instantaneous focus. In some embodiments, the distance of the camera to the instantaneous focus of the scene is calculated, and the focus of the camera is controlled to switch to the instantaneous focus by a smooth zoom process. In this way, the switching of the focal point of the camera is controlled by combining the intensity of the current sound source information, so that the camera can be guaranteed to be adjusted in time according to the sound source condition in the scene.

In some embodiments, the current sound source information is updated to the sound source statistics information, the first sound source information with the highest intensity is selected from the updated sound source statistics information, and the position corresponding to the first sound source information is used as the instantaneous focus. In this way, a strong correlation of sound source information and camera control is achieved.

In some embodiments, if the current sound source information does not meet a threshold condition, controlling the focal point of the camera to sequentially move among people in the scene area according to the sound source statistical information. In some embodiments, a weight value corresponding to a person in the scene is determined according to the sound source statistical information, and smooth movement of a focus of the camera between the person in the scene area is controlled according to the weight value, wherein the camera is moved smoothly through damping, a specific position of the camera is calculated through built-in damping programmable clicking, and motor parameters are set for rotation. In this way, through the damping smoothing effect, the quick smooth transition in the transition stage is realized, the slow smooth transition is realized when the change is over, and the user watching experience is improved.

In some embodiments, whether the difference value of adjacent position information of sound source information of the same voiceprint feature in the sound source statistical information is larger than a preset distance is judged, and if yes, the sound source information is filtered. In this way, sound source information that the participant position moves fast is filtered out.

In some embodiments, taking the average intensity value of the sound source information of the scene area as the activity level of the personnel according to the sound source statistical information, wherein controlling the focus of the camera to sequentially move between the personnel in the scene area according to the sound source statistical information comprises: and sequencing the liveness of the personnel, and sequentially switching the focus of the camera to the personnel area corresponding to the liveness according to the sequencing sequence. In this way, the focal point of the camera is controlled to move slowly between the several persons in combination with the activity of the persons.

Hereinafter, application scenarios of the method of controlling a camera according to an embodiment of the present invention will be described by way of example.

Fig. 2 is a flowchart illustrating calculation of conference scene focus based on sound source information according to an embodiment of the present disclosure. Referring to fig. 2, the method includes the following steps 202 to 214.

In step 202, sound source information is acquired using a sound source localization component of a microphone array. Specifically, the sound source information includes: sound source direction (azimuth and pitch), sound position, sound intensity, and the like.

In step 204, the filtering process of the sound source information filters out the sound source information whose participant position is moving fast.

At step 206, the statistical participant is moved by the sound source location information. Specifically, the participant is characterized by using position information, sound source statistical information is stored by using a circular queue/FIFO queue, and the position variance is determined to be the same participant within a certain range, namely, for 1080P images, the position variance is determined to be the same participant within 2-5 pixels.

At step 208, the activity of the statistical participant is moved by the sound source intensity information. Specifically, the ring queue/fifo queue is used to store sound source statistics, an average of the sound source intensities in the statistics queue is calculated, and the average is used to characterize the activity of the participant.

At step 210, a maximum meeting rectangular area for the participant is calculated. Specifically, assuming that the statistics queue counts 6 participants and the position information of the participants, the maximum conference rectangular area of the participants can be calculated. The sound source position is a rectangular area with an expanded center, and represents the participant area. When 6 participants are detected in the conference, 6 rectangular areas are generated, and the largest conference rectangular area can be calculated through rectangular merging operation.

In step 212, the control camera covers 120% of the maximum meeting rectangular area. Specifically, the maximum meeting rectangular area is expanded from the center, and 120% of the rectangular area of the maximum meeting rectangular area can be calculated. Then, the video conference device controls parameters such as focal length and visual angle of the camera: the focus of the camera is preferably the center of the largest conference rectangular area or the most active participant; in the case of a camera with a physical mechanism for rotation angle, it is preferable that the camera direction be perpendicular to the largest meeting rectangular area. The adjustment has the advantages that the zooming following effect based on the center of the maximum meeting rectangular area is improved, the camera covers 110-130% of the maximum meeting rectangular area to improve the digital image quality of the maximum meeting rectangular area, and the camera preferably covers 120% of the maximum meeting rectangular area.

In step 214, when the conference is silent, the focus of the camera is controlled to be switched slowly at a plurality of participant positions according to the weight. Specifically, the video conference device confirms whether to stay or switch according to the activity of the participants in the statistic queue as a weight value and based on the instantaneous activity of the participants and the sound source statistic information, and controls the focus of the camera to slowly move among a plurality of participants.

In this embodiment, table 1 is sound source statistics stored using circular queues/fifo queues:

TABLE 1 Sound source statistics

Therefore, using the sound source information of the microphone array, we calculate the largest conference rectangular area using a statistical method and adjust the camera parameters to improve the digital image quality of the largest conference rectangular area. By using the sound source information of the microphone array, the image quality of an effective video conference scene of the camera is indirectly improved, and the microphone array and the camera are not isolated from each other.

Fig. 3 is a flowchart illustrating controlling camera parameters based on sound source information according to an embodiment of the present disclosure. Referring to fig. 3, the method includes the following steps 302 to 310.

In step 302, sound source information is acquired using a sound source localization component of the microphone array. The sound source information includes: sound source direction (azimuth and pitch), sound position, sound intensity, and the like.

In step 304, the instantaneous focus of the conference scene is calculated using the current sound source information and the sound source statistics. Specifically, the instantaneous sound intensity is stored in the sound source statistical information, then the maximum sound intensity is selected from the sound source statistical information, and the position information corresponding to the maximum sound intensity is used as the instantaneous focus of the conference scene.

In step 306, the distance of the camera to the instantaneous focus of the conference scene is calculated.

In step 308, control switches the focus of the camera to the instantaneous focus of the conference scene.

Switching the instantaneous focus of the conference scene involves a smoothing process at step 310.

Calculating the instantaneous focus of the conference scene based on the instantaneous sound source information can present a variety of conference scenes that affect the user experience, such as 2 participants talking quickly, which can cause frequent switching of the instantaneous focus of the conference scene. The instant focus of the conference scene is calculated based on the instant sound source information and the sound source statistical information, so that the effects of slow switching and delay switching of the instant focus can be achieved, and better user experience is achieved.

Fig. 4 is a flowchart illustrating a camera smoothing zoom method based on sound intensity according to an embodiment of the present disclosure. Referring to fig. 4, the method includes the following steps 402 to 408.

In step 402, sound source information is acquired using a sound source localization component of a microphone array.

In step 404, the instantaneous focus of the conference scene is calculated using the current sound source information and the sound source statistics.

In step 406, the instantaneous focus of the conference scene is switched from point a to point B, with a smoothing effect. In particular, the smoothing effect is preferably a damped smoothing effect, i.e. a fast smooth transition in the transition phase of the change, a slow smooth transition towards the end of the change.

In step 408, disturbances of the instantaneous focus of the conference scene are filtered, keeping the instantaneous focus stable.

Switching of the instantaneous focus of the conference scene may occur in a variety of conference scenes that affect the user experience, such as abrupt and jumped pictures. The method has the advantages that sound source information is fully achieved, camera parameters are controlled, the transient focus of a conference scene is switched from the point A to the point B, the smooth effect is achieved, and user experience is obviously improved.

According to another aspect of the present invention, fig. 5 is a block diagram illustrating a videoconferencing system, according to an embodiment of the present disclosure. The system includes a microphone array, at least one camera, and a controller.

The camera is preferably a dual camera with long and short focus, or a wide zoom camera with optical zoom. The parameters of the camera may include: resolution, frame rate, shooting mode, flash, exposure, white balance, focus, and focal length.

The microphone array is preferably a microphone array based on a controllable beam forming technique of maximum output power, or a microphone array based on a high resolution spectrogram estimation technique, or a microphone array based on a sound source localization technique of sound time differences.

The controller is configured to acquire sound source information acquired by the microphone array, and to count persons in the conference based on the sound source information to obtain sound source statistics information. The controller is configured to calculate a conference area in which a person in the conference is located based on the sound source statistics. Further, the controller is configured to control the camera according to current sound source information and the sound source statistics information such that a focus of the camera is changed according to the current sound source information in the conference area.

According to yet another aspect of the invention, fig. 6 is a block diagram illustrating an electronic device 600 according to an embodiment of the invention. Referring to fig. 6, the electronic device 600 includes a memory 602 and a processor 604. The memory 602 stores executable programs. The processor 604, when executing the executable program, performs the steps of the method of controlling a camera as described above.

According to still another aspect of the present invention, there is provided a storage medium. The medium has stored thereon a computer program to be executed by a processor for implementing the method of controlling a camera as described above.

In summary, the present invention provides a method for controlling a camera, a video conference system, an electronic device, and a storage medium, which obtains sound source statistics information in a remote conference from a microphone array, calculates a participant region and a conference region in the conference from the sound source statistics information, calculates an instantaneous focus of the conference by combining current sound source information and the sound source statistics information when the intensity of the current sound source information reaches a threshold value, and switches the focus of the camera, otherwise switches the focus of the camera sequentially according to the participant region. Therefore, the conference rectangular area is calculated to obtain conference global information so as to ensure the quality of conference global images, and when the current sound source information has sound, the focus of the camera is switched to the instantaneous focus, so that when the conference has speaking, the camera can be adjusted according to the speaking condition in time, and the quality of local images is ensured when the close view is switched. In this way, the camera is enabled to be controlled dynamically from sound data collected by the microphone.

The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent changes made by the specification and drawings of the present invention, or direct or indirect application in the relevant art, are included in the scope of the present invention.

Claims

1. A method of controlling a camera, comprising:

acquiring sound source information acquired by a microphone array associated with a current scene, wherein the sound source information comprises at least one of a sound source direction, a sound source position and a sound source intensity;

statistics are carried out on personnel in the scene based on the sound source information to obtain sound source statistical information, and the method comprises the following steps: counting the personnel by the sound source position information, and storing counted information by using a circular or first-in first-out queue to represent the personnel by using the sound source position information, wherein the position variance is judged to be the same personnel within a certain range, and counting the activity of the personnel by the sound source intensity information, and storing counted information by using the circular or first-out queue to calculate the average value of the sound source intensity in the counted queue so as to represent the activity of the personnel;

calculating a scene area where a person in the scene is located according to the sound source statistical information, including: according to the sound source statistical information, expanding the sound source position as the center to obtain a rectangular area, and carrying out rectangular combination operation on the rectangular area corresponding to a plurality of people to obtain the scene area, wherein the rectangular area represents the area where the person corresponding to the sound source position is located;

if the difference value of adjacent position information of sound source information of the same voiceprint characteristic in the sound source statistical information is larger than a preset distance, filtering the sound source information;

taking the intensity average value of the sound source information of the scene area as the activity of the personnel according to the sound source statistical information; and

controlling a camera associated with the scene according to current sound source information and the sound source statistics information such that a focus of the camera changes in the scene area according to the current sound source information, comprising: if the current sound source information meets a threshold condition, calculating an instantaneous focus associated with the current sound source information in the scene by combining the current sound source information and the sound source statistical information, and switching the focus of the camera to the instantaneous focus; if the current sound source information does not meet the threshold condition, controlling the focus of the camera to sequentially move among the people in the scene area according to the sound source statistical information, including: determining a weight value corresponding to a person in the scene according to the sound source statistical information, controlling the focus of the camera to smoothly move between the persons in the scene area according to the weight value,

wherein controlling the focal point of the camera to sequentially move between persons in the scene area according to the sound source statistical information comprises: and sequencing the liveness of the personnel, and sequentially switching the focus of the camera to the personnel area corresponding to the liveness according to the sequencing sequence.

2. The method of claim 1, wherein switching the focal point of the camera to the instantaneous focal point comprises:

calculating the distance from the camera to the instantaneous focus of the scene; and

and controlling the focus of the camera to be switched to the instantaneous focus through a smooth zooming process.

3. The method as recited in claim 1, further comprising:

and controlling the camera so that the shooting area of the camera at least covers the scene area.

4. The method of claim 1, wherein calculating a scene area in which a person in the scene is located based on the sound source statistics comprises:

and according to the position information corresponding to the sound source information of the same voiceprint characteristic in the sound source statistical information, if the variance of the position information is in a preset range, calculating a minimum first rectangular area corresponding to the position information, and taking the first rectangular area as a personnel area where a person is located.

5. The method of claim 4, wherein calculating a scene area in which a person in the scene is located based on the sound source statistics further comprises:

and calculating a second rectangular area at least comprising all the personnel areas, and taking the second rectangular area as the scene area.

6. The method of claim 1, wherein calculating an instantaneous focus associated with the current sound source information in the scene in combination with the current sound source information and the sound source statistics comprises:

updating the current sound source information to the sound source statistical information;

selecting first sound source information with the maximum intensity from the updated sound source statistical information; and

and taking the position corresponding to the first sound source information as the instantaneous focus.

7. A video conferencing system, comprising:

a microphone array;

at least one camera; and

a controller configured to:

acquiring sound source information acquired by the microphone array, wherein the sound source information comprises at least one of a sound source direction, a sound source position and a sound source intensity;

counting personnel in a conference based on the sound source information to obtain sound source counting information, counting the personnel by using the sound source position information, and storing counted information by using a circular or first-in first-out queue to represent the personnel by using the sound source position information, wherein the position variance is judged to be the same personnel within a certain range, counting the activity of the personnel by using the sound source intensity information, and storing counted information by using a circular or first-out queue to calculate the average value of the sound source intensity in the counting queue so as to represent the activity of the personnel;

according to the sound source statistical information, expanding the sound source position as the center to obtain a rectangular area, and carrying out rectangular combination operation on the rectangular area corresponding to a plurality of people to obtain a scene area, wherein the rectangular area represents the area where the person corresponding to the sound source position is located;

controlling the camera according to the current sound source information and the sound source statistical information, so that the focus of the camera is changed in the scene area according to the current sound source information, if the current sound source information meets a threshold condition, calculating an instantaneous focus associated with the current sound source information in the scene by combining the current sound source information and the sound source statistical information, and switching the focus of the camera to the instantaneous focus; if the current sound source information does not meet the threshold condition, controlling the focus of the camera to sequentially move among the persons in the scene area according to the sound source statistical information, determining a weight value corresponding to the persons in the scene according to the sound source statistical information, controlling the focus of the camera to smoothly move among the persons in the scene area according to the weight value,

8. An electronic device, comprising:

a memory configured to store an executable program; and

a processor configured to execute the executable program to perform the method according to any one of claims 1 to 6.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program is executed to implement the method according to any one of claims 1 to 6.