[go: up one dir, main page]

CN113905221B - Stereoscopic panoramic video asymmetric transport stream self-adaption method and system - Google Patents

Stereoscopic panoramic video asymmetric transport stream self-adaption method and system Download PDF

Info

Publication number
CN113905221B
CN113905221B CN202111165065.8A CN202111165065A CN113905221B CN 113905221 B CN113905221 B CN 113905221B CN 202111165065 A CN202111165065 A CN 202111165065A CN 113905221 B CN113905221 B CN 113905221B
Authority
CN
China
Prior art keywords
quality
code rate
slice
view
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111165065.8A
Other languages
Chinese (zh)
Other versions
CN113905221A (en
Inventor
兰诚栋
梁昊霖
饶迎节
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202111165065.8A priority Critical patent/CN113905221B/en
Publication of CN113905221A publication Critical patent/CN113905221A/en
Application granted granted Critical
Publication of CN113905221B publication Critical patent/CN113905221B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/194Transmission of image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

The invention relates to a self-adaptive method and a self-adaptive system for asymmetric transmission streams of stereoscopic panoramic video, wherein the method comprises the following steps: s1, a server side cuts video data into fragments in time and cuts the fragments into slices in space; s2, caching the cut video in an HTTP server according to different video quality and different code rates; s3, carrying out probability prediction by combining a 3DCNN network and an LSTM network; s4, performing joint code rate control on the left and right view points by utilizing multi-agent reinforcement learning based on the AC so as to balance the mutual influence of the quality of the single-path view point and the overall quality; s5, designing a reward function so that the system can select a more proper code rate; and S6, decoding, splicing and stitching the downloaded data, storing the data in a play buffer of the client, and rendering and playing the data through playing software. The method is beneficial to improving the control effect of the joint code rate of the stereoscopic panoramic video, and improving the experience quality of users under the limited bandwidth.

Description

Stereoscopic panoramic video asymmetric transport stream self-adaption method and system
Technical Field
The invention belongs to the field of stereoscopic panoramic video transmission, and particularly relates to a stereoscopic panoramic video asymmetric transmission stream self-adaption method and system based on multi-agent reinforcement learning.
Background
Virtual Reality (VR) is a new technology in the field of computers developed by integrating various scientific technologies such as computer graphics technology, multimedia technology, sensor technology, man-machine interaction technology, network technology, stereoscopic display technology, and simulation technology. Cisco predicts that by 2021, the internet traffic generated by immersive applications will increase 20-fold. Thus, current network bandwidth has not been able to meet the development of VR video. While single 360 videos are perhaps the most popular type in current VR video content, they lack 3D information and therefore cannot be viewed in a full 6 degrees of freedom (DOF). Stereoscopic panoramic video implementations have received attention to further enhance immersive effects under 3DOF360 video conditions. In conventional panoramic video transmission, chakarski J et al propose a rate-distortion model to map the relation between Quantization Parameter (QP) and bit rate, thereby developing a tile-based bit stream allocation algorithm. VanDerHooft J and the like perform bit stream allocation according to the distance from the tile center to the view center. Under the condition of ensuring that each tile has the lowest quality, redundant bandwidth distributes fewer and more bits according to distance and near distance. Xie L and the like construct a tile probability prediction model by using a mathematical method, and then select different code rates for each tile by using a bit stream self-adaptive strategy based on target cache. The reinforcement learning method can obtain optimal decisions over a long period of time. Rate selection of panoramic video using reinforcement learning has been studied by a large number of students. Jiang X and the like use an A3C algorithm in reinforcement learning to perform code rate selection of a viewpoint region, a neighboring region, an external region by inputting data such as previous time bandwidth, previous time prediction accuracy, current bandwidth and the like, and the algorithm has become a classical algorithm for panoramic video region code rate selection based on reinforcement learning. KanN et al also uses the A3C algorithm for bit stream allocation for three regions. It is considered that the buffer should not be excessively large in order to improve the view prediction accuracy. The size of the buffer area is also used as a reward function, an algorithm is encouraged to deviate to a proper buffer size to consider prediction accuracy and play card, and the algorithm shows that the setting of the reward function has important influence on system operation. Zhan gY and the like utilize an AC algorithm, and embed an LSTM network into state (state) change, and orderly adjust states by utilizing LSTM prediction characteristics, so that a search space is reduced, and decision making is facilitated. Currently, stereoscopic panoramic video transmission is less studied. Based on the binocular suppression principle, naikD and the like list DMOS values under various QP, different spatial scaling ratios and the like by performing quality evaluation on the asymmetric stereoscopic panoramic video. The conclusion of the method is that binocular suppression is also applicable to stereoscopic panoramic video, and the bandwidth of the method can be saved by 25% -50% when the spatial resolution of a certain view point is scaled under acceptable conditions. Xu G, etc., downsamples one view horizontally, vertically, and upsamples at the decoding end. While the other view remains unchanged for asymmetric transmission. The methods are all methods for carrying out asymmetric coding by fixed code rate or downsampling, and the influence of real-time change of network bandwidth and the like on QoE is not fully considered.
Disclosure of Invention
The invention aims to provide a self-adaptive method and a self-adaptive system for asymmetric transmission streams of stereoscopic panoramic video, and the method and the system are beneficial to improving the joint rate control effect of the stereoscopic panoramic video.
In order to achieve the above purpose, the invention adopts the following technical scheme: a stereoscopic panoramic video asymmetric transport stream self-adaption method comprises the following steps:
s1, a server side cuts video data into fragments in time and cuts the fragments into slices in space;
s2, caching the cut video in an HTTP server according to different video quality and different code rates;
s3, carrying out probability prediction by combining a 3DCNN network and an LSTM network;
s4, performing joint code rate control on the left and right view points by utilizing multi-agent reinforcement learning based on the AC so as to balance the mutual influence of the quality of the single-path view point and the overall quality;
s5, designing a reward function so that the system can select a more proper code rate;
and S6, decoding, splicing and stitching the downloaded data, storing the data in a play buffer of the client, and rendering and playing the data through playing software.
Further, in the step S3, feature extraction is performed on the obtained static significant information, dynamic significant information and parallax information of the binocular viewpoint of the main viewpoint sequence slice by using a 3DCNN network; meanwhile, the LSTM network is utilized to predict head motion data, and then the head motion data is spliced and fused with the characteristic information extracted by the 3DCNN network; finally, inputting the spliced and fused results into a plurality of full-connection layers to respectively acquire the viewing probabilities of the left and right view points focusing on different information; the viewing probability of the ith slice is recorded as P by the probability prediction method i
Further, the step S4 specifically includes the following steps:
the left view point and the right view point of the panoramic video are respectively divided into N fragments in time, each fragment has the length of T, each fragment comprises K slices, and each fragment has M bit levels; each slice in each slice has a code rate selected to be a i Where i ε {0, M-1}; q (a) i ) A mapping representing code rate to perceived quality; the viewing probability of each slice of the left and right viewpoints is respectivelyUsing the multi-agent reinforcement learning based on the Actor-critic, taking each slice as an agent, sharing a state, and carrying out joint action, thereby realizing code rate distribution;
when multi-agent reinforcement learning is adopted for slice code rate distribution in the left and right viewpoints, rewards of each agent are mixed with local rewards obtained from the environment by agent joint action in a single viewpoint and global rewards when agents in the left and right viewpoints are combined;
the Global rewards and the local rewards are separated and optimized respectively by introducing Global-Critic for supervision so as to ensure the stability of the model; the policy gradient of each agent after modification is:
where ep denotes the sample playback buffer, o i Is a local environment, i.e. an intelligent environment, a i Is the code rate selected by an agent, s is the overall environment, i.e. the intersection of the environmental states of all agents, θ is the parameter of the network model training,local value function for each agent,/->A global value function composed of all the agents;
the loss function of (2) is:
wherein y is l As an estimate of the local value function, r l For local rewards, γ is the discount factor;
the loss function of (2) is:
wherein y is g An estimated value of the global value function, r g Is a global reward;the Q value that causes the agent to take the optimal action in combination in the global state representing the composition of the left and right viewpoints is expressed as:
further, the step S5 specifically includes the following steps:
assuming that each agent shares a state at each moment in the left and right viewpoints, the input states are respectively:
wherein,representing network throughput of past k segments; />Representing an optional code rate set; b t Representing the current buffer size; z t Representing the average code rate of the last segment; />And->Download time of k past clips respectively representing left and right viewpoints; />And->The viewing probability of each slice of the left and right viewpoints is respectively represented; />And->Respectively representing the set of code rates selected by the slices of the left and right view points of the last segment;
the size of the watching probability of each slice determines the contribution degree of the whole video quality; when the slice is in the viewport region,1, otherwise 0; therefore, the average quality of view ports of the left and right viewpoint segments is:
spatial quality change:
wherein,and->Respectively represent the spatial domain quality change of the viewing port of the left and right view points, q (a) i ) Representing code rateMapping to perceived quality;
the average quality change of the left and right viewpoint viewport regions at the front and rear moments reflects the fluctuation of video quality in the time domain; time domain quality change:
wherein,and->Respectively representing time domain quality changes of the left and right view points and view ports;
the fragments of the left and right viewpoints are continuously downloaded, the fragments form the final downloading time, and the left and right viewpoints together influence the buffering time of the system; meanwhile, the agents with different code rates are selected for the left and right view points to be in a cooperative relationship; when the requested data is completely downloadedData size b of buffer memory larger than sending request time t-1 When the data is not completely downloaded and the buffer memory is exhausted, a buffer phenomenon occurs; the buffer time is as follows:
the quality difference of the slices at the corresponding positions of the left and right viewpoints is too large, and the QoE is seriously reduced when the quality difference exceeds a set range; and the symmetric coding has better performance when the quality of the left and right view points is smaller; in order to avoid the too large quality difference of the slices corresponding to the left and right viewpoints, a punishment item A is designed t To limit the code rate difference of the corresponding slices of the left and right view pointsSize of:
wherein,representing right view slice quality, < >>Representing a difference in left and right view slice quality; when->When larger, the person is in need of->Allow a variation in a larger range, but +.>Does not change significantly; when->Less time, ->In the case of large range changes, +.>Can vary significantly; therefore, the penalty term constrains that the left and right view quality is poor, but has higher acceptance for asymmetric coding in the case of high right view slice quality;
the local rewards aim at single view points, and in order to make the space domain and time domain changes in the view points as small as possible, the space domain and time domain changes are set as negative rewards; the global rewards are aimed at the whole formed by the left viewpoint and the right viewpoint, and the average quality is set as positive rewards in order to obtain higher average quality; to reduce the buffering time and avoid too large difference in quality between left and right view points, the buffering time is reducedAnd the left and right viewpoint quality difference constraint term is a negative reward; setting left and right viewpoint local rewards r t L,l ,r t R,l And global rewards r t g The function expressions are respectively as follows:
wherein λ and η are weights;
and by utilizing head motion data acquired from the playing equipment, selecting different code rates for the intra-view and extra-view region slices by means of view prediction and combining current bandwidth data, reducing the code rate of the slice with lower significance in each path of view, improving the code rate of the slice with higher significance in each path of view, and reasonably distributing network bandwidth data.
The invention also provides a stereoscopic panoramic video asymmetric transmission stream self-adaptive system, which comprises a memory, a processor and computer program instructions which are stored on the memory and can be run by the processor, wherein the computer program instructions can realize the steps of the method when the processor runs the computer program instructions.
Compared with the prior art, the invention has the following beneficial effects: the method and the system take into consideration the difference of the salience of all slices in the stereoscopic panoramic video view points, namely the difference of contribution degrees of the slices corresponding to the left view point and the right view point to subjective quality, reasonably reduce the code rate of the slice with lower salience in each way view point, improve the code rate of the slice with higher salience in each way view point, reasonably allocate network bandwidth data by reinforcement learning, and set a proper reward function according to the binocular inhibition principle, thereby improving the overall quality of the video. The invention utilizes multi-agent reinforcement learning to respectively select the code rate of each slice of the left and right view points, so as to avoid the problem of action space explosion caused by the code rate selection of a plurality of slices in the traditional reinforcement learning. Finally, to ensure the effectiveness of the system, a step-by-step update strategy is employed to balance the overall rewards with the local rewards for the left and right views.
Drawings
FIG. 1 is a schematic diagram of a multi-agent reinforcement learning model in an embodiment of the invention;
FIG. 2 is a block diagram of an adaptive system for stereoscopic panoramic video asymmetric transport stream in accordance with an embodiment of the invention;
FIG. 3 is a schematic view of a tile-based view prediction probability model in an embodiment of the present invention;
FIG. 4 is a diagram of a joint rate control method architecture based on multi-agent reinforcement learning in an embodiment of the present invention;
FIG. 5 is a 4G and 5G bandwidth trace in an embodiment of the invention;
FIG. 6 is a graph comparing the performance of the methods in an embodiment of the invention;
in fig. 6, (a) 4K video is transmitted for 4G bandwidth, (b) 8K video is transmitted for 4G bandwidth, (c) 4K video is transmitted for 5G bandwidth, and (d) 8K video is transmitted for 5G bandwidth;
FIG. 7 is a comparative CDF chart for various methods in an embodiment of the invention;
in FIG. 7, (a) is the average QOE value measured at 4G-4K, (b) is the average QOE value measured at 4G-8K, (c) is the average QOE value measured at 5G-4K, and (d) is the average QOE value measured at 5G-8K.
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
As shown in fig. 1-4, the present embodiment provides a stereoscopic panoramic video asymmetric transport stream adaptive method, which includes the following steps:
s1, a server side cuts video data into segment segments in time and cuts the segment segments into slice tiles in space;
s2, caching the cut video in an HTTP server according to different video quality and different code rates;
s3, carrying out probability prediction by combining a 3DCNN network (3Dimension convolution neuron network,3-dimensional convolutional neural network) and an LSTM network (Long Short-Term Memory network);
s4, performing joint code rate control on the left and right view points by utilizing multi-agent reinforcement learning based on an AC (Actor-Critic) so as to balance the mutual influence of the quality of the single-path view point and the overall quality;
s5, designing a reward function so that the system can select a more proper code rate;
and S6, decoding, splicing and stitching the downloaded data, storing the data in a play buffer of the client, and rendering and playing the data through playing software such as head-mounted equipment.
In this embodiment, the specific implementation method of steps S1 to S2 is as follows:
the panoramic video is cut into segments in time and tiles in space by a specific tool, and meanwhile, a media description file MPD is generated. When in transmission, the MPD file is transmitted preferentially, and the client analysis module analyzes the MPD file so as to analyze the information such as the code rate, resolution, frame rate, download address and the like of the cut video clips. After the client control module analyzes the information of the available video, in order to ensure that video data in the future time viewing area can select a high code rate, an estimation must be made on the future time viewing port. The code rate selection needs to make optimal code rate decisions for tiles in different areas according to the current network situation and the future view port positions. And the client selects proper code rates for tiles in the view field and tiles outside the view field at future time according to the network bandwidth condition and the predicted view port position. And the client sends a downloading request to the server through the HTTP module, and finally downloads the streaming media file according to the URL address. After the client downloads the requested file, the file can be decoded and played.
In this example, experimental verification was performed using an analog transmission experimental platform in the literature [ Jiang X, chiangY-H, zhao Y, et al Plato: learning-based Adaptive Streaming of-Degre video [ C ].2018IEEE 43rd Conference on Local ComputerNetworks (LCN), 2018:393-40 ]. The platform assumes that the client communicates with the server in HTTP/2 mode, and when the server receives the client request, it sends all tiles contained in a segment. And assuming a packet load rate of 95% and a round trip time of 80ms. According to the default repeat request time of the DASH player, when the buffer is full, its re-request time is set to 500ms.
The play buffer size is set to 3s. The implementation of the main framework is based on python and pytorch.
There is no stereoscopic panoramic video data set disclosed at present, and four stereoscopic panoramic videos with resolution of 4k and 8k are downloaded from Youtube in this embodiment. And performing space domain and time domain slicing on the data set by using ffmpeg, and performing HEVC coding. And the MP4Box is utilized to package the coded data. The data reflecting the actual head movements of the user are presented as published data sets in the literature [ Corbillon X, de Simone F, simon G.360-Degree Video Head Movement Dataset [ C ]. Proceedings ofthe 8. 8thACM on Multimedia Systems Conference-MMSys'17,2017:199-204 ]. Today, 5G technology has become increasingly popular, and the use of 5G technology to transmit high definition video is a necessary choice for visual inspection. We therefore verify the effect of transmitting 4K and 8K stereoscopic panoramic video with different algorithms at 4G and 5G bandwidths, respectively. The bandwidth data set is respectively a 4G bandwidth data set measured by VanDerHooft J and the like in Belgium, and a 5G bandwidth data set measured by Darijo Raca. The bandwidth trace is shown in fig. 5. The viewpoint area size is 110 ° in the horizontal direction and 90 ° in the vertical direction. The projection mode adopts the common ERP projection, and the layout is 6x4, namely K is 24.
According to the recommendation of the video website Youtube with the largest global, setting the code rate selection range as [40,16,8,5,2.5,1 ]]Mbps, i.e. M is 6. According to the literature [ Jiang X, chiangY-H, zhaoY, et al Plato: learning-based Adaptive Streaming of-Degree video [ C ]]2018IEEE 43rdConference on Local ComputerNetworks (LCN), proposal in 2018:393-40, mapping code rate to quality q (a i ) The method can be set as follows:
in the step S3, feature extraction is performed on the obtained static significant information, dynamic significant information and parallax information of the binocular viewpoint of the main viewpoint sequence slice by using a 3DCNN network; meanwhile, the LSTM network is utilized to predict head motion data, and then the head motion data is spliced and fused with the characteristic information extracted by the 3DCNN network; finally, inputting the spliced and fused results into a plurality of full-connection layers to respectively acquire the viewing probabilities of the left and right view points focusing on different information; the viewing probability of the ith slice is recorded as P by the probability prediction method i
In the embodiment, the Opencv is utilized to respectively obtain a static saliency map, a dynamic saliency map and a parallax map of the two-path video. And respectively adopting two full-connection layers to predict the left and right view probability.
In this embodiment, the step S4 specifically includes the following steps:
the left view point and the right view point of the panoramic video are respectively divided into N fragments in time, each fragment has the length of T, each fragment comprises K slices, and each fragment has M bit levels; each slice in each slice has a code rate selected to be a i Where i ε {0, M-1}; in the single view case, the code rate is selected for each tile by reinforcement learning, and at each time, M is shared K A possibility of the species. Such huge rowThe dynamic space is not feasible in practice. And (3) using the multi-agent reinforcement learning based on the Actor-critic to take each slice as an agent, sharing a state, and carrying out joint action so as to realize code rate distribution.
When multi-agent reinforcement learning is adopted for slice code rate selection in left and right view points, rewards of each agent are mixed with local rewards (such as Q avg ,Q sv ,Q tv Etc.), and global rewards at the time of agent coupling of left and right viewpoints (e.g.: t (T) rb )。
The Global rewards and the local rewards are separated and optimized respectively by introducing Global-Critic (Global supervision) for supervision so as to ensure the stability of the model; the policy gradient of each agent after modification is:
where ep denotes the sample playback buffer, o i Is a local environment, i.e. an intelligent environment, a i Is the code rate selected by an agent, s is the overall environment, i.e. the intersection of the environmental states of all agents, θ is the parameter of the network model training,local value function for each agent,/->A global value function for all agent compositions.
The loss function of (2) is:
wherein y is l As an estimate of the local value function, r l For local rewards, γ is the discount factor;
the loss function of (2) is:
wherein y is g An estimated value of the global value function, r g Is a global reward;the Q value that causes the agent to take the optimal action in combination in the global state representing the composition of the left and right viewpoints is expressed as:
in this embodiment, the step S5 specifically includes the following steps:
the reward function determines the reinforcement learning direction, so that the proper reward function is designed to determine the working performance of the system. Assuming that each agent shares a state at each moment in the left and right viewpoints, the input states are respectively:
wherein,representing network throughput of past k segments; />Representing an optional code rate set; b t Representing the current buffer size; z t Representing the average code rate of the last segment; />And->Download time of k past clips respectively representing left and right viewpoints; />And->The viewing probability of each slice of the left and right viewpoints is respectively represented; />And->Respectively representing the set of code rates selected by the slices of the left and right view of the last slice.
The size of the watching probability of each slice determines the contribution degree of the whole video quality; when the slice is in the viewport region,1, otherwise 0. Therefore, the average quality of view ports of the left and right viewpoint segments is:
the spatial quality of the left and right viewpoint views varies:
wherein,and->Respectively represent the spatial domain quality change of the viewing port of the left and right view points, q (a) i ) A mapping representing code rate to perceived quality;
the average quality change of the left and right viewpoint viewport regions at the front and rear moments reflects the fluctuation of video quality in the time domain; time domain quality change:
wherein,and->Respectively show left and right viewsTime domain quality change of point view port;
the segments of the left and right views can be considered as continuous downloads, which form the final download time, and the left and right views together affect the buffer time of the system; meanwhile, the intelligent agents with different code rates are selected for the left and right view points to be in a complete cooperation relationship; when the requested data is completely downloadedData size b of buffer memory larger than sending request time t-1 When the data is not completely downloaded and the buffer memory is exhausted, a buffer phenomenon occurs; the buffer time is as follows:
the quality difference of the slices at the corresponding positions of the left and right viewpoints is too large, and the QoE is seriously reduced when the quality difference exceeds a set range; and the symmetric coding has better performance when the quality of the left and right view points is smaller; in order to avoid the too large quality difference of the slices corresponding to the left and right viewpoints, a punishment item A is designed t To limit the code rate gap size of the corresponding slices of the left and right view points:
wherein,representing right view slice quality, < >>Representing a difference in left and right view slice quality; when->When larger, the person is in need of->Allow a variation in a larger range, but +.>Does not change significantly; when->Less time, ->In the case of large range changes, +.>Will vary significantly. Therefore, the penalty term constrains that the left and right view quality is poor, but has higher acceptance for asymmetric coding in the case of high right view slice quality; />In order to properly reduce the effect of view probability on the view quality gap penalty for intra slices.
The local rewards aim at single view points, and in order to make the space domain and time domain changes in the view points as small as possible, the space domain and time domain changes are set as negative rewards; the global rewards are aimed at the whole formed by the left viewpoint and the right viewpoint, and the average quality is set as positive rewards in order to obtain higher average quality; in order to reduce the buffer time and avoid too large difference of the quality of the left and right view points, the constraint item of the buffer time and the quality difference of the left and right view points is negative rewarded; setting left and right viewpoint local rewards r t L,l ,r t R,l And global rewards r t g Their functional expressions are respectively:
where λ and η are weights. The reward functions include two local reward functions and one global reward function.
In this embodiment, an experience playback mechanism (Experience Replay) is used for training multi-agent reinforcement learning. Let β be 0.7, λ be 15.0, η be 11.2. The discount factor gamma is 0.99, k is 8,T and 1.
And by utilizing head motion data acquired from the playing equipment, selecting different code rates for the intra-view and extra-view region slices by means of view prediction and combining current bandwidth data, reducing the code rate of the slice with lower significance in each path of view, improving the code rate of the slice with higher significance in each path of view, and reasonably distributing network bandwidth data.
The embodiment also provides a stereoscopic panoramic video asymmetric transmission stream self-adaptive system, which comprises a memory, a processor and computer program instructions stored on the memory and capable of being run by the processor, wherein the computer program instructions can realize the method steps when the processor runs the computer program instructions.
In order to verify the effectiveness of the present invention, comparative experiments were performed as follows.
The method of the present invention was compared with the following 2 methods:
(1) Adaptive streaming based on reinforcement Learning method [ Jiang X, chiang Y-H, zhao Y, et al Plato: learning-basedAdaptive Streaming of-Degre video [ C ].2018IEEE 43rd Conference on Local ComputerNetworks (LCN), 2018:393-40 ].
(2) Adaptive streaming methods based on conventional theory [ Nguyen D V, tran HT, pham A T, et al. An Optimal Tile-Based Approach for Viewport-Adaptive 360-Degree Video Streaming [ J ]. IEEE Journal on Emerging and Selected Topics in Circuits and Systems,2019,9 (1): 29-42 ].
The adaptive streaming method based on reinforcement learning adopts fixed area expansion. A method for analyzing self-adaptive stream transmission based on the traditional theory adopts real-time area expansion. The code rates of the corresponding areas of the left view and the right view in the method (1) and the method (2) are equal.
According to the description in literature [ Saygili G, gurler C G, tekalpAM. Evaluation ofAsymmetric Stereo Video Coding and Rate Scaling forAdaptive 3D Video Streaming[J ]. IEEE Transactions on Broadcasting,2011,57 (2): 593-601 ], the 3D perceptual quality of asymmetric coding is better than that of asymmetric coding when the left and right view PSNR value is greater than the threshold value of 32 dB; below the threshold of 32dB, the perceived quality of symmetric coding is better than asymmetric coding. Thus, stereoscopic video objective quality can be measured by the following formula:
the average PSNR of the stereoscopic panoramic video viewpoint area is:
wherein,kth tile, which is the view area of the nth segment,/for the view area>The size of the view region tile set is indicated.
The buffer time, time domain and space domain smoothness of the three algorithms and the viewpoint region PSNR value constructed by the average PSNR of the viewpoint region of the stereoscopic panoramic video are obtained under the same objective experimental environment, and the overall perceived quality is reflected.
Fig. 6 reflects a comparison of the data of the three methods in the case of 4G and 5G network bandwidths, wherein fig. 6 (a), 6 (b) and 6 (c), 6 (d) show the relevant data in the case of 4K and 8K stereoscopic panoramic video transmission in 4G and 5G bandwidths, respectively. Experimental results show that the performance under the 5G bandwidth is better than the performance under the 4G bandwidth in the overall performance, which accords with the consistent cognition and objective facts of people. But in the case of 5G bandwidth, the buffering time of both methods is increasing because the 5G bandwidth fluctuations are more severe relative to the 4G, as shown in fig. 5 for the 4G and 5G bandwidth trace cases. The severely fluctuating bandwidth will present a greater challenge to the code rate selection strategy of the algorithm. The buffering time will instead increase and also result in a reduced temporal smoothness. The reinforcement learning-based code rate selection method can better show excellent performance under complex conditions by continuously learning the past and future effects of reinforcement, and makes correct decisions.
The code rate distribution of each tile of the space domain cutting can cause certain degree of non-smoothness in the time domain and the space domain, which is also the reason that the effect of the method in the time domain and the space domain is poor. But the present method of asymmetric transmission mechanism based on binocular suppression, which represents perceived qualityIs highest relative to the other two algorithms and is also relatively lowest in buffering time. Under the condition that a certain threshold value is exceeded, the asymmetric coding is superior to the symmetric coding, and compared with the transmission of tiles with the left code rate and the right code rate, the asymmetric transmission can reduce bandwidth data, so that the buffer time is reduced. And the effect is more obvious under the condition that the environment is worse, such as the 4G condition and the 5G condition, and the 8K stereoscopic panoramic video is transmitted.
Fig. 7 shows QoE-CDF diagrams of different algorithms for transmitting 4K and 8K stereoscopic panoramic video under 4G and 5G bandwidths, and it can be seen from the diagrams that the method can achieve a good balance, and the average QoE can be improved by 20% on average under 4G bandwidth and can be improved by 12% on average under 5G bandwidth compared with other two algorithms. The 5G network can alleviate the quality degradation caused by the transmission of the stereoscopic panoramic video under the 4G network to a certain extent, and the asymmetric transmission method can further improve the overall quality of the video.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims (4)

1. The method for adapting the asymmetric transmission stream of the stereoscopic panoramic video is characterized by comprising the following steps of:
s1, a server side cuts video data into fragments in time and cuts the fragments into slices in space;
s2, caching the cut video in an HTTP server according to different video quality and different code rates;
s3, carrying out probability prediction by combining a 3DCNN network and an LSTM network;
s4, performing joint code rate control on the left and right view points by utilizing multi-agent reinforcement learning based on Actor-critic so as to balance the interaction between the quality of the single-path view point and the overall quality;
s5, designing a reward function so that the system can select a more proper code rate;
s6, decoding, splicing and stitching the downloaded data, storing the data in a play cache of the client, and rendering and playing the data through playing software;
the step S4 specifically includes the following steps:
the left view point and the right view point of the panoramic video are respectively divided into N fragments in time, each fragment has the length of T, each fragment comprises K slices, and each fragment has M bit levels; each slice in each slice has a code rate selected to be a i Where i ε {0, M-1}; q (a) i ) A mapping representing code rate to perceived quality; the viewing probability of each slice of the left and right viewpoints is respectivelyUsing Actor-critic based multi-agent reinforcement learning, each slice is treated as an agent that shares a state for joint actionThereby realizing the distribution of code rate;
when multi-agent reinforcement learning is adopted for slice code rate distribution in the left and right viewpoints, rewards of each agent are mixed with local rewards obtained from the environment by agent joint action in a single viewpoint and global rewards when agents in the left and right viewpoints are combined;
the Global rewards and the local rewards are separated and optimized respectively by introducing Global-Critic for supervision so as to ensure the stability of the model; the policy gradient of each agent after modification is:
where ep denotes the sample playback buffer, o i Is a local environment, i.e. an intelligent environment, a i Is the code rate selected by an agent, s is the overall environment, i.e. the intersection of the environmental states of all agents, θ i Is a parameter for the training of the network model,local value function for each agent,/->A global value function composed of all the agents;
the loss function of (2) is:
wherein y is l As an estimate of the local value function, r l For local rewards, γ is the discount factor;
the loss function of (2) is:
wherein y is g An estimated value of the global value function, r g Is a global reward;the Q value that causes the agent to take the optimal action in combination in the global state representing the composition of the left and right viewpoints is expressed as:
2. the method according to claim 1, wherein in the step S3, the 3DCNN network is used to extract the characteristics of the static significant information, the dynamic significant information and the parallax information of the binocular viewpoint of the obtained main viewpoint sequence slice respectively; meanwhile, the LSTM network is utilized to predict head motion data, and then the head motion data is spliced and fused with the characteristic information extracted by the 3DCNN network; finally, inputting the spliced and fused results into a plurality of full-connection layers to respectively acquire the viewing probabilities of the left and right view points focusing on different information; the viewing probability of the ith slice is recorded as P by the probability prediction method i
3. The method for adapting a stereoscopic panoramic video asymmetric transport stream according to claim 1, wherein said step S5 specifically comprises the steps of:
assuming that each agent shares a state at each moment in the left and right viewpoints, the input states are respectively:
wherein,representing network throughput of past k segments; />Representing an optional code rate set; b t Representing the current buffer size; z t Representing the average code rate of the last segment; />And->Download time of k past clips respectively representing left and right viewpoints;and->The viewing probability of each slice of the left and right viewpoints is respectively represented; />And->Respectively representing the set of code rates selected by the slices of the left and right view points of the last segment;
the size of the watching probability of each slice determines the contribution degree of the whole video quality; when the slice is in the viewport region,1, otherwise 0; therefore, the average quality of view ports of the left and right viewpoint segments is:
spatial quality change:
wherein,and->Respectively represent the spatial domain quality change of the viewing port of the left and right view points, q (a) i ) A mapping representing code rate to perceived quality;
the average quality change of the left and right viewpoint viewport regions at the front and rear moments reflects the fluctuation of video quality in the time domain; time domain quality change:
wherein,and->Respectively representing time domain quality changes of the left and right view points and view ports;
the fragments of the left and right viewpoints are continuously downloaded, the fragments form the final downloading time, and the left and right viewpoints together influence the buffering time of the system; meanwhile, the agents with different code rates are selected for the left and right view points to be in a cooperative relationship; when the requested data is completely downloadedData size b of buffer memory larger than sending request time t-1 When the data is not completely downloaded and the buffer memory is exhausted, a buffer phenomenon occurs; the buffer time is as follows:
the quality difference of the slices at the corresponding positions of the left and right viewpoints is too large, and the QoE is seriously reduced when the quality difference exceeds a set range; and the symmetric coding has better performance when the quality of the left and right view points is smaller; in order to avoid the too large quality difference of the slices corresponding to the left and right viewpoints, a punishment item A is designed t To limit the code rate difference of the corresponding slices of the left view point and the right view pointDistance size:
wherein,representing right view slice quality, < >>Representing a difference in left and right view slice quality; when->When larger, the person is in need of->Allow a variation in a larger range, but +.>Does not change significantly; when->Less time, ->In the case of large range changes, +.>Can vary significantly; therefore, the penalty term constrains that the left and right view quality is poor, but has higher acceptance for asymmetric coding in the case of high right view slice quality;
the local rewards aim at single view points, and in order to make the space domain and time domain changes in the view points as small as possible, the space domain and time domain changes are set as negative rewards; the global rewards aim at the whole formed by the left view point and the right view point to obtain higher flatnessAverage mass, the average mass is set as positive rewards; in order to reduce the buffer time and avoid too large difference of the quality of the left and right view points, the constraint item of the buffer time and the quality difference of the left and right view points is negative rewarded; setting left and right viewpoint local rewards r t L,l ,r t R,l And global rewards r t g The function expressions are respectively as follows:
wherein λ and η are weights;
and by utilizing head motion data acquired from the playing equipment, selecting different code rates for the intra-view and extra-view region slices by means of view prediction and combining current bandwidth data, reducing the code rate of the slice with lower significance in each path of view, improving the code rate of the slice with higher significance in each path of view, and reasonably distributing network bandwidth data.
4. A stereoscopic panoramic video asymmetric transport stream adaptation system comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, which when executed by the processor, are capable of carrying out the method steps of any one of claims 1 to 3.
CN202111165065.8A 2021-09-30 2021-09-30 Stereoscopic panoramic video asymmetric transport stream self-adaption method and system Active CN113905221B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111165065.8A CN113905221B (en) 2021-09-30 2021-09-30 Stereoscopic panoramic video asymmetric transport stream self-adaption method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111165065.8A CN113905221B (en) 2021-09-30 2021-09-30 Stereoscopic panoramic video asymmetric transport stream self-adaption method and system

Publications (2)

Publication Number Publication Date
CN113905221A CN113905221A (en) 2022-01-07
CN113905221B true CN113905221B (en) 2024-01-16

Family

ID=79189919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111165065.8A Active CN113905221B (en) 2021-09-30 2021-09-30 Stereoscopic panoramic video asymmetric transport stream self-adaption method and system

Country Status (1)

Country Link
CN (1) CN113905221B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114979089B (en) 2022-04-25 2023-03-24 北京邮电大学 System and method for transmitting panoramic video in real time
CN114979799B (en) * 2022-05-20 2024-07-26 北京字节跳动网络技术有限公司 Panoramic video processing method, device, equipment and storage medium
CN115022546B (en) * 2022-05-31 2023-11-14 咪咕视讯科技有限公司 Panoramic video transmission method, device, terminal equipment and storage medium
CN115037962B (en) * 2022-05-31 2024-03-12 咪咕视讯科技有限公司 Video self-adaptive transmission method, device, terminal equipment and storage medium
CN114900506B (en) * 2022-07-12 2022-09-30 中国科学技术大学 User experience quality-oriented 360-degree video viewport prediction method
CN117768669A (en) * 2022-09-19 2024-03-26 腾讯科技(深圳)有限公司 Data transmission method, device, electronic equipment and storage medium
CN117156175B (en) * 2023-10-30 2024-01-30 山东大学 QoE optimization method for panoramic video streaming based on viewport prediction distance control

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020043126A1 (en) * 2018-08-29 2020-03-05 中兴通讯股份有限公司 Video data processing and transmission methods and apparatus, and video data processing system
CN111711810A (en) * 2020-06-30 2020-09-25 福州大学 Stereoscopic video transmission method based on asymmetric bit rate allocation
CN112584119A (en) * 2020-11-24 2021-03-30 鹏城实验室 Self-adaptive panoramic video transmission method and system based on reinforcement learning
CN112822564A (en) * 2021-01-06 2021-05-18 鹏城实验室 Viewpoint-based panoramic video adaptive streaming media transmission method and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9846960B2 (en) * 2012-05-31 2017-12-19 Microsoft Technology Licensing, Llc Automated camera array calibration
WO2016123721A1 (en) * 2015-02-07 2016-08-11 Zhou Wang Method and system for smart adaptive video streaming driven by perceptual quality-of-experience estimations
US20170195561A1 (en) * 2016-01-05 2017-07-06 360fly, Inc. Automated processing of panoramic video content using machine learning techniques

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020043126A1 (en) * 2018-08-29 2020-03-05 中兴通讯股份有限公司 Video data processing and transmission methods and apparatus, and video data processing system
CN111711810A (en) * 2020-06-30 2020-09-25 福州大学 Stereoscopic video transmission method based on asymmetric bit rate allocation
CN112584119A (en) * 2020-11-24 2021-03-30 鹏城实验室 Self-adaptive panoramic video transmission method and system based on reinforcement learning
CN112822564A (en) * 2021-01-06 2021-05-18 鹏城实验室 Viewpoint-based panoramic video adaptive streaming media transmission method and system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A CNN-based Quality Model for Image Interpolation;Yuting Lin 等;2020 Cross Strait Radio Science & Wireless Technology Conference;全文 *
Reinforcement Learning Based Rate Adaptation for 360-Degree Video Streaming;Zhiqian Jiang 等;IEEE TRANSACTIONS ON BROADCASTING;第67卷(第2期);全文 *
基于HTTP 自适应流媒体传输的3D 视频质量评价;翟宇轩 等;北京航空航天大学学报;第45卷(第12期);全文 *
基于机器学习的立体全景视频自适应流系统;饶迎节 等;《电视技术》;第44卷(第12期);全文 *
基于视点的全景视频编码与传输优化;谢文京;王悦;张新峰;王苫社;马思伟;;扬州大学学报(自然科学版)(02);全文 *

Also Published As

Publication number Publication date
CN113905221A (en) 2022-01-07

Similar Documents

Publication Publication Date Title
CN113905221B (en) Stereoscopic panoramic video asymmetric transport stream self-adaption method and system
Xie et al. 360ProbDASH: Improving QoE of 360 video streaming using tile-based HTTP adaptive streaming
Sun et al. Flocking-based live streaming of 360-degree video
Yuan et al. Spatial and temporal consistency-aware dynamic adaptive streaming for 360-degree videos
TWI511544B (en) Techniques for adaptive video streaming
Yaqoob et al. A combined field-of-view prediction-assisted viewport adaptive delivery scheme for 360° videos
Liu et al. JET: Joint source and channel coding for error resilient virtual reality video wireless transmission
US20140292751A1 (en) Rate control bit allocation for video streaming based on an attention area of a gamer
Park et al. Volumetric media streaming for augmented reality
CN115037962B (en) Video self-adaptive transmission method, device, terminal equipment and storage medium
US20250097399A1 (en) Processing system for streaming volumetric video to a client device
US11575894B2 (en) Viewport-based transcoding for immersive visual streams
US11373380B1 (en) Co-viewing in virtual and augmented reality environments
Park et al. Navigation graph for tiled media streaming
WO2021092821A1 (en) Adaptively encoding video frames using content and network analysis
CN115633143B (en) An adaptive video streaming transmission system with edge-to-edge collaborative super-resolution
Aksu et al. Viewport-driven rate-distortion optimized scalable live 360° video network multicast
US20240283986A1 (en) Live Streaming Media
WO2018133709A1 (en) Method, device and system for streaming media transmission, server and terminal
Zong et al. Progressive frame patching for FoV-based point cloud video streaming
Tanjung et al. Qoe optimization in dash-based multiview video streaming
CN119172571A (en) Multi-channel collaborative acceleration method for set-top box data processing and transmission
CN117714700B (en) Video coding method, device, equipment, readable storage medium and product
Xie et al. Perceptually optimized quality adaptation of viewport-dependent omnidirectional video streaming
Zhang et al. Exploiting layer and spatial correlations to enhance SVC and tile based 360-degree video streaming

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant