CN119203100A

CN119203100A - A smart speaker permission design management method and system

Info

Publication number: CN119203100A
Application number: CN202411217025.7A
Authority: CN
Inventors: 程佳能
Original assignee: Guangzhou Ruigao Intelligent System Co ltd
Current assignee: Guangzhou Ruigao Intelligent System Co ltd
Priority date: 2024-09-02
Filing date: 2024-09-02
Publication date: 2024-12-27
Anticipated expiration: 2044-09-02
Also published as: CN119203100B

Abstract

The invention provides an intelligent sound box authority design management method system, which comprises the steps of collecting user audio data and preprocessing the audio data, constructing a user-situation joint feature vector to carry out recognition accuracy adjustment, designing fuzzy matching and dynamic adjustment models to carry out real-time dynamic management on the authority of an intelligent sound box if complex voice environments of multiple languages and multiple accents appear, constructing a behavior monitoring and anomaly detection model to carry out real-time analysis on the user authority through feature engineering and time sequence modeling, dynamically adjusting the authority level according to the results of anomaly detection and behavior monitoring, providing a mapping function of emotion state and authority response to dynamically adjust the authority level, immediately applying new authority configuration to the intelligent sound box, constructing an adaptive authority optimization model based on reinforcement learning, and carrying out adaptive learning on the authority strategy of the intelligent sound box through reinforcement learning to carry out long-term strategy optimization. The intelligent sound box meets the wide application requirements of the intelligent sound box in various scenes such as families, offices and the like.

Description

Intelligent sound box authority design management method and system

Technical Field

The invention belongs to the technical field of authority design management, and particularly relates to an intelligent sound box authority design management method and system.

Background

With the rapid development of internet of things (IoT) technology, smart speakers are becoming indispensable smart devices in home and office environments. The method provides a plurality of convenient services such as music playing, information inquiry, intelligent home control and the like for users in a voice interaction mode. However, the widespread use of smart speakers has also raised a number of problems associated with privacy protection, rights management, and multi-user collaboration. The existing intelligent sound box mainly controls the use permission of equipment through simple static permission setting, and the method has various limitations and disadvantages.

Firstly, the existing static authority management mode lacks of dynamics and intelligence, and cannot adapt to complex requirements of multiple users and multiple scenes. In a home environment, different members may have different usage requirements and rights requirements for the function of the sound box. For example, the rights of children and adults to content access should be different, and the existing static rights settings cannot flexibly cope with these changes, often resulting in children accessing unsuitable content or adults not having timely access to the desired functions. Second, conventional static rule-based rights management cannot provide a personalized experience, lacks deep understanding and learning of individual behaviors and usage habits of users, and results in poor user experience. The intelligent sound box is used as equipment with extremely high daily use frequency, and is supposed to have higher user experience optimizing capability.

In addition, with the popularization of intelligent sound boxes in home, office and other environments, the authority management of the intelligent sound boxes faces more serious privacy security challenges. The traditional authority setting mode generally requires a user to manually adjust the authority through tedious operations, and is not convenient and easy to make mistakes. Even more serious, enclosures are often exposed to public environments where unauthorized users may perform unauthorized operations through voice commands, presenting a significant security risk. For example, unauthorized users can easily access confidential information, control smart home devices, and even make improper voice purchases, resulting in property and privacy losses. In the prior art, solutions to these security problems are mostly post-hoc remedies, and a prospective protection strategy is lacking.

Finally, rights management of smart speakers lacks an effective management mechanism in a multi-user scenario, especially in a scenario where multiple people are simultaneously using, how to differentiate user identities and provide personalized services is a big challenge. The current rights management system can only distinguish simple user roles or rely on a manual login mode to switch users, which is not convenient and unreliable in actual use and cannot meet diversified use requirements. In the face of alternate use of users with different ages and different roles, the prior art cannot provide an intelligent and dynamic authority management scheme, which limits the function exertion and user experience optimization of the intelligent sound box.

Therefore, the current intelligent sound box authority management technology has obvious defects, which are mainly represented by the lack of dynamic intelligence, personalized experience, security guarantee and insufficient support for multi-user scenes. These problems severely limit the potential deployment and market applications of smart speakers.

Disclosure of Invention

The invention aims to design an intelligent sound box authority design management method and system, introduces an advanced artificial intelligent technology and a data analysis method, overcomes the defects of the prior art, and provides a more intelligent sound box authority management system with high safety and excellent user experience.

In order to achieve the above object, in a first aspect of the present invention, there is provided a method for rights design management of an intelligent sound box, the method comprising:

S1, collecting user audio data, preprocessing the audio data, constructing a Gaussian mixture model according to the preprocessed data, and outputting an audio signature;

S2, taking the audio signature as input to represent the identity characteristic of the user, collecting real-time environment data, preprocessing, constructing a user-situation joint feature vector according to the identity characteristic of the user and the preprocessed real-time environment data, adding a regular term into the user-situation joint feature vector to carry out recognition accuracy adjustment, and then carrying out situation awareness and dynamic authority configuration initialization to obtain an optimized authority level and authority score;

S3, according to the optimized authority level and authority score, if complex voice environments of multiple languages and multiple accents appear, designing real-time dynamic management of the intelligent sound box authority by a fuzzy matching and dynamic adjusting model, wherein the fuzzy matching and dynamic adjusting model calculates the credibility score gamma of the voice feature vector C _new by using a feature weighted aggregation method, and the credibility score gamma is expressed as follows:

Wherein, gamma represents the credibility score of the voice input, represents the credibility of the matching degree of the current voice feature, M represents the total number of voice features, w _j represents the weight of the jth feature, represents the importance of the feature in calculating the voice credibility, alpha _j represents the fuzzy adjustment parameter of the jth feature, controls the sensitivity of feature matching, C _new,j represents the component of the jth voice feature vector, mu _j represents the expected value of the jth feature, and represents the typical value of the feature under normal condition;

Confidence score gamma and optimized authority score using speech input Adjusting functions according to rightsDynamically adjusting authority level of userThe expression is as follows:

Wherein, Representing the authority level after dynamic adjustment, and representing the authority setting after adjustment according to the voice credibility and the original authority score; Representing the initial authority level output by the second step; η represents a rights adjustment gain coefficient, controls the rights adjustment amplitude, γ represents a confidence score for the speech input, δ represents an intermediate threshold for the confidence score; Represents the optimized authority score, tau represents the reference threshold of the authority score, and is calculated according to the new authority level The intelligent sound box system adjusts the authority configuration of the user in real time and dynamically changes the accessibility of the functions;

s4, according to the new authority level after being optimized by combining the complex voice environment And then combining with real-time user behavior data, constructing a behavior monitoring and anomaly detection model, analyzing user permission in real time through feature engineering and time sequence modeling, performing anomaly detection and behavior monitoring on user behavior, and dynamically adjusting permission level according to the anomaly detection and behavior monitoring results

S5, according to the authority levelClassifying the emotion states through an emotion classifier model by combining the emotion states of the user, generating emotion labels, providing mapping functions of the emotion states and authority responses to dynamically adjust authority levels according to the emotion state labels and abnormal behavior response strategies, and calculating the authority levels according to the emotion states and the authority response mapping functionsThe intelligent sound box immediately applies new authority configuration;

S6, according to the authority level Constructing an adaptive authority optimization model based on reinforcement learning, and carrying out adaptive learning on the authority strategy of the intelligent sound box through reinforcement learning to carry out long-term strategy optimization by combining long-term behaviors and use modes of users.

Preferably, the preprocessing is to process the collected audio data x (t) to remove noise and to perform normalization processing;

Wherein, the probability density function of the Gaussian mixture model is expressed as:

Wherein, C represents the extracted Mel frequency cepstrum coefficient feature vector, which is used for representing the voiceprint feature of the audio; represents the parameter set of the Gaussian mixture model, pi _k represents the mixing weight of the kth Gaussian component, meets the requirement And pi _k≥0;μ_k represents the mean vector of the kth Gaussian component, Σ _k represents the covariance matrix of the kth Gaussian component, K represents the number of Gaussian components and the number of Gaussian distributions in the model; Representing a multivariate normal distribution, expressed as:

where d represents the dimension of the feature vector C.

Preferably, for each new audio input, the user identity is identified by computing a log-likelihood estimate of the feature vector of the new audio under each user model, as follows:

Wherein logL (C _new|S_i) represents a log-likelihood value of the new feature vector under the ith user model for measuring the matching degree of the audio input and the user model, C _new represents the feature vector of the new audio input, S _i represents the audio signature of the ith user, namely GMM parameter set Θ _i, T represents the total number of frames of the new audio input, represents the number of frames after the audio signal is segmented, and P (C _new(t)|Θ_i) represents the probability of the feature vector C _new (T) of the T frame under the user model Θ _i.

Preferably, the real-time environment data is collected through various sensors, wherein the sensors comprise a microphone, a camera, a light sensor, a temperature and humidity sensor and an accelerometer, and a real-time environment feature vector is expressed as E;

The user-situation joint feature vectors are fused by a weighted multi-layer perceptron model, expressed as follows:

C_e＝σ(W₂·σ(W₁·[E^norm,S_i]+b₁)+b₂)

The method comprises the steps of generating a context feature vector, wherein C _e represents the generated context feature vector, represents the behavior feature of a user in the current environment, [ E ^norm,S_i ] represents the input feature obtained by combining the environment feature vector and a user audio signature, W ₁,W₂ represents a weight matrix, the dimensions are m (n+1) and m×m respectively, and the weight matrix is learned through model training, b ₁,b₂ represents a bias vector, the dimensions are m and m respectively, and the bias vector is learned through model training, and sigma represents a nonlinear activation function and is used for capturing nonlinear relations;

Combining the context feature vector C _e and the user audio signature S _i to generate a user-context joint feature vector f= [ S _i,C_e ];

adding a regularization term to the user-context joint feature vector f= [ S _i,C_e ] achieves higher recognition accuracy by a weighted combination of user and environmental features, as follows:

F_opt＝F+λ·(F⊙W_reg)

Wherein F _opt represents the optimized joint feature vector, lambda represents the regularization parameter for balancing the influence of the original feature and the regularization term, as well as the element product operation for element-by-element weighting, W _reg represents the weight matrix and the interrelationship between the features.

Preferably, the configuration initialization includes:

designing a permission scoring function according to the adjusted user-situation joint feature vector for calculating the permission score of the user under the current situation The expression is as follows:

Wherein, Representing a permission score for reflecting a permission level of a user in a current context; beta represents a weight adjustment coefficient to control the overall weight of the joint feature; Representing the two norms of the joint features as a measure of feature importance, increasing the constraint of balance between features; Representing a bias term;

Scoring according to the calculated rights And a preset authority level threshold value, performing authority dynamic adjustment, wherein the authority adjustment functionThe expression is as follows:

Wherein, Representing the authority level after dynamic adjustment; And expressing the authority threshold, and setting according to the user requirements and system configuration.

Preferably, in S3, the fuzzy matching and dynamic adjustment model designs an error correction feedback mechanism to dynamically adjust the weights and parameters of the fuzzy matching model, the error correction mechanism is expressed as follows:

Δw=∈·(γ_target-γ)·C_new

Wherein Deltaw represents a weight adjustment vector for modifying the weight of a speech feature, E represents a learning rate for controlling the step size of the weight adjustment, gamma _target represents a target confidence score, typically set to a system desired confidence level, gamma represents a confidence score for the current speech input, and C _new represents a feature vector for the current speech input.

Preferably, a monitoring and abnormality detection model based on a long-short-time memory network is constructed according to the weighted behavior feature vector B _w, behavior features at the next moment are predicted, and an abnormality mode of the behavior is identified, wherein an update formula of the behavior monitoring and abnormality detection model is expressed as follows:

h_t＝f_LSTM(B_w,t,h_t-1)

Wherein h _t represents the hidden state of the current time t, captures the time dependence of the user behavior, B _w,t represents the weighted behavior feature vector of the current time t, h _t-1 represents the hidden state of the previous time t-1, and f _LSTM represents the state update function of the long and short time memory network;

judging whether abnormal behavior exists or not by calculating residual errors between actual behavior characteristics and predicted behavior characteristics, wherein the abnormal score xi is expressed as follows:

Wherein, xi represents the abnormal score, the difference degree between the current behavior and the predicted behavior, and B _w,t represents the actual weighted behavior characteristic vector of the current time t; Sigma ² represents the variance of the prediction residual and is used for normalizing the abnormality score;

Judging whether to trigger a permission adjustment or warning mechanism according to the anomaly score xi and a preset anomaly threshold value theta, wherein the adjustment strategy is as follows:

If xi is less than or equal to theta, the current authority level is maintained without adjustment

If xi > theta, identifying the abnormal behavior, and carrying out authority tightening or prompting the user to carry out identity verification according to the severity of the abnormality.

Preferably, the collection of the emotional states of the user includes collection of audio and video features, and then a fusion model based on a self-attention mechanism is used to capture interaction between the multi-modal features, and the specific formula is as follows:

e represents a fused emotion feature vector and is used for describing the current comprehensive emotion state of the user; A represents an audio feature vector, the dimension is m, V represents a video feature vector, the dimension is n, W _A and W _V represent projection matrices for mapping audio and video features to the same space, the dimensions are m x d _k and n x d _k;d_k represent the dimensions of key vectors in an attention mechanism for scaling dot product operations, and softmax represents a normalization function for calculating attention weights;

classifying the emotion states by using the fused emotion feature vector E through an emotion classifier model C (E), and generating an emotion label C, wherein the emotion label C is expressed as follows:

w _i represents a weight vector of the classifier model, the weight corresponding to the ith class, and the dimension d _k;b_i represents a bias item of the classifier model, and the bias corresponding to the ith class;

according to the emotion state label c and the abnormal behavior response strategy, a mapping function of emotion state and authority response is provided For dynamically adjusting authority levelsThe expression is as follows:

Wherein, And rho (c) represents a permission adjustment increment function, and is expressed as an integer which can be positive or negative depending on the user emotion state c.

Preferably, the self-adaptive authority optimization model based on reinforcement learning combines the current authority level and emotion state of the user, and the historical behavior mode and system feedback of the user, namely a state vectorWherein H represents a historical behavior pattern feature vector of a user, F represents a system feedback feature vector, and defines actions executable by the system as an adjustment operation set of authoritiesWherein each action a _i represents an adjustment to the level of authority, a sense-optimized objective function J (θ) that maximizes the desired jackpot by optimizing the parameter θ of policy pi _θ (S);

the formula for optimizing the objective function J (θ) is as follows:

wherein J (theta) represents an optimized objective function of a strategy parameter theta, and represents a desired jackpot, S represents a current state vector and consists of a permission level, an emotion state, a historical behavior mode and system feedback; Representing a state distribution set, representing a possible state space, wherein gamma represents a discount factor, the value range of which is more than or equal to 0 and less than or equal to 1 and is used for measuring the importance of future rewards, r (S _t,a_t) represents instant rewards obtained when a state S _t executes an action a _t and reflects the validity and rationality of authority adjustment, a _t represents the action executed at a time t and is determined by a strategy pi _θ (S);

in the strategy optimization process, a strategy regularization term omega (theta) is introduced to avoid the strategy from excessively fitting the short-term behavior pattern of the user, and an optimization target for maximizing regularization of strategy parameters theta is updated, wherein the optimization target is expressed as follows:

Wherein, theta _t+1 represents the updated strategy parameter, theta _t represents the strategy parameter at the current time t, alpha represents the learning rate and controls the step length of parameter updating; The method comprises the steps of representing strategy gradients, representing gradients of strategy parameters of an optimized objective function, lambda representing regularization strength parameters and controlling influence degree of regularization items on strategy updating, and omega (theta _t) representing strategy regularization items, wherein the strategy regularization items are sparsity regularization, and simplicity and generalization capability of strategy parameters are encouraged, and are represented as follows:

where N represents the total number of policy parameters and θ _i represents the ith parameter of the policy parameters.

In a second aspect of the present invention, there is provided an intelligent sound box authority design management system, the system comprising:

The user data collection module is used for collecting user audio data, preprocessing the audio data, constructing a Gaussian mixture model according to the preprocessed data and outputting an audio signature;

The configuration initialization module is used for taking the audio signature as input to represent the identity characteristic of the user, collecting real-time environment data, preprocessing, constructing a user-situation joint feature vector according to the identity characteristic of the user and the preprocessed real-time environment data, adding a regular term into the user-situation joint feature vector to adjust the identification precision, and then carrying out situation awareness and dynamic authority configuration initialization to obtain an optimized authority level and authority score;

The authority initializing module is used for designing real-time dynamic management of the authority of the intelligent sound box by the fuzzy matching and dynamic adjusting model according to the optimized authority level and the authority score if complex voice environments of multiple languages and multiple accents appear, wherein the fuzzy matching and dynamic adjusting model calculates the credibility score gamma of the voice feature vector C _new by using a feature weighting aggregation method and is expressed as follows:

the authority management module is used for optimizing the new authority level according to the combined complex voice environment And then combining with real-time user behavior data, constructing a behavior monitoring and anomaly detection model, analyzing user permission in real time through feature engineering and time sequence modeling, performing anomaly detection and behavior monitoring on user behavior, and dynamically adjusting permission level according to the anomaly detection and behavior monitoring resultsAccording to the authority levelClassifying the emotion states through an emotion classifier model by combining the emotion states of the user, generating emotion labels, providing mapping functions of the emotion states and authority responses to dynamically adjust authority levels according to the emotion state labels and abnormal behavior response strategies, and calculating the authority levels according to the emotion states and the authority response mapping functionsThe intelligent sound box immediately applies new authority configuration;

a system optimization module for optimizing the authority level Constructing an adaptive authority optimization model based on reinforcement learning, and carrying out adaptive learning on the authority strategy of the intelligent sound box through reinforcement learning to carry out long-term strategy optimization by combining long-term behaviors and use modes of users.

The beneficial technical effects of the invention are at least as follows:

Firstly, the invention adopts the user audio signature recognition technology, and can recognize the user identity in real time through the audio characteristics in the using process of the loudspeaker box. The method not only can effectively distinguish different users in a home or office environment, but also can automatically adjust the authority setting according to the use habit and the authority requirement of each user, thereby realizing real personalized service and overcoming the problem of lack of flexibility in the traditional static authority setting.

Secondly, the invention introduces a context-aware authority management technology, and the system can intelligently perceive the current context and dynamically adjust the authority setting by combining the behavior data of the user and the environment information (such as time, place, schedule and the like) through multi-sensor data fusion such as a microphone, a camera, a light sensor and the like. For example, in the evening home mode, the system can automatically start the child protection mode and limit unsuitable function access, so that the intelligence and user experience of the system are greatly improved, and the problem of lack of dynamic adjustment capability in the prior art is solved.

The invention designs a self-adaptive voice fuzzy matching algorithm, which can self-adaptively adjust voice recognition precision under various accents and speech speeds, and ensures that the permission setting is flexibly adjusted to ensure safety while accurately understanding the intention of a user. The self-adaptive mechanism solves the problem of low recognition accuracy in the multi-language and multi-accent use scene in the prior art, and reduces the risk of misoperation of rights.

In addition, the invention also integrates an abnormal authority behavior detection module and an emotion reasoning authority management module. The abnormal authority behavior detection module monitors and analyzes the authority use behavior of the user in real time by using a machine learning algorithm, can rapidly identify abnormal operation and take corresponding safety measures, and enhances the safety and the protection capability of the system. The emotion reasoning authority management module analyzes the emotion state of the user through voice, automatically adjusts the authority when the emotion of the user is excited, and prevents improper behaviors caused by emotion fluctuation. The innovation points effectively solve the defect that the prior art lacks response to the emotion and abnormal behavior of the user, and remarkably improve the safety and humanized experience of the system.

Through the innovation points, the invention provides a comprehensive solution, can effectively overcome various defects in the prior art, and realizes dynamic management, personalized service, safety protection and multi-user support of the authority of the intelligent sound box. The multi-layer and omnibearing intelligent authority management system not only improves user experience, but also obviously enhances the safety and adaptability of equipment, and meets the wide application requirements of intelligent sound boxes in diversified scenes such as families, offices and the like.

Drawings

The invention will be further described with reference to the accompanying drawings, in which embodiments do not constitute any limitation of the invention, and other drawings can be obtained by one of ordinary skill in the art without inventive effort from the following drawings.

Fig. 1 is a flowchart of a method for rights design management of an intelligent sound box according to an embodiment of the invention.

Fig. 2 is a frame diagram of an intelligent sound box authority design management system according to an embodiment of the invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

In one or more embodiments, as shown in fig. 1, the invention discloses a method for managing authority design of an intelligent sound box, which comprises steps 1-6, including:

s1, collecting user audio data, preprocessing the audio data, constructing a Gaussian mixture model according to the preprocessed data, and outputting an audio signature.

Specifically, during preprocessing, the acquired audio data x (t) is processed to remove noise and normalized.

Where y (t) represents the denoised audio signal. x (t) represents the original audio signal as a function of time.Representing the estimated noise signal, is extracted from the environment using an adaptive filter.

Signal normalization process, the normalized signal is expressed as:

Where y _norm (t) represents the normalized audio signal. Mu (y) represents the mean value of the denoising signal y (t), calculated as

Where N is the number of signal samples. Sigma (y) represents the standard deviation of the denoising signal y (t), calculated as

Further, the audio features of the user are modeled using a gaussian mixture model to form an audio signature. The probability density function of the GMM is expressed as:

where C represents an extracted mel-frequency cepstral coefficient (MFCC) feature vector for representing voiceprint features of audio. A set of parameters representing the GMM. Pi _k represents the mixing weight of the kth Gaussian component, satisfying And pi _k≥0.μ_k represents the mean vector of the kth gaussian component. Σ _k represents the covariance matrix of the kth gaussian component. K represents the number of Gaussian components and the number of Gaussian distributions in the model.Representing a multivariate normal distribution, expressed as:

Where d is the dimension of the feature vector C.

Further, for each new audio input, the user identity is identified by computing a log-likelihood estimate of its feature vector under each user model:

Wherein logL (C _new|S_i) represents the log-likelihood value of the new feature vector under the ith user model for measuring the degree of matching of the audio input to the user model. C _new denotes the feature vector of the new audio input. S _i represents the audio signature of the ith user, i.e., GMM parameter set Θ _i. T represents the total number of frames of the new audio input, representing the number of frames after segmentation of the audio signal. P (C _new(t)|Θ_i) represents the probability that the feature vector C _new (t) of the t-th frame is under the user model Θ _i.

S2, taking the audio signature as input to represent the identity characteristic of the user, collecting real-time environment data, preprocessing, constructing a user-situation joint feature vector according to the identity characteristic of the user and the preprocessed real-time environment data, adding a regular term into the user-situation joint feature vector to carry out recognition accuracy adjustment, and then carrying out situation awareness and dynamic authority configuration initialization to obtain the optimized authority level and authority score.

The intelligent sound box collects environmental data in real time through various sensors (such as a microphone, a camera, a light sensor, a temperature and humidity sensor, an accelerometer and the like) to form an environmental characteristic vector E= [ E ₁,e₂,…,e_n ]. These environmental features include:

e ₁ represents an ambient noise level, ranging from 0 to 100 db.

E ₂ represents the illumination intensity, ranging from 0 to 10,000 lux.

E ₃ denotes the distance of the user from the loudspeaker in the range of 0 to 5 meters.

E ₄ represents the current time, 24 hours, ranging from 0 to 23.99.

E ₅ denotes the number of detected persons, and the integer range is 0 to 10.

E ₆ denotes the user activity state, classifies the variables, and takes the value range { stationary, walking, jumping, etc }.

Other sensor data such as temperature e ₇, humidity e ₈, etc.

Further, since the data ranges and units of different sensors are different, the environmental feature vector E needs to be standardized to unify the data scales:

Wherein, Representing the normalized environmental characteristics. min (e _i) and max (e _i) represent the minimum and maximum values of the environmental feature e _i for normalization processing.

Further, the user identity and the environmental characteristics are fused by using the user audio signature S _i obtained in the first step, and a novel fusion method is introduced, so that the user identity characteristics and the environmental data are fused, and a situation characteristic vector C _e is constructed. The specific method is that a weighted multi-layer perceptron (MLP) model is used, and the model can dynamically adjust the weights of the user characteristics and the environment characteristics in the learning process so as to reflect the behavior modes of the user in different environments.

C_e＝σ(W₂·σ(W₁·[E^norm,S_i]+b₁)+b₂)

Wherein C _e represents the generated contextual feature vector, representing the behavioral characteristics of the user in the current environment. [ E ^norm,S_i ] represents the input feature after combining the ambient feature vector and the user audio signature. W ₁,W₂ represents a weight matrix, the dimensions are m× (n+1) and m×m, respectively, and learning is performed through model training. b ₁,b₂ represents bias vectors, the dimensions are m and m respectively, and learning is performed through model training. σ represents a nonlinear activation function (e.g., reLU or Sigmoid) for capturing nonlinear relationships.

Further, a user-context joint feature vector f= [ S _i,C_e ] is generated in combination with the context feature vector C _e and the user audio signature S _i. This is a fused feature vector that represents the comprehensive state of a particular user in the current context.

In order to better capture the complex relationship between the user features and the environmental features, a joint feature modeling method with innovative regularization terms is provided. This regularization term achieves higher recognition accuracy by a weighted combination of the user and environmental features. The model formula is as follows:

F_opt＝F+λ·(F⊙W_reg)

Wherein F _opt represents the optimized joint feature vector, taking into account the weighted relationship of the user and the environmental features. Lambda represents a regularization parameter for balancing the effects of the original features and regularization term, typically determined by cross-validation. As indicated by the letter, "(Hadamardproduct) is used for the element-wise weighting. W _reg represents a weight matrix, representing the interrelationship between features, obtained by optimizing dynamic learning.

Further, based on the optimized joint feature vector F _opt, a permission scoring function with an innovation extra term is provided for calculating the permission score of the user in the current situation

Wherein, And representing the permission score for reflecting the permission level of the user in the current context. w represents a weight vector, obtained through model training, and represents the importance of each feature. Beta represents a weight adjustment coefficient, controlling the overall weight of the joint feature. and/F _opt∥² represents the two norms of the joint features, which are taken as the measure of the importance of the features, and the constraint of balance among the features is increased. b represents the bias term, obtained by training data learning.

Scoring according to the calculated rightsAnd a preset authority level threshold value, and performing authority dynamic adjustment. Rights adjustment function Is defined as:

Wherein, Representing the dynamically adjusted permission level according to which the specific permission configuration is to be set. T ₁,T₂ represents a permission threshold, and is set according to user requirements and system configuration.

The finally determined authority levelThe method is applied to an intelligent sound box system and can dynamically control functions and data authority which can be accessed by a user.

And S3, according to the optimized authority level and authority score, if complex speech environments of multiple languages and multiple accents appear, designing fuzzy matching and dynamic regulation models to dynamically manage the authority of the intelligent sound box in real time.

Specifically, when a user sends a voice command, the sound box collects a current voice input signal x (t), and performs preprocessing (such as denoising and standardization) on the voice signal by using the method in the first step to generate a processed voice feature vector C _new.

Meanwhile, in order to adapt to a multi-language and multi-accent environment, a self-adaptive voice fuzzy matching model is provided. The model uses a feature weighted aggregation method to calculate the credibility score gamma of the voice feature vector C _new, and the formula is as follows:

Where γ represents the confidence score of the speech input, and represents the confidence of the matching degree of the current speech feature. M represents the total number of speech features. w _j represents the weight of the jth feature, indicating the importance of that feature in calculating speech confidence. Alpha _j denotes the fuzzy tuning parameter of the j-th feature, controlling the sensitivity of feature matching. C _new,j denotes a component of the jth speech feature vector. Mu _j represents the expected value of the j-th feature, representing the typical value of that feature under normal conditions.

Further, defining a confidence-based entitlement adjustment policy utilizes a confidence score gamma of the voice input and the entitlement score in the second stepA rights regulating function is provided "Dynamically adjusting the permission level of a user

Wherein, And the authority level after dynamic adjustment is represented, and the authority setting after adjustment is carried out according to the voice credibility and the original authority score.Representing the initial permission level of the output of the second step. η represents a gain factor for entitlement adjustment, and the amplitude of entitlement adjustment is controlled. Gamma denotes the confidence score of the speech input. Delta represents an intermediate threshold of the confidence score, typically set to 0.5.Representing the rights score calculated in the second step. τ represents a baseline threshold for the entitlement score, typically set based on actual usage.

Further, according to the calculated new authority levelThe intelligent sound box system adjusts the authority configuration of the user in real time and dynamically changes the accessibility of the functions.

In order to further improve the robustness and accuracy of the system, an "error correction feedback mechanism" is introduced. The mechanism dynamically adjusts the weight and parameters of the fuzzy matching model by detecting and analyzing user feedback or use behaviors after rights adjustment. The core formula of error correction is as follows:

Δw=∈·(γ_target-γ)·C_new

Where Δw represents a weight adjustment vector for modifying the weight of the speech feature. E denotes the learning rate, the step size for controlling the weight adjustment. Gamma _target represents the target confidence score, typically set to the desired confidence level for the system. Gamma denotes the confidence score of the current speech input. C _new denotes the feature vector of the current speech input.

The parameters of the voice fuzzy matching model are updated according to the error correction formula, so that the authority configuration can be more accurately adjusted when the system faces to voice input of different users and contexts.

Further, the permission adjusting function is combined with the instant feedback and long-term use behavior data of the userAnd continuously optimizing the fuzzy matching model. The system parameters are dynamically adjusted by monitoring the operation habit and the authority use condition of the user in real time, so that the adaptability and the accuracy of the authority adjustment are improved.

And collecting the use data (such as the behavior, satisfaction feedback and the like of the user under the adjusted authority level) of the user, and periodically retraining the fuzzy matching model and the authority adjustment function to ensure that the model parameters are consistent with the dynamic changes of the user demands.

S4, according to the new authority level after being optimized by combining the complex voice environmentAnd then combining with real-time user behavior data, constructing a behavior monitoring and anomaly detection model, analyzing user permission in real time through feature engineering and time sequence modeling, performing anomaly detection and behavior monitoring on user behavior, and dynamically adjusting permission level according to the anomaly detection and behavior monitoring results

Specifically, extracting the user behavior features includes:

The intelligent sound box continuously collects operation behavior data of a user and generates a behavior feature vector B= [ B ₁,b₂,…,b_k ], wherein the features comprise:

b ₁ represents the type of voice instruction (such as playing music, setting alarm clock, etc.) issued by the user, and is represented as a classification variable.

B ₂ denotes the time interval (unit: second) at which the instruction occurs, expressed as a continuous variable.

B ₃ denotes the frequency of instructions (units: times/min) issued by the user during a specific time period, expressed as a continuous variable.

B ₄ represents the authority level change amplitude corresponding to each instruction, and represents an integer.

B ₅ represents the identified emotional state of the user, expressed as a classification variable (e.g., calm, excited, etc.).

And (3) normalizing and weighting the behavior feature vector B to eliminate scale difference among different features. Then, weighted combination is performed according to the feature importance, and a weighted behavior feature vector B _w is generated:

B_w＝W_B·B_norm

wherein B _w represents a weighted behavior feature vector, which is represented as a behavior feature weighted by different weights. W _B denotes a weight matrix for weighting of different features. B _norm denotes the normalized behavior feature vector, expressed as normalized feature values.

Further, a behavior monitoring and anomaly detection model based on a long-short-time memory network (LSTM) is constructed using the weighted behavior feature vector B _w for capturing the time dependence of the user behavior. The goal of the model is to predict the behavior characteristics at the next moment, thereby identifying the abnormal pattern of behavior. The updated formula of the LSTM model is:

h_t＝f_LSTM(B_w,t,h_t-1)

Wherein h _t represents the hidden state at the current time t, capturing the time dependence of the user behavior. B _w,t denotes the weighted behavior feature vector at the current time t. h _t-1 denotes the hidden state at the previous time t-1. f _LSTM represents the state update function of LSTM, defined as a conventional LSTM cell calculation formula.

And judging whether abnormal behaviors exist or not by calculating residual errors between the actual behavior characteristics and the predicted behavior characteristics. The anomaly score ζ is calculated by the following formula:

where ζ represents the anomaly score, representing the degree of difference between the current behavior and the predicted behavior. B _w,t represents the actual weighted behavior feature vector at the current time t. And the weighted behavior feature vector represents the next moment of LSTM model prediction. σ ² represents the variance of the prediction residual for normalizing the anomaly score.

Further, according to the anomaly score ζ and a preset anomaly threshold value θ, it is determined whether to trigger a permission adjustment or a warning mechanism. The adjustment strategy is as follows:

Further, defining authority dynamic adjustment function under abnormal behavior conditionDynamic adjustment of rights level based on anomaly score ζ

Wherein, And representing the authority level after dynamic adjustment, and considering the authority setting after abnormal behaviors.Representing the level of rights output by the third step. ζ represents the adjustment step size parameter, and the adjustment range of the control authority. ζ represents the current anomaly score. θ represents a threshold value for abnormality detection, and is generally set based on actual user behavior data.

Further, the time sequence behavior model and the abnormality detection mechanism are continuously optimized by combining feedback of the user and actual behavior data. Through online learning and periodic model updating, the self-adaptability of the model to the behavior change of the user is ensured. User behavior data (such as abnormal behavior times, authority adjustment frequency and the like) are collected, parameters of the LSTM model are updated periodically, and an abnormality detection threshold value theta and a weight matrix W _B are adjusted to ensure that the system can stably run in a changeable user behavior environment.

S5, according to the authority levelClassifying the emotion states through an emotion classifier model by combining the emotion states of the user, generating emotion labels, providing mapping functions of the emotion states and authority responses to dynamically adjust authority levels according to the emotion state labels and abnormal behavior response strategies, and calculating the authority levels according to the emotion states and the authority response mapping functionsThe intelligent sound box instantly applies new permission configuration.

Further, the user emotion state characteristics are extracted, and the intelligent sound box is combined with the microphone and the camera to collect the multi-mode emotion data of the user in real time, wherein the multi-mode emotion data comprises audio and video characteristics. The audio feature vector a= [ a ₁,a₂,…,a_m ] may include:

a ₁ represents the voice pitch frequency variation, expressed as a continuous variable.

A ₂ represents the speech rate variation, expressed as a continuous variable.

A ₃ represents the intensity of the voice volume, expressed as a continuous variable.

Other audio features such as speech discontinuities, timbre, etc.

The video feature vector v= [ V ₁,v₂,…,v_n ] may include:

v ₁ denotes a facial expression feature (such as eye-horn up, mouth-horn down, etc.), expressed as a classification variable.

V ₂ denotes an eye movement characteristic (such as eye movement speed, direction, etc.), expressed as a continuous variable.

V ₃ denotes head pose (e.g., rotation angle, pitch angle, etc.), expressed as a continuous variable.

Further, the audio feature vector A and the video feature vector V are fused to generate a fused emotion feature vector E. A fusion model based on self-attention mechanism is used for capturing interaction between multi-modal features, and the specific formula is as follows:

E represents the fused emotion feature vector and is used for describing the current comprehensive emotion state of the user. A represents an audio feature vector, and the dimension is m. V denotes the video feature vector, with dimension n. W _A and W _V represent projection matrices for mapping audio and video features to the same space, dimensions m x d _k and n x d _k.d_k represent the dimensions of key vectors in the attention mechanism for scaling dot product operations. softmax represents the normalization function used to calculate the attention weight.

Further, the fused emotion feature vector E is utilized to pass through an emotion classifier modelClassifying the emotion states to generate an emotion label c:

Where c represents the current user's emotional state label, expressed as discrete categories (e.g., calm, anger, happiness, etc.). w _i represents the weight vector of the classifier model, the weight corresponding to the i-th class, the bias term of the classifier model with the dimension d _k.b_i, the bias corresponding to the i-th class and the scalar.

Further, according to the emotion state label c and the abnormal behavior response strategy, a mapping function of emotion state and authority response is providedFor dynamically adjusting authority levels

Wherein, And representing the authority level dynamically adjusted according to the emotion state, and considering authority setting of the emotion of the user.Representing the level of rights output by the fourth step. ρ (c) represents a permission adjustment delta function, expressed as an integer that can be positive or negative depending on the user's emotional state c.

Further, according to the emotion state and authority response mapping functionCalculated authority levelThe intelligent sound box immediately applies new permission configuration, and ensures safe and reasonable operation of the user in the current emotion state.

And continuously optimizing an emotion classification model and a permission response strategy by combining instant feedback of a user and long-time behavior pattern data. By monitoring the behavior and emotion change of the user in real time, the parameters of the emotion classifier are updated regularly, and the authority response function is adjusted And the emotion characteristic fusion model ensures that the model can adapt to the dynamic change of the emotion state of the user.

S6, according to the authority levelConstructing an adaptive authority optimization model based on reinforcement learning, and carrying out adaptive learning on the authority strategy of the intelligent sound box through reinforcement learning to carry out long-term strategy optimization by combining long-term behaviors and use modes of users.

Specifically, the state of the adaptive authority optimization model based on reinforcement learning is defined as the current authority level and emotion state of the user, and the historical behavior pattern and system feedback of the user, namely a state vectorWherein, And (3) representing the dynamically adjusted authority level output in the fifth step, and representing the current authority configuration optimized according to the emotion state of the user. c represents the current emotional state label of the user, expressed as discrete categories (e.g., calm, anger, happiness, etc.). H represents a historical behavior pattern feature vector of the user, representing statistical features of the user behavior over a period of time, such as operating frequency, preferred operating type, etc. F represents a system feedback feature vector, and represents user feedback data of the system under different authority configurations, such as user satisfaction degree scores, error operation frequencies and the like.

Defining system-executable actions as a set of rights adjustment operationsWherein each action a _i represents an adjustment (e.g., raise, lower, hold, etc.) to the level of rights. The design of these actions needs to take into account the security of the rights, the user experience and the functional limitations of the smart speakers.

Further, to enable the system to maintain optimal rights configuration in varying user behavior and emotional states, we employ a policy-optimized reinforcement learning approach. An optimization objective function J (θ) is defined to maximize the desired jackpot by optimizing the parameter θ of the strategy pi _θ (S). The formula for optimizing the objective function J (θ) is as follows:

where J (θ) represents an optimized objective function of the policy parameter θ, representing the desired jackpot. S represents a current state vector and consists of a permission level, an emotion state, a historical behavior mode and system feedback. Representing a set of state distributions, representing possible state spaces. Gamma represents a discount factor, and the value range is more than or equal to 0 and less than or equal to 1, so that the importance of future rewards is measured. r (S _t,a_t) represents an instant prize obtained when the action a _t is performed in the state S _t, reflecting the validity and rationality of the rights adjustment. a _t represents the action performed at time t, as determined by policy pi _θ (S).

In the policy optimization process, to avoid the policy from excessively fitting the short-term behavior pattern of the user, we introduce a novel policy regularization term Ω (θ) and update the policy parameters θ to maximize the regularized optimization objective:

Where θ _t+1 represents the updated policy parameters. θ _t represents the policy parameter at the current time t. Alpha represents the learning rate and controls the step size of parameter updating. Representing the policy gradient and representing the gradient of the policy parameters that optimize the objective function. λ represents a regularization strength parameter, controlling the extent to which regularization terms affect policy updates. Omega (theta _t) represents a strategy regularization term, is designed to be sparsity regularization, encourages conciseness and generalization capability of strategy parameters, and has the following formula:

where N represents the total number of policy parameters. θ _i represents the ith parameter among the policy parameters.

Further, the design of the instant rewarding function r (S _t,a_t) combines the reasonability of user behavior feedback, emotion state change and authority adjustment, and aims to balance the security and user experience of the system. The formula of the reward function is as follows:

Where r (S _t,a_t) represents the instant prize, measuring the effect of executing action a _t in state S _t. U _t represents the amount of change in the user satisfaction score at the current time t, and the function f (U _t) represents a positive reward, positively correlated with a reasonable entitlement adjustment. E _t represents the amount of change in the user's emotional state fluctuation, and function g (E _t) represents a negative reward that is inversely related to the negative emotional response due to the rights adjustment. Representing the magnitude, function of rights adjustmentIndicating a negative prize and punishing excessively frequent entitlement adjustment operations. Omega ₁,ω₂,ω₃ represents a weight coefficient and controls the influence of various factors on instant rewards.

To further improve the adaptability and generalization ability of the model, we introduced an adaptive learning mechanism based on user behavior patterns. The mechanism dynamically adjusts the learning rate and regularization parameters of the strategy model by continuously monitoring the behavior characteristics and the emotion state changes of the user so as to adapt to the personalized requirements of the user. Learning rate and regularization parameter dynamic adjustment formula:

α_t+1＝α_t·(1-η·|ΔH_t|)

λ_t+1＝λ_t·(1+ζ·|ΔF_t|)

Where α _t+1 denotes the learning rate at the next time. Alpha _t represents the learning rate at the current time. η represents a learning rate adjustment coefficient, and the adjustment range of the learning rate is controlled. The |Δh _t | represents the variation amplitude of the user behavior pattern feature vector, and represents the variation degree of the user behavior. Lambda _t+1 represents the regularized intensity parameter at the next time instant. Lambda _t represents the regularized strength parameter at the current time. ζ represents a regularization parameter adjustment coefficient, and an adjustment amplitude of the regularization intensity is controlled. The |Δf _t | represents the variation amplitude of the system feedback feature vector, and represents the variation degree of the user feedback.

Further, a policy optimization and feedback loop is established in combination with the long-term behavioral data and immediate feedback of the user. And the optimal performance of the system in long-term operation is ensured by continuously monitoring the performance of the system under different authority configurations and periodically updating parameters of the strategy model.

And collecting long-term behavior data (such as authority adjustment frequency, user satisfaction change, emotion state fluctuation and the like) of the user, and periodically updating parameters of the reinforcement learning model, including strategy parameters theta, regularization parameters lambda and learning rate alpha, so as to ensure that the model can adapt to long-term changes of the user behavior and emotion state.

The embodiment of the application also provides an intelligent sound box authority design management system, as shown in fig. 2, which comprises:

the user data collection module 101 is used for collecting user audio data and preprocessing the audio data, constructing a Gaussian mixture model according to the preprocessed data and outputting an audio signature;

The configuration initialization module 102 is configured to take the audio signature as input to represent the identity feature of the user, collect real-time environment data and perform preprocessing, construct a user-context joint feature vector according to the identity feature of the user in combination with the preprocessed real-time environment data, add a regular term into the user-context joint feature vector to perform recognition accuracy adjustment, and then perform context awareness and dynamic authority configuration initialization to obtain an optimized authority level and authority score;

The authority initialization module 103 is configured to design real-time dynamic management of the authority of the intelligent sound box by using a fuzzy matching and dynamic adjustment model according to the optimized authority level and authority score if complex speech environments with multiple languages and multiple accents appear, where the fuzzy matching and dynamic adjustment model calculates a confidence score γ of a speech feature vector C _new by using a feature weighted aggregation method, and the confidence score γ is expressed as follows:

rights management module 104 for optimizing new rights levels based on a combination of complex speech environments And then combining with real-time user behavior data, constructing a behavior monitoring and anomaly detection model, analyzing user permission in real time through feature engineering and time sequence modeling, performing anomaly detection and behavior monitoring on user behavior, and dynamically adjusting permission level according to the anomaly detection and behavior monitoring resultsAccording to the authority levelClassifying the emotion states through an emotion classifier model by combining the emotion states of the user, generating emotion labels, providing mapping functions of the emotion states and authority responses to dynamically adjust authority levels according to the emotion state labels and abnormal behavior response strategies, and calculating the authority levels according to the emotion states and the authority response mapping functionsThe intelligent sound box immediately applies new authority configuration;

A system optimization module 105 for optimizing the authority level Constructing an adaptive authority optimization model based on reinforcement learning, and carrying out adaptive learning on the authority strategy of the intelligent sound box through reinforcement learning to carry out long-term strategy optimization by combining long-term behaviors and use modes of users.

The foregoing disclosure is only illustrative of the preferred embodiments of the present invention and is not to be construed as limiting the scope of the invention, as it is understood by those skilled in the art that all or part of the above-described embodiments may be practiced without resorting to the equivalent thereof, which is intended to fall within the scope of the invention as defined by the appended claims.

Claims

1. A method for designing and managing permissions for a smart speaker, characterized in that the method comprises:

S1, collect user audio data and preprocess the audio data, build a Gaussian mixture model based on the preprocessed data to output the audio signature;

S2. The audio signature is used as input to represent the identity characteristics of the user. The real-time environment data is collected and preprocessed. A user-context joint feature vector is constructed based on the user's identity characteristics and the preprocessed real-time environment data. A regularization term is added to the user-context joint feature vector to adjust the recognition accuracy. Then, context awareness and dynamic permission configuration initialization are performed to obtain the optimized permission level and permission score.

S3. According to the optimized permission level and permission score, if a complex voice environment with multiple languages and multiple accents appears, a fuzzy matching and dynamic adjustment model is designed to dynamically manage the permissions of the smart speaker in real time; wherein, the fuzzy matching and dynamic adjustment model uses a feature weighted aggregation method to calculate the credibility score γ of the voice feature vector C _new , which is expressed as follows:

Wherein, γ represents the credibility score of the speech input, indicating the credibility of the current speech feature matching; M represents the total number of speech features; _wj represents the weight of the jth feature, indicating the importance of the feature in calculating the speech credibility; _αj represents the fuzzy adjustment parameter of the jth feature, controlling the sensitivity of feature matching; Cnew _,j represents the component of the jth speech feature vector; _μj represents the expected value of the jth feature, indicating the typical value of the feature under normal circumstances;

Using the credibility score γ of voice input and the optimized authority score Adjusting functions based on permissions Dynamically adjust user permission levels It is expressed as follows:

in, Indicates the dynamically adjusted permission level, indicating the permission setting adjusted according to the voice credibility and the original permission score; represents the initial authority level output in the second step; η represents the authority adjustment gain coefficient, which controls the authority adjustment range; γ represents the credibility score of the speech input; δ represents the intermediate threshold of the credibility score; represents the optimized permission score; τ represents the baseline threshold of the permission score; based on the calculated new permission level The smart speaker system instantly adjusts the user's permission configuration and dynamically changes the accessibility of functions;

S4. New permission level optimized based on complex voice environment Combined with real-time user behavior data, a behavior monitoring and anomaly detection model is built to analyze user permissions in real time through feature engineering and time series modeling, perform anomaly detection and behavior monitoring on user behavior, and dynamically adjust permission levels based on the results of anomaly detection and behavior monitoring.

S5. According to the permission level Combined with the user's emotional state, the emotional state is classified through the emotion classifier model to generate emotional labels. According to the emotional state labels and abnormal behavior response strategies, a mapping function between emotional state and permission response is proposed to dynamically adjust the permission level. The permission level calculated by the mapping function between emotional state and permission response is Smart speakers instantly apply new permission configurations;

S6. According to the authority level Build an adaptive permission optimization model based on reinforcement learning, combine users' long-term behavior and usage patterns, and use reinforcement learning to adaptively learn the permission strategy of smart speakers for long-term strategy optimization.

2. A smart speaker permission design management method according to claim 1, characterized in that the preprocessing is to process the collected audio data x(t) to remove noise and perform standardization processing;

Wherein, C represents the extracted Mel frequency cepstral coefficient feature vector, which is used to represent the voiceprint feature of the audio; represents the parameter set of the Gaussian mixture model; π _k represents the mixing weight of the kth Gaussian component, satisfying And π _k ≥ 0; μ _k represents the mean vector of the kth Gaussian component; Σ _k represents the covariance matrix of the kth Gaussian component; K represents the number of Gaussian components, indicating the number of Gaussian distributions in the model; represents the multivariate normal distribution, expressed as:

Where d represents the dimension of the feature vector C.

3. A smart speaker permission design management method according to claim 2, characterized in that for each new audio input, the user identity is identified by calculating the log-likelihood estimate of the feature vector of the new audio under each user model, which is expressed as follows:

Among them, logL(C _new |S _i ) represents the logarithmic likelihood value of the new feature vector under the i-th user model, which is used to measure the matching degree between the audio input and the user model; C _new represents the feature vector of the new audio input; S _i represents the audio signature of the i-th user, that is, the GMM parameter set Θ _i ; T represents the total number of frames of the new audio input, which represents the number of frames after the audio signal is segmented; P(C _new (t)|Θ _i ) represents the probability of the feature vector C _new (t) of the t-th frame under the user model Θ _i .

4. A smart speaker permission design management method according to claim 1, characterized in that the real-time environmental data is collected through a variety of sensors, the sensors include microphones, cameras, light sensors, temperature and humidity sensors and accelerometers, and the real-time environmental feature vector is represented by E;

The user-context joint feature vector is fused through a weighted multilayer perceptron model, which is expressed as follows:

C _e =σ(W ₂ ·σ(W ₁ ·[E ^norm ,S _i ]+b ₁ )+b ₂ )

Among them, _Ce represents the generated context feature vector, which represents the user's behavior characteristics in the current environment; [E ^norm ,S _i ] represents the input feature after merging the environment feature vector and the user audio signature; W ₁ ,W ₂ represent weight matrices, with dimensions of m×(n+1) and m×m respectively, which are learned through model training; b ₁ ,b ₂ represent bias vectors, with dimensions of m and m respectively, which are learned through model training; σ represents a nonlinear activation function, which is used to capture nonlinear relationships;

Combine the context feature vector _Ce and the user audio signature _Si to generate a user-context joint feature vector F = [ _Si , _Ce ];

Adding a regularization term to the user-context joint feature vector F = [S _i ,C _e ] achieves higher recognition accuracy by weighted combination of user and environment features, as shown below:

F _opt = F + λ·(F⊙W _reg )

Among them, F _opt represents the optimized joint feature vector; λ represents the regularization parameter, which is used to balance the influence of the original features and the regularization term; ⊙ represents the element-by-element product operation, which is used for element-by-element weighting; W _reg represents the weight matrix, which represents the relationship between features.

5. A smart speaker permission design management method according to claim 1, characterized in that the configuration initialization comprises:

Design a permission scoring function based on the adjusted user-context joint feature vector to calculate the user's permission score in the current context It is expressed as follows:

in, Indicates the permission score, which is used to reflect the user's permission level in the current context; represents the weight vector, indicating the importance of each feature; β represents the weight adjustment coefficient, controlling the overall weight of the joint features; The bi-norm of the joint features is used as a measure of feature importance, which adds constraints on the balance between features. represents the bias term;

Based on the calculated authority score and preset permission level threshold, and dynamically adjust permissions. It is expressed as follows:

in, Indicates the dynamically adjusted permission level; Indicates the permission threshold, which is set according to user needs and system configuration.

6. A smart speaker authority design and management method according to claim 1, characterized in that, in S3, an error correction feedback mechanism is designed for the fuzzy matching and dynamic adjustment model to dynamically adjust the weights and parameters of the fuzzy matching model, and the error correction mechanism is expressed as follows:

Δw＝∈·(γ _target -γ)·C _new

Among them, Δw represents the weight adjustment vector, which is used to correct the weight of speech features; ∈ represents the learning rate, which is used to control the step size of weight adjustment; γ _target represents the target credibility score, which is usually set to the credibility level expected by the system; γ represents the credibility score of the current speech input; C _new represents the feature vector of the current speech input.

7. According to claim 1, a smart speaker permission design management method is characterized in that a monitoring and anomaly detection model based on a long short-term memory network is constructed according to the weighted behavior feature vector _Bw to predict the behavior characteristics at the next moment and identify the abnormal behavior pattern, wherein the update formula of the behavior monitoring and anomaly detection model is expressed as follows:

h _t = f _LSTM (B _w,t ,h _t-1 )

Among them, h _t represents the hidden state at the current time t, capturing the time dependency of user behavior; B _w,t represents the weighted behavior feature vector at the current time t; h _t-1 represents the hidden state at the previous time t-1; f _LSTM represents the state update function of the long short-term memory network;

By calculating the residual between the actual behavior characteristics and the predicted behavior characteristics, it is determined whether there is abnormal behavior; wherein the abnormal score ξ is expressed as follows:

Among them, ξ represents the anomaly score, which indicates the difference between the current behavior and the predicted behavior; _Bw,t represents the actual weighted behavior feature vector at the current time t; represents the weighted behavior feature vector of the next moment predicted by the long short-term memory network; σ ² represents the variance of the prediction residual, which is used to standardize the anomaly score;

According to the abnormal score ξ and the preset abnormal threshold θ, it is determined whether to trigger the permission adjustment or warning mechanism; the adjustment strategy is as follows:

If ξ≤θ, no adjustment is made and the current permission level is maintained

If ξ>θ, it is identified as abnormal behavior, and permissions are tightened or the user is prompted to authenticate based on the severity of the abnormality.

8. According to claim 1, a smart speaker permission design management method is characterized in that the collection of the user's emotional state includes audio and video feature collection, and then a fusion model based on a self-attention mechanism is used to capture the mutual influence between multimodal features. The specific formula is as follows:

Where E represents the fused emotional feature vector, which is used to describe the user's current comprehensive emotional state; A represents the audio feature vector with a dimension of m; V represents the video feature vector with a dimension of n; W _A and W _V represent projection matrices, which are used to map audio and video features to the same space with dimensions of m×d _k and n×d _k ; d _k represents the dimension of the key vector in the attention mechanism, which is used to scale the dot product operation; softmax represents the normalization function used to calculate the attention weight;

Using the fused emotional feature vector E, the emotional state is classified through the emotional classifier model C(E) to generate the emotional label c, which is expressed as follows:

Where c represents the emotional state label of the current user, expressed as a discrete category; _wi represents the weight vector of the classifier model, the weight corresponding to the i-th category, and the dimension is _dk ; _bi represents the bias term of the classifier model, the bias corresponding to the i-th category;

According to the emotional state label c and the abnormal behavior response strategy, a mapping function between emotional state and authority response is proposed Used to dynamically adjust permission levels It is expressed as follows:

in, It represents the permission level dynamically adjusted according to the emotional state, and takes the user's emotions into consideration when setting the permission. ρ(c) represents the permission adjustment increment function, which depends on the user's emotional state c and is expressed as a positive or negative integer.

9. A smart speaker permission design management method according to claim 8, characterized in that the adaptive permission optimization model based on reinforcement learning is the user's current permission level and emotional state, as well as the user's historical behavior pattern and system feedback, that is, the state vector Among them, H represents the user's historical behavior pattern feature vector; F represents the system feedback feature vector; the system's executable actions are defined as the permission adjustment operation set Each action a _i represents an adjustment to the permission level; the objective function J(θ) is optimized to maximize the expected cumulative reward by optimizing the parameters θ of the strategy π _θ (S);

The formula for optimizing the objective function J(θ) is as follows:

Where J(θ) represents the optimization objective function of the policy parameter θ, which represents the expected cumulative reward; S represents the current state vector, which consists of the authority level, emotional state, historical behavior pattern, and system feedback; represents the state distribution set, which represents the possible state space; γ represents the discount factor, which ranges from 0≤γ≤1 and is used to measure the importance of future rewards; r(S _t ,a _t ) represents the immediate reward obtained when the action a _t is executed in the state S _t , which reflects the effectiveness and rationality of the authority adjustment; a _t represents the action executed at time t, which is determined by the strategy π _θ (S);

In the process of policy optimization, the policy regularization term Ω(θ) is introduced to prevent the policy from overfitting the user's short-term behavior pattern, and the policy parameter θ is updated to maximize the regularized optimization objective, which is expressed as follows:

Among them, θ _t+1 represents the updated policy parameters; θ _t represents the policy parameters at the current time t; α represents the learning rate, which controls the step size of parameter update; represents the policy gradient, which represents the gradient of the optimization objective function to the policy parameters; λ represents the regularization strength parameter, which controls the influence of the regularization term on the policy update; Ω(θ _t ) represents the policy regularization term, which is a sparsity regularization that encourages the simplicity and generalization of policy parameters, and is expressed as follows:

Where N represents the total number of strategy parameters; θ _i represents the i-th parameter in the strategy parameters.

10. A smart speaker authority design management system, characterized in that the system includes:

A user data collection module is used to collect user audio data and preprocess the audio data, and construct a Gaussian mixture model based on the preprocessed data to output an audio signature;

Configuration initialization module, used to take audio signature as input to represent the identity characteristics of the user, collect real-time environment data and pre-process it, build a user-context joint feature vector based on the user's identity characteristics combined with the pre-processed real-time environment data, add regularization terms to the user-context joint feature vector to adjust the recognition accuracy, and then perform context awareness and dynamic permission configuration initialization to obtain the optimized permission level and permission score;

The permission initialization module is used to design a fuzzy matching and dynamic adjustment model for real-time dynamic management of smart speaker permissions based on the optimized permission level and permission score if a complex voice environment with multiple languages and multiple accents appears; wherein the fuzzy matching and dynamic adjustment model uses a feature weighted aggregation method to calculate the credibility score γ of the voice feature vector C _new , which is expressed as follows:

in, Indicates the dynamically adjusted permission level, which indicates the permission setting adjusted according to the voice credibility and the original permission score; represents the initial authority level output in the second step; η represents the authority adjustment gain coefficient, which controls the authority adjustment range; γ represents the credibility score of the speech input; δ represents the intermediate threshold of the credibility score; represents the optimized permission score; τ represents the baseline threshold of the permission score; based on the calculated new permission level The smart speaker system instantly adjusts the user's permission configuration and dynamically changes the accessibility of functions;

The permission management module is used to optimize the new permission level based on the complex voice environment. Combined with real-time user behavior data, a behavior monitoring and anomaly detection model is built to analyze user permissions in real time through feature engineering and time series modeling, perform anomaly detection and behavior monitoring on user behavior, and dynamically adjust permission levels based on the results of anomaly detection and behavior monitoring. Based on permission level Combined with the user's emotional state, the emotional state is classified through the emotion classifier model to generate emotional labels. According to the emotional state labels and abnormal behavior response strategies, a mapping function between emotional state and permission response is proposed to dynamically adjust the permission level. The permission level calculated by the mapping function between emotional state and permission response is Smart speakers instantly apply new permission configurations;

System optimization module for Build an adaptive permission optimization model based on reinforcement learning, combine users' long-term behavior and usage patterns, and use reinforcement learning to adaptively learn the permission strategy of smart speakers for long-term strategy optimization.