EP0012767B1

EP0012767B1 - Speech analyser

Info

Publication number: EP0012767B1
Application number: EP19790900422
Authority: EP
Inventors: John Decatur Williamson
Original assignee: OMNITRONICS RESEARCH Corp
Current assignee: OMNITRONICS RESEARCH CORPORATION
Priority date: 1978-04-11
Filing date: 1979-11-19
Publication date: 1983-07-27
Also published as: JPS55500275A; DK525679A; EP0012767A1; WO1979000913A1; DE2965975D1

Abstract

A speech analyser is provided for determining the emotional state of a person by analysing pitch or frequency perturbations in the speech pattern. The analyser determines null points or "flat" spots in a FM demodulated speech signal and it produces an output indicative of the nulls. The output can be analysed by the operator of the device to determine the emotional state of the person whose speech pattern is being monitored.

Description

This invention is related to an apparatus for analysing an individual's speech and more particularly, to an apparatus for analysing pitch perturbations to determine the individual emotional state such as stress, depression, anxiety, fear, happiness, etc., which can be indicative of subjective attitudes, character, mental state, physical state, gross behavioral patterns, veracity, etc. In this regard, the apparatus has commercial applications as a criminal investigative tool, a medical and/or psychiatric diagnostic aid, a public opinion polling aid, etc.
One type of technique for speech analysis to determine emotional stress is disclosed in Bell, Jr., et al., U.S. Patent 3,971,034. In the technique disclosed in this patent a speech signal is processed to produce an FM demodulated speech signal. This FM demodulated signal is recorded on a chart recorder and then is manually analysed by an operator. This technique has several disadvantages. First, the output is not a real time analysis of the speech signal. Another disadvantage is that the operator must be very highly trained in order to perform a manual analysis of the FM demodulated speech signal and the analysis is a very time consuming endeavor. Still another disadvantage of the technique disclosed in Bell, Jr., et al. is that it operates on the fundamental frequencies of the vocal cords and, in the Bell, Jr., et al. technique tedious re-recording and special time expansion of the voice signal are required. In practise, all these factors result in an unnecessarily low sensitivity to the parameter of interest, specifically stress.
Another technique for voice analysing to determine emotional states is disclosed in Fuller, U.S. Patents 3,855,416, 3,855,417 and 3,855,418. The technique disclosed in the Fuller patents analyses amplitude characteristics of a speech signal and operates on distortion products of the fundamental frequency commonly called vibrato and on proportional relationships between various harmonic overtone or higher order formant frequencies.
Although this technique appears to operate in real time, in practise, each voice sample must be calibrated or normalized against each individual for reliable results. Analysis is also limited to the occurrence of stress, and other characteristics of an individual's emotional state cannot be detected.
A technique for voice analysis is disclosed in Coulter U.S. Patent No. 3,268,661 which determines the frequency loci of voice consonants by extrapolating the slope of the consonant sound to its initial extrapolation. The apparatus disclosed in the Coulter patent comprises a microphone whose output is fed in parallel to a formant analyser and a _-voice/unvoice silence detector. The detector serves simply to detect whether a voice signal i.e. one having a particular pitch frequency is present from the microphone. There is no disclosure in the Coulter patent that indicates that it would be used for determining the emotional state of a person. Further, the Coulter patent makes use of the record formant.
The present invention is directed to an apparatus for analysing a person's speech to determine their emotional state. The analyser operates on the real time frequency or pitch components within the first formant band of human speech. In analysing the speech, the apparatus analyses certain value occurrence patterns in terms of differential first formant pitch, rate of change of pitch, duration and time distribution patterns. These factors relate in a complex but very fundamental way to both transient and long term emotional states.
Human speech is initiated by two basic sound generating mechanisms. The vocal cords; thin stretched membranes under muscle control, oscillate when expelled air from the lugs passes through them. They produce a characteristic "buzz" sound at a fundamental frequency between 80 Hz and 240 Hz. This frequency is varied over a moderate range by both conscious and unconscious muscle contraction and relaxation. The wave form of the fundamental "buzz" contains many harmonics, some of which excite resonance in various fixed and variable cavities associated with the vocal tract. The second basic sound generated during speech is a pseudo-random noise having a fairly broad and uniform frequency distribution. It is caused by turbulence as expelled air moves through the vocal tract and is called a "hiss" sound. It is modulated, for the most part, by tongue movements and also excites the fixed and variable cavities. It is this complex mixture of "buzz" and "hiss" sounds, shaped and articulated by the resonant cavities, which produces speech.
In an energy distribution analysis of speech sounds, it will be found that the energy falls into distinct frequency bands called formants. There are three significant formants. The system described here utilizes the first formant band which extends from the fundamental "buzz" frequency to approximately 1000 Hz. This band has not only the highest energy content but reflects a high degree of frequency modulation as a function of various vocal tract and facial muscle tension variations.
In effect, by analysing certain first formant frequency distribution patterns, a qualitative measure of speech-related muscle tension variations and interactions is performed. Since these muscles are predominantly biased and articulated through secondary unconscious processes which are in turn influenced by emotional state, a relative measure of emotional activity can be determined independent of a person's awareness or lack of awareness of that state. Research also bears out a general supposition that since the mechanisms of speech are exceedingly complex and largely autonomous, very few people are able to consciously "project" a fictitious emotional state. In fact, an attempt to do so usually generates its own unique psychological stress "fingerprint" in the voice pattern.
Because of the characteristics of the first formant speech sounds, the present invention analyses an FM demodulated first formant speech signal and produces an output indicative of nulls thereof.
The frequency or number of nulls or "flat" spots in the FM demodulated signal, the length of the nulls and the ratio of the total time that nulls exist during a word period to the overall time of the word period are all indicative of the emotional state of the individual. By looking at the output of the device, the user can seen or feel the occurrence of the nulls and thus can determine by observing the output the number or frequency of nulls, the length of the nulls and the ratio of the total time nulls exist during a word period to the length of the word period, the emotional state of the individual.
In the present invention, there is provided a speech analyser for determining the emotional state of a person, the speech analyser having FM demodulator means for detecting the first formant of a person's speech and producing an FM demodulated signal therefrom, the speech analyser characterized by

(a) word detector means for detecting the presence of an FM demodulated signal;
(b) null detector means for detecting the absence of a change in frequency of the speech utilized to produce the FM demodulated signal and for producing an output indicative thereof; and,
(c) output means coupled to said word detector means and said null detector means, wherein said output means is enabled by said word detector means when said word detector means detects the presence of an FM demodulated signal and wherein said output means produces an output indicative of the presence or absence of a change in frequency of the person's speech utilized to produce the FM demodulated signal.

The user of the device thus monitors the nulls and can thereby determine the emotional state of the individual whose speech is being analysed.
An advantage of the speech analyser is that it can be made small and portable.
In order that the present invention be more readily understood, an embodiment thereof will now be described by way of example with reference to the accompanying drawings, in which:-

Figure 1 is a block diagram of the system of the present invention;
Figures 2A-2K illustrate the electrical signals produced by the system shown in Figure 1;
Figure 3 illustrates an alternative embodiment of the output of the present invention; and
Figure 4 illustrates still another alternative embodiment of the output of the present invention.

Detailed Description of the

Preferred Embodiment

Referring to Figs. 1 and 2A-2K, speech, for the purposes of convenience, is introduced into the speech analyser by means of a built-in microphone 2. The low level signal from the microphone 2 shown in Fig. 2A is amplified by the preamplifier 4 which also removes the low frequency components of the signal by means of a high pass filter section. The amplified speech signal is then passed through the low pass filter 6 which removes the high frequency components above the first formant band. The resultant signal, illustrated in Fig. 2B represents the frequency components to be found in the first formant band of speech, the first formant band being 250 Hz-800 Hz. The signal from low pass filter 6 is then passed through the zero axis limiter circuit 8 which removes all amplitude variations and produces a uniform square wave output illustrated in Fig. 2C which contains only the period or instantaneous frequency component of the first formant speech signal. This signal is then applied to the pulse generator circuit 10 which produces an output pulse of constant amplitude and width, hence constant energy, upon each positive going transition of the input signal. The output of pulse generator circuit 10 is illustrated in Fig. 2D. The pulse signal in Fig. 2D is integrated by the low pass filter circuit 12 whose output is shown in Figs. 2E1 and 2E2. The D.C. level or amplitude of the output of the filter as shown in Fig. 2E thus represents the instantaneous frequency of the first formant speech signal. The output of the low pass filter 12 will thus vary as a function of the frequency modulation of the first formant speech signal by various vocal cord and other vocal tract muscle systems. The overall combination of the zero axis limiter 8, the pulse generator 10, and the low pass filter 12 comprise a conventional FM demodulator designed to operate over the first formant speech frequency band.
The FM democulated output signal from the low pass filter 12 is applied to word detector circuit 14 which is a voltage comparator with a reference voltage set to a level representative of a first formant frequency of 250 Hz. When this reference level is exceeded by the FM demodulated signal, the comparator output switches from OFF to ON as illustrated in Fig. 2F.
The FM demodulated output signal from the pass filter 12 is also applied to differentiator circuit 16 which produces an output signal proportional to the instantaneous rate of change of frequency of the first formant speech signal. The output of differentiator 16, which is shown in Fig. 2G, corresponds to the degree of frequency modulation of the first formant spooch signal.
The signal from differentiator 16 is applied to a full wave rectifier circuit 18. This circuit passes the positive portion of the signal unchanged. The negative portion is inverted and added to the positive portion. The composite signal is then applied to pulse stretching circuit 19 which comprises a parallel circuit of a resistor and capacitor in series with a diode. The pulse stretching circuit 19 provides a fast rise, slow delay function which eliminates false null information as the differentiated signal passes through zero. The output of pulse stretching circuit 19 is illustrated in Fig. 2H.
The output signal of the pulse stretching circuit 19 is applied to an output circuit 17 including a comparator circuit 20 and a display 21. Comparator circuit 20 comprises a three level voltage comparator gated ON or OFF by the ouput of word detector circuit 14. Thus, when speech is present, the comparator circuit 20 evaluates, in terms of amplitude level, the output of the pulse stretching circuit 19. Reference levels of the comparator circuit 20 are set so that when normal levels of frequency modulation are present in the first formant speech signal an output as shown in Fig. 21 is produced and a display 21 having an appropriate visual indicator, such as a green LED 22 is turned ON. When there is only a small amount of frequency modulation present, such as under mild stress conditions, an output such as shown in Fig. 2J is produced and the comparator circuit 20 turns on the yellow LED 24. When there is a full null, such as produced by more intense stress conditions, an output such as shown in Fig. 2K is produced and the comparator circuit turns on the red LED 26.
Referring to Fig. 3, comparator circuit 20 can have an output coupled to a tactile device 28 for producing a tactile output so that the user can place the device close to his body and sense the occurrence of nulls through a physical stimulation to his body rather than through a visual display. In this embodiment the user can maintain eye contact with the individual whose speech is being analysed which could in turn reduce the anxiety of the individual whose speed is being analysed, which is caused by the user constantly looking to the speech analyser.
In the embodiment shown in Fig. 4 the word detector 14 and the pulse stretching circuit 19 are connected to a voltage meter circuit 30 which is substituted for the comparator circuit 20. The meter circuit 30 is turned on when word detector 14 in ON and meter 32 provides an indication of the voltage output of pulse stretching circuit 19.
Since the pitch or frequency null perturbations contained within the first formant speech signal define, by their pattern of occurrence, certain emotional states of the individual whose speech is being analysed, a visual integration and interpretation of the displayed output provides adequate information to the user of the imm for môkino certain decisions WiIn regard to the emotional state, in real time, of the person speaking.
The speech analyser of the present invention can be constructed using integrated circuits and therefore can be constructed in a very small size which allows it to be portable and capable of being carried in one's pocket, for example.

Claims

1. A speech analyser for determining the emotional state of a person, the speech analyser having FM demodulator means (8, 10, 12) for detecting the first formant of a person's speech and producing an FM demodulated signal therefrom, the speech analyser characterized by

(a) word detector means (14) for detecting the presence of an FM demodulated signal;

(b) null detector means (16, 18, 19) for detecting the absence of a change in frequency of the speech utilized to produce the FM demodulated signal and for producing an output indicative thereof; and,

(c) output means (17) coupled to said word detector means (14) and said null detector means (16, 18, 19), wherein said output means (17) is enabled by said word detector means (14) when said word detector means (14) detects the presence of an FM demodulated signal and wherein said output means (17) produces an output indicative of the presence or absence of a change in frequency of the person's speech utilized to produce the FM demodulated signal.

2. A speech analyser, as set forth in claim 1, characterized by said word detector means (14) and said null detector means (16, 18, 19) each coupled to the output of said FM demodulator means (8, 10, 12).

3. A speech analyser, as set forth in claim 1, wherein said null detector means (16, 18, 19) is characterized by

(a) a differentiator means (16) for differentiating the FM demodulated signal;

(b) a full wave rectifier means (18) for rectifying the FM demodulated signal; and

(c) pulse stretching circuit means (19) for eliminating the detection of the absence of a change in frequency of the speech utilized to produce the FM demodulated signal when the differentiated FM demodulated signal passes through zero.

4. A speech analyser, as set forth in claim 1, wherein said output means (17) is characterized by a meter means (30, 32) for depicting voltage magnitude.

5. A speech analyser, as set forth in claim 1, wherein said output means (17) is characterized by

(a) comparator means (20) for detecting the level of the output of the null detector means (16, 18, 19) and comparing the level with predetermined voltage levels wherein when said level is below a first predetermined level there exists an absence of a change in frequency of the speech utilized to produce the FM demodulated signal and when said level is above a second predetermined level a change in the frequency of the speech utilized to produce the FM demodulated signal is present; and

(b) display means (21) for displaying the output of said comparator means (20).

6. A speech analyser, as set forth in claim 5, wherein said display means (21) is characterized by a tactile display (28).

7. A speech analyser, as set forth in claim 5, wherein said display means (21) is characterized by at least two lights (22, 26) one of said lights (26) being turned on when the output of the comparator means (20) is indicative of an absence of a change in frequency of the speech utilized to produce the FM demodulated signal and the other light (22) being turned on when the output of the comparator means (20) is indicative of a change in frequency of the speech utilized to produce the FM demodulated signal.

8. A speech analyser as set forth in claim 7, wherein said display means (21) is further characterized by a third light (24), said third light (24) being turned on when the level of the output of the comparator means (20) is indicative of a transition between the presence and absence of a change in frequency of the speech utilized to produce the FM demodulated signal.

9. A speech analyser, as set forth in claim 1, further characterized by filter means (4, 6) receiving the person's speech and passing signals only in the range of 250 Hz to 800 Hz to the FM demodulator means (8, 10, 12).