IL192367A - System for indicating emotional attitudes through intonation analysis and methods thereof - Google Patents
System for indicating emotional attitudes through intonation analysis and methods thereofInfo
- Publication number
- IL192367A IL192367A IL192367A IL19236708A IL192367A IL 192367 A IL192367 A IL 192367A IL 192367 A IL192367 A IL 192367A IL 19236708 A IL19236708 A IL 19236708A IL 192367 A IL192367 A IL 192367A
- Authority
- IL
- Israel
- Prior art keywords
- intonation
- speaker
- emotional
- frequency
- sound
- Prior art date
Links
- 230000002996 emotional effect Effects 0.000 title claims description 86
- 238000000034 method Methods 0.000 title claims description 61
- 238000004458 analytical method Methods 0.000 title claims description 17
- 230000001755 vocal effect Effects 0.000 claims description 45
- 230000008451 emotion Effects 0.000 claims description 35
- 241001465754 Metazoa Species 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 19
- 230000006870 function Effects 0.000 description 21
- 230000008569 process Effects 0.000 description 13
- 238000011160 research Methods 0.000 description 11
- 230000035882 stress Effects 0.000 description 10
- 230000003340 mental effect Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 7
- 238000012937 correction Methods 0.000 description 6
- 238000005259 measurement Methods 0.000 description 6
- 230000009471 action Effects 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 230000008909 emotion recognition Effects 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 210000004556 brain Anatomy 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000007257 malfunction Effects 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 208000019901 Anxiety disease Diseases 0.000 description 2
- 230000036506 anxiety Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000019771 cognition Effects 0.000 description 2
- 230000001149 cognitive effect Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 230000004064 dysfunction Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000005055 memory storage Effects 0.000 description 2
- 230000009894 physiological stress Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000006641 stabilisation Effects 0.000 description 2
- 238000011105 stabilization Methods 0.000 description 2
- 238000002560 therapeutic procedure Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 241000124008 Mammalia Species 0.000 description 1
- 241000282320 Panthera leo Species 0.000 description 1
- 206010034719 Personality change Diseases 0.000 description 1
- 206010042464 Suicide attempt Diseases 0.000 description 1
- 208000003443 Unconsciousness Diseases 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000006397 emotional response Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000013011 mating Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
Landscapes
- Electrically Operated Instructional Devices (AREA)
Description
SYSTEM FOR INDICATING EMOTIONAL ATTITUDES THROUGH INTONATION ANALYSIS AND METHODS THEREOF FIELD OF THE INVENTION This invention relates to methods and system for indicating the emotional attitudes of a speaker through intonation analysis. creates with the listener an understanding of the type of emotional attitude intended by the speaker. A significant part of non-verbal communication between people, and even among animals, is based upon the interpretation of intonation. In human communication, this interpretation is combined with the verbal content, to form a complete understanding.
Interpretation of intonation is used alongside with interpretation of body language and interpretation of content, as an emotional communication tool. This interpretation is intuitive and, to date, an accurate description and decoding has not been provided.
Indeed, various studies have been conducted to understand a speaker's emotional state, but those studies did not relate to the way specific words are pronounced nor were they able to attach a specific emotion to each and every tone.
Among the systems developed until now we can count the following: US patent No. 3,971 ,034, presents a method of detecting psychological stress by evaluating manifestation of physiological change in the human voice; US patent No. 6,006,188 compares models of speech with models in a data base, reaching conclusions from a comparison between the models; and other patents that identify situations of stress, characteristic of lying and even attempted suicide are US patent No. 6,427,137 and US patent No. 6,591,238.
WO 2006/059325 discloses a method and system indicating a condition of an individual by ) * Y ?A y [c analysis on non-discernible sounds.
A methods and system for: a) indicating emotional attitudes by intonation analysis, related to the way specific words are pronounced or, b) attaching a specific emotion to each and every tone, thus meets a long felt need.
SUMMARY OF THE INVENTION The present invention thus provides a method and system for indicating emotional attitudes of a speaker by intonation analysis. This analysis is related to the way specific words are pronounced and is also related to the finding that specific emotions are attached to each and every tone.
It is an object bf the present invention to provide a method for indicating emotional attitude of a speaker according to intonation. The method comprises defining a set of words; obtaining a database comprising reference intonations, and reference emotional attitudes connected to each of the words within said set of words; repeatedly pronouncing at least one of the words from this set of words by the speaker; recording a plurality of such pronunciations to obtain a signal representing sound magnitude as a function of time; processing said signal to obtain voice characteristics comprising a description of sound magnitude as a first function of frequency; decoding these voice characteristics to identify an intonation; comparing this intonation to the reference intonations; and retrieving at least one of the reference emotional attitude.
It is also an object of the present invention to provide a method for indicating emotional attitude of a speaker according to intonation. This method comprises recording a speaker to obtain a sample of speech; processing the sample of speech to obtain digital data representing sound magnitude as a function of time; processing the digital data to obtain voice characteristics comprising a description of sound magnitude as a first function of frequency; decoding the voice characteristics to identify dominant tones; and attributing an emotional attitude to the speaker based on the determined dominant tones.
It is within the scope of the present invention to provide a method as defined above, wherein the step of retrieving further comprises interpolating between emotional attitudes according to intonations.
It is further within the scope of the present invention to provide a method as defined above, wherein the step of decoding further comprises calculating a maximum over a range of said first function of frequency to obtain a second function of frequency.
It is further within the scope of the present invention to provide a method as defined above, wherein the step of decoding further comprises calculating an average over a range of said first function of frequency to obtain a second function of frequency.
It is further within the scope of the present invention to provide a method as defined above, wherein the step of decoding further comprises calculating a maximum over a range of said second function of frequency to obtain an intonation.
It is further within the scope of the present invention to provide a method as defined above, wherein the step of retrieving further comprises determining the difference of the speaker's emotional attitudes from the reference emotional attitudes.
It is another object of the present invention to provide a method for indicating emotional attitudes of an animal, other than a human being, according to intonation. This method comprises defining a set of sounds that such an animal may emit, and obtaining a database comprising reference intonations, and reference emotional attitudes per each sound of said set of sounds; repeatedly producing at least one sound of the set of sounds by said animal; recording a plurality of these produced sounds to obtain a signal representing sound magnitude as a function of time; processing this signal to obtain sound characteristics; decoding these sound characteristics to identify an intonation; comparing the calculated intonation to the reference intonations; and retrieving at least one of the reference emotional attitude.
Another object of the present invention is to provide an automated, computer-based system for indicating the emotional attitudes of a speaker by automated intonation analysis. This system comprises the following components: a. a sound recorder adapted to record a word that is repeatedly pronounced by the speaker, and to produce a signal representing the recorded sound magnitude as a function of time, b. a first processor, with processing software running on said processor, adapted for processing this signal, and obtain voice characteristics such as intonation, c. a database comprising an intonation glossary; d. a second processor, with computing software running on said computer, adapted for collecting a set of predefined words, for retrieving relations between reference intonations and reference emotional attitudes per each word of said set of words from the database, for connecting to the processor, for comparing the a computed intonation to the reference intonations, and for retrieving at least one of the reference emotional attitudes; and e. an indicator connected to either the computer or the processor.
It is yet another object of the present invention to provide a method for advertising, marketing, educating, or lie detecting by indicating emotional attitudes of a speaker according to intonation. The method comprises obtaining a database comprising a plurality of vocal objects, such as words or sounds; playing-out a first group of at least one of these vocal objects for the speaker to hear; replying to said play-out by the speaker; recording this reply; processing this recording to obtain some voice characteristics; decoding these voice characteristics to identify an intonation; comparing this calculated intonation to the reference intonations; and playing-out a second group of at least one of the vocal objects for the speaker to hear again.
It is yet another object of the present invention to provide a system for indicating emotional attitudes of a speaker according to intonation, comprising a glossary of intonations relating intonations to emotions attitudes, and it is in the scope of the present invention for the glossary to comprise any part of the information listed in Table 1 herein below.
It is yet another object of the present invention to provide a method of providing remote service by a group comprising at least one observer to at least one speaker. This method comprises identifying the emotional attitude of a speaker according to intonation, and advising the observers how to provide said service.
It is in the scope of the present invention to provide such a method, wherein the step of advising comprises selecting at least one of said group of observer to provide the service.
BRIEF DESCRIPTION OF THE INVENTION In order to understand the invention and to see how it may be implemented in practice, a preferred embodiment will now be described, by way of a non-limiting example only, with reference to the accompanying drawing, in which figure 1 schematically presents a system according to the present invention; figure 2 schematically presents a system according to the present invention in use; figure 3 schematically presents some of the main software modules in a system according to the present invention; figure 4 schematically presents a method according to the present invention; and figure 5a and figure 5b elucidate and demonstrate intonation and its independence of language.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS The following description is provided, alongside all chapters of the present invention, so as to enable any person skilled in the art to make use of this invention. This description sets forth the best modes contemplated by the inventor for carrying out this invention, but various modifications will be apparent to those skilled in the art. The generic principles of the present invention have been defined specifically to provide an apparatus and methods for diagnosing emotional attitudes of a speaker according to intonation.
The term "word" refers in the present invention to a unit of speech. Words selected for use according to the present invention usually carry a well defined emotional meaning. For example, "anger" is an English language word that may be used according to the present invention, while the word "regna" is not; the latter carrying no meaning, emotional or otherwise, to most English speakers.
The term "tone" refers in the present invention to a sound characterized by a certain dominant frequency. Several tones are defined by frequency in Table 1 herein below. Among them, for example, are the tones named FA and SOL.
The term "intonation" refers in the present invention to a tone or a set of tones, produced by the vocal chords of a human speaker or an animal. For example the word "love" may be pronounced by a human speaker with such an intonation so that the tones FA and SOL are dominant.
The term "Dominant tones" refers in the present invention to tones produced by the speaker with more energy and intensity than other tones. The magnitude or intensity of intonation can be expressed as a table, or graph, relating relative magnitude (measured, for example, in units of dB) to frequency (measured, for example, in units of Hz.) The term "reference intonation", as used in the present invention, relates to an intonation that is commonly used by many speakers while pronouncing a certain word or, it relates to an intonation that is considered the normal intonation for pronouncing a certain word. For example, the intonation FA-SOL may be used as a reference intonation for the word "love" because many speakers will use the FA-SOL intonation when pronouncing the word "love".
The term "emotional attitude", as used in the present invention, refers to an emotion felt by the speaker, and possibly affecting the. behavior of the speaker, or predisposing a speaker to act in a certain manner. It may also refer to an instinct driving an animal. For example "anger" is an emotion that may be felt by a speaker and "angry" is an emotional attitude typical of a speaker feeling this emotion.
The term "reference emotional attitude", as used in the present invention, refers to an emotional attitude normally associated with the meaning of a certain word. For example, the word "baby" may be associated with reference emotional attitudes of affection.
The term "voice characteristic", as used in the present invention, refers to a measurable quantity that can be calculated from a sample of recorded sound. For example the frequency of the sound carrying the most energy is a voice characteristic.
The term "decoding voice characteristic", as used in the present invention, refers to a calculation of voice characteristics, as derived from a sample of recorded sound, and the identification of the intonation expressed by this sample of recorded sound.
Research conducted by the applicants, looking at speakers of different languages, has found that certain words which contain an emotional value or meaning; such as love, happiness, hate, sadness, war, father, mother, baby, family, etc. have common cores of intonation throughout the languages examined and for the majority of speakers tested.
Our research revealed that each and every tone is associated with a common emotional attitude. Tone combinations are usually found to be universal. The results of our intonation analysis of spoken emotive words showed that principal emotional values can be assigned to each and every tone (as are described in Table 1 below). Table 1 divides the range of frequencies between 120 Hz and 240 Hz into seven tones. These tones have corresponding harmonics in higher frequency ranges: 240 to 480, 480 to 960Hz, etc. Per each tone, the table describes a name and a frequency range, and relates its accepted emotional significance.
Table 1 Glossary of Tones No. Tone Accepted Emotional Significance 1. DO (C) The need for activity in general and for survival activity in 128 Hz± 8 Hz particular (defense, attack, work etc.) and all dyadic multiples 2. RE (D) Impulsiveness and/or creativity (generally creativity, 146 Hz ± 8 Hz pleasure from food, sex, drink etc.) and all dyadic multiples 3. MI (E) Self control, action within restraints 162Hz ± 8 Hz and all dyadic multiples 4. FA (F) Especially deep emotions such as love, hatred, sadness, joy, 179 Hz ± 8 Hz happiness. and all dyadic multiples . SOL (G) Deep level inter-personal communication (intimacy) 195 Hz± 8 Hz and all dyadic multiples 6. LA (A) Speech out of profound conviction, referral to principles of 220 Hz ± 10 Hz reliability and all dyadic multiples 7. SI (B) Tones of command and leadership, ownership or sense of 240 Hz ± 10 Hz mission And all dyadic multiples After conducting extensive research, it was found by the inventors that certain combinations of these tones (listed and described above) serve as an intuitive basis for specific words. For example, the inventors found that the word "love", spoken in any one of the many languages examined, was pronounced by over 90% of the speakers using one of the following three intonations: 1. FA-SOL - (signifying, according to the table above, emotional communication); 2. SOL-RE - (signifying communication and impulsiveness); and 3. RE-FA ÷ (signifying a combination of impulsiveness and emotion).
In addition to the three common intonations mentioned above, every speaker adds his or her own "personal" tones, probably signifying their individual attitude or their momentary attitudes in the given situation. Therefore, for example, adding the tone "SI" to the word "love" adds the notion of ownership to that person's perception of love.
The inventors offer, in a non-limiting manner, a suggested explanation for understanding these observations. Presuming the existence of certain centers of emotion in the brain, each responsible for a specific emotional state, it is plausible that every emotional center in the brain is associated with a certain tone, and vice versa, so that whenever a center is active and initiates a verbal emotional response, the tone that is associated with that active center in the brain is expressed together with the verbal response.
While studying sounds emitted by animals during various activities, it was found that specific tones are expressed during specific activities. For example, analysis of the sounds of songbirds mating calls showed an abundance of RE tones. Analyzing the lion's roar, on the other hand, showed an emphasis of the tone SI above the. other tones. Or, in other words, SI is the tone expressed using a higher volume, or magnitude, when compared to the other tones.
These findings lead to the assumption that some genetic structure exists, shared partially or fully, by humans and animals, that is responsible for the observed universal language of tones.
The current invention is based on the analysis of human and animal sounds, the identification of emphasized tones and the development of insights and interpretations of the intended meanings. All those are based on the glossary of tones presented herein above.
Based on the decoding of the intonation, it is possible to conduct a number of commercially valuable activities. Those activities are selected in a non-limiting manner from the following: 1. Decoding a person's emotional positions, conscious and unconscious, for the purpose of intelligence, negotiations, improved dialogue, etc; 2. Psychoanalysis based on decoding intonation, serving as basis for therapeutic dialogue between therapist and patient; 3. Devices for games or for human and animal behavior research; and, 4. Aids for improving inter-cultural communication.
For example, an automated system could be built that will sample a speaker's voice and determine the emotional attitude of the speaker by: a) finding the dominant tones used by the speaker, b) comparing those dominant tones to a list of tones and their meanings (such as detailed in Table 1) and, c) making a determination about the nature of the emotional attitude of the speaker based on the tones used. Such a system could find many commercial applications. For example, such a system could help a salesperson better understand the character and/or the needs of his customer(s); whether the selling transaction is taking place over the phone or in person. Another useful application, for example, could be the use of an automated system, such as described above, to follow the emotional attitude of a caller and determine the nature of his base emotional attitude and whether his emotional attitude changes in the course of a phone conversation. Such a system, as just described, could aid in the improvement of customer service and in a retroactive analysis of customer calls for the purpose of improving future service. Another application, for example, could be the use of such a system to alert a supervisor about an on-going conversation between a customer and a company representative and the changing emotional attitudes of the customer during the conversation.
Another commercial example could be the use of one of the systems described herein to screen potential hires, by an employer, and conclude whether the speaker fits the demands of the job, with which co-workers the speaker could co-operate most productively and finally, which incentives could be best used by an employer to motivate the tested individual (speaker).
Reference is thus made now to figure 1, presenting a schematic and generalized presentation of the aforementioned novel system for indicating emotiona]_^jtudes of a speaker through intonation analysis [100]. Voice recorder [110] converts sound into a signal such as an electrical or optical signal, digital or analog. The voice recorder typically comprises a microphone. The signal is fed to computer or processor [120], running software code [150] which accesses database [140]. According to a specific embodiment of the present invention the computer comprises a personal computer. According to a specific embodiment of the present invention the computer comprises a digital signal processor embedded in a portable device. Database [140] comprises definitions of certain tones and a glossary relating tones to emotions. The results of the computation and signal processing are displayed by indicator [130] connected to the computer. According to one specific embodiment of the present invention, the indicator [130] comprises a visual display of text or graphics. According to another specific embodiment of the present invention, it comprises an audio output such as sounds or spoken words.
Reference is now made to figure 2, presenting a schematic and generalized presentation of the aforementioned novel system for indicating emotional attitudes of a speaker through intonation analysis [100] in action. Speaker [200] is repeating a certain word into the system and observer [300] is observing an indication of the emotional attitudes of the speaker.
Reference is now made to figure 3, presenting a schematic and generalized presentation of the software [150] of the aforementioned novel system for indicating emotional attitudes of a speaker through intonation. For the sake of clarity and brevity, infrastructure software, e.g. the operating system, is not described here in detail. The relevant software comprises three main components: 1) the signal processing component processes the audio signal received from the recorder and produces voice characteristics such as frequency, amplitude and phase, 2) the software component responsible for tonal characteristics calculations identifies the frequency ranges in which sound amplitude reaches maximum levels, and compares them to reference values found in a glossary of words and tones stored in the database 3) the variables definition software component defines the intonation specific to the speaker and defines the speaker's emotional attitudes accordingly.
Reference is now made to figure 4, presenttng a schematic and generalized representation of the method for using intonation analysis to decipher the emotional attitudes of a speaker..
The method comprises inter alia the following steps, executed serially per each word pronounced, but. can be executed in parallel for several words as a pipeline. First a word is selected, and then it is , repeatedly pronounced by a speaker. This repetitive pronunciation is recorded in a digital format and the recorded voice is then processed to obtain sound characteristics. The obtained characteristics are further processed to identify the dominant tones. The obtained results are compared to a database of tones in order to obtain a reading, on the corresponding emotional attitudes. The described process is repeated for several utterances of the same word, and also for several words, until finally some output is displayed indicating the results of the calculation. The displayed results include a description of the emotional attitudes of the speaker; optionally accompanied by some recommendations for further action. In the case that the calculated intonation is found similar to a specific reference intonation stored in the database, then a specific emotional attitude related to the specific reference intonation can be retrieved from the database.
Otherwise, emotional attitude can interpolated from stored values. For example, if a dominant tone is found for a certain word that is close to the mean frequenc between two adjacent tones stored in the database, and two emotional attitudes are found in the database corresponding to these two tones, then the output may comprise the mixing of the attitudes in equal parts.
Reference is now made to figures 5a and 5b, presenting some research data to elucidate and demonstrate the use of the present invention for indicating emotional attitudes of a speaker through intonation analysis. Both figures show a graph of relative sound volume versus sound frequency from 0 to 1000Hz. Such sound characteristics can be obtained from processing sound as described in reference to figure 4, by signal processing software described in reference to figure 3, and by equipment described in reference to figure 1. The graphs are the result of processing 30 seconds of speech each. Dominant tones can be identified in figures 5a and 5b, and the dominant tones in 5a are similar to those of 5b. Both graph result from speaking a word whose meaning is Move'. The language was Turkish in case of figure 5a, and English for figure 5b. Thus these figures demonstrate the concept on dominant tones and their independence of language.
The following is a set of examples which illustrate a few best modes for practicing the present invention. These examples should not be construed as limiting. In the first example a personal computer comprising a CPU made by Intel TM is running the Windows XP TM operating system made by Microsoft TM. The personal computer is equipped, inter alia, with a CRT monitor display and a socket accepting a plug connected to a microphone as is commercially available from numerous manufactures, for example Dell TM. The computer is also equipped with a hard disk and solid state memory. The computer is optionally connected to a network such as the internet. The operating system handles these peripherals and provides services from the software described herein below. The software is written in a programming language such as C++ and compiled using tools such as Code Warrior TM by Symantec TM or Visual Studio TM by Microsoft TM. Alternatively, the code is written in an interpreted language such as Basic or Matlab TM and is interpreted as it runs.
A set of words is selected in a language spoken by a speaker [200]. The words are selected so that they carry some emotional value to the speaker. The process described herein below is repeated for each word in the set. An English language example comprises the following words: love, hate, happiness, mother, father, baby, dream, jealousy, anger.
The speaker pronounces the word in front of a microphone. The speaker repeats pronouncing the same word several times, at least twice and preferably about 10 times. The voice of the speaker is recorded. The microphone translates it to an electrical signal which is accepted by the computer, converted to digital form and stored either temporarily or permanently. In the case of permanent storage, a file may be written in the hard drive or transmitted elsewhere through the network. WAV and MP3 are well known examples of file formats for storing voice recordings.
The recorded voice data is analyzed by the following two software modules, possibly working in parallel as a pipeline of operations, and possibly also in parallel with the recording of the pronunciations of further words. A first software module analyzes the voice data and calculates the following two voice characteristic functions. Function A is the maximum of sound volumes as a function of the sound frequencies, where the frequencies range, for example, 0 to 3500 Hz, and function B is the average of sound volume for each and every frequency in such a range. The range is sampled at a reasonable rate, for example one sample per one Hz. A second software module calculates the maximum points of functions A and B, and creates the order of tone emphasis according to the following algorithm. Tone No. I is the tone projecting to the frequency at which the global maximum point of function B is received. Tone No. 2 is the tone for which the greatest amount of maximum points was registered for function B, i.e. the same tone appears as the maximum tone on several octaves, and this number of octaves is higher than for any of the other tones. Tone No. 3 is the tone projecting to the frequency at which the second-highest maximum point is received, i.e., after deducting the points attributed to tone No. 1 and tone No. 2. Tone No 4 is the tone projecting to the frequency at which the third-highest maximum point in function B is received, i.e., after deducting the points attributed to tones 1, 2 and 3. Tone No. 5 is the tone projecting to the frequency at which the fourth-highest maximum point in function B is received, in condition that it is also received as the maximum point of function A.
This group of five tones is compared to a standard group of tones calculated in advanced, which is called herein "the norms". The norms are a part of a database available to the software. It may reside in the computer's permanent storage, or it may be available through the network. Based on closest correspondence, the most suitable characteristic is defined. Any deviations, e.g., addition or absence of tones, are also defined per spoken word and are added or deducted from the above characterization. Finally, some indication of emotional attitude, possibly with recommendations for further action, are presented an observer, or the speaker's attitude examiner, who may a clinical psychologist, the speaker himself or herself, a scientific researcher, etc. The indication is given through the CRT display, or as spoken words through a loudspeaker, or possibly sent as a message through the network.
In the second example the personal computer is replaced by a mobile device equipped with a microphone and display means, as well as processing power and memory, and possibly connection to a network, for example a mobile cellular telephone.
In the third example the system resembles that of the first example, but the speaker is actually an animal such as a mammal or a bird. In this case the database comprises a glossary of tones corresponding to similar animals, for example animals of the same species and/or gender. The glossary is composed by recording several similar animals at similar behavioral situations, and identifying the core, norm, or reference tones.
Iri the fourth example the system of the first example is expanded by storing some recordings of reference words as sounds, and playing-out these words through a loudspeaker or earphones plugged to the personal computer, so that the speaker may hear them. This feature creates a form of conversation.
In the fifth example the system is split between two location and connected via the telephone network. The speaker is located at a first location and is speaking into a telephone receiver. The rest of the system is located at a second location and supplies an observer with an indication of the speaker's emotions. For example, the system is located at a service center and the observer provides some service to the speaker. The system provides the observer with a profile of the personality of the speaker enabling the observer to give optimal service. The system may recommend course of action to the observer or even select between observers so that the speaker receives optimal service from the observer best equipped of train to deal with his or her specific condition. For example, speakers may be roughly divided into three groups: early adopters, conservatives and survivors, and the service may be dealing with complaints about equipment malfunction. When a speaker is complaining about a malfunction, the system may indicate which replies the speaker may find acceptable. An early adopter may accept the answer that the equipment is state-of-the-art and thus not yet free of problems. A conservative may accept the answer the equipment is the most commonly used, and its failure rate is acceptable by most users, and a survivor may need to be shown ways of working around the malfunction. Mixing answers between personality types may increase the tension between speaker (customer) and observer (service provider) typical of this situation. The present invention allows for best service for the benefit of both parties.
P T U 00/24325 The voice signal may be emitted from or received by the present invention. Optionally, the emotion associated with the voice signal is identified upon the emotion being provided. In such case, it should be determined whether the automatically determined emotion or the user- determined emotion matches the identified emotion. The user may be awarded a prize upon the user-determined emotion matching the identified emotion. Further, the emotion may be automatically determined by extracting at least one feature from the voice signals, such as in a manner discussed above.
To assist a user in recognizing emotion, an emotion recognition game can be played in accordance with one embodiment of the present invention. The game could allow a user to compete against the computer or another person to see who can best recognize emotion in recorded speech. One practical application of the game is to help autistic people in developing better emotional skills at recognizing emotion in speech.
In accordance with one embodiment of the present invention, an apparatus may be used to create data about voice signals that can be used to improve emotion recognition. In such an embodiment, the apparatus accepts vocal sound through a transducer such as a microphone or sound recorder. The physical sound wave, having been transduced into electrical signals are applied in parallel to a typical, commercially available bank of electronic filters covering the audio frequency range. Setting the center frequency of the lowest filter to any value that passes the electrical energy representation of the vocal signal amplitude that includes the lowest vocal frequency signal establishes the center values of all subsequent filters up to the last one passing the energy-general ly between 8kHz to 16kHz or between 10kHz and 20kHz, and also determine the exact number of such filters. The specific value of the first filter's center frequency is not significant, so long as the lowest tones of the human voice is captured, approximately 70 Hz.
Essentially any commercially available bank is applicable if it can be interfaced to any commercially available digitizer and then microcomputer. The specification section describes a specific set of center frequencies and microprocessor in the preferred embodiment. The filter quality is also not particularly significant because a refinement algorithm disclosed in the specification brings any average quality set of filters into acceptable frequency and amplitude values. The ratio 1/3, of course, defines the band width of all the filters once the center frequencies are calculated.
Following this segmentation process with filters, the filter output voltages are digitized by a 34 commercially available set of digitizers or preferably multiplexer and digitizer, on in the case of the disclosed preferred embodiment, a digitizer built into the same identified commercially available filter bank, to eliminate interfacing logic and hardware. Again quality of digitizer in terms of speed of conversion or discrimination is not significant because average presently available commercial units exceed the requirements needed here, due to a correcting algorithm (see specifications) and the low sample rate necessary.
Any complex sound that is carrying constantly changing information can be approximated with a reduction of bits of information by capturing the frequency and amplitude of peaks of the signal. This, of course, is old knowledge, as is performing such an operation on speech signals.
However, in speech research, several specific regions where such peaks often occur have been labeled "formant" regions. However, these region approximations do not always coincide with each speaker's peaks under all circumstances. Speech researchers and the prior inventive art, tend to go to great effort to measure and name "legitimate" peaks as those that fall within the typical formant frequency regions, as if their definition did not involve estimates, but rather absoluteness. This has caused numerous research and formant measuring devices to artificially exclude pertinent peaks needed to adequately represent a complex, highly variable sound wave in real time. Since the present disclosure is designed to be suitable for animal vocal sounds as well as all human languages, artificial restrictions such as formants, are not of interest and the sound wave is treated as a complex, varying sound wave which can analyze any such sound.
In order to normalize and simplify peak identification, regardless of variation in filter band width, quality and digitizer discrimination, the actual values stored for amplitude and frequency are "representative values". This is so that the broadness of upper frequency filters is numerically similar to lower frequency filter band width. Each filter is simply given consecutive values from 1 to 25, and a soft to loud sound is scaled from 1 to 40, for ease of CRT screen display. A correction on the frequency representation values is accomplished by adjusting the number of the filter to a higher decimal value toward the next integer value, if the filter output to the right of the peak filter has a greater amplitude than the filter output on the left of the peak filter. The details of a preferred embodiment of this algorithm is described in the specifications of this disclosure. This correction process must occur prior to the compression process, while all filter amplitude values are available.
Rather than slowing down the sampling rate, the preferred embodiment stores all filter amplitude values for 10 to 15 samples per second for an approximate 10 to 15 second speech sample before The voice signal may be emitted from or received by the present invention. Optionally, the emotion associated with the voice signal is identified upon the emotion being provided. In such case, it should be determined whether the automatically determined emotion or the user- determined emotion matches the identified emotion. The user may be awarded a prize upon the user-determined emotion matching the identified emotion. Further, the emotion may be automatically determined by extracting at least one feature from the voice signals, such as in a manner discussed above.
To assist a user in recognizing emotion, an emotion recognition game can be played in accordance with one embodiment of the present invention. The game could allow a user to compete against the computer or another person to see who can best recognize emotion in recorded speech. One practical application of the game is to help autistic people in developing better emotional skills at recognizing emotion in speech.
In accordance with one embodiment of the present invention, an apparatus may be used to create data about voice signals that can be used to improve emotion recognition. In such an embodiment, the apparatus accepts vocal sound through a transducer such as a microphone or sound recorder. The physical sound wave, having been transduced into electrical signals are applied in parallel to a typical, commercially available bank of electronic filters covering the audio frequency range. Setting the center frequency of the lowest filter to any value that passes the electrical energy representation of the vocal signal amplitude that includes the lowest vocal frequency signal establishes the center values of all subsequent filters up to the last one passing the energy-generally between 8kHz to 16kHz or between 10kHz and 20kHz, and also determine the exact number of such filters. The specific value of the first filter's center frequency is not significant, so long as the lowest tones of the human voice is captured, approximately 70 Hz.
Essentially any commercially available bank is applicable if it can be interfaced to any commercially available digitizer and then microcomputer. The specification section describes a specific set of center frequencies and microprocessor in the preferred embodiment. The filter quality is also not particularly significant because a refinement algorithm disclosed in the specification brings any average quality set of filters into acceptable frequency and amplitude values. The ratio 1/3, of course, defines the band width of all the filters once the center frequencies are calculated.
Following this segmentation process with filters, the filter output voltages are digitized by a 34 commercially available set of digitizers or preferably multiplexer and digitizer, on in the case of the disclosed preferred embodiment, a digitizer built into the same identified commercially available filter bank, to eliminate interfacing logic and hardware. Again quality of digitizer in terms of speed of conversion or discrimination is not significant because average presently available commercial units exceed the requirements needed here, due to a correcting algorithm (see specifications) and the low sample rate necessary.
Any complex sound that is carrying constantly changing information can be approximated with a reduction of bits of information by capturing the frequency and amplitude of peaks of the signal. This, of course, is old knowledge, as is performing such an operation on speech signals.
However, in speech research, several specific regions where such peaks often occur have been labeled "formant" regions. However, these region approximations do not always coincide with each speaker's peaks under all circumstances.: Speech researchers and the prior inventive art, tend to go to great effort to measure and name "legitimate" peaks as those that fall within the typical formant frequency regions, as if their definition did not involve estimates, but rather absoluteness. This has caused numerous research and formant measuring devices to artificially exclude pertinent peaks needed to adequately represent a complex, highly variable sound wave in real time. Since the present disclosure is designed to be suitable for animal vocal sounds as well as all human languages, artificial restrictions such as formants, are not of interest and the sound wave is treated as a complex, varying sound wave which can analyze any such sound.
In order to normalize and simplify peak identification, regardless of variation in filter band width, quality and digitizer discrimination, the actual values stored for amplitude and frequency are "representative values". This is so that the broadness of upper frequency filters is numerically similar to lower frequency filter band width. Each filter is simply given consecutive values from 1 to 25, and a soft to loud sound is scaled from 1 to 40, for ease of CRT screen display. A correction on the frequency representation values is accomplished by adjusting the number of the filter to a higher decimal value toward the next integer value, if the filter output to the right of the peak filter has a greater amplitude than the filter output on the left of the peak filter. The details of a preferred embodiment of this algorithm is described in the specifications of this disclosure. This correction process must occur prior to the compression process, while all filter amplitude values are available.
Rather than slowing down the sampling rate, the preferred embodiment stores all filter amplitude values for 10 to 15 samples per second for an approximate 10 to 15 second speech sample before As with any complex sound, amplitude across the audio frequency range for a "time slice" 0.1 of a second will not be constant or flat, rather there will be peaks and valleys. The frequency representative values of the peaks of this signal, 1219, are made more accurate by noting the amplitude values on each side of the peaks and adjusting the peak values toward the adjacent filter value having the greater amplitude. This is done because, as is characteristic of adjacent 1/3 octave filters, energy at a given frequency spills over into adjacent filters to some extent, depending on the cut-off qualities of the filters. In order to minimize this effect, the frequency of a peak filter is assumed to be the center frequency only if the two adjacent filters have amplitudes within 10% of their average. To guarantee discreet, equally spaced, small values for linearizing and normalizing the values representing the unequal frequency intervals, each of the 25 filters are given number values 1 through 25 and these numbers are used throughout the remainder of the processing. This way the 3,500 HZ difference between filters 24 and 25 becomes a value of 1, which in turn is also equal to the 17 HZ difference between the first and second filter.
To prevent more than five sub-divisions of each filter number and to continue to maintain equal valued steps between each sub-division of the 1 to 25 filter numbers, they are divided into 0.2 steps and are further assigned as follows. If the amplitude difference of the two adjacent filters to a peak filter is greater than 30% of their average, then the peak filter's number is assumed to be nearer to the half-way point to the next filter number than it is of the peak filter. This would cause the filter number of a peak filter, say filter number 6.0, to be increased to 6.4 or decreased to 5.6, if the bigger adjacent filter represents a higher, or lower frequency, respectively. All other filter values, of peak filters, are automatically given the value of its filter number +0.2 and -0.2 if the greater of the adjacent filter amplitudes represents a higher or lower frequency respectively.
The segmented and digitally represented vocal utterance signal 1219, after the aforementioned frequency correction 1220, is compressed to save memory storage by discarding all but six amplitude peaks. The inventor found that six peaks were sufficient to capture the style characteristics, so long as the following characteristics are observed. At least one peak is near the fundamental frequency; exactly one peak is allowed between the region of the fundamental frequency and the peak amplitude frequency, where the nearest one to the maximum peak is preserved; and the first two peaks above the maximum peak is saved plus the peak nearest the 16,000 HZ end or the 25th filter if above 8kHz, for a total of six peaks saved and stored in microprocessor memory. This will guarantee that the maximum peak always is the third peak 39 As with any complex sound, amplitude across the audio frequency range for a "time slice" 0.1 of a second will not be constant or flat, rather there will be peaks and valleys. The frequency representative values of the peaks of this signal, 1219, are made more accurate by noting the amplitude values on each side of the peaks and adjusting the peak values toward the adjacent filter value having the greater amplitude. This is done because, as is characteristic of adjacent 1/3 octave filters, energy at a given frequency spills over into adjacent filters to some extent, depending on the cut-off qualities of the filters. In order to minimize this effect, the frequency of a peak filter is assumed to be the center f equency only if the two adjacent filters have amplitudes within 10% of their average. To guarantee discreet, equally spaced, small values for linearizing and normalizing the values representing the unequal frequency intervals, each of the 25 filters are given number values 1 through 25 and these numbers are used throughout the remainder of the processing. This way the 3,500 HZ difference between filters 24 and 25 becomes a value of 1, which in turn is also equal to the 17 HZ difference between the first and second filter.
To prevent more than five sub-divisions of each filter number and to continue to maintain equal valued steps between each sub-division of the 1 to 25 filter numbers, they are divided into 0.2 steps and are further assigned as follows. If the amplitude difference of the two adjacent filters to a peak filter is greater than 30% of their average, then the peak filter's number is assumed to be nearer to the half-way point to the next filter number than it is of the peak filter. This would cause the filter number of a peak filter, say filter number 6.0, to be increased to 6.4 or decreased to 5.6, if the bigger adjacent filter represents a higher, or lower frequency, respectively. All other filter values, of peak filters, are automatically given the value of its filter number +0.2 and -0.2 if the greater of the adjacent filter amplitudes represents a higher or lower frequency respectively.
The segmented and digitally represented vocal utterance signal 1219, after the aforementioned frequency correction 1220, is compressed to save memory storage by discarding all but six amplitude peaks. The inventor found that six peaks were sufficient to capture the style characteristics, so long as the following characteristics are observed. At least one peak is near the fundamental frequency; exactly one peak is allowed between the region of the fundamental frequency and the peak amplitude frequency, where the nearest one to the maximum peak is preserved; and the first two peaks above the maximum peak is saved plus the peak nearest the 16,000 HZ end or the 25th filter if above 8kHz, for a total of six peaks saved and stored in microprocessor memory. This will guarantee that the maximum peak always is the third peak 39 4,490, 840 6 Another patent teaches the measurement of the presbers of Two Occupations: An Example of Potential ence or absence of a low frequency vocal component as Vocal/Mental Relationships That May Affect Voice it relates to physiological stress (U.S. Pat. No. Measurement of Pilot Mental Workload, AD-TR-80-57, 3,971,034). However, this prior art is not concerned July 1980. with most of the speech spectrum, and must be cali- 5 brated to each individual, meaning that the stress level SUMMARY OF THE INVENTION obtained cannot be compared to a population mean or A profile of an individual's speech, voice or percepstandard and does not involve normal perceptual style tual profile derived unobtrusively from any speech or dimensions. Each of these measures one or two specific language, or even animal sounds, must involve a time vocal parameters and then indicates the presence or 10 and frequency domain dissasembly into fundamental absence of these. The assumption is made that stress, vocal properties, and reassembly capability into buildlying, or proper speaking is, or is not being exhibited by ing block elements and finally to dimensions of mental the user. These inventions do not measure the entire (or in the case of animals-behavioral) processes related amplitude frequency distribution, determine speech or to perception or awareness. vocal style elements and dimensions or relate these to 15 This has not even been previously specifically theoperceptual style dimensions through both a built-in and rized, much less attempted in an invented machine. This a user supplied coefficient array. invention discloses such a machine. Its application is so Another speech analyzer reads lip and face movebroad as to include not only speech therapists and psyments, air velocities and acoustical sounds which are chologists as potential users, but career counselors, compared and digitally stored and processed (U.S. Pat. 20 artificial intelligence research, cross-cultural comparaNo. 3,383,466). A disadvantage is that the sound charactive cognitions, even population stabilization properties. teristics are not disassembled and reassembled into The fields needing such a machine are as disparate as speech style elements or dimensions nor related to perall the humanities, as well as industiral psychology, ceptual style dimensions. There is a great deal of art environmental engineering, and mental workload. In relating to speech recognition devices wherein a voice's 25 short, it should revolutionize both psychological theory digital representation is compared to a battery of previand practice as well as the speech sciences and on to ously stored ones. Some of these use filters, others use speech synthesizer style standardization. analytic techniques, but none relate normal and typical The present disclosed invention accepts vocal sound voice and speech styles to normal and typical percepthrough a transducer such as a microphone or sound tual or cognitive style dimensions. 30 recorder. The physical sound wave, having been trans¬ Another technique for analyzing voice involves deduced into electrical signals are applied in parallel to a termining the emotional state of the subject as disclosed typical, commercially available bank of electronic filin Fuller, U.S. Pat. Nos. 3,855,416; 3,855,417; and ters covering the audio frequency range. Setting the 3,855,418. These analyze the amplitude of the speech, center frequency of the lowest filter to any value that voice vibrato, and relationships between harmonic 35 passes the electrical energy representation of the vocal overtones of higher frequencies. However, these invensignal amplitude that includes the lowest vocal fretions are not concerned with natural and typical voice quency signal establishes the center values of all subseand speech style elements and dimensions and typical quent filters up to the last one passing the energy-gener-perceptual style dimensions, and are limited to stress ally between 8k hz to 16k hz or between 10k hz and 20k measurement and the presence or absence of specific 40 Hz, and also determine the exact number of such filters. emotional states. The specific value of the first filter's center frequency is The presence of specific emotional content such as not significant, so long as the lowest tones of the human fear, stress, or anxiety, or the probability of lying on voice is captured, approximately 70 Hz. Essentially any specific words, is not of interest to the invention discommercially available bank is applicable if it can be closed herein. The invention disclosed herein also is not 45 interfaced to any commercially available digitizer and calibrated to a specific individual, such as is typical of then microcomputer. The specification section dethe prior art, but rather measures all speakers against scribes a specific set of center frequencies and microone standard because of the inventor's scientific discovprocessor in the preferred embodiment. The filter qualery that there exist universal standards of style. ity is also not particularly significant because a refine¬ The user can evaluate the similarlity of the various 50 ment algorithm disclosed in the specification brings any vocal style dimensions of his or her voice (in biofeedaverage quality set of filters into acceptable frequency back mode) or his client's voice (in therapy setting) to and implitude values. The ratio J, of course, defines the those of target groups such as recording and entertainband width of all the filters once the center frequencies ment stars, successful and unsuccessful people, psychoare calculated. logically dysfunctional people (or a variety of different 55 Following this segmentation process with filters, the dysfunctions), self-actualizing people, etc. Any and all filter output voltages are digitized by a commercially naturally occurring groupings of people, occupationally available set of digitizers or preferably multiplexer and or cognitively, can be assumed to have one or more digitizer, on in the case of the disclosed preferred emspecific and predictable vocal style components with bodiment, a digitizer built into the same identified comranges characteristic of that specific category of people, 60 mercially available filter bank, to eliminate interfacing according to the following citations. logic and hardware. Again quality of digitizer in terms Jones, J. M., 98th Meeting: Acoustical Society of of speed of conversion or discrimination is not signifiAmerica, Fall 1979; Jones, J. M., Differences in the cant because average presently available commercial Amplitude-Frequency Distribution of Vocal Energy units exceed the requirements needed here, due to a Among Ph.D. managers, Engineers, ar¾d Enlisted Mili- 65 correcting algorithm (see specifications) and the low tary Personnel, Masters Thesis UWF 1979; Voice Style, sample rate necessary.
Perceptual Style and Process Psychology, book in Press Any complex sound that is carrying constantly 1982; and, Jones, J., Vocal Differences Between Mern- changing information can be approximated with a re- 4,4 , 6 Another patent teaches the measurement of the presbers of Two Occupations: An Example of Potential ence or absence of a low frequency vocal component as Vocal/Mental Relationships That May Affect Voice it relates to physiological stress (U.S. Pat. No. Measurement of Pilot Mental Workload, AD-TR-80-57, 3,971,034). However, this prior art is not concerned July 1980. with most of the speech spectrum, and must be cali- 5 brated to each individual, meaning that the stress level SUMMARY OF THE INVENTION obtained cannot be compared to a population mean or A profile of an individual's speech, voice or percepstandard and does not involve normal perceptual style tual profile derived unobtrusively from any speech or dimensions. Each of these measures one or two specific language, or even animal sounds, must involve a time vocal parameters and then indicates the presence or 10 and frequency domain dissasembly into fundamental absence of these. The assumption is made that stress, vocal properties, and reassembly capability into buildlying, or proper speaking is, or is not being exhibited by ing block elements and Finally to dimensions of mental the user. These inventions do not measure the entire (or in the case of animals-behavioral) processes related amplitude frequency distribution, determine speech or to perception or awareness. vocal style elements and dimensions or relate these to 15 This has not even been previously specifically theoperceptual style dimensions through both a built-in and rized, much less attempted in an invented machine. This a user supplied coefficient array. invention discloses such a machine. Its application is so Another speech analyzer reads lip and face movebroad as to include not only speech therapists and psyments, air velocities and acoustical sounds which are chologists as potential users, but career counselors, compared and digitally stored and processed (U.S. Pat. 20 artificial intelligence research, cross-cultural comparaNo. 3,383,466). A disadvantage is that the sound charactive cognitions, even population stabilization properties. teristics are not disassembled and reassembled into The fields needing such a machine are as disparate as speech style elements or dimensions nor related to perall the humanities, as well as industiral psychology, ceptual style dimensions. There is a great deal of art environmental engineering, and mental workload. In relating to speech recognition devices wherein a voice's 25 short, it should revolutionize both psychological theory digital representation is compared to a battery of previand practice as well as the speech sciences and on to ously stored ones. Some of these use filters, others use speech synthesizer style standardization. analytic techniques, but none relate normal and typical The present disclosed invention accepts vocal sound voice and speech styles to normal and typical percepthrough a transducer such as a microphone or sound tual or cognitive style dimensions. 30 recorder. The physical sound wave, having been trans¬ Another technique for analyzing voice involves deduced into electrical signals are applied in parallel to a termining the emotional state of the subject as disclosed typical, commercially available bank of electronic filin Fuller, U.S. Pat. Nos. 3,855,416; 3,855,417; and ters covering the audio frequency range. Setting the 3,855,418. These analyze the amplitude of the speech, center frequency of the lowest filter to any value that voice vibrato, and relationships between harmonic 35 passes the electrical energy representation of the vocal overtones of higher frequencies. However, these invensignal amplitude that includes the lowest vocal fretions are not concerned with natural and typical voice quency signal establishes the center values of all subseand speech style elements and dimensions and typical quent filters up to the last one passing the energy-gener-perceptual style dimensions, and are limited to stress ally between 8k hz to 16k hz or between 10k hz and 20k measurement and the presence or absence of specific 40 Hz, and also determine the exact number of such filters. emotional states. The specific value of the first filter's center frequency is The presence of specific emotional content such as not significant, so long as the lowest tones of the human fear, stress, or anxiety, or the probability of lying on voice is captured, approximately 70 Hz. Essentially any specific words, is not of interest to the invention discommercially available bank is applicable if it can be closed herein. The invention disclosed herein also is not 45 interfaced to any commercially available digitizer and calibrated to a specific individual, such as is typical of then microcomputer. The specification section dethe prior art, but rather measures all speakers against scribes a specific set of center frequencies and microone standard because of the inventor's scientific discovprocessor in the preferred embodiment. The filter qualery that there exist universal standards of style. ity is also not particularly significant because a refine¬ The user can evaluate the similarlity of the various 50 ment algorithm disclosed in the specification brings any vocal style dimensions of his or her voice (in biofeedaverage quality set of filters into acceptable frequency back mode) or his client's voice (in therapy setting) to and implitude values. The ratio i, of course, defines the those of target groups such as recording and entertainband width of all the filters once the center frequencies ment stars, successful and unsuccessful people, psychoare calculated. logically dysfunctional people (or a variety of different 55 Following this segmentation process with filters, the dysfunctions), self-actualizing people, etc. Any and all filter output voltages are digitized by a commercially naturally occurring groupings of people, occupationally available set of digitizers or preferably multiplexer and or cognitively, can be assumed to have one or more digitizer, on in the case of the disclosed preferred emspecific and predictable vocal style components with bodiment, a digitizer built into the same identified comranges characteristic of that specific category of people, 60 mercially available filter bank, to eliminate interfacing according to the following citations. logic and hardware. Again quality of digitizer in terms Jones, J. M., 98th Meeting: Acoustical Society of of speed of conversion or discrimination is not signifiAmerica, Fall 1979; Jones, J. M., Differences in the cant because average presently available commercial Amplitude-Frequency Distribution of Vocal Energy units exceed the requirements needed here, due to a Among Ph.D. managers, Engineers, arjd Enlisted Mili- 65 correcting algorithm (see specifications) and the low tary Personnel, Masters Thesis UWF 1979; Voice Style, sample rate necessary.
Perceptual Style and Process Psychology, book in Press Any complex sound that is carrying constantly 1982; and, Jones, J., Vocal Differences Between Mem- changing information can be approximated with a re-
Claims (14)
1. . A method for indicating emotional attitudes of a speaker according to voice intonation, said method comprising: a. defining a set of words; b. obtaining a database comprising reference intonations and reference emotional attitudes per each word of said set of words; c. repeatedly pronouncing at least one word of said set of words by said speaker; d. recording a plurality of said pronunciations to obtain a signal representing sound magnitude as a function of time; e. processing said signal to obtain voice characteristics comprising a description of sound magnitude as a first function of frequency; f. decoding said voice characteristics to identify an intonation; g. comparing said intonation to said reference intonations; and, h. retrieving at least one of said reference emotional attitude.
2. A method for indicating emotional attitudes of a speaker according to voice intonation, said method comprising: a. recording a speaker to obtain a sample of speech; b. processing said sample of speech to obtain digital data representing sound magnitude as a function of time; c. processing said digital data to obtain voice characteristics comprising a description of sound magnitude as a first function of frequency; d. decoding said voice characteristics to identify dominant tones; and, e. attributing an emotional attitude to the speaker based on said dominant tones.
3. A method according to claim 1 , wherein said step of retrieving further comprising of interpolating between emotional attitudes according to intonations.
4. A method according to either claim 1 or claim 2, wherein said step of decoding further comprising calculating a maximum over a range of said first function of frequency to obtain a second function of frequency. 16
5. A method according to either claim 1 or. claim 2, wherein said step of decoding further comprising calculating a average over a range of said first function of frequency to obtain a second function of frequency.
6. A method according to either claim 3 or claim 4, wherein said step of decoding further comprising calculating a maximum over a range of said second function of frequency to obtain an intonation.
7. A method according to claim 1 , wherein said step of comparing further comprising of calculating the variation of a calculated intonation from at least one of said reference intonations.
8. A method according to claim 7, wherein said step of retrieving further comprising determining the difference of said speaker's emotional attitudes from said reference emotional attitudes.
9. A method for indicating emotional attitudes of an animal that is not a human being, said method comprising: a. defining a set of sounds that said animal emits; b. obtaining a database comprising reference intonations and reference emotional attitudes per each sound of said set of sounds; c. repeatedly producing at least one sound of said set of emitted sounds; d. recording a plurality of said produced sounds to obtain a signal representing sound magnitude as a function of time; e. processing said signal to obtain sound characteristics comprising a description of sound magnitude as a function of frequency; f. decoding said sound characteristics to identify an intonation; g. comparing said intonation to said reference intonations; and, h. retrieving at least one of said reference emotional attitudes.
10. A system for indicating the emotional attitudes of a speaker by automated intonation analysis, said system comprising: a. a sound recorder adapted to record a word that is repeatedly pronounced by said speaker, and to produce a signal representing the recorded sound magnitude as a function of time, 17 b. a first processor, with processing software running on said processor, adapted for processing said signal, and obtain voice characteristics such as intonation, c. a database comprising an intonation glossary; d. a second processor, with computing software running on said computer, adapted for collecting a set of predefined words, retrieving relations between reference intonations and reference emotional attitudes per each word of said set of words from said database, connecting to said processor, comparing said intonation to said reference intonations, and retrieving at least one of said reference emotional attitudes; and e. an indicator connected to either said first or second processors.
11. 1 1. A method for advertising, marketing, educating, or lie detecting by indicating emotional attitudes of a speaker according to intonation; said method comprising a. obtaining a database comprising a plurality of vocal objects, especially words or sounds; b. playing-out a first group of at least one of said vocal objects for said speaker to hear; c. replying to said play-out by said speaker; d. recording said reply; e. processing said recording to obtain voice characteristics; f. decoding said voice characteristics to identify an intonation; and, g. comparing said intonation to said reference intonations; and, h. playing-out a second group of at least one of said vocal objects for said speaker to hear.
12. A system for indicating emotional attitudes of a speaker; said system comprising a glossary of intonations, said glossary relating intonations to emotions attitudes.
13. A system according to claim 12 wherein said glossary comprises any part of the information listed in Table 1 herein.
14. A method of providing remote service by a group comprising at least one observer to at least one speaker; said method comprising a. identifying the emotional attitude of said speaker according to intonation; and, b. advising said group of observers how to provide said service. 18 A method according to claim 14, wherein said step of advising further comprising selecting at least one of said group of observer to provide the service. 19
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| IL192367A IL192367A (en) | 2005-12-22 | 2008-06-22 | System for indicating emotional attitudes through intonation analysis and methods thereof |
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US75239905P | 2005-12-22 | 2005-12-22 | |
| PCT/IL2006/001464 WO2007072485A1 (en) | 2005-12-22 | 2006-12-20 | System for indicating emotional attitudes through intonation analysis and methods thereof |
| IL192367A IL192367A (en) | 2005-12-22 | 2008-06-22 | System for indicating emotional attitudes through intonation analysis and methods thereof |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| IL192367A0 IL192367A0 (en) | 2008-12-29 |
| IL192367A true IL192367A (en) | 2014-03-31 |
Family
ID=42617260
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| IL192367A IL192367A (en) | 2005-12-22 | 2008-06-22 | System for indicating emotional attitudes through intonation analysis and methods thereof |
Country Status (1)
| Country | Link |
|---|---|
| IL (1) | IL192367A (en) |
-
2008
- 2008-06-22 IL IL192367A patent/IL192367A/en not_active IP Right Cessation
Also Published As
| Publication number | Publication date |
|---|---|
| IL192367A0 (en) | 2008-12-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8078470B2 (en) | System for indicating emotional attitudes through intonation analysis and methods thereof | |
| KR101248353B1 (en) | Speech analyzer detecting pitch frequency, speech analyzing method, and speech analyzing program | |
| US6480826B2 (en) | System and method for a telephonic emotion detection that provides operator feedback | |
| US6697457B2 (en) | Voice messaging system that organizes voice messages based on detected emotion | |
| Eyben et al. | The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing | |
| EP1222448B1 (en) | System, method, and article of manufacture for detecting emotion in voice signals by utilizing statistics for voice signal parameters | |
| US6353810B1 (en) | System, method and article of manufacture for an emotion detection system improving emotion recognition | |
| US6427137B2 (en) | System, method and article of manufacture for a voice analysis system that detects nervousness for preventing fraud | |
| US7606701B2 (en) | Method and apparatus for determining emotional arousal by speech analysis | |
| Seneviratne et al. | Extended Study on the Use of Vocal Tract Variables to Quantify Neuromotor Coordination in Depression. | |
| US20080045805A1 (en) | Method and System of Indicating a Condition of an Individual | |
| Morrison et al. | Introduction to forensic voice comparison | |
| Solomon et al. | Objective methods for reliable detection of concealed depression | |
| Usman et al. | Heart rate detection and classification from speech spectral features using machine learning | |
| RU2559689C2 (en) | Method of determining risk of development of individual's disease by their voice and hardware-software complex for method realisation | |
| JP7307507B2 (en) | Pathological condition analysis system, pathological condition analyzer, pathological condition analysis method, and pathological condition analysis program | |
| Ingrisano et al. | Environmental noise: a threat to automatic voice analysis | |
| van Brenk et al. | Investigating acoustic correlates of intelligibility gains and losses during slowed speech: A hybridization approach | |
| Dubey et al. | Hypernasality Severity Detection Using Constant Q Cepstral Coefficients. | |
| KR20220069854A (en) | Method and device for predicting anxiety and stress using phonetic features | |
| He | Stress and emotion recognition in natural speech in the work and family environments | |
| Grigorev et al. | An Electroglottographic Method for Assessing the Emotional State of the Speaker | |
| IL192367A (en) | System for indicating emotional attitudes through intonation analysis and methods thereof | |
| Le et al. | The use of spectral information in the development of novel techniques for speech-based cognitive load classification | |
| Sigmund et al. | Statistical analysis of glottal pulses in speech under psychological stress |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FF | Patent granted | ||
| KB | Patent renewed | ||
| KB | Patent renewed | ||
| MM9K | Patent not in force due to non-payment of renewal fees |