WO2009007512A1

WO2009007512A1 - A gesture-controlled music synthesis system

Info

Publication number: WO2009007512A1
Application number: PCT/FI2008/050421
Authority: WO
Inventors: Perttu HÄMÄLÄINEN
Original assignee: Virtual Air Guitar Company Oy
Priority date: 2007-07-09
Filing date: 2008-07-09
Publication date: 2009-01-15
Also published as: FI20075530A0

Abstract

A method, software, and system for gesture- controlled music synthesis are disclosed. The user's state is recognized and compared against pre-stored states, each associated with a musical passage. Based on state data, such as pose, frequency of trigger gestures (e.g., strumming the strings of an air guitar), and distance between the user's hands, suitable passages are chosen for playback. The invention lets the user generate high-quality music with no musical training. Typically, the invention is used for creating the illusion of the user playing lead guitar in a rock band, especially in embodiments that comprise displaying a picture of the user on a virtual stage in front of a virtual audience. The user may control simple musical parameters, such as pitch and amount of notes using simple gestures, and use full body poses for controlling emotional aspects of the musical output.

Description

A GESTURE-CONTROLLED MUSIC SYNTHESIS SYSTEM FIELD OF THE INVENTION

The present invention is related to gesture control, user interfaces and music synthesis.

BACKGROUND OF THE INVENTION

Traditional musical instruments, such as the guitar, violin and piano, are very expressive, and thus difficult to learn to play well. Since the beginning of the 20^th century, gesture-controlled instruments without any tangible user interfaces have been developed, such as the Theremin (US Patent US1661058) . However, many of these interfaces, like the Theremin, have been difficult to learn to play - the Theremin in particular, because highly developed muscular precision is required for producing steady tones and keeping with a musical scale, because even minute variations in the player' s hand positions cause changes in the sound.

At the same time, ways of interacting with music have been developed that allow users to produce music creatively without the need for extensive training. Many of these systems are based on combining pre-recorded musical passages on a timeline. Playing back the created arrangement of these passages results in a musical piece. However, these systems are more composition tools than musical instruments, since they cannot be controlled in real time but require a period of preparation before the musical piece can be played back. This represents a significant problem in prior art.

SUMMARY OF THE INVENTION

The present invention discloses a method for creating synthesized music in response to a player's gestures. The method is characterized in that at first a current player state is recognized. After that differences are calculated between the recognized player state and at least one pre-stored player state, wherein the pre-stored player states are each associated with a musical passage. Finally, playback volumes of the associated musical passages are adjusted according to the calculated differences so that a small difference yields a high volume and a large difference yields a low volume. In an embodiment of the invention, the player state comprises a pose.

In an embodiment of the invention, the player state comprises playing speed. In a further embodiment, the playing speed is determined in relation to the frequency of consecutive trigger gestures performed by the player. In a further embodiment, such trigger gestures are recognized that mimic the strumming of the strings of a guitar.

In an embodiment of the invention, the player state comprises the distance between the hands of the player .

In an embodiment of the invention, the method further comprises the step of producing a plurality of pre-stored player states and associated musical passages so that the player states with a relatively high distance between the hands of the player are associated with musical passages that have notes with relatively low pitches .

In an embodiment of the invention, the method further comprises producing the musical passages so that each passage has a tempo that is an integer multiple of a base tempo, and playing the musical passages in sync with each other.

In an embodiment of the invention, the method further comprises the step of playing back a musical accompaniment track that has the same musical key as the musical passages and a tempo that is an integer multiple of the base tempo.

In an embodiment of the invention, the playback of an associated musical passage is stopped if its playback volume is below a threshold value, and a stopped musical passage whose playback volume is above a threshold value, is restarted at a playback position where the playback position would be in case the musical passage had not been stopped.

In an embodiment of the invention, the lengths of musical passages are defined as integer multiples of a base length, and the musical passages are looped consecutively and continuously.

In an embodiment of the invention, a threshold speed for the playing is defined, and playback of a single non-synchronized sound is started with each trigger gesture when playing at slower speed than the threshold speed.

In an embodiment of the invention, the playback of single non-synchronized sounds is started when predefined special poses and gestures are recognized. In an embodiment of the invention, a pose is recognized among a predetermined discrete set of poses, and only the musical passages, which are associated with the pre-stored player state comprising the same discrete pose as the recognized player state, are played audibly. In an embodiment of the invention, playing speed is recognized among a predetermined discrete set of playing speed values, and only the musical passages, which are associated with the pre-stored player state comprising the same discrete playing speed as the recognized player state, are played audibly.

In an embodiment of the present invention, instruction images representing predefined poses are displayed for giving movement instructions to the player.

In an embodiment of the present invention, the instruction images are highlighted as a function of the distance between the pose of the displayed image and the pose of the player.

In an embodiment of the present invention, visual selection objects are provided for allowing the player to choose a desired pose by virtually touching the visual selection objects on screen.

In an embodiment of the present invention, a level of matching is calculating between the instruction images and the player' s current visual representation, and the desired pose is chosen as the best match among the instruction images.

In an embodiment of the present invention, player' s hands and feet are included in said matching calculation.

In an embodiment of the present invention, a limited subset of instruction images are provided at a time which show the best matches according to player' s current visual representation. According to a second aspect of the invention, the method steps described above are implemented in a form of a computer program. The computer program comprises program code configured to control a data- processing device to perform the previously disclosed method steps while applicable. In one embodiment of the invention, the computer program is embodied on a computer readable medium. In one embodiment of the invention, the parameters and additional features mentioned regarding different embodiments of the method, are also applied to the program code of the computer program and to the system according to the invention.

According to a third aspect of the present invention, a system for creating synthesized music in response to a player's gestures is disclosed. The system comprises a motion tracking tool and data processing means configured to recognize a current player state. The system further comprises a memory configured to store a plurality of musical passages which each comprise an association to pre-stored player states. Furthermore, the system comprises a calculating means configured to calculate differences between the recognized player state and at least one pre-stored player state stored in the memory. The system also comprises a control unit configured to adjust playback volumes of the associated musical passages according to the calculated differences so that a small difference yields a high volume and a large difference yields a low volume. Finally, the system comprises a sound production device configured to produce the synthesized music according to the control unit output .

A further embodiment of the invention is a system comprising a computing device controlled by said computer program.

In an embodiment of the invention, the system further comprises location indicators worn by the player in order to define the locations of player' s hands and possibly feet, and means for capturing the location data with the motion tracking tool.

In an embodiment of the invention, the system further comprises a camera in the motion tracking tool for taking plurality of pictures of the player, and means for using the picture data for the player state recognition. In a further embodiment of the invention, the system further comprises a screen and graphics rendering means configured to show the picture of the player composited inside three-dimensional computer graphics as a billboard texture that stays facing the virtual camera if the virtual camera moves.

In an embodiment of the invention, a screen and graphics rendering means are configured to display instruction images representing predefined poses for giving movement instructions to the player. In an embodiment of the invention, visual selection objects are provided for allowing the player to choose a desired pose by virtually touching the visual selection objects on screen.

In an embodiment of the invention, the calculating means is configured to calculate a level of matching between the instruction images and the player' s current visual representation, and the control unit is configured to choose the desired pose as the best match among the instruction images. As an advantage of the present invention, the device is a useful virtual musical instrument which provides a new manner of transforming player' s gestures and postures into a realistic audio experience. The invention is useful e.g. for creating virtual air guitar playing system and corresponding guitar sound creation.

BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included to provide a further understanding of the invention and constitute a part of this specification, illustrate embodiments of the invention and together with the description help to explain the principles of the invention. In the drawings:

Fig. 1 is a simple block diagram of the architecture according to an embodiment of the invention,

Fig. 2 is a flow chart of the procedure with needed apparatus in an embodiment of the invention, Fig. 3 is a flow chart of a more detailed embodiment of the procedure according to the invention,

Fig. 4 is a flow chart of another more detailed embodiment of the procedure according to the invention,

Fig. 5 is an example of computer graphics rendered by a software based embodiment of the invention,

Fig. 6 is a second example of computer graphics rendered by a software based embodiment of the invention, and

Fig. 7 is a third example of computer graphics rendered by a software based embodiment of the invention.

DETAILED DESCRIPTION OF THE INVENTION

Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings.

The present invention discloses a method for synthesizing music by playing a so-called air guitar. By using different playing gestures, the system according to the invention is able to capture the gestures and convert them to different licks, that is, musical passages like the real guitar would produce. In an embodiment of the invention, the musical passages may be stored in computer memory or on a computer readable medium, for example, as MIDI or as sound waveform data.

The present invention allows the user to control a virtual musical instrument without any physical user interface of an actual musical instrument. The physical user interface is exchanged into that of a virtual full- body interface, i.e. controlling the created sound with postures assumed by the user' s body, here referred to as poses, and movements of the body, here referred to as gestures. While using the method according to an example of the present invention, the user may use three controls to select the sound being played. These controls may be for instance the pose, playing speed defined by the frequency of performed trigger gestures, and distance between the left and right hands of the player. In other words, player's state is observed with a suitable apparatus. The player state may comprise gesture data of the player and the posture of the player. Gesture data comprises at least the aforementioned distance between the hands and the playing speed.

Each combination of the three controls is mapped to a specific musical passage, called a lick, which can be stored in a database, for instance. The database with the lick data can be stored in a memory, which can form a part of a computer or it can be a separate memory unit. For example, standing in a certain pose, playing at slow speed (where the frequency of the player' s trigger gestures is low) , and holding the hands at a far mutual distance corresponds to one lick. Keeping the other parameters the same but playing at a high speed changes the lick to a different one, as does moving the hands closer to each other. In one embodiment of the invention, changing the pose opens up a new collection of licks for each combination of playing speed and the mutual distance of hands. Thus, each pose can be seen to contain a subset of licks, where each lick is activated by changing playing speed and the hand distance. The reason for this categorization is because changing poses is subjectively a major change, while the hand distance and playing speed are minor changes.

The control parameters may be categorized into discrete parameter sets, but the user's movements and gestures are continuous by nature. Thus, in one embodiment of the invention, licks are selected based on the combination of control parameters which the user is the nearest to. For example, if the user's position is between the two defined poses, the lick is selected based on which pose the user is closer to. In one embodiment of the invention, selection is performed which means that the volumes of the other licks are decreased either completely or partially. In a further embodiment, the volumes may be normalized so that the overall loudness perceived by the player stays constant even though the relative volumes of the licks change.

The pose may be described as a plurality of features, such as the locations of the player's hand and feet, body joint angles, or image features computed from an image acquired by a camera or from data provided by some other motion tracking tool, such as the control devices of Nintendo Wii game console. Furthermore, the pose may be determined as one of a predetermined discrete set of reference poses using pattern recognition and classification techniques, such as selecting the predetermined pose with a feature vector closest to the feature vector computed based on the motion tracking data. In case of a virtual guitar, the predetermined poses may include, for example, "low grinding", "basic playing", or "intense solo". Each predetermined pose may have an associated a plurality of input image reference areas, and pose may be determined by computing the overlap of the player' s body parts and the reference areas . Typically, the poses represent the emotional and expressive qualities of the licks (e.g., "relaxed", "anxious", "heroic") , whereas hand distance and playing speed are more directly mapped to musical parameters, such as the amount and pitch of notes in the lick. Learning to express emotion through playing a real instrument requires considerably more training and fine- motor skill than simply producing the correct notes. The benefit of the present invention is that the player can produce professional sounding music by expressing emotion with full body and controlling the essential musical parameters with simple gestures. For example, if the virtual instrument is a guitar, frantic solo licks may be played by dropping on one's knees and arching one's back intensely, whereas confident, low, and crunchy licks may be played by assuming a stabile low stance with legs widely spread apart but keeping upper body relaxed.

The player states (pose and gesture data) may be categorized into discrete sets, but the user's movement is continuous by nature. Thus, licks are selected based on player state that the user is the nearest to. For example, if the user is between two poses, the lick is selected based on which pose the user is the closest to. Here, selection means that the volumes of the other licks are decreased either completely or partially.

In an example according to the invention, the playing speed parameter is achieved by detecting the movements of one hand in a top-down motion similar to strumming the strings of a real guitar. By now referring to the drawings, an embodiment of an apparatus according to the present invention is disclosed in Figure 1. The apparatus comprises a motion tracking tool 11, a computing unit 12 and a sound production device 13 while a user 10 is located in the field of view regarding the motion tracking tool 11.

The motion tracking tool 11 may be, for example, an ordinary digital camera that is capable of providing images at a certain desired resolution and rate. In an embodiment of the invention, the motion tracking tool 11 may comprise computing means configured to perform data analysis, such as detecting a pose or filtering the signals. The motion tracking tool 11 may also comprise a number of devices held by or attached to the user 10 that are capable of sensing their location and/or orientation in space. In an example, these devices may be gloves which are worn by the virtual guitar player. The gloves can be connected to the computing unit 12 with wires or wirelessly via radio transmission, or the gloves can simply act as passive markers that can be recognized from a picture by computer vision software executed by the computing unit.

The computing unit 12 may be, for example, an ordinary computer having sufficient computing power to provide the result at desired quality level. Furthermore, the computing unit 12 includes common means, such as a processor and a memory, in order to execute a computer program or a computer-implemented method according to the present invention. Furthermore, the computing device includes storage capacity for storing the recorded music pieces that are played back in the usage of the present invention. The sound production device 13 may be, for example, an ordinary loudspeaker capable of generating sound waves or a device that is capable of recording audio signals into digital or analog format on a storage device .

Figure 2 discloses a flow chart disclosing different functional units in one embodiment of the invention. Generally speaking, synthesized musical passages are chosen based on the poses and gestures of the user. The motion tracking tool 21 monitors the user 20 and transmits data concerning the user's body and movements to a player state analysis system 22. The player state analysis system 22 analyses the data and determines the posture of the user and/or gestures which the user is performing. The player state analysis system 22 outputs data that corresponds to the control parameters. These parameters can include for example a pose, speed or rate of playing, the distance between player's hands, the angle between the player's hands, and/or the locations of the player's hand and/or feet. This data is transmitted to a mixing unit 25, which may be simply a multiplexer, but also, for example, a volume controller. The mixing unit 25 reads the licks 24 (in this example, N pieces of choosable licks) from a lick database 23 and modifies their playback states and playback volumes so that the lick corresponding to the control parameters is being played back. The mixed audio is sent to the sound production device 26 for audio output .

Figure 3 discloses a flow chart of a more detailed embodiment according to the invention. The motion tracking tool 31 monitors the player 30, and transmits data to a pose analysis unit 32 and a musical parameter analysis unit 33. The pose analysis unit 32 outputs pose data, such as a pose identifier. The musical parameter analysis unit 33 outputs musical gesture data, such as the hand distance and playing speed. The group selector 36 selects a lick group from the database 35 based on the pose data. The group selector 36 outputs the individual licks that belong to the selected group. The lick mixer 37 selects a lick based on the musical gesture data. The lick mixer 37 outputs the selected lick for audio output 34.

Figure 4 discloses a flow chart of yet another embodiment of the invention where the selection of licks proceeds hierarchically. The motion tracking tool 41 monitors the player 40, and transmits data to the pose analysis unit 42, playing speed analysis unit 43 and hand distance analysis unit 44. The pose based selector 47 selects and outputs the licks from the lick database 46 that correspond to the detected pose. The speed based selector 48 selects and outputs the licks that correspond to the detected playing speed. The hand distance based selector 49 selects and outputs the lick that corresponds to the detected hand distance. Units 47, 48 and 49 may also be placed to any other mutual order than the one shown in Figure 4. Finally, the selected lick is directed to the audio output device 45.

Generally in the present invention, the content of the pre-recorded licks can correspond to the combinations of pose, speed and distance, for example. This means that each combination can be mapped to a certain lick and any chosen lick corresponds to one specific combination of the playing gesture parameters. For each playing speed, there are licks containing passages, where the recorded instrument is played at different speeds. These speeds may be, for example, musical 4^th, 8^th and 16^th notes, corresponding to slow, medium and fast playing speeds, respectively. More accurately, the passages contain the perception of being played with 4^th, 8^th and 16^th notes where the actual sounds may contain syncopations or deviations in timing to make them musically rich and interesting while still maintaining the perception of slow, medium or fast playing speed.

For each mutual hand distance (for example, corresponding to a broad or narrow grip of the guitar) , the same or similar lick exists with a certain pitch. For example, broad grip (large distance of the hands) results in lower pitch and narrow grip results to a higher pitch of the same lick. Playing a certain lick at, for example, medium speed and then moving the hands from near to far distance changes the lick from high to low pitch.

Because changing poses represents a major change, the licks within one subset of poses are typically fairly similar to each other compared to the licks with a different pose.

In addition to the licks, an accompaniment track can be played throughout the performance. Thus, the user' s perception is that they are playing the lead instrument in an orchestra or band. The method of selecting discrete licks based on continuous movements is prone to fluctuations. Thus, ways of mitigating these fluctuations can be incorporated into the system.

In one example taking this issue into account, the system knows the user's Mistance' to each defined pose at all times. Let x be a normalized distance measure so that x = 0.0 indicates the user matching first pose and x = 1.0 indicates the user matching the second pose. If the user is located at x = 0.5, that is, equally far away from both poses, minor fluctuations to either side can cause the system to change between two licks very rapidly. The same theory applies to hand distance and playing speed as well. The first way of the mitigation process is to apply a hysteresis filter to the data of distance between two poses so that the change from one lick to the next happens only after the distance has passed well over the midpoint in the opposite direction from the first pose. Thus, if the user hovers around x = 0.5, no accidental switching will occur.

The second way of the mitigation process is performing a cross-fade between two licks. At any given moment, all licks in the current subset of the pose are being played back, but only one of them is at full volume while the other licks are at zero volume. When one of the control parameters changes, the playing volume of the two licks related to that control parameter' s extreme values changes as a function of the parameter change. For example, moving from a near hand distance towards a far hand distance, the volume of the lick corresponding to the near distance is reduced, and the volume of the lick corresponding to the far distance is increased correspondingly. Since the user can not be assumed to be proficient with any instrument, the system must ensure that the result sounds musical. This is accomplished by adhering to certain musical rules both in the recording of the musical content as well as in the real-time control.

In an embodiment of the invention, each lick in the database is recorded with a tempo (beats per minute) that is an integer multiple of a base tempo. For a tempo of 120 4^th notes per minute, a lick with 4^th notes contains approximately 120 notes in a minute. A lick with 8^th is played twice as fast, containing approximately 240 notes, and a lick with 16^th notes is four times as fast (480) .

In an embodiment of the invention, the length of each lick is an integer multiplier of a base length, e.g. a musical measure, so that memory and disk capacity can be saved by looping the licks (restarting playback once the lick has reached its end) so that a steady tempo is maintained. In a further embodiment, all licks have also been recorded in the same musical key and overall style so that they fit together well.

To maintain tempo and overall musical appeal during actual playback, the licks must be synchronized in time. Usually, this is accomplished by playing back all licks at the same time and controlling the volumes of each lick. However, for the sake of efficient calculation, licks to be played at zero volume may be stopped entirely and restarted, when the user picks up one of them. In this case, the playback must be started at a position that is synchronized to the global time.

For example, if the user changes the lick 1.317 seconds after the lick has started, the new lick must start playing from the position of 1.317 seconds, not the beginning, and not the position it was previously stopped at. This ensures that each lick is always synchronized to a global tempo, thus ensuring that they fit into the accompaniment track.

In addition to licks in one embodiment of the invention, the user may also trigger single, non- synchronized sounds called long notes by playing at a speed even slower than the slowest licks. In this case, each playing gesture is interpreted as a trigger for a long note. The recorded content of long notes may be, for example, individual notes or chords that are let ring for a long time. The long notes may also be miscellaneous, non-musical sounds such as explosions or similar kinds of special sounds.

For example, triggering a long note immediately mutes any lick being played and begins the playback of a long note from the beginning. The note will continue playing until the user plays another note or chooses to mute the currently playing note with a special gesture.

For example, the user may mute the lick by lifting the right hand back up above the centerline, as if on top of an imaginary guitar's strings. An exception to this might be a case where the user moves the right hand horizontally away to the side of the imaginary guitar and only then lifts the hand back up, in which case the note would continue to be played.

The pose and hand distance affect which long note is chosen. Typically, there may be 2-6 different hand distances which produce a different note, each of which is perceptually higher or lower in pitch corresponding to the distance.

In addition to licks and long notes, the user may also trigger single, non-synchronized sounds called special effects with predefined trigger gestures or poses. For example, assuming a certain pose may trigger the sound of an explosion, or moving the right hand in a circular motion may trigger the sound of hitting a drum.

An embodiment of the invention may be a gaming system that comprises a screen configured to display the picture of the user captured by the motion tracking tool. In order to provide a convincing illusion of the user playing on a virtual stage in front of a virtual audience, the background may be removed from the picture using computer vision software, and the picture may be rendered as a billboard texture inside 3d graphics. In a further embodiment, the 3d graphics are displayed from the point of view of a virtual camera that can move, and the billboard texture is rotated so that it always faces the camera to maintain the illusion and not reveal the 2d nature of the texture. The present invention may comprise also instruction methods and visualizations that solve the problem of communicating to the player what player states

(poses, hand distances and playing speeds) the system is able to recognize and for which sound content has been prerecorded or can be generated in real-time. In an embodiment of the invention shown in Figure 5, a plurality of instruction images 51, 52, 53, one for each recognized player pose, are visualized on screen in addition to the user 50. In an interactive system, feedback provided by the system to the player about the pose analysis is important. The feedback helps the player determine how to move for allowing the system to detect a desired pose. To provide stronger feedback than the audible changes in the musical content, the instruction images may be modified depending on their difference from the current pose of the player. For example, the instruction image depicting a pose closest to the player may be made larger than the other instruction images. In an alternative embodiment, all instruction images may be scaled according to the distances between the poses corresponding to the instruction images and the player's current pose. Additionally, the instruction image depicting a pose closest to the player may be highlighted, e.g., using a glow effect.

In a further embodiment of the present invention shown in Figure 6, the pose analysis may comprise the player' s visual representation 60 interacting on screen with visual elements, e.g., by touching at least one of a plurality of visual selection objects 61, 62, 63 to define the current pose. In this case, the user experience may become similar to the player playing a virtual guitar and selecting a lick group by touching virtual foot pedals on screen, and then modifying the volumes of the licks within the group with other player state information, such as the distance between hands and the playing speed.

In an embodiment of the invention shown in Figure 7, the pose analysis may comprise displaying a plurality of instruction images 71, 72, 73, 74, each corresponding to a pre-defined pose, and determining the recognized pose by measuring how well the player's current visual representation 70 on screen matches the instruction images. In Figure 7, the instruction image 74

(illustrated as the outline of a body silhouette) is the best match for the player's visual representation 70

(illustrated as a solid black silhouette) . For example, if the player' s video image is shown on screen together with the instruction images, the match may be measured by computing the distance between the on screen locations of the player' s body parts and the corresponding on screen locations of the body parts of the instruction images. For example, the body parts used in the distance computation may comprise hands and feet. In a further embodiment, some of the instruction images may be hidden part of the time based on, e.g., the player's location on screen or on a virtual stage 75 so that the player may move around to search for the instruction images. In another embodiment, the instruction images may move on screen so that they stay close to the player, and to minimize visual clutter, only a subset of the instruction images are visible at a given moment. The visible subset may be determined by computing which instruction images are matching the player's current pose best.

It is obvious to a person skilled in the art that with the advancement of technology, the basic idea of the invention may be implemented in various ways. The invention and its embodiments are thus not limited to the examples described above, instead they may vary within the scope of the claims.

Claims

1. A method for creating synthesized music in response to a player's gestures, cha r a ct e r i z e d in that the method comprises the steps of: recognizing a current player state; calculating differences between the recognized player state and at least one pre-stored player state, wherein the pre-stored player states are each associated with a musical passage; and adjusting playback volumes of the associated musical passages according to the calculated differences so that a small difference yields a high volume and a large difference yields a low volume.

2. The method according to claim 1, cha r a ct e r i z e d in that the player state comprises a pose.

3. The method according to any of preceding claims 1 - 2, cha r a ct e r i z e d in that the player state comprises playing speed.

4. The method according to claim 3, cha r a ct e r i z e d in that the method further comprises the step of: determining the playing speed in relation to the frequency of consecutive trigger gestures performed by the player.

5. The method according to claim 4, cha r a c t e r i z e d in that the method further comprises the step of: recognizing trigger gestures that mimic the strumming of the strings of a guitar.

6. The method according to any of preceding claims 1 - 5, cha r a ct e r i z e d in that the player state comprises the distance between the hands of the player.

7. The method according to any of preceding claims 1 - 6, cha r a ct e r i z e d in that the method further comprises the step of: producing a plurality of pre-stored player states and associated musical passages so that the player states with a relatively high distance between the hands of the player are associated with musical passages that have notes with relatively low pitches.

8. The method according to any preceding claims 1 - 7, cha r a c t e r i z e d in that the method further comprises the steps of: producing the musical passages so that each passage has a tempo that is an integer multiple of a base tempo; and playing the musical passages in sync with each other.

9. The method according to any of preceding claims 1 - 8, cha r a c t e r i z e d in that the method further comprises the step of: playing back a musical accompaniment track that has the same musical key as the musical passages and a tempo that is an integer multiple of the base tempo.

10. The method according to any of preceding claims 1 - 9, cha r a c t e r i z e d in that the method further comprises the steps of: stopping the playback of an associated musical passage if its playback volume is below a threshold value; and restarting a stopped musical passage if its playback volume is above a threshold value, and setting the playback position of the restarted musical passage to where it would be in case the musical passage had not been stopped.

11. The method according to any of preceding claims 1 - 10, c h a r a c t e r i z e d in that the method further comprises the step of: displaying instruction images representing predefined poses for giving movement instructions to the player.

12. The method according to claim 11, c h a r a c t e r i z e d in that the method further comprises the step of: highlighting the instruction images as a function of the distance between the pose of the displayed image and the pose of the player.

13. The method according to any of preceding claims 1 - 12, cha r a c t e r i z e d in that the method further comprises the step of: providing visual selection objects for allowing the player to choose a desired pose by virtually touching the visual selection objects on screen.

14. The method according to any of preceding claims 11-13, cha r a c t e r i z e d in that the method further comprises the steps of: calculating a level of matching between the instruction images and the player' s current visual representation; and choosing the desired pose as the best match among the instruction images.

15. The method according to claim 14, cha r a c t e r i z e d in that the method further comprises the step of: including player' s hands and feet in said matching calculation .

16. The method according to any of preceding claims 11-15, cha r a c t e r i z e d in that the method further comprises the step of: providing a limited subset of instruction images at a time which show the best matches according to player' s current visual representation.

17. A computer program for creating synthesized music in response to a player's gestures, cha r a c t e r i z e d in that the computer program comprises program code configured to control a data- processing device to perform: recognizing a current player state; calculating differences between the recognized player state and at least one pre-stored player state, wherein the pre-stored player states are each associated with a musical passage; and adjusting playback volumes of the associated musical passages according to the calculated differences so that a small difference yields a high volume and a large difference yields a low volume.

18. The computer program according to claim 17, cha r a ct e r i z e d in that the computer program further comprises the step of: controlling the determining of the playing speed in relation to the frequency of consecutive trigger gestures performed by the player.

19. The computer program according to any of preceding claims 17 - 18, cha r a c t e r i z e d in that the computer program further comprises the step of: controlling the producing of a plurality of pre- stored player states and associated musical passages so that the player states with a relatively high distance between the hands of the player are associated with musical passages that have notes with relatively low pitches .

20. The computer program according to any preceding claims 17 - 19, cha r a c t e r i z e d in that the computer program further comprises the steps of: controlling the producing of the musical passages so that each passage has a tempo that is an integer multiple of a base tempo; and controlling the playing of the musical passages in sync with each other.

21. The computer program according to any of preceding claims 17 - 20, cha r a c t e r i z e d in that the computer program further comprises the step of: controlling the playback of a musical accompaniment track that has the same musical key as the musical passages and a tempo that is an integer multiple of the base tempo.

22. The computer program according to any of preceding claims 17 - 21, cha r a c t e r i z e d in that the computer program further comprises the steps of: controlling the stopping of the playback of an associated musical passage if its playback volume is below a threshold value; and controlling the restarting of a stopped musical passage if its playback volume is above a threshold value, and setting the playback position of the restarted musical passage to where it would be in case the musical passage had not been stopped.

23. The computer program according to any of preceding claims 17 - 22, cha r a c t e r i z e d in that the computer program further comprises the step of: controlling the displaying of instruction images representing predefined poses for giving movement instructions to the player.

24. The computer program according to claim 23, cha r a c t e r i z e d in that the computer program further comprises the step of: controlling the highlighting of the instruction images as a function of the distance between the pose of the displayed image and the pose of the player.

25. The computer program according to any of preceding claims 17 - 24, cha r a c t e r i z e d in that the computer program further comprises the step of: providing visual selection objects for allowing the player to choose a desired pose by virtually touching the visual selection objects on screen.

26. The computer program according to any of preceding claims 23-25, cha r a c t e r i z e d in that the computer program further comprises the steps of: calculating a level of matching between the instruction images and the player' s current visual representation; and choosing the desired pose as the best match among the instruction images.

27. The computer program according to claim 26, c h a r a c t e r i z e d in that the computer program further comprises the step of: controlling the providing of a limited subset of instruction images at a time which show the best matches according to player's current visual representation.

28. The computer program according to any of preceding claims 11-21 , cha r a c t e r i z e d in that the computer program is embodied on a computer readable medium.

29. A system for creating synthesized music in response to a player's gestures, cha r a ct e r i z e d in that the system comprises: a motion tracking tool (11, 21) and data processing means configured to recognize a current player state; a memory (23) configured to store a plurality of musical passages which each comprise an association to pre-stored player states; a calculating means (12) configured to calculate differences between the recognized player state and at least one pre-stored player state stored in the memory

(23) ; a control unit (25) configured to adjust playback volumes of the associated musical passages according to the calculated differences so that a small difference yields a high volume and a large difference yields a low volume; and a sound production device (13, 26) configured to produce the synthesized music according to the control unit (25) output.

30. The system according to claim 29, cha r a c t e r i z e d in that the system further comprises: location indicators worn by the player in order to define the locations of player's hands; and means for capturing the location data with the motion tracking tool (11, 21).

31. The system according to any preceding claims

29 - 30, cha r a ct e r i z e d in that the system further comprises: a camera in the motion tracking tool (11, 21) for taking plurality of pictures of the player; and means (11, 12, 21, 22) for using the picture data for the player state recognition.

32. The system according to claim 31, cha r a c t e r i z e d in that the system further comprises: a screen and graphics rendering means configured to show the picture of the player composited inside three- dimensional computer graphics as a billboard texture that stays facing the virtual camera if the virtual camera moves.

33. The system according to any preceding claims 29 - 32, cha r a ct e r i z e d in that the system further comprises: a screen and graphics rendering means configured to display instruction images (51, 52, 53) representing predefined poses for giving movement instructions to the player .

34. The system according to any of preceding claims 29 - 33, cha r a ct e r i z e d in that the system further comprises: visual selection objects (61, 62, 63) for allowing the player to choose a desired pose by virtually touching the visual selection objects on screen.

35. The system according to any of preceding claims 29 - 34, cha r a ct e r i z e d in that the system further comprises: calculating means (12) configured to calculate a level of matching between the instruction images (71, 72, 73, 74) and the player's current visual representation (70); and control unit (25) configured to choose the desired pose as the best match among the instruction images (74) .