HK1255278A1

HK1255278A1 - Method and system for automated generation of coordinated audiovisual work

Info

Publication number: HK1255278A1
Application number: HK18114438.1A
Authority: HK
Inventors: Kevin Sung; Bona Kim; Jon MOLDOVER; John SHIMMIN; Jeannie Yang; Perry Cook
Original assignee: 思妙公司
Priority date: 2015-06-03
Filing date: 2016-06-03
Publication date: 2019-08-09

Abstract

Vocal audio of a user together with performance synchronized video is captured and coordinated with audiovisual contributions of other users to form composite duet-style or glee club-style or window-paned music video-style audiovisual performances. In some cases, the vocal performances of individual users are captured (together with performance synchronized video) on mobile devices, television-type display and/or set-top box equipment in the context of karaoke-style presentations of lyrics in correspondence with audible renderings of a backing track. Contributions of multiple vocalists are coordinated and mixed in a manner that selects for presentation, at any given time along a given performance timeline, performance synchronized video of one or more of the contributors. Selections are in accord with a visual progression that codes a sequence of visual layouts in correspondence with other coded aspects of a performance score such as pitch tracks, backing audio, lyrics, sections and/or vocal parts.

Description

Automatically generating coordinated audiovisual works based on content captured from remotely distributed performers

Technical Field

The present invention relates generally to the capture and/or processing of audiovisual performances, and in particular to techniques suitable for use in connection with portable device embodiments of vocal performance capture.

Background

Mount for cell phones and other portable computing devicesNumbers are growing daily in both absolute numbers and computing power. It is commonly present and deeply rooted in the life style of people around the world, almost exceeding all cultural and economic barriers. Computationally, today's cell phones offer speed and storage capabilities comparable to less than ten years ago desktop computers, making them well suited for real-time sound synthesis and other music applications. In part, as a result, some modern cell phones (e.g., available from apple Inc.) are availableHandheld digital devices) are capable of supporting audio and video playback.

Like traditional acoustic musical instruments, cell phones can be private sound producing and capturing devices. However, it is somewhat limited in acoustic bandwidth and power compared to most conventional instruments. Despite these disadvantages, handsets do have the advantages of ubiquity, number advantage and super mobility, which makes it desirable (at least in theory) to group artists together for performance almost anywhere and anytime. The field of mobile music has been explored in several developing research institutions. In fact, recently and recently, such as Smule Ocarina^TMSmule Magic Picano, and Smule Sing! Karaoke^TMExperiences with applications such as (all available from Smule corporation) have shown that advanced digital acoustic technology can be delivered in a manner that provides an attractive user experience.

As digital acoustic researchers seek to convert their innovations to be deployable to modern handheld devices (e.g.,handsets) and other platforms that can operate within real-world constraints imposed by processors, memory, and other limited computing resources, and/or within constraints of communication bandwidth and transmission delays typical of wireless networks, present significant practical challenges. Improved technical and functional capabilities are desired, particularly with respect to video.

Disclosure of Invention

It has been found that despite many practical limitations imposed by mobile device platforms and application execution environments, audio visual (audio) performances, including vocal music, can be captured and coordinated with audio visual performances of other users in a manner that creates an compelling user experience. In some cases, vocal performance (and video synchronized to the performance) of individual users is captured on a mobile device in concert with an audible (rendering) rendering of accompaniment in the context of karaoke-style lyric presentations. In some cases, a pitch prompt may be presented to the singer in conjunction with a karaoke-style lyric presentation, and optionally, a continuous automatic pitch correction (or pitch conversion to harmony) may be provided.

The vocal music audio of the user and the performance synchronized video are captured and coordinated with the audiovisual contributions of the other users to form a composite duet or chorus or window-delimited music video type audiovisual performance. In some cases, vocal performances (and performance-synchronized videos) of individual users are captured on a mobile device, a television display, and/or a set-top box device in concert with a vocal rendering of an accompaniment in the context of a karaoke-style lyric presentation. The contributions of multiple singers are coordinated and mixed in such a way that: the performance-synchronized video of one or more of the contributors is selected for presentation at any given time along a given performance timeline. The selection is in accordance with a visual progression that encodes the visual layout sequence in accordance with other encoding aspects of the performance score (e.g., pitch track, accompaniment audio, lyrics, chapters, and/or vocals).

In some embodiments of the invention, a method of preparing a coordinated audiovisual work from offsite distributed performer contributions comprises: receiving, via a communication network, a plurality of audiovisual codes of performances captured on respective remote devices in temporal correspondence with respective vocal renderings of seeds, each of the received audiovisual codes comprising a respective performer vocal work (vocal) and time-synchronized video; retrieving a visual process that encodes, in temporal correspondence with the seed, a succession of templated screen layouts, each of the templated screen layouts specifying a number and arrangement of visual elements in which each of the videos is visually rendered; associating each of the captured performances including the respective performer vocal work and the coordinated video to each of the visual elements; and rendering the coordinated audiovisual work as an audio mix of the captured performance and a coordinated visual presentation according to the visual process and the association.

In some cases or embodiments, successive screen layouts in the templated screen layout change the spatial arrangement or number of visual elements or both. In some cases or embodiments, the audio mix includes performer vocals of respective ones of the captured performances of the visual elements that have been associated to the then-active templated screen layout in accordance with the active particular templated screen layout at a given point in the visual progression. In some cases or embodiments, at a given point in time in the audio mix, the included performer vocals are only those of the respective captured performances for the visual elements associated to the templated screen layout that are active at that time.

In the visual process employed in some cases or embodiments, at least some transitions from one templated screen layout to another are temporally coincident with boundaries between music sections. In the visual process employed in some cases or embodiments, at least some transitions from one templated screen layout to another are temporally coincident with transitions between portions selected from the following set: a first acoustic portion; a second sound part; and multiple song hands. In the visual process employed in some cases or embodiments, at least some of the transitions from one templated screen layout to another are temporally coincident with the Nth beat rhythm (N ≧ 1) of the base song to which the seed corresponds. In the visual process employed in some cases or embodiments, the number of visual elements in at least some successive templated screen layouts increases in correspondence with the intensity of the base song to which the seed corresponds. In the visual process employed in some cases or embodiments, the spatial arrangement or size of at least some of the visual elements changes from one templated screen layout to the next successive templated screen layout.

In some embodiments, the method further includes generating a visual process according to the structured musical arrangement corresponding to the seed. In some cases or embodiments, the structured musical arrangement includes an encoding of a musical section consistent with either or both of a pitch track for the performer's vocal work and lyrics for the performer's vocal work. In the visual process employed in some cases or embodiments, at least some transitions from one templated screen layout to another are temporally coincident with boundaries between music sections of the structured music arrangement. In some cases or embodiments, the structured musical arrangement includes encoding of an accompaniment. In the visual process employed in some cases or embodiments, at least some of the transitions from one templated screen layout to another are temporally coincident with the Nth beat rhythm (N ≧ 1) computed as extracted from the accompaniment.

In some cases or embodiments, each of the templated screen layouts defines a visual range of a set of visual cells in which videos captured in correspondence with vocal works of respective ones of the performers are rendered. In some cases or embodiments, the templated screen layout comprises: at least one layout having an actor; at least one layout having two performers; a plurality of three and four performers; and a plurality of N performers for at least one performer of N number, wherein N is more than or equal to 4.

In some embodiments, the method further comprises transitioning from a first templated screen layout to a next sequential templated screen layout in correspondence with the retrieved visual process, wherein for video captured in correspondence with the vocal work of the particular performer, the transitioning is from a first visual element of the first layout to a second visual element of the next sequential layout. In some cases or embodiments, the conversion of the first visual element to the second visual element comprises one or more of: sliding conversion; fade-in or fade-out transitions; scaling conversion; and clipping conversion.

In some cases or embodiments, the rendering is for audiovisual coding or container format rendering suitable for storage or transmission over a communication network. In some cases or embodiments, the rendering is for a display and an audio transducer.

In some embodiments, the method further comprises scaling, in the audio mixing, an audio amplitude of the performer's vocal work for the captured performance that has been associated with the particular visual element of the templated screen layout that is active at the time, wherein the scaled amplitude of the vocal work for the particular performer is consistent with a size of the particular visual element to which the video of the particular performer is associated. In some embodiments, the method further comprises panning (from left to right) in the audio mixing a performer's vocal work for the captured performance of the particular visual element that has been associated to the then-active templated screen layout, wherein the panning of the vocal work for the particular performer is consistent with the lateral placement of the particular visual element to which the video of the particular performer is associated.

In some embodiments, the method further comprises transmitting the encoding of the coordinated audiovisual work to one or more of the remotely distributed performers. In some embodiments, the method further includes receiving, via the communication network, an audiovisual encoding of a seed performance including a first performer vocal work and a time-synchronized video captured at the first remote device in an aural rendering of an accompaniment, wherein the seed includes the seed performance of the first performer.

These and other embodiments in accordance with the invention(s) will be understood with reference to the following description and appended claims.

Drawings

The present invention is illustrated by way of example, and not by way of limitation, with reference to the accompanying figures in which like reference numerals generally indicate similar elements or features.

Fig. 1 depicts information flow between an illustrative cell phone-type portable computing device and a content server, where seed shows are captured and mixed with the contributions of additional performers, in accordance with some embodiments of the invention(s).

Fig. 2 depicts information flow between illustrative cell phone type portable computing devices for audiovisual content capture and use of a content server for audiovisual performance symbiosis (encryption) in accordance with some embodiment of the invention(s).

Fig. 3 depicts templated screen layouts for various numbers of singers, as may be employed in accordance with some embodiments of the present invention(s) to encode a visual process to be used for coordinating multiple audiovisual performances.

Fig. 4A, 4B, and 4C are sequential screenshots of video synchronized to a vocal performance along a coordinated audio-visual performance timeline in which video of a plurality of contributing singers is coordinated using a visual process encoded in correspondence with the musical score, according to some embodiments of the invention(s).

Fig. 5 depicts score coding in accordance with some embodiments of the invention(s), wherein the visual progression of a templated screen layout is coded in addition to (but in concert with) lyrics, pitch tracks for vocal cues and/or continuous pitch correction of a captured user's vocal work, and accompaniment.

Fig. 6 is a flow diagram depicting optional real-time continuous pitch modification and harmony generation for score-encoding-based pitch modification settings for captured audiovisual performances in accordance with some embodiments of the invention(s).

Figure 7 is a functional block diagram of hardware and software components executable on an illustrative cell phone-type portable computing device for facilitating processing of captured audiovisual performances in accordance with some embodiments of the invention(s).

Fig. 8 depicts features of a mobile device that may serve as a platform for executing software implementations, including audiovisual capture, in accordance with some embodiments of the invention(s).

Fig. 9 is a network diagram depicting cooperation of exemplary devices according to some embodiments of the invention(s).

Skilled artisans will appreciate that elements or features in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions or saliency of some of the depicted elements or features may be exaggerated relative to other elements or features to help to improve understanding of embodiments of the present invention.

Detailed Description

Mode(s) for carrying out invention(s)

Techniques have been developed to facilitate the capture, pitch modification, and vocalization, encoding, and/or rendering of audiovisual performances on portable computing devices and hall entertainment devices. Vocal audio and performance synchronized video is captured and coordinated with the audiovisual contributions of other users to form a duel or chorus or window-delimited music video-type audiovisual performance.

In some cases, vocal performances (and performance-synchronized videos) of individual users are captured on a mobile device, a television display, and/or a set-top box device in concert with a vocal rendering of an accompaniment in the context of a karaoke-style lyric presentation. In some cases, a pitch prompt may be presented to the singer in conjunction with a karaoke-style lyric presentation, and optionally, a continuous automatic pitch correction (or pitch conversion to harmony) may be provided.

The contributions of multiple singers are coordinated and mixed in such a way that: the performance-synchronized video of one or more of the contributors is selected for presentation at any given time along a given performance timeline. This selection is consistent with a visual process that encodes the visual layout sequence in accordance with other encoding aspects of the performance score (e.g., pitch track, accompaniment audio, lyrics, chapters, and/or vocals). The number, visual arrangement, and size of the individual visual elements in such a layout vary throughout a given coordinated performance.

In general, for a given song, the musical structural aspects of the song are used to create a visual layout sequence of the mapping. For example, in some cases, scenarios, or embodiments, a song format (e.g., { verse, refrain, bridge … … }) is used to restrict the mapping. In some cases, such as in duet, the ordering of the vocal parts (e.g., you sing one line, i sing one line, you sing two words, i sing three, we sing together … …) provides structural information for creating a visual layout sequence. In some cases, situations, or embodiments, constructing the intensity of a song (e.g., as measured by acoustic power, tempo, or some other measure) can result in a visual layout sequence of: more and more singers are added according to the measured intensity.

In some cases, situations, or embodiments, the selection of a particular contribution, the mapping of video to particular visual elements of a coordinated screen layout, and/or the highlighting of video and/or audio synchronized to a particular performance may be based at least in part on computationally defined audio characteristics extracted from (or calculated from) captured vocal music audio. Similarly, in some cases, situations, or embodiments, selection of a particular contribution, mapping of video to particular visual elements of a coordinated screen layout, and/or highlighting of video and/or audio of a particular performance synchronization may be based at least in part on computationally defined video characteristics extracted from (or calculated from) captured video.

Depending on the location and/or prominence of the video synchronized for a particular performance, the operational audio mixing settings may be modified accordingly. For example, in some cases, situations, or embodiments, a spatialization filter may be employed to pan the captured audio left or right according to the current screen layout position for the corresponding video. Similarly, a spatialization filter may be used to change the captured audio according to (i) the higher (or lower) current screen layout position of the particular visual unit in which the corresponding video is presented and/or (ii) the apparent depth of the singer's stack. For example, for captured vocal works mapped to a chorus, some embodiments apply greater reverberation to those vocal works that present video in smaller (and significantly more distant) visual units.

Optionally and in some cases or embodiments, vocal music audio can be pitch-corrected in real-time on a mobile device (or more generally, a portable computing device such as a cell phone, personal digital assistant, laptop computer, notebook computer, tablet computer, or netbook) according to a pitch correction setting. In some cases, the pitch modification settings encode a particular tone or scale for the vocal performance or for a portion thereof. In some cases, the pitch modification settings include score-coded melodies and/or chorus sequences that are provided with or for association with the lyrics and the accompaniment. Harmonic notes or chords may be coded as well-defined targets or relative to the melody coded in the score or even the actual pitch by the singer, if desired. Machine-usable musical instrument digital interface (MIDI style) encoding may be used for lyrics, accompaniment, note targets, vocals (e.g., vocals 1, vocals 2 … … collectively), and music chapter information (e.g., song/tail, verse, pre-chorus, refrain, bridge, transition, and/or other chapter encoding), among others. In some cases or embodiments, traditional MIDI-style encoding may be extended to also encode score-aligned visual processes of mapping to a series of visual units of a templated screen layout, as depicted and described herein.

Based on the dramatic and innovative nature of pitch-corrected vocal work, performance synchronized video and score-coded and vocal mixing, the user/singer can overcome the otherwise naturally occurring shyness or anxiety associated with sharing their vocal performance. Instead, it encourages even remotely distributed singers to share or collaborate with friends and family and contribute vocal performances as part of the social music network. In some embodiments, these interactions are facilitated through social network and/or email-mediated performance sharing and invitations to join a group performance. By using an uploaded vocal work captured on a client, such as the aforementioned portable computing device, a content server (or service) can mediate such coordinated performances by manipulating and mixing the audiovisual content uploaded by a plurality of contributing singers. Depending on the goals and implementations of a particular system, the upload may include, in addition to the video content, a pitch-corrected vocal performance (with or without harmony), a control track of dry (i.e., uncorrected) vocal work and/or user tones and/or pitch correction selections, etc.

Social music can be mediated in various ways. For example, in some embodiments, a vocal performance of a first user captured at the portable computing device as an accompaniment and typically pitch-corrected according to a score-encoded melody and/or vocal cues is fed as a seed performance to other possible vocal performers. A video of performance synchronization is also captured and the video may be supplied with a pitch-corrected captured vocal work. The supplied vocal work is typically mixed with the accompanist track/vocal work and forms an accompaniment for capturing the vocal work of the second (and possibly successive) user. Often, successive vocal contributors are geographically separated and may be strange to each other (at least inferentially), whereas the intimacy of vocal works and their own collaborative experience tend to minimize such separation. As successive vocal performances and videos are captured (e.g., on respective portable computing devices) and married as part of a social musical experience, the accompaniment relied upon to capture the respective vocal work may evolve to include previously captured vocal works of other contributors. Generally speaking, first, second or nth generation audiovisual shows may be used as seeds, however, for simplicity of description, many of the examples and illustrations herein assume first generation seeds.

In some cases, an attractive visual animation and/or facility for audience comment and ranking, and the formation or chorus logic of duets, chorus groups, or chorus groups, are provided in association with the vocal rendering of a vocal performance (e.g., captured and pitch-corrected at another similarly configured mobile device) mixed with an accompanist musical composition and/or vocal work. Synthetic harmony and/or additional vocals (e.g., a vocals captured from another singer at another location and optionally pitch-converted to be vocalized and vocalized with other vocals) may also be included in the mix. The geocoding of the captured vocal performance (or individual contributions to the combined performance) and/or audience feedback can facilitate animation or display of artifacts in a manner that prompts a performance or show originating from a particular geographic location on the user-manipulated globe. In this way, embodiments of the described functionality can transform other mundane mobile devices into social tools that facilitate global connectivity, collaboration, and community.

Karaoke OK vocal performance capture

Although embodiments of the present invention are not so limited, pitch-corrected karaoke-type vocal capture using cell phone-type and/or television-type audiovisual devices provides a useful descriptive environment. For example, in some embodiments, such as that shown in FIG. 1, available from apple IncThe handset (or more generally, handset 101) holds software that executes in cooperation with the content server 110 to provide vocal capture and pitch correction and vocalization of the continuous real-time score encoding of the captured vocal work. Can be used forUsing a television or other audiovisual media device or apparatus such as Apple TV^TMA camera provided by or in conjunction with a connected set-top box device (not separately shown in fig. 1) such as a device to capture video synchronized to the performance. The performance-synchronized video may also, or alternatively, be captured by using an on-board camera provided by handset 101.

Such as a Karaoke-type application (e.g., Sing Karaoke available from Smule, Inc.)^TMApplication), the accompaniment of instrumental music and/or vocals can be audibly rendered for the user/singer to sing in accordance therewith. In such a case, the lyrics may be displayed (102) in correspondence with the vocal rendering for the user to perform a karaoke-style vocal performance. In the configuration shown in fig. 1, lyrics, timing information, pitch and harmony cues, accompaniment (e.g., instrumental music/vocal work), performance coordinated video, etc. may all originate from a network-connected content server 110. In some cases or scenarios, the data may be retrieved from a media storage device (e.g., iTunes)^TMLibrary) or other audiovisual content storage device resident in or accessible from a handset, set-top box, media streaming device, etc. renders the accompanying audio and/or video.

For simplicity, it may be assumed that the wireless local area network provides communications between the handset 101, any audiovisual and/or set-top box devices, and the wide area network gateway to a hosting service platform (e.g., content server 110). Fig. 9 depicts an exemplary network configuration. However, based on the description herein, those skilled in the art will recognize that the protocols including 802.11 Wi-Fi, Bluetooth, etc. can be employed alone or in combination^TMAny of a variety of data communication facilities, wired 4G-LTE wireless, wired data networks, wired or wireless audiovisual interconnections (e.g., according to HDMI, AVI, Wi-Di standards or facilities) for communication and/or audiovisual rendering as described herein.

Referring again to the example of fig. 1, the user's vocal work 103 is captured at the handset 101 and optionally continuously and real-time pitch corrected and audibly rendered (see 104) blended with accompaniment at the handset or through a computing facility using an audiovisual display and/or a set-top box device (not specifically shown) to provide the user with an improved timbre rendering of his/her own vocal performance. It is noted that although the captured vocal work 103 and the vocal rendering 104 are shown using convenient visual symbology centered around the microphone and speaker facilities of the handset 101, persons of ordinary skill in the art having benefit of the present disclosure will appreciate that in many cases the functionality of the microphone and speaker may be provided using attached or wirelessly connected earplugs, earphones, speakers, feedback isolated microphones, and the like. Accordingly, unless specifically limited, vocal music capture and vocal rendering are to be broadly understood and are not limited to a particular audio transducer configuration.

When provided, pitch modification is typically based on a score-encoded set of notes or cues (e.g., pitch and chorus cues 105) that provide a continuous pitch modification algorithm with a performance synchronization sequence of the target notes in the current tone or scale. In addition to performing synchronized melodic objectives, the score-encoded harmonic note sequence (or collection) can provide an additional objective (typically encoded as an offset relative to the main melody note track and typically only scored for selected portions thereof) to the pitch conversion algorithm to pitch convert to a harmonic version of the user's own captured vocal work. In some cases, the pitch correction setting may be characteristic of a particular artist (e.g., the artist who initially performed (or sung) a vocal work associated with a particular accompaniment).

Additionally, lyrics, melodies, and harmonic track note sets and related timing and control information may be packaged as musical scores encoded in suitable containers or objects (e.g., in a Musical Instrument Digital Interface (MIDI) or Java script object notation (json) type format) to be supplied with the accompaniment(s). Using such information, the handset 101, audiovisual display, and/or set-top box device, or both, may display lyrics and even visual cues, harmony associated with the target note, and the currently detected vocal pitch according to the accompanying vocal performance(s) for the user to perform a karaoke-style vocal performance. Thus, if a contracting singer selects the "when me" of Anna Kendrick singing red, then none, json, and none.m 4a may be downloaded from the content server 110 (if not already available or cached based on previous downloads) and used in turn to provide background music, synchronized lyrics, and in some cases or embodiments, a score-encoded note track for continuous real-time pitch correction as the user sings.

Optionally, at least for some embodiments or genres, the note tracks and note tracks may be score coded for harmonic transformation to the captured vocal work. Typically, the captured pitch-corrected (possibly vocalized) vocal performance along with the performance-synchronized video are saved locally on the handset device or set-top box as one or more audiovisual files and then compressed and encoded for uploading (106) to the content server 110 as an MPEG-4 container file. MPEG-4 is an international standard for the encoded representation and transmission of digital multimedia content for internet, mobile network and advanced broadcast applications. Other suitable codecs, compression techniques, encoding formats and/or containers may be used if desired.

According to an embodiment, the encoding of the dry and/or pitch-corrected vocals may be uploaded (106) to a content server 110. In general, such vocal work (encoded, e.g., in an MPEG-4 container or otherwise), whether pitch-corrected or pitch-corrected at the content server 110, can then be mixed (111) with, e.g., accompaniment audio and other captured (and possibly pitch-shifted) vocal performances to produce files or streams of limited selected quality or encoding characteristics according to the capabilities of a particular target or network (e.g., handset 120, audiovisual display and/or set-top box device, social media platform, etc.).

As described in further detail herein, the performances of multiple singers (including performance-synchronized videos) may be married and combined such that a music video-type composition or vocal impromptu presentation appears as a duet-type performance, a chorus, a window frame. In some embodiments, a performance-synchronized video contribution (e.g., in the illustration of fig. 1, a performance-synchronized video 122 comprising a seed performance captured at handset 101 or using an audiovisual and/or set-top box device) may be presented in the resulting mixed audiovisual performance rendering 123, with screen positioning, size, or other visual prominence dynamically changing throughout the mixed audiovisual performance rendering 123. The visual process of positioning, sizing, or other visual prominence is based, at least in part, on a sequence of templated screen layouts, as explained in more detail herein.

To simplify the initial illustration, fig. 1 depicts the capture of performance-synchronized audio (103) and video (105) uploaded to a content server 110 (or service platform) and acting as an initial seed performance 106 of a seed performance that is distributed to one or more possible contributing singers or performers, and from which other contributing singers or performers (#2, #3 … … # N) capture additional Audiovisual (AV) performances. Fig. 1 depicts the provision of other captured AV performances #2, #3 … … # N for audio mixing and visual arrangement 111 at the content server 110 to produce a performance-synchronized video 122.

Fig. 2 depicts in slightly enlarged detail the provision of background music/vocal works 107, lyrics/timing information 108, pitch and voice cues 109 and seed shows 106 to an additional singer or performer (#2 … … # N). These additional singers or performers are typically distributed off-site and, in some cases, may never see the scene in person. As with the first or seed performer, audio (103.2 … …. N) and video (105.2 … …. N) for the second … … nth performer may be provided in the manner described above in karaoke using a handset, audio-visual display and/or set-top box device, or both. It is noted that although the illustrations of fig. 1 and 2 assume that the initial seed show is captured using a handset (101) or a parlor audiovisual display and/or a set-top box device, persons of ordinary skill in the art having benefit of the present disclosure will appreciate that in some cases or embodiments, studio equipment or even existing music video content may be used as the seed show 106.

The captured audiovisual performance (#2 … … # N) including the vocal work for the second … … nth performer is fed to the content server 110 where it is combined with other AV performances (typically including seed performances) and fed or presented (e.g., at handset 120) as a performance synchronized audiovisual work 122. Referring again to fig. 1 and in general, the number, layout, visual positioning and/or prominence of individual performers, etc. of the visual representations of performances (and corresponding videos) can vary throughout the blended audiovisual performance rendering 123 according to the encoded visual progression.

In the illustration of fig. 1, two performers (two performers of two, three, or more performers from which a captured AV performance (e.g., #2, #3 … …) is derived) are selected based on the current state of the encoded visual process. However, persons of ordinary skill in the art having benefit of the present disclosure will appreciate that different numbers, selections, arrangements, and/or visual layouts of performers may appear in the mixed audiovisual performance rendering 123 at any given time based on the encoded visual processes. In general, the encoding visual process encodes or otherwise selects for changing the number of performers presented and the placement layout on the screen while remaining time-aligned with the vocal sequencing or other musical structure of the underlying background track from which the AV performance has been captured. In some cases or embodiments, a particular performer may be selected for inclusion in (or selected for highlighting) based on audio (or video) characteristic analysis of the corresponding vocal work (or video).

In some embodiments of the invention, the social network construct may facilitate the pairing or grouping of placeshifted singers. For example, with respect to FIG. 1, a first singer may capture (vocal audio and performance synchronized video) and upload (106) to a content server or service platform in one manner. The audiovisual content so captured may in turn be distributed to the first singer's social media contacts via a content server-mediated open call or through electronic communications initiated by the first singer. In this manner, the first singer himself (and/or a content server or service platform on behalf thereof) may invite others to join the coordinated audiovisual performance.

Audiovisual captures such as those shown and described may include vocal works (typically pitch-corrected) captured from the original or previous contributors, as well as performance-synchronized video. Such audiovisual capture can be (or can form the basis for) an accompanying audiovisual track for a subsequent audiovisual capture (see, e.g., other captured video shows #2, #3 … … # N) from another (possibly remote) user/singer. In general, the capture of subsequently performed audiovisual content may be performed locally or on another (geographically separated) handheld device or using another (geographically separated) audiovisual and/or set-top box configuration. In some cases or embodiments, particularly in conjunction with a living room, audiovisual display, and/or set-top box configuration (e.g., using a network-connected Apple Tv set and television monitor), initial and subsequent audiovisual captures by additional performers may be accomplished using a common (collocated) set of handheld devices and audiovisual and/or set-top box equipment.

Where the provision and use of accompaniment is shown and described herein, it will be understood that the captured pitch-corrected (and possibly but not necessarily vocalized) vocal work itself may be blended to produce "accompaniment" for stimulating, guiding or formulating subsequent vocal capture. Also, additional singers may be invited to sing a particular part (e.g., male pitch, part B of a duet, etc.) or simply sing, after which the content server 110 may pitch-shift and place their captured vocal work in one or more locations within the duet, quartet, or virtual chorus. These AND other aspects of performance sympathy are described in more detail in commonly owned U.S. patent No.8,983,829 entitled "COORDINATING AND blending vocal works CAPTURED FROM remotely DISTRIBUTED PERFORMERS" (Cook, Lazier, Lieber, AND Kirk).

Visual process and templated screen layout

FIG. 3 depicts templated screen layouts for various numbers of singers, as may be employed in accordance with some embodiments of the present invention(s), for encoding a visual process useful for coordinating multiple audiovisual performances. An exemplary layout is shown for use in the course of an audio-visual performance of a mixed multi-performer (recall the mixed video performance rendering 123, see fig. 1). Layouts (132, 133, 134, 135, 136, 138 … …) of a single singer (131) and multiple singers are shown, including multiple alternative layouts for at least some number of singers. Illustratively, with reference to a five singer layout, three alternative layouts 135.1, 135.2 and 135.3 are shown.

Generally, embodiments in accordance with the invention(s) will employ various layouts throughout the mixed audiovisual performance rendering timeline, including multiple layout variations for a given number of performers, to provide visual interest in the resulting mixed audiovisual performance rendering. Fig. 4A, 4B, and 4C depict a series of layouts (122A, 122B, and 122C) that are employed along a coordinated AV performance timeline 130. In some cases, one or more of the layout variations for a given number of performers tend to make a particular singer (or specific singers) the most prominent feature (or more prominent feature) relative to others. Referring again to fig. 3, examples of this prominence include layouts 135.1, 136.1 and 138.1, of the five, six and eight singers layouts shown, respectively. As previously described, the visual prominence of a particular performer may be determined from audio characteristic analysis (e.g., audio power, spectral flux, and score-based quality metrics). In some cases and embodiments, a prominent visual position may be provided to the seed performer.

While certain exemplary layouts are depicted, those of ordinary skill in the art, having the benefit of this disclosure, will appreciate numerous suitable variations. It is also noted that while a generally square form factor having generally rectangular constituent panes has been shown for simplicity, other form factors and column geometries may be employed in some cases and embodiments. For example, landscape, portrait, and letterbox form factors would be desirable in the deployment of many handheld devices.

Visual progress and pitch track for score coding

Fig. 5 depicts score coding in accordance with some embodiments of the invention(s), in which a visual progression of a templated screen layout is coded, except for (but generally temporally coincident with) lyrics 108, a pitch track 109 for vocal cues and/or continuous pitch correction of a captured user's vocal work, and accompaniment 107. Generally, the score-encoding visual process 151 encodes a time-varying number of performers and can select a sequence of visual layouts 153 for corresponding to the time-varying number of performers (152).

Consistent with the commonly employed style of musical composition, the total score is broken down into music chapters (here, in the example of fig. 5, depicted as verse, master song, pre-refrain, refrain) and the singer's count tends to be built up in typical chapters. For example, in the verse 161, the singer's count builds from one to two, to four, to six (then five, then six) and finally up to eight singers. An exemplary selection of visual layouts 153 corresponding to a time-varying number of performers is shown as part of the sequence 152. Successive screen layouts of the templated screen layout change the spatial arrangement or number or both the spatial arrangement and number of visual elements.

Typically, the audio mix in the overall mixed AV show rendering 123 (recall fig. 1) includes performer vocals for each of the captured shows that have been associated to the visual elements of the then-active templated screen layout in accordance with the active particular templated screen layout at a given point in the visual progression. In some cases or embodiments, at a given point in time (e.g., point P1) in the score and corresponding audio mix, the included performer vocals are those of the respective captured performances for the visual element(s) associated to the templatized screen layout (e.g., layout L1) that is functioning at that time. Accordingly, with point in time P1 and corresponding layout L1, six vocals for six performers in a blended AV performance are rendered along with the performance-synchronized visual presentation, with performer 1 (typically a seed performer) having a visual feature in the prominent position of layout L1.

As will be appreciated by those skilled in the art having the benefit of this disclosure, at least some transitions from one templated screen layout to another are temporally coincident with boundaries between music sections, while others are located inside a given section. For example, the transition from one templated screen layout to another may coincide in time with the transition between the various portions (e.g., the first sound portion, the second sound portion, and the multiple song hand portion). Similarly, the transition from one templatized screen layout to another may coincide in time with internal marks within a given chapter (e.g., the verse 161 shown in FIG. 5). In particular, the visual process employed in some cases or embodiments, particularly within a given chapter, may coincide in time with the Nth beat rhythm (N ≧ 1) of the underlying song to which the performance corresponds.

In the visual process employed in some cases or embodiments, the number of visual elements in at least some successive templated screen layouts increases in accordance with the build strength of the base song to which the seed corresponds. Typically, the spatial arrangement or size of at least some of the visual elements varies from one templated screen layout to the next successive templated screen layout. In some embodiments of the invention(s), computer readable code of the visual process 151 as shown in fig. 5 is prepared according to a structured musical arrangement corresponding to an accompaniment or to a seed performance.

Fig. 6 is a flow diagram depicting optional real-time continuous pitch modification and harmony generation for score-encoding-based pitch modification settings for captured audiovisual performances in accordance with some embodiments of the invention(s). In the illustrated configuration, the user/singer sings with the accompanying karaoke style. The vocal work captured (651) from the microphone input 601 is continuously pitch-corrected (652) and harmony (655) in real-time to mix (653) with the accompaniment that is audibly rendered at the one or more sound transducers 202.

Both pitch modification and added harmony are selected to correspond to the pitch track 609 of the musical score, which in the illustrated configuration is wirelessly transmitted (661) to the device(s) (e.g., from the content server 110 to the handset 101 or set-top box device, recall fig. 1) on which vocal capture and pitch modification, and audio encoding of the lyrics 608 and accompaniment 607 is to be performed. In some embodiments of the techniques described herein, the notes (in the current scale or pitch) closest to the notes uttered by the user/singer are determined based on the pitch track 609 of the musical score. While this closest note may typically be the dominant note height corresponding to the score-encoded vocal melody, it is not necessarily so. In fact, in some cases, the user/singer may intend to sing harmony and the notes uttered may be closer to the harmony track.

Thus, a calculated determination that a given vocal performance is closer to melodies or harmony may result in a corresponding determination of visual prominence, such as at the location of the prominence in the visual layout of the multi-performer (recall the layouts 135.1, 136.1 and 138.1 of fig. 3, and the location of performer 1 in layout L1 of the sequence 152 of visual layouts shown in fig. 5). In some modes or embodiments, video synchronized with performances corresponding to vocal works determined (or pitch-corrected) as melodies may be visually presented in a generally more prominent manner, while video synchronized with performances corresponding to vocal works determined (or pitch-corrected) as harmony may be visually presented with less prominence.

In the computing flow of fig. 6, the pitch-corrected or converted vocal work may be combined (654) or aggregated to be mixed with the audibly rendered accompaniment (653) and/or transmitted (662) to the content server 110 or a remote device (e.g., handset 120 or 620, television and/or set-top box apparatus, or some other media-enabled computing system 611). In some embodiments, pitch modification or conversion of the vocal work and the resulting determination of the desired visual prominence may be performed at the content server 110.

Audio-visual capture on handheld devices

Fig. 7 is a functional block diagram of hardware and software components executable on an illustrative cell phone-type portable computing device for processing captured audiovisual performances in accordance with some embodiments of the invention(s). In some embodiments (recall fig. 1), the capturing of vocal audio and performance synchronized video may be performed using the facilities of a television-type display and/or set-top box device. However, in other embodiments, the handheld device (e.g., handheld device 101) itself may support the capture of both vocal audio and performance-synchronized video. Thus, fig. 7 depicts a basic signal processing flow (750) according to certain embodiments suitable for a cell phone type handheld device 101 to capture vocal audio and performance synchronized video to produce pitch corrected and optionally vocalized vocal works for vocal rendering (locally and/or at a remote target device) and to communicate with a content server or service platform 110.

Based on the description herein, one of ordinary skill in the art will understand the appropriate allocation of signal processing techniques (sampling, filtering, decimation, etc.) and data representations to the functional blocks (e.g., decoder(s) 752, digital-to-analog (D/a) converter 751, capturer 753, and encoder 755) that may execute software to provide a signal processing flow 750 as shown in fig. 7. Likewise, with respect to fig. 6, signal processing flow 650, and note targets (including chord note targets) of the illustrative score encoding, those of ordinary skill in the art will appreciate the appropriate allocation of signal processing techniques and data representations to functional blocks and signal processing constructs (e.g., decoder(s) 658, capturer 651, digital-to-analog (D/a) converter 656, mixers 653, 654, and encoder 657) as shown in fig. 6 implemented at least in part as software executable on a handset (101) or other portable computing device.

As will be appreciated by those of ordinary skill in the art, pitch detection and pitch modification have a rich technological history in the field of music and speech coding. Indeed, a wide variety of time-domain and even frequency-domain techniques of property choice have been used in the art and may be employed in some embodiments according to the invention. In view of this and recognizing that the highlighting technique according to the present invention is generally not dependent on any particular pitch detection or pitch correction technique, the present invention does not seek to exhaustively list the wide variety of signal processing techniques that may be suitable for various designs or embodiments according to the present invention. Instead, we merely note that in some embodiments according to the invention, the pitch detection method computes an average amplitude difference function (AMDF) and performs logic to pick the peak corresponding to the estimate of the pitch period. Constructed from this estimate, a pitch-conversion overlap-add (PSOLA) technique is used to facilitate resampling of the waveform to produce a variant of the pitch conversion, while reducing the aperiodic effects of splicing. Embodiments based on the AMDF/PSOLA technique are described in more detail in commonly owned U.S. patent No.8,983,829 entitled "COORDINATING AND MIXING vocal works CAPTURED FROM remotely distributed performers" AND inventors Cook, Lazier, Lieber AND Kirk.

Exemplary Mobile device and network

Fig. 8 depicts features of a mobile device that may serve as a platform for executing software implementations, including audiovisual capture, in accordance with some embodiments of the invention(s). In particular, fig. 8 depicts features of a mobile device that may serve as a platform for executing software implementations in accordance with some embodiments of the present invention. More specifically, FIG. 8 shows a diagram of a conventional iPhone^TMA block diagram of a mobile device 800 consistent with a commercially available version of a mobile digital device. While embodiments of the invention are certainly not limited to the deployment or application of an iPhone (or even an iPhone-type device), iPhone devicesThe placement platform, along with its rich sensors, multimedia facilities, application programmer interfaces, and wireless application delivery model complement, provides a high-capacity platform on which to deploy certain embodiments. Based on the description herein, one of ordinary skill in the art will understand the wide variety of additional mobile device platforms that may be suitable (now or later) for a given implementation or deployment of the present technology described herein.

Briefly, the mobile device 800 includes a display 802 that can be sensitive to tactile and/or haptic contact with a user. Touch sensitive display 802 is capable of supporting multi-touch features, processing multiple simultaneous touch points, including processing data related to the pressure, extent, and/or location of each touch point. Such processing facilitates gesturing and interacting with multiple fingers, as well as other interactions. Of course, other touch sensitive display technologies can also be used, such as a display that is contacted using a stylus or other pointing device.

Typically, the mobile device 800 presents a graphical user interface on the touch-sensitive display 802 that provides the user with access to various system objects and for communicating information. In some embodiments, the graphical user interface can include one or more display objects 804, 806. In the example shown, the display objects 804, 806 are graphical representations of system objects. Examples of system objects include device functions, applications, windows, files, alarms, events, or other identifiable system objects. In some embodiments of the invention, the application, when executed, provides at least some of the digital acoustic functionality described herein.

Typically, the mobile device 800 supports network connectivity, including, for example, mobile radio and wireless network interconnection functions to enable a user to navigate with the mobile device 800 and its associated network-enabled functions. In some cases, the mobile device 800 may be capable of interacting with other devices in the vicinity (e.g., via Wi-Fi, bluetooth, etc.). For example, the mobile device 800 can be configured to interact with peers or base stations for one or more devices. In this regard, mobile device 800 may grant or deny network access to other wireless devices.

The mobile device 800 includes various input/output (I/O) devices, sensors, and transducers. For example, a speaker 860 and a microphone 862 are typically included to facilitate audio, such as the capture of vocal music performances and the vocal rendering of accompaniment as described elsewhere herein, as well as mixed pitch-corrected vocal music performances. In some embodiments of the invention, speaker 860 and microphone 862 may provide suitable transducers for the techniques described herein. An external speaker port 864 can be included to enable hands-free voice functionality, such as speakerphone functionality. An audio jack 866 can also be included for use with a headset and/or microphone. In some embodiments, an external speaker and/or microphone may be used as a transducer for the techniques described herein.

Other sensors can also be used or provided. A proximity sensor 868 can be included to detect user positioning of the mobile device 800. In some embodiments, ambient light sensor 870 can be utilized in order to adjust the brightness of touch sensitive display 802. The accelerometer 872 can be utilized to detect movement of the mobile device 800, as shown by directional arrow 874. Accordingly, display objects and/or media can be rendered according to a detected environmental decision (e.g., portrait or landscape). In some embodiments, the mobile device 800 may include circuitry and sensors to support location determination capabilities, such as those provided by a Global Positioning System (GPS) or other positioning system (e.g., systems using Wi-Fi access points, television signals, cellular grids, Uniform Resource Locators (URLs)), for geocoding as described herein. The mobile device 800 also includes a camera lens and an imaging sensor 880. In some embodiments, examples of camera lenses and sensors 880 are located on the front and back surfaces of mobile device 800. The camera allows for the capture of still images and/or video to be associated with the captured pitch-corrected vocal work.

Mobile device 800 can also include one or more wireless communication subsystems, such as 802.11b/g/n/ac communication devices and/or Bluetooth^TMA communication device 888. Other communication protocols can also be supported, including other 802.x communication protocols (e.g., WiMax, Wi-Fi, 3G), fourth generation protocols and modulation (4G-LTE), Code Division Multiple Access (CDMA), global system for mobile communications (GSM), Enhanced Data GSM Environment (EDGE), and so forth. A port device 890, such as a Universal Serial Bus (USB) port or a docking port or some other wired port connection, can be included and used to establish a wired connection to other computing devices, such as other communication devices 800, network access devices, personal computers, printers, or other processing devices capable of receiving and/or transmitting data. The port device 890 may also allow the mobile device 800 to synchronize with host devices using one or more protocols (e.g., TCP/IP, HTTP, UDP, and any other known protocols).

Fig. 9 is a network diagram depicting cooperation of exemplary devices according to some embodiments of the invention(s). In particular, fig. 9 depicts various examples of handheld or portable computing devices (e.g., mobile device 800) employed in audiovisual capture (103, 103.2 … …. N) and programmed with vocal audio and video capture code, user interface code, pitch modification code, audio rendering pipeline, and feedback code according to the functional description herein. A first apparatus example is described as employed, for example, in vocal audio of a seed performance, and performance-synchronized video capture (103) of the seed performance is described as operating while apparatus example 520 is operating in a presentation or playback mode for a mixed audiovisual performance with dynamic visual prominence of the performance-synchronized video. Although additional television-type display and/or set-top box devices 920A are also depicted as operating in a presentation or playback mode, as described elsewhere herein, such devices may also operate as part of a video capture facility with vocal audio and performance synchronization. Each of the aforementioned devices communicates with a server 912 or service platform hosting storage devices and/or functions explained herein with respect to the content server 110 via wireless data transmission and/or intervening networks 904. As described herein, a captured pitch-corrected vocal performance mixed with performance-synchronized video for rendering a mixed audio-visual performance rendering based on a visual progression of a templated screen layout may (optionally) be streamed and audio-visual rendered at the laptop 911.

Other embodiments

While the invention(s) have been described with reference to various embodiments, it will be understood that these embodiments are illustrative, and that the scope of the invention(s) is not limited thereto. Many variations, modifications, additions, and improvements are possible. For example, although particular templated screen layouts, conversions, and audio mixing techniques are shown and described, persons of ordinary skill in the art having benefit of the present disclosure will appreciate the number and adaptability changes that are appropriate for a given deployment, implementation, music genre, or user demographics. Likewise, while pitch-modified vocal performances captured from a karaoke-type interface have been described, other variations and adaptations will be appreciated. Moreover, although certain illustrative signal processing techniques have been described in the context of certain illustrative applications and apparatus/system configurations, those of ordinary skill in the art will recognize that the described techniques can readily be modified to accommodate other suitable signal processing techniques and effects.

Embodiments according to the present invention may be provided in the form of and/or in a computer program product encoded in a machine-readable medium as other functional constructs of instructions and software, which may in turn be executed in a computing system, such as an iPhone handset, mobile or portable computing device, or content server platform, to perform the methods described herein. Generally, a machine-readable medium can comprise a tangible article that encodes information in a form readable by a machine (e.g., a computing facility of a computer, mobile device, or portable computing device, etc.) and a tangible storage device (e.g., as an application, source or object code, functional descriptive information, etc.) accompanying transmission of the information. The machine-readable medium may include, but is not limited to, magnetic storage media (e.g., disk and/or tape storage devices); optical storage media (e.g., CD-ROM, DVD, etc.); a magneto-optical storage medium; read Only Memory (ROM); random Access Memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flashing; or other type of media suitable for storing electronic instructions, sequences of operations, functional descriptive information encodings, and the like.

In general, multiple examples for components, operations, or structures described herein may be provided as a single example. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are described in the context of particular illustrative configurations. Other allocations of functionality are contemplated and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in the exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the invention(s).

Claims

1. A method of preparing a coordinated audiovisual work from offsite distributed performer contributions, the method comprising:

receiving, via a communication network, a plurality of audiovisual codes of performances captured on respective remote devices in temporal correspondence with respective vocal renderings of the seed, each of the received audiovisual codes comprising a respective performer vocal work and a time-synchronized video;

retrieving a visual process that encodes, in temporal correspondence with the seed, a sequence of templated screen layouts, each of the templated screen layouts specifying a number and arrangement of visual units in which each of the videos is visually rendered;

associating each of the captured performances including the respective performer vocal work and coordinated video to each of the visual units; and

rendering the coordinated audiovisual work as an audio mix of the captured performance and a coordinated visual presentation in accordance with the visual process and the association.

2. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein successive templated screen layouts change the spatial arrangement or number of the visual elements, or both.

3. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein the audio mix includes performer vocals for individual ones of the captured performances consistent with the particular templated screen layout active at a given point in the visual progression that have been associated with the visual elements of the templated screen layout active at that time.

4. The method of claim 3, wherein the first and second light sources are selected from the group consisting of,

wherein, at a given point in time in the audio mix, the included performer vocals are only those performer vocals for the respective captured performances associated with the then-active visual elements of the templated screen layout.

5. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein at least some transitions from one templated screen layout to another in the visual process temporally coincide with boundaries between music sections.

6. The method of claim 5, wherein the first and second light sources are selected from the group consisting of,

wherein, in the visual process, at least some transitions from one templated screen layout to another temporally coincide with transitions between respective portions selected from the following set:

a first acoustic portion;

a second sound part; and

multiple song hands.

7. The method of claim 5, wherein the first and second light sources are selected from the group consisting of,

wherein, in the visual process, at least some conversions from one templated screen layout to another are consistent with the Nth beat rhythm of the basic song corresponding to the seed in time, N is more than or equal to 1.

8. The method of claim 2, wherein the first and second light sources are selected from the group consisting of,

wherein, in the visual progression, the number of visual elements in at least some successive templated screen layouts increases in correspondence with the intensity of the base song to which the seed corresponds.

9. The method of claim 2, wherein the first and second light sources are selected from the group consisting of,

wherein, in the visual progression, the spatial arrangement or size of at least some visual elements changes from one templated screen layout to the next successive templated screen layout.

10. The method of claim 1, further comprising:

generating the visual process according to a structured musical arrangement corresponding to the seed.

11. The method of claim 10, wherein the first and second light sources are selected from the group consisting of,

wherein the structured musical arrangement includes an encoding of musical sections consistent with either or both of:

pitch track of performer vocal work; and

lyrics of a performer's vocal work.

12. The method of claim 11, wherein the first and second light sources are selected from the group consisting of,

wherein, in the visual progression, at least some transitions from one templated screen layout to another temporally coincide with boundaries between the music sections of the structured music arrangement.

13. The method of claim 10, wherein the first and second light sources are selected from the group consisting of,

wherein the structured musical arrangement includes a coding of an accompaniment.

14. The method of claim 13, wherein the first and second light sources are selected from the group consisting of,

wherein, in the visual progression, at least some transitions from one templated screen layout to another are temporally coincident with an Nth beat rhythm computationally extracted from the accompaniment, N ≧ 1.

15. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein each of the templated screen layouts defines a visual range of a set of visual cells in which videos captured in correspondence with vocal works of individual ones of the performers are rendered.

16. The method of claim 15, wherein the templated screen layout comprises:

at least one layout having an actor;

at least one layout having two performers;

a plurality of three and four performers; and

for at least one arrangement with N performers, a plurality of N performers, wherein N is more than or equal to 4.

17. The method of claim 1, further comprising:

in accordance with the retrieved visual progression, a transition is made from a first templated screen layout to a next sequential templated screen layout, wherein for video captured in accordance with a particular performer vocal work, the transition is from a first visual element of the first layout to a second visual element of the next sequential layout.

18. The method of claim 15, wherein the conversion of the first visual element to the second visual element comprises one or more of:

sliding conversion;

fade-in or fade-out transitions;

scaling conversion; and

and (5) cutting and converting.

19. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein the rendering is for audiovisual coding or container format suitable for storage or transmission over the communication network.

20. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,

wherein the rendering is for a display and an audio transducer.

21. The method of claim 1, further comprising:

in the audio mixing, scaling an audio amplitude of a performer's vocal work for the captured performance that has been associated with a particular visual element of the templated screen layout that is active at the time, wherein the scaled amplitude of the vocal work for a particular performer is consistent with a size of the particular visual element to which the video of the particular performer is associated.

22. The method of claim 1, further comprising:

in the audio mixing, panning (from left to right) for a performer's vocal work of the captured performance that has been associated to a particular visual element of the templated screen layout that is active at the time, wherein the panning for a particular performer's vocal work is consistent with the lateral placement of the particular visual element to which the particular performer's video is associated.

23. The method of claim 1, further comprising:

transmitting the encoding of the coordinated audiovisual work to one or more of the remotely distributed performers.

24. The method of claim 1, further comprising:

receiving, via the communication network, an audiovisual encoding of a seed performance comprising a first performer vocal work and a time-synchronized video captured at a first remote device audibly rendered as an accompaniment,

wherein the seed comprises the seed performance of the first performer.