NL2037362B1

NL2037362B1 - Subtitling system and method

Info

Publication number: NL2037362B1
Application number: NL2037362A
Authority: NL
Inventors: Cor Maria Pas Pas Ivo; Catharina Maria Rutten Monique
Original assignee: Stichting Het Nat Theater
Priority date: 2024-03-28
Filing date: 2024-03-28
Publication date: 2025-10-10
Also published as: WO2025206956A1

Abstract

The invention provides live subtitling system for subtitling a series of people while they are speaking to a series of clients, in particular the series of people performing in a theatre, for instance playing in a play, singing in a musical, opera or operetta, with the clients forming part of a live audience. The live subtitling system comprising at least one speech-to-text conversion system and a wireless broadcast system.

Description

P100917NL00

Subtitling system and method

Field of the invention

The invention relates to a live subtitling system and a method for live subtitling an event.

Background of the invention

WO2020017961 according to its abstract relates to “Methods for a voice processing system comprising P microphone units (102A...102D) and a central unit (104) are disclosed. Each microphone unit ís linked to a person and derives from N microphone signals a source localisation signal. The source localisation signal is used to control an adaptive beam form process to obtain a beam formed audio signal. The microphone unit is further configured to derive metadata from for N microphone signals, such direction the sound is coming from. Packages with the metadata and beam formed audio signal are transmitted to the central unit. The central unit processes the metadata to determine which parts of the P beam formed audio signal comprises speech from a person that is linked to another microphone unit. By removing said parts from the audio signals before transcription, the quality of the transcription is improved.

The transcriptions are displayed on a remote device. “

GB2568656 according to its abstract relates to “A system for displaying captions during a live performance, comprises: a memory storing a follower script, including waypoints associated with performance cues, and a caption script, a speech follower component to recognise performance spoken dialogue and compare it with the follower script to track the location in the follower script of the spoken dialogue, identifying when a caption is displayed; a caption output module, accessing from the caption script a caption for display at each location in the follower script associated with a caption; and a cue handler storing performance cue identifiers with associated cue metadata and which receives detected performance cues and outputs cue signals to the speech follower, assisting the speech follower to determine the location based on waypoints at detected cues. Also, a method of delivering an information output to a live performance viewer, the information being displayed text or an audio description at predefined times relative to stage events. A follower script with entries organized along a timeline, and metadata at timepoints between at least some entries are provided. Metadata is associated with stage events. Speech recognition tracks spoken dialogue against the follower script entries, and the stage events, aiding following the live performance. “

Summary of the invention

A disadvantage of prior art is that subtitles or captions are mixed between performers, and for instance language difficulties remain a problem. This problem grows bigger if the performance includes other sound, like singing, music, and the like.

Hence, it is an aspect of the invention to provide an alternative system and method, which preferably further at least partly obviates one or more of above- described drawbacks.

There is provide a live subtitling system for subtitling a series of people while they are speaking to a series of clients, in particular the series of people performing in a theatre, for instance playing in a play, singing in a musical, opera or operetta, with the clients forming part of a live audience, the live subtitling system comprising: - a series of sound input devices, for instance microphones, each sound input device dedicated to one of the series of people and in operation each providing a sound signal; - a mixing console having the series of sound input devices each functionally coupled thereto to a respective input channel for receiving a sound signal on each input channel, the mixing console comprising a digital output channel for each sound input device of the series sound input device, and at least one fader for setting an output level for each of the output channels separately, and a further output channel for providing an output sound signal to a sound output system; - a sound processing system comprising a sound system input functionally coupled to one of the digital output channels of the mixing console, comprising a data processor running a computer program which, when running on the data processor, performs at least one selected from a sound compression, a sound multiplexing, a labelling of sound data with mixing console input channels, and a combination thereof for producing at least one sound data package that comprises labels indicative of a respective one of the series of sound input devices and sound input device sound output if sound of that sound input device is above the preset output level, and transmitting said at least one sound data package to at least one sound processing output; - at least one speech-to-text conversion system functionally coupled to the at least one sound processing output for receiving said at least one sound data package, the speech-to-text conversion system adapted for providing for each sound data package a text data package to said sound processing system, said sound processing system labelling the text in said text data package to the series of sound input devices, providing a labelled text data package; - a wireless broadcast system operationally coupled to the sound processing system for receiving the labelled text data package and adapted for wirelessly broadcasting the generated labelled text to client display devices for functionally live subtitling the sound produced by the series of people, in particular for allowing displaying of the subtitles in synchronisation with the output sound signal, more in particular in synchronisation with the sound played by the sound system resulting from the output sound signal.

There is further provided a method for live subtitling an event in which a series of people talk to a series of clients, in particular the series of people performing in a theatre with the clients forming part of an audience, in particular a live subtitling system of any one of the preceding claims, wherein each person of the series of people is provided with a sound input device, in particular a microphone, the sound input devices provide a series of sound streams to a mixing console, the mixing console couples the data steams each to a speech-to-text conversion system, the speech-to-text conversion system comprises at least one sound input channel, providing a separate digital output from the mixing console for each of the series of input channels, coupling each digital output to a sound-to-text system, the sound-to-text system providing digital data comprising a text with a label indicating one of the series of microphones to a wireless broadcast system, said wireless broadcasting system wirelessly broadcasting the generated labelled texts to the clients.

There is further provided a computer program product which, when executed on a data processing device, preforms receiving a series of sound streams, applying sound processing comprising at least one selected from a sound compression and a multiplexing to combine the series of sound streams into a sound data package,

transmitting the sound data package to at least one speech-to-text conversion system, receiving text data from said speech-to-text conversion system which text data is a conversion of said sound data package, converting the text data in a subtitle data package comprising a series of subtitles including an indication with each subtitle of the series of subtitles labelling the subtitle to a sound stream of the series of sound streams, and broadcasting the subtitle data package to a series of display devices, wherein the subtitle data is displayed live with respect to a time at which the series of sound streams were produced.

In the current context, reference is made to “ subtitling”. In some instances, this may also be referred to as “ caption”. In the current technology, sound is generated by people or persons and that sound comprises words. This sound is converted into readable text and presented to one or more persons that are live present, hearing the sound. Usually, this is an audience. This current system may also be applied in large meetings like a UN assembly meeting.

Important in the current invention is that the people perform or speak live. For instance, the person or people give a live performance. In this respect, live means that for a spectator or person attending the performance the motions and lips of the people performing or speaking are in synchronization (“in sinc”) with produced sound. In some embodiments, also sound, like music, or pre-recorded sound is added to the performance. This in some way complicates the subtitling process. The requirement of being “in sinc” means that usually there can only be a little time between the motions of lips and hearing the sound and displaying the subtitling. If there is a time difference, this is also referred to as latency. In practise, usually there is less than 1 second time difference. For a good synchronisation, there is less than 10 milliseconds time difference. In more optimal situations, the time difference is less than 1 millisecond.

In the current description, the people are speaking. This usually includes public speaking, like giving a lecture or a speech, recite a poem or part of a book. This here also includes singing. In fact, this can also be seen as performing.

In the current context, the clients are at the same physical location as the people that are performing. This means that in fact the performance is live. The current system can help people with a hearing disability to understand and follow a performance. It can also help people follow and understand a performance. This may also include people that do not have a hearing disability. It may even include presenting a translated version of the performance. In an embodiment, the clients may even be allowed to select a language in which the text is displayed. Such a selection of language may be individually.

With respect to a mixing console, such a device comprises a series of input 5 channels. To these input channels, a series of sound input devices can be functionally coupled, each to a respective input channel. The mixing console further comprises at least one digital output channel for the series sound input device. It further comprises a fader for setting an output level for each of the output channels separately. The mixing console has one or more outputs for providing output to sound output devices, like for instance to speakers.

In particular, a mixing console or mixing desk is an electronic device for mixing audio signals, used in sound recording and reproduction and sound reinforcement systems. Inputs to the console include microphones, signals from electric or electronic instruments, or pre-recorded sounds. Mixers may control analog or digital signals. The modified signals are summed to produce the combined output signals, which can then be broadcast, amplified through a sound reinforcement system or recorded.

Examples of suitable mixing consoles are from Alesis, Allen & Heath, Audient,

Automated Processes, Inc., AMS Neve, Avid, Behringer, Cadac Electronics, Calrec,

Crest Audio, D&R, DHD audio, DiGiCo, Electro-Voice, Euphonix, Fairlight,

Focusrite, Harrison Audio Consoles, Klotz Digital, Lawo, Logitek, Mackie, MCI,

Midas, Peavey, Phonic, PreSonus, QSC, Rane, Roland, Shure, Solid State Logic (SSL), Soundcraft, Speck Electronics, Stage Tec, Studer, Studiomaster, TASCAM,

Telos Alliance, Ward-Beck Systems, Wheatstone, Yamaha, Yorkville.

Detailed description of the invention

As described above, there is provide a live subtitling system for subtitling a series of people while they are speaking to a series of clients, in particular the series of people performing in a theatre, for instance playing in a play, singing in a musical, opera or operetta, with the clients forming part of a live audience. Below, some specific embodiments are discussed. It should be noted that combinations of these embodiments are explicitly also foreseen.

In an embodiment, the computer program when running further retrieves a label coupled to each sound input device, and couples each label with a sound output from said sound input device.

In a embodiment, the mixing console comprises a series of said at least one fader. In an embodiment, the mixing console comprises at least one fader per sound input device. In an embodiment, for at least one fader per sound input device, said faders are provided for at least one of setting a sound level pass percentage per sound input device and setting relative mutual levels for the sound input devices.

In an embodiment, the mixing console provides a sound output for a mixed sound signal as output from the at least one fader, and the series of individual output channels from the at least one fader.

In an embodiment, the live subtitling system comprises a time division multiplexing device for providing a series of sound channels as a train of time slices of said sound of said sound input devices in a series of channels. In particular, the multiplexer reduces the series of sound channels to one channel.

In an embodiment, the computer program preforms a lossless sound compression, in particular compressing the timeline of a time segment of the sound input device sound output.

In an embodiment, the sound output of the series of sound input devices is processed parallel. In a particular embodiment, the sound output is processed by applying time compression. In this way, it may be possible to reduce latency between production of sound and display of the subtitles.

In an embodiment, the computer program applies sound compression and subsequently multiplexing. In a particular embodiment, the computer program applies time division multiplexing to produce a reduced number of sound data packages.

In an embodiment, the label includes sound input device data. In a particular embodiment, the label comprises position data of at least one of the sound input devices. In this way, the display system like glasses may project the subtitling close to a performer that produced that text.

In an embodiment, the live subtitling system comprises a speech-to-text conversion system per sound data stream. In this way, relevant sound input devices are all parallel processed. This may reduce latency.

In an embodiment, wireless broadcast comprises a WIFI transmission system adapted for providing a broadcast. The WIFI transmission system may be an open system. Broadcasting may include transmitting as WIFI or via Bluetooth or similar digital system, allowing the clinets/audience to easily receive the subtitling live.

In an embodiment, the mixing console comprises a mixing console data processor and a mixing console computer program which, when running on said mixing console data processor, applies a trained neural network to operate said at least one switch, said trained neural network trained using a series of follower scripts and resulting mixing console switch settings.

In an embodiment, the live subtitling system further comprises a text-translation system for translating at least part of the text output by said speech-to-text conversion system. This improves and adds more involvement of the clients/users. For instance, the performer/actor may use different languages, or sing and speak in different languages. Auto-translation may present all in one language. The client may select a preferred language, increasing involvement or understanding.

In an embodiment, the labelled text is/are received on a personal display device of each client. In an embodiment, such a personal display device comprises a wearable device. In an embodiment, the wearable device comprises a pair of glasses comprising a projecting device for projecting data on at least one glass of the pair of grasses, including a smart contact lens. A pair of glasses or contact lenses or the like may increase the experience and combine visual and sound-to-text converted visual information.

In an embodiment, the label comprises an additional indication of a nature of said text, said additional indication selected from at least one of an indication that the text was sung, the text resulted from a relatively loud sound. This makes the user experience more intense.

In an embodiment, the live subtitling system further comprises a positioning system for tracking the position of the people talking in public. In an embodiment of such a positioning system, the labelled text is displayed on the personal display device near a representation of the person.

The terms “upstream” and “downstream” relate to an arrangement of items or features relative to the propagation of the light from a light generating means or the flow of water or the transmission of sound. Relative to a first position within a beam of light from the light generating means, a second position in the beam of light closer to the light generating means is “upstream”, and a third position within the beam of light further away from the light generating means is “downstream”. In the current system, the source of sound is referred to as upstream, and the displaying of subtitle is referred toas downstream.

The term “substantially” herein, such as in “substantially consists”, will be understood by the person skilled in the art. The term “substantially” may also include embodiments with “entirely”, “completely”, “all”, etc. Hence, in embodiments the adjective substantially may also be removed. Where applicable, the term “substantially” may also relate to 90% or higher, such as 95% or higher, especially 99% or higher, even more especially 99.5% or higher, including 100%. The term “comprise” includes also embodiments wherein the term “comprises” means “consists of”.

The term "functionally" will be understood by, and be clear to, a person skilled in the art. The term “substantially” as well as “functionally” may also include embodiments with “entirely”, “completely”, “all”, etc. Hence, in embodiments the adjective functionally may also be removed. When used, for instance in “functionally parallel”, a skilled person will understand that the adjective “functionally” includes the term substantially as explained above. Functionally in particular is to be understood to include a configuration of features that allows these features to function as if the adjective “functionally” was not present. The term “functionally” is intended to cover variations in the feature to which it refers, and which variations are such that in the functional use of the feature, possibly in combination with other features it relates to in the invention, that combination of features is able to operate or function. For instance, if an antenna is functionally coupled or functionally connected to a communication device, received electromagnetic signals that are receives by the antenna can be used by the communication device. The word “functionally” as for instance used in “functionally parallel” is used to cover exactly parallel, but also the embodiments that are covered by the word “substantially” explained above. For instance, “functionally parallel” relates to embodiments that in operation function as if the parts are for instance parallel. This covers embodiments for which it is clear to a skilled person that it operates within its intended field of use as if it were parallel.

Furthermore, the terms first, second, third and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.

The devices or apparatus herein are amongst others described during operation.

As will be clear to the person skilled in the art, the invention is not limited to methods of operation or devices in operation.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "to comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device or apparatus claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

The invention further applies to an apparatus or device comprising one or more of the characterising features described in the description and/or shown in the attached drawings. The invention further pertains to a method or process comprising one or more of the characterising features described in the description and/or shown in the attached drawings.

The various aspects discussed in this patent can be combined in order to provide additional advantages. Furthermore, some of the features can form the basis for one or more divisional applications.

Brief description of the drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying schematic drawings in which corresponding reference symbols indicate corresponding parts, and in which:

Figure 1 schematically depicts an embodiment of a live subtitling system;

Figure 2 schematically shows an alternative embodiment of a live subtitling system.

The drawings are schematic and not necessarily on scale.

Description of preferred embodiments

In general, the currently claims system is used in venues, which includes theatre halls, but can also include indoor venues like general halls, classrooms. It may also include outdoor venues. In such venues at least one person speaks, sings, or produces sound. In particular, such a sound is spoken sound, like a recital, a speech, or text in a play. In many circumstances, it may be difficult for a person, like an audience, to hear or understand the sound. For instance, for some people the lyrics may be difficult to hear of understand. Thus, the current system can assist hearing disabled people, but may also help to understand lyrics, for instance. It may even be combined with or include “on the fly” translation. In fact, the available text may be translated in different languages at the same time. Such translated text may be subsequently broadcast over separate channels, allowing audience to switch to a text in a selected language. For translating, currently text is send to a computer translation system, usually based on a trained neural network.

Figure | schematically depicts an embodiment of a live subtitling system 1. Such a system is usually applied in a venue 9 like a hall, theatre hall, or the like. In such a venue 9, an audience 3 is present. One or more persons 2 perform in front of the audience 3. Usually, more than one person performs. In fact, the complexity and problems multiply when the performance comprises more than one person. The complexity further multiplies when the performance includes music and spoken or sung text, and further becomes more complex if more than one person produces audio text. Often, the performers use a microphone 4, and the produced text (including lyrics) 1s output via at least one speaker 8, often amplified using a sound system.

Usually in these venues, there is a mixing console 5. The mixing console 5 may be part of the sound system that plays sound or music via the speakers 8 so that the audience can hear the performance of the people 2, for instance artists, singers, speakers. The sound system may include one or more amplifiers.

Usually, during the performance a sound technician operates the mixing console 5. In recent developments, attempts are made to use automated systems, like trained neural networks or other artificial intelligence (Al), to operate the mixing console.

Using the mixing console, a sound technician mixes incoming channels (“in”) of the mixing console. The incoming input channels of the mixing console 5 includes signals of the microphones (which may be wirelessly coupled), but may additionally also include music or other sources of sound, for instance pre-recorded sound. Usually, each source of incoming sound has at least one input channel “in”. The various sound levels of each input channel are usually individually adapted using a fader.

The mixed sound result from the mixing console 5 is coupled, via an output channel “out”, to a speaker system. The speakers 8 of the speaker system may be wire coupled, or wirelessly coupled. With respect to the output channel “out”, the mixing console mixes all the incoming channels into (usually one) sound output that is coupled to a sound system, usually speakers playing the result of the mixed channels.

The mixing console 5 usually comprises at least one further output channel out- 2. In an embodiment, this further output channel out-2 is digital. In an embodiment, each input channel “in” is coupled to an individual further output channel out-2. In an embodiment, the further output channel out-2 comprises a series of digital further output channels out-2, individually coupled to one of the input channels “in”. Usually, the output channel(s) out-2 is/are coupled via a fader. This means that the relative levels of input channels are changed. After the faders, the input channels are individually coupled to out-2, allowing to identify individual sound input devices. The output from the faders is also coupled to a mixer, to provide one mixed output via output channel “out”.

Via mixing console further output out-2, the mixing console 5 has a coupling 6 with a computer system 13. The coupling 6 may be a wired or wireless coupling. The computer system often is remote from the mixing console 5. The may means that the computer system 13 is at a distance from the mixing console 5. It may be in the vicinity of the mixing console 5. In an embodiment, it is in the same building.

Alternatively or in combination, it is coupled via a LAN, WLAN of even via a cabled coupling.

For processing the output received from the mixing console 5 from the further output out-2 via coupling 6, the computer system 13 comprises a sound processing system 14. Such a sound processing system 14 in an embodiment comprises a data processor and software running on that data processor. The software can perform a sound compression, for instance using known sound compression algorithms. The software can also perform sound multiplexing. This can be included after sound compression. In sound multiplexing, several channels are placed one behind the other (time division multiplexing) or in any other multiplexing method suitable for sound and known to a skilled person. The software can further provide labelling of sound data with mixing console input channels. These software implemented functions can be combined. The processing using the software produces at least one sound data package that comprises labels indicative of a respective one of the series of sound input devices and sound input device sound output if sound of that sound input device is above the preset output level. The software running on the sound processing system 14 furthermore transmits the at least one sound data package to at least one sound processing output. Via that output, sound data packages are coupled to a speech-to-text conversion system.

In an embodiment, the sound processing system 14 comprises a so-called Dante system.

Dante is the product name for a combination of software, hardware, and network protocols that delivers uncompressed, multi-channel, low-latency digital audio over a standard Ethernet network using Layer 3 IPpackets. It is developed in 2006 by the

Sydney-based Audinate. Dante builds on previous audio over Ethernet and audio over

IP technologies.

The sound that is transmitted from the mixing console 5 in an embodiment can also be input to the sound processing system 14 via an automixer 18. An automixer is a device that is known in the working field of sound technicians. This device automatically reduces the level or even silences unused microphones or other sound input devices 4 that are not used. An automixer 18 is also referred to as an automatic microphone mixer. Usually, the automixer reduces extraneous noise pick-up.

Application in the current system allows a better sound-to-text conversion, as was found by applicant.

The computer system 13 is functionally coupled with a sound-to text conversion system, also referred to as a speech-to-text conversion system 17. The coupling can be a wireless coupling, for instance via the internet (IP protocol). The speech-to-text system as such often comprises an artificial intelligence system that is trained to convert sound, often spoken words, but also singing, to text. These systems as such are known to a skilled person. In the current system, often various channels of sound are available, often producing sound, including spoken or sung, via separate channels but at the same time. Using the sound system, in an embodiment the different channels can multiplexed. In order to reduce time delay as much as possible, sound data may be compressed and/or time-compressed. Often, a time multiplexing is performed. In such a set up, n incoming channels are cut up into time slices. From various channels, time slices are combined. A sound-to-text conversion system may comprise a setting for applying a frequency shift of the sound.

As described, the computer system 13 is coupled to a speech-to-text conversion system 17. The one or more sound channels are transmitted from the computer system 13 to the speech-to-text conversion system 17. The text-to-speech conversion system 17 converts incoming sound to text. In order to relate the text to the original source (microphone), each part of text is labelled as labelled text fragment. The labelled text fragments are transmitted from the speech-to-text conversion system 17 to the computer system 13. The computer system 13 can process the labelled text fragments into subtitles. The subtitles are subsequently broadcasted via transmitter 10 in the venue. There, display devices comprising screens, tablets 11, or for instance smart glasses 12 or even smart contact lenses, can display the subtitles. The subtitles may include an indication linking the subtitles to the specific source, like microphone 4.

Figure 2 schematically illustrates an alternative system. The references show the same or functionally the same elements and features as figure 1. In some instances, differences are explained.

In figure 2, the mixing console 5 has a controller or switch 7 for controlling an output level for each incoming input channel. In this embodiment, each incoming channel has its own output, providing a series of mixing console couplings 6. These can be physically separate channels. Alternatively, these can comprise one multiplexed in a single sound data stream. Each input channel in this embodiment is separately input into an input channel of a speech-to-text system 17. In an embodiment, only those channels 6 that have an output above a predefined level are input into an input channel of a speech-to-text system. In an embodiment, multiple speech-to-text systems 17 are provided in order to limit possible delay between production of a sound and displaying a subtitle representation of that sound.

The various subtitle texts are output by a respective speech-to-text system. A wireless transmission or broadcast system 10 can label each subtitle text before it is displayed on a display device, for instance a screen 11. This allows a user in the audience to see which performer produced which text.

The current live subtitling system has various functional systems and devices that preform a function in the system. These various functions may be transferred or also be included in other system of the current live subtitling system.

It will also be clear that the above description and drawings are included to illustrate some embodiments of the invention, and not to limit the scope of protection.

Starting from this disclosure, many more embodiments will be evident to a skilled person. These embodiments are within the scope of protection and the essence of this invention and are obvious combinations of prior art techniques and the disclosure of this patent.

Reference numbers 1 live subtitling system 2 performers 3 audience 4 microphone 5 mixing console 6 mixing console coupling 7 switch 8 speakers 9 venue, theatre hall 10 wireless transmission/broadcast 11 display device, screen 12 display device, smart glass 13 computer system 14 sound processor/ 15 theatre seats 16 functional coupling, including internet 17 speech-to-text conversion system 18 automixer

Out mixing console output (to speaker system)

Out-2mixing console further output

In mixing console input

Claims

P100917NL00 Conclusions

1. A live captioning system for captioning a series of people while they are speaking to a series of customers, in particular a series of people performing in a theatre, for example acting in a play, singing in a musical, opera or operetta, the customers forming part of a live audience, the live captioning system comprising: - an array of sound input devices, for example microphones, each sound input device being dedicated to one of the series of persons and each in operation providing an audio signal; - a mixing console comprising the array of sound input devices, each operatively coupled thereto to a respective input channel for receiving an audio signal on each input channel, the mixing console comprising a digital output channel for each sound input device of the array of sound input devices and at least one fader for setting an output level for each of the output channels separately, and a further output channel for providing an output audio signal to an audio output system; - a sound processing system comprising an audio input device operatively coupled to one of the digital output channels of the mixing console, comprising a data processor running a computer program that, when executed on the data processor, performs at least one of the following functions: audio compression, audio multiplexing, tagging audio data with input channels of the mixing console, and a combination thereof, for producing at least one audio data packet comprising tags indicative of one of the sets of audio input devices and audio output from an audio input device when the audio from that audio input device is above a preset output level, and transmitting that at least one audio data packet to at least one sound processing output device; - at least one speech-to-text conversion system operatively coupled to the at least one sound processing output for receiving said at least one sound data packet, the speech-to-text conversion system being adapted to provide a text data packet to the sound processing system for each sound data packet, the sound processing system labeling the text in the text data packet for the array of sound input devices and providing a labeled text data packet; - a wireless broadcasting system operatively coupled to the sound processing system for receiving the labeled text data packet and being adapted to wirelessly broadcast the generated labeled text to client displays for functional live subtitling of the sound produced by the array of persons, in particular to enable display of the subtitles in synchronization with the outgoing sound signal, more particularly in synchronization with the sound played by the sound system based on the outgoing sound signal.

2. The live captioning system of claim 1, wherein the computer program, when executed, further retrieves a label associated with each sound input device and associates each label with an audio output of said sound input device.

3. The live captioning system according to claim 1 or 2, wherein the mixing console comprises an array of said at least one fader, in particular at least one fader per sound input device, in particular faders for at least one of setting a sound pass percentage per sound input device and setting relative mutual levels for the sound input devices.

4. The live captioning system of any preceding claim, wherein the mixing console provides an audio output for a mixed audio signal output from the at least one fader and the set of individual output channels from the at least one fader.

5. The live captioning system of any preceding claim, wherein the sound processing system comprises a time division multiplexing device for providing a sequence of sound channels as a sequence of time slices of said sound from said sound input devices into one channel.

6. The live captioning system according to any of the preceding claims, wherein the computer program performs lossless sound compression, in particular compressing the timeline of a time segment of the sound output from the sound input device.

7. The live captioning system according to any of the preceding claims, wherein the sound output of the sound input device of the array of sound devices is processed in parallel, in particular by applying time compression.

8. The live captioning system according to any of the preceding claims, wherein the computer program applies sound compression and subsequent multiplexing, in particular time division multiplexing, to produce a reduced number of sound data packets.

9. The live captioning system according to any of the preceding claims, wherein the label comprises input device data, in particular position data of at least one of the sound input devices.

10. The live captioning system according to any of the preceding claims, having a speech-to-text conversion system per audio data stream.

11. The live captioning system of any preceding claim, wherein the wireless broadcasting comprises a WIFI transmission system adapted to provide a broadcast.

12. The live captioning system of any preceding claim, wherein the mixing console comprises a mixing console data processor and a mixing console computer program that, when executed on the mixing console data processor, applies a trained neural network to operate said switch, the trained neural network being trained using a series of tracking scripts and resulting mixing console switch settings.

13. The live captioning system of any preceding claim, further comprising a text translation system for translating at least a portion of the text output by the speech-to-text conversion system.

14. The live captioning system according to any of the preceding claims, wherein the labeled text is received on a personal display device of each customer, in particular on a wearable device, in particular glasses comprising a projection device for projecting data onto a lens.

15. The live captioning system of any preceding claim, wherein the label comprises an additional indication of the nature of the text, the additional indication being at least one of the following: an indication that the text is sung, the text is produced from a relatively loud sound.

16. The live captioning system of any preceding claim, further comprising a positioning system for tracking the position of the persons talking in public, wherein the tagged text is displayed on the personal display device near a representation of the person.

17. A method for live captioning of an event in which a series of people are talking to a series of customers, in particular a series of people performing in a theatre where the customers are part of an audience, in particular a live captioning system of any of the preceding claims, wherein each person of the series of people is provided with a sound input device, in particular a microphone, the sound input devices supply a series of sound streams to a mixing console, the mixing console couples each of the data streams to a speech-to-text conversion system, the speech-to-text conversion system comprising at least one sound input channel, which provides a separate digital output from the mixing console for each of the series of input channels, and which couples each digital output to a sound-to-text system, the sound-to-text system providing digital data comprising a text with a label identifying one of the series of microphones to a wireless broadcasting system, the wireless broadcasting system wirelessly broadcasting the generated labeled texts to the customers.

18. A computer program product that, when executed on a data processing device, performs the steps of receiving a sequence of audio streams, applying audio processing comprising at least one selected from audio compression and multiplexing to combine the sequence of audio streams into an audio data packet, transmitting the audio data packet to at least one speech-to-text conversion system, receiving text data from said speech-to-text conversion system, the text data being a conversion of said audio data packet, converting the text data into a subtitle data packet comprising a sequence of subtitles having an indication on each subtitle of the sequence of subtitles associating the subtitle with an audio stream of the sequence of audio streams, and transmitting the subtitle data packet to a series of display devices, wherein the subtitle data is displayed live with respect to a point in time at which the sequence of audio streams was produced.