GB2600932A

GB2600932A - Audio personalisation method and system

Info

Publication number: GB2600932A
Application number: GB2017766.3A
Authority: GB
Inventors: Villanueva Barreiro Marina; Schembri Danjeli; Lee Jones Michael; Armstrong Calum
Original assignee: Sony Interactive Entertainment Inc
Current assignee: Sony Interactive Entertainment Inc
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2022-05-18
Also published as: GB202017766D0

Abstract

An audio personalisation method for a user comprises the steps of capturing at least a first image of a user comprising a view of their head (wherein at least one of the captured images comprises a reference feature of known absolute size in a predetermined relationship to the user's head), analysing the or each captured image to generate data characteristic of the morphology of the user's head. Following this, a corpus of reference individuals for whom respective head related transfer functions 'HRTF's have been generated, are compared with the generated data from the user. A reference individual whose generated data best matches the generated data from the user is then identified before using the HRTF of the identified reference individual for the user. Preferably the size or shape of a user’s ear is also considered. Such features of the ear to be considered include the concha, pinna, tragus, antitragus, lower crus of antihelix and ear notch.

Description

AUDIO PERSONALISATION METHOD AND SYSTEM

BACKGROUND OF THE INVENTION

Field of the invention

The present invention relates to an audio personalisation method and system. Description of the Prior Art Consumers of media content, including interactive content such as videogames, enjoy a sense of immersion whilst engaged with that content. As part of that immersion, it is also desirable for the audio to sound more realistic. However, techniques for achieving this realism tend to be complex and require specialist equipment.

The present invention seeks to mitigate or alleviate this problem. SUMMARY OF THE INVENTION Various aspects and features of the present invention are defined in the appended claims and within the text of the accompanying description and include at least: - In a first aspect, an audio personalisation method for a user is provided in accordance with claim 1.

- In another aspect, an audio personalisation system for a user is provided in accordance with claim 12.

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete appreciation of the disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein: - Figure 1 is a schematic diagram of an entertainment device in accordance with embodiments of the present description; - Figures 2A and 2B are schematic diagrams of head related audio properties; - Figures 3A and 3B are schematic diagrams of ear related audio properties; - Figures 4A and 4B are schematic diagrams of audio systems used to generate data for the computation of a head related transfer function in accordance with embodiments of the present description; - Figure 5 is a schematic diagram of an impulse response for a user's left and right ears in the time and frequency domains; - Figure 6 is a schematic diagram of a head related transfer function spectrum for a user's left and right ears; - Figure 7 is a schematic diagram of two views of a user and example measurements characteristic of their head morphology; - Figure 8 is a schematic diagram of a user's ear and example measurements characteristic of their ear morphology; and - Figure 9 is a flow diagram of an audio personalisation method for a user, in accordance with embodiments of the present description.

DESCRIPTION OF THE EMBODIMENTS

An audio personalisation method and system are disclosed. In the following description, a number of specific details are presented in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to a person skilled in the art that these specific details need not be employed to practice the present invention. Conversely, specific details known to the person skilled in the art are omitted for the purposes of clarity where appropriate.

In an example embodiment of the present invention, a suitable system and/or platform for implementing the methods and techniques herein may be an entertainment device such as the Sony PlayStation 4 ® or PlayStation 5 ® videogame consoles.

For the purposes of explanation, the following description is based on the PlayStation 4 ® but it will be appreciated that this is a non-limiting example.

Referring now to the drawings, wherein like reference numerals designate identical or corresponding parts throughout the several views, Figure 1 schematically illustrates the overall system architecture of a Sony® PlayStation 4® entertainment device. A system unit 10 is provided, with various peripheral devices connectable to the system unit.

The system unit 10 comprises an accelerated processing unit (APU) 20 being a single chip that in turn comprises a central processing unit (CPU) 20A and a graphics processing unit (GPU) 20B. The APU 20 has access to a random access memory (RAM) unit 22.

The APU 20 communicates with a bus 40, optionally via an I/O bridge 24, which may be a discreet component or part of the APU 20.

Connected to the bus 40 are data storage components such as a hard disk drive 37, and a Blu-ray ® drive 36 operable to access data on compatible optical discs 36A. Additionally the RAM unit 22 may communicate with the bus 40.

Optionally also connected to the bus 40 is an auxiliary processor 38. The auxiliary processor 38 may be provided to run or support the operating system.

The system unit 10 communicates with peripheral devices as appropriate via an audio/visual input port 31, an Ethernet ® port 32, a Bluetooth ® wireless link 33, a Wi-Fi ® wireless link 34, or one or more universal serial bus (USB) ports 35. Audio and video may be output via an AV output 39, such as an H DMI ® port The peripheral devices may include a monoscopic or stereoscopic video camera 41 such as the PlayStation ® Eye; wand-style videogame controllers 42 such as the PlayStation ® Move and conventional handheld videogame controllers 43 such as the DualShock ® 4; portable entertainment devices 44 such as the PlayStation ® Portable and PlayStation ® Vita; a keyboard 45 and/or a mouse 46; a media controller 47, for example in the form of a remote control; and a headset 48. Other peripheral devices may similarly be considered such as a printer, or a 3D printer (not shown), or a mobile phone 49 connected for example via Bluetooth ® or Wifi Direct ®.

The GPU 20B, optionally in conjunction with the CPU 20A, generates video images and audio for output via the AV output 39. Optionally the audio may be generated in conjunction with or instead by an audio processor (not shown).

The video and optionally the audio may be presented to a television 51. Where supported by the television, the video may be stereoscopic. The audio may be presented to a home cinema system 52 in one of a number of formats such as stereo, 5.1 surround sound or 7.1 surround sound. Video and audio may likewise be presented to a head mounted display unit 53 worn by a user 60.

In operation, the entertainment device defaults to an operating system such as a variant of FreeBSD 9.0. The operating system may run on the CPU 20A, the auxiliary processor 38, or a mixture of the two. The operating system provides the user with a graphical user interface such as the PlayStation ® Dynamic Menu. The menu allows the user to access operating system features and to select games and optionally other content.

When playing such games, or optionally other content, the user will typically be receiving audio from a stereo or surround sound system 52, or headphones, when viewing the content on a static display 51, or similarly receiving audio from a stereo surround sound system 52 or headphones, when viewing content on a head mounted display ('HMD') 53.

In either case, whilst the positional relationship of in game objects either to a static screen or the user's head position (or a combination of both) can be displayed visually with relative ease, producing a corresponding audio effect is more difficult.

This is because an individual's perception of direction for sound relies on a physical interaction with the sound around them caused by physical properties of their head; but everyone's head is different and so the physical interactions are unique.

Referring to figure 2A, an example physical interaction is the interaural delay or time difference (ITD), which is indicative of the degree to which a sound is positioned to the left or right of the user (resulting in relative changes in arrival time at the left and right ears), which is a function of the listener's head size and face shape.

Similarly, referring to figure 2B, interaural level difference (ILD) relates to different loudness for left and right ears and is indicative of the degree to which a sound is positioned to the left or right of the user (resulting in different degrees of attenuation due to the relative obscuring of the ear from the sound source), and again is a function of head size and face shape.

In addition to such horizontal (left-right) discrimination, referring also to Figure 3A the outer ear comprises asymmetric features that vary between individuals and provide additional vertical discrimination for incoming sound; referring to Figure 3B, the small difference in path lengths between direct and reflected sounds from these features cause so-called spectral notches that change in frequency as a function of sound source elevation.

Furthermore, these features are not independent; horizontal factors such as ITD and ILD also change as a function of source elevation, due to the changing face/head profile encountered by the sound waves propagating to the ears. Similarly, vertical factors such as spectral notches also change as a function of left/right positioning, as the physical shaping of the ear with respect to the incoming sound, and the resulting reflections, also change with horizontal incident angle.

The result is a complex two-dimensional response for each ear that is a function of monaural cues such as spectral notches, and binaural or inter-aural cues such as ITD and ILD. An individual's brain learns to correlate this response with the physical source of objects, enabling them to distinguish between left and right, up and down, and indeed forward and back, to estimate an object's location in 3D with respect to the user's head.

It would be desirable to provide a user with sound (for example using headphones) that replicated these features so as to create the illusion of in-game objects (or other sound sources in other forms of consumed content) being at specific points in space relative to the user, as in the real world. Such sound is typically known as binaural sound.

However, it will be appreciated that because each user is unique and so requires a unique replication of features, this would be difficult to do without extensive testing.

In particular, it is necessary to determine the in-ear impulse or frequency response of the user for a plurality of positions, for example in a sphere around them; figure 4A shows a fixed speaker arrangement for this purpose, whilst figure 4B shows a simplified system where, for example, the speaker rig or the user can rotate by fixed increments so that the speakers successively fill in the remaining sample points in the sphere.

Referring to figure 5, for a sound (e.g. an impulse such as a single delta or click) at each sampled position, a recorded impulse response within the ear (for example using a microphone positioned at the entrance to the ear canal) is obtained, as shown in the upper graph. A Fourier transform of such an impulse response is referred to as a frequency response, as shown in the lower graph of Figure 5. Collectively, these impulse responses or frequency responses can be used to define a so-called head-related transfer function (HRTF) describing the effect for each ear of the user's head on the received frequency spectrum for that point in space.

Measured over many positions, a full HRTF can be computed, as partially illustrated in figure 6 for both left and right ears (showing frequency on the y-axis versus azimuth on the x-axis). Brightness is a function of the Fourier transform values, with dark regions corresponding to spectral notches.

An HRTF typically comprises a time or frequency filter (e.g. based on an impulse or frequency response) for a series of positions on a sphere or partial sphere surrounding the user's head (e.g. for both azimuth and elevation), so that a sound, when played through a respective one of these filters, appears to come from the corresponding positon/direction. The more measured positions on which filters are based, the better the HRTF is. For positions in between measured positions, interpolation between filters can be used. Again, the closer the measurement positions are to each other, the better (and less) interpolation there is.

It will be appreciated that obtaining an HRTF for each of potentially tens of millions of users of an entertainment device using systems such as those shown in figures 4A and 4B is impractical, as is supplying some form of array system to individual users in order to perform a self-test.

Accordingly, in embodiments of the present description, a different technique is disclosed.

In these embodiments, full HRTFs for a plurality of reference individuals are obtained using systems such as those shown in figures 4A and 4B, to generate a library of HRTFs. This library may be may initially be small, with for example individual representatives of several ages, ethnicities and each sex being tested, or simply a random selection of volunteers, beta testers, quality assurance testers, early adopters or the like. One reference individual can be a default such as a dummy head. However over time more and more individuals may be tested with their resulting HRTF being added to the library.

Subsequently in these embodiments it is intended to match home users to the morphologically most similar reference individual, and then use the HRTF obtained for that most similar reference individual for the respective home user.

To do this, images are captured of reference individuals, together with reference features of known size.

The reference features may be any suitable feature that will also be available to home users, for whom corresponding images are subsequently captured.

Hence a non-exhaustive list of reference features include the following One or more stickers to be attached to the reference individual's face. The or each sticker will be of a known absolute size and shape, and may optionally be of a colour (such as green or blue) that provides a strong discrimination with respect to skin tones. Stickers may be positioned on suitable locations on the user's head (for example under guidance provided by an app on a videogame console or mobile phone). Locations include but are not limited to on the reference individual's neck, chin, cheek(s), nose, forehead, and ears (e.g. earlobes). The stickers can be provided with the audio content generator.

H. One or more clip-on earrings. Like the stickers, the earring(s) will be of a known absolute size and shape, and may optionally be of a contrasting colour. The earrings enable a reference feature to be placed in a known relationship with respect to the reference individual's ear(s), which as explained herein can be of importance. Again the clip-on earring can be provided with the audio content generator.

Hi. A peripheral item provided with the audio content generator. The or each peripheral item will be of a known absolute size and shape. Examples include a remote control for a particular TV, set-top TV box, or optical disk player, or a videogame controller for a particular videogame console. The peripheral item may be provided specifically for the purposes of being a reference feature, and for example comprise a piece of plastic of known size and shape. In any event, the peripheral item may be positioned on or adjacent to suitable locations on the user's head (for example under guidance provided by an app on a videogame console or mobile phone). Locations include but are not limited to on the reference individual's neck, chin, cheek(s), nose, forehead, and ears.

iv. A commonplace item, to be used in an identical manner to the peripheral item above. A typical example would be a CD, DVD, or Blu-Ray ® disk, which share a standard size. Other examples include a sheet of A4, AS or US Letter paper, or such paper folded into equal parts (e.g. half, third, quarter or sixth). Bank cards are also typically of a known and standard size, as are playing cards. In the case of a bank card, the user may be advised to ensure the side comprising account details is facing away from the camera.

v. Structured light. If a depth-sensing camera is available that emits structured light (e.g. a near-infrared grid pattern or the like), then the pitch of the grid as projected onto the reference individual's face/head at the detected distance can be known to an absolute size.

vi. AR markers. If a depth-sensing camera is available that provides a feed to a computing device such as a PC or videogame console, and this in turn can augment the image on a screen, then the person can be invited to place their head exactly between two markers superposed on the displayed image. This will indicate relative head size, and in conjunction with a depth measure to the head, this can be resolved to an absolute head size. Once known other features of the head/face/ears can be inferred from this absolute head size.

In any event, the reference features within the captured images allow for calculation of absolute feature size for features of the reference individual's face, head, and/or shoulders.

Referring now to Figure 7, a non-exhaustive list of features for which absolute measurements may be optionally calculated include the following.

- Head height (710), e.g. y-axis head dimension; - Head width (720), e.g. x-axis head dimension; - Head depth (730), e.g. z-axis head dimension; - Ear size 740A; - Ear height 740B (e.g. y-axis ear positon relative to a feature of the head); - Ear distance from face 740C (e.g. z-axis ear positon relative to a feature of the head); o E.g. collectively ear position relative to one or more features of the head; - Ear overhang 750A (e.g. angle or rotation about the z axis) ; - Ear tilt 750B (e.g. angle of rotation about the x-axis); o E.g. collectively ear orientation It will also be appreciated that whilst absolute feature size for one or more of the above listed features may be explicitly calculated in this manner, alternatively or in addition a model of a head (for example a mesh or similar parametric model of a default head) may be modified (e.g. morphed) to match features of the reference individual's face. Whilst one or more of the above listed features may be used to modify the model, other features may include eye size and position, eyebrow position, nose size and position and mouth size and position. Similarly, the model may be modified to match the face-on outline of the reference individual and similarly their side-on profile. If images at other angles are provided, the profile of the model at those angles can also be matched, in a manner similar to photogrammetry.

Such a model can be constrained to provide a continuous surface between known points obtained from the images that are consistent with a human head, e.g. scaling it according to head height, width and depth, and scaling and positioning the ear according to its measurements. Conforming to a face-on, side-on or other profile, if done, is then a finer-resolution scaling or morphing of model points or parameters.

In addition to at least a first image of the user's face, preferably an image of at least one of the reference individual's ears, and more preferably images of each ear, are captured.

Referring now to Figure 8, the as described with reference to figures 3A and 3B the morphology of the ear determines the relationship between sounds source angle and frequency notches caused by resonances within the outer ear structure. Typical measurements characteristic of the outer ear's morphology as it relates to these effects include but are not limited to the distance between the tragus (or the ear canal if visible) and antitragus (A); between the tragus (or the ear canal if visible) and concha (B); between the ear notch and lower crus of antihelix (C); adjacent the lower crus of antihelix to the concha (D); and the width (E) and height (F) of the outer ear encompassed by the helix.

Similar more general features include concha width and/or height, and pinna length, width, and/or angles (e.g. angles of rotation about the x, y, and/or z axes as per Figure 7).

Again it will be appreciated that whilst absolute feature size for one or more of the above listed features may be explicitly calculated in this manner, alternatively or in addition a model of an ear (for example a mesh or similar parametric model of a default ear) may be modified (e.g. morphed) to match features of the reference individual's ear. Whilst one or more of the above listed features may be used to modify the model, other features may be used such as the curve or profile of the helix, lower crus of antihelix, antihelix, concha, ear notch, antitragus, tragus and ear canal (if visible).

Again such a model can be constrained to provide a continuous surface between known points obtained from the images that are consistent with a human ear, e.g. scaling it according to ear height and width, and scaling or positioning certain features such as one or more of those listed above according to such measurements. Conforming to curve or profile features of one or more parts of the outer ear, if done, is then a finer-resolution scaling or morphing of model points or parameters.

This model of the ear can then be positioned and orientated with respect to the model of the head, optionally combining the models into one mesh. Alternatively or in addition, the or each measurement for the ear(s) can be added to the measurement(s) for the face/head.

The captured images referred to herein may be captured for example using any suitable digital camera, whether a dedicated camera (e.g. a so-called SLR) or a camera associated with a computer or videogame console, such as a webcam or PlayStation Eye ® (41), or a structured light emitting camera as referred to previously, or, typically, a camera on a mobile phone.

The captured images may be passed to an image analysis application for processing. Such an app may be on a computer, or on a videogame console (for example in the case of a camera in communication with such a console). The app may similarly be on the mobile phone that takes the photos. In each case, the image analysis application may perform the processing locally or pass some or all of it to a remote server (not shown).

Captured images that may assist in generating the morphological data described herein include but are not limited to a face-on picture of the whole head, a profile picture of the whole head, a close-up (for example occupying more than 50% of the height/width of an image) of one or preferably both ears taken at an angle substantially parallel to the plane of the pinna (the person's outer ear) on their head, and additional images of the person's head or ears at other angles (e.g. angles of rotation about the y or z axes as per Figure 7), which can assist with determining and/or compensating for certain features such as ear angles. Other additional images may include the person's bust (head, neck and upper torso) to determine relative positions of these. Such additional images can improve robustness by allowing some measurements or model / mesh points to be obtained from multiple images, and optionally be compared to remove outliers and/or provide confidence values, which in turn may optionally affect the weighting of such elements.

As described herein, the app may analyse the images to determine one or more of the measurements described, and/or to warp or modify a default model / mesh of a head and/or ear(s) in response to these measurements and/or features within the images.

As described herein, the measurements or model / mesh may be subject to constraints and/or assumed relationships; for example if only a face-on image of the reference individual's whole head is available, then the visible height and width of the head can also be used to estimate the depth, based on an expected relationship between the measurements (for example with reference to a default head). A similar approach can be used when warping or modifying a model / mesh, with some mesh values changing in response to changes imposed on other ones, based on an expected relationship between the model / mesh points (for example with reference to a default head model / mesh).

Similarly if an image of an ear enables some measurements to be performed, or features to be substantially matched to a model/mesh, but not others (e.g. due to shadow, poor contrast, or low resolution), then the others can be inferred based on expected relationships between the measurements, or the model / mesh.

Conversely, where a feature of a head or ear is notionally constrained by other measurements or model / mesh values, then optionally if a new measurement or model / mesh value corresponding to that feature indicates a value more than a threshold different to the constrained value, then the new value may be rejected, and/or the person taking the photos could be invited to take another, for example with different lighting or from a different angle, in order to capture an image from which a more accurate measurement / model / mesh value can be obtained.

In this way, a description of the morphology of the reference individual's face and ears can be obtained (e.g. in the form of the above measurements, models/mesh, and/or images), for storage in association with the reference individual's HRTF.

Consequently the library of reference individual HRTFs also comprises a corresponding description of morphology for the respective reference individuals.

Subsequently, a home user is asked to capture similar images (e.g. one face-on, and of one or both ears, plus optional additional photos such whole-head profile shots at one or more angles), as described previously herein.

Again an app, for example on the user's phone, tablet, computer, videogame console, smart TV, set top box or the like, analyses these images (or sends them to a remote server for analysis), as described previously herein.

The results are similar to those for the reference individuals, namely measurements, model/mesh, and/or images that describe the morphology of the home user's face and ears, as described previously herein.

Once data characterising the morphology of the home user has been generated, this can be compared with the equivalent morphology data in the HRTF library.

The morphology data in the HRTF library that is the closest match can thus be identified, and the HRTF corresponding to this closest matching data and reference individual can then be selected for use with the home user as a good approximation of an HRTF for that home user.

The comparison may for example determine cumulative absolute errors between measurements, or average absolute errors between measurements, and/or cumulative absolute errors between model / mesh values, or average absolute errors between model / mesh values, and/or cumulative absolute errors between normalised images (e.g. normalised for absolute scale as indicated by reference feature, image size, resolution, colour and the like), or average absolute errors between similarly normalised images, as a cost function for determining the closest match.

Optionally certain measurements or values for model / mesh points or regions may be weighted differently.

For example the distance between the user's ears (720) may be considered particularly important, as may the distance between the tragus (or the ear canal if visible) and antitragus (A), and between the tragus (or the ear canal if visible) and concha (B). The contribution of these measurements or corresponding model! mesh points / regions may be weighted accordingly in any assessment.

Alternatively, measurements or corresponding model / mesh points / regions may be compared with equivalent values for reference individual's data in successive elimination rounds.

Hence as a non-limiting example the three measurements above (720, A, B) or equivalent model / mesh points / regions could be compared with the equivalent morphology data in the HRTF library to identify a top N candidate reference individuals with the closest match on that basis only. Again even at this stage, different elements of the data could be weighted in this comparison. The actual measurements or equivalent model / mesh points / regions used to select a top N candidate reference individuals may for example be chosen based upon their effect on HRTFs in general, their relative variance and hence potential discriminatory capability within the library data, and empirical testing.

A further or full data comparison would then be performed only on the data for these N candidate reference individuals. In this way the most important M measurements or model / mesh points / regions can be used for primary selection but then the overall model data can be used for final selection. One reason for taking this approach rather than just heavily weighting the most important features within a global comparison of the library data is that in order for that weighting to be effective it may be so heavy that it prevents more refined selectivity of other features. By allowing for an initial selection of a subset of candidates with the main features close to those of the home user, a further assessment with different weightings can be used to obtain a good overall match.

In any event, the closest matching morphology data from the reference individuals in the library can be identified in this manner, and the corresponding HRTF for the respective reference individual can then be used for sound processing for the home user.

It will be appreciated that other data descriptive of the reference individuals and the home user may be factored into any comparison / matching process, including for example ethnicity, gender and/or age, which can have an effect on interaural delay due to head size, and also a tendency for higher frequency notches in women and the young. These demographic aspects may be used in an overall weighted matching or as an initial filter, in a similar manner to other certain measurements as described previously herein. The person could provide these demographic details as part of the data capture process, or may be automatically estimated from image analyses and optionally audio analysis if the person is invited to speak (e.g. a predetermined utterance).

In any event, as noted above, when a best match between the morphological data for the home user and the equivalent data for the reference individuals is found (whether based on unweighted or weighted data, and/or based on one or more initial elimination rounds using respective subsets of one or more user morphological data values and/or one or more user demographic values), the head related transfer function of the corresponding reference individual is thus identified as the best match in the library for the current home user.

This HRTF may then be used to process sound for the home user, for example when presenting them with binaural sound.

This will provide a more immersive audio experience than normal stereo or surround sound, or a default HRTF based upon a reference dummy head, without the need for the equipment, resources, or time to perform measurements to generate a direct HRTF for the home user themselves.

It will be appreciated that some variants of the techniques herein may also be considered.

Hence for example optionally, having obtained the home user's morphological data, it can then be kept on record; consequently if a new reference individual is added to the library, the user's morphological data can be tested against that of the new reference individual to see if they are a better match, for example as a background service provided by a remote server. If a better match is found, then the HRTF of the closer matching reference individual may be installed as the HRTF for that user, thereby improving their experience further.

The individuals chosen to expand the library can also be selected judiciously; one may assume that for a representative set of reference individuals, a random distribution of home users will match to each reference individual in roughly equal proportions; however if a comparatively high number of users match to a reference individual (for example above a threshold variance in the number of users matching to reference individuals), then this is indicative of at least one of the following: The population of users is not random (e.g. due to demographics), and so there are more people similar to this reference individual than the norm; and ii. The set of reference individuals is not sufficiently representative of the users and there is a gap in the proxy result space surrounding this particular reference individual, causing people who in fact are not that similar to the individual to be matched to them for lack of a better match.

In either case, it would be desirable to find other reference individuals who are morphologically similar to the one currently in the library, in order to provide more refined discrimination within this sub-group of the user population. Such individuals may optionally be found for example by comparing photographs of the candidate reference individual, for example face-on and side on (showing an ear) to help with automatically assessing head shape and outer ear shape.

Morphologically similar individuals may also be found using other methods, such as identifying individuals with similar demographics, or inviting close family relatives of the existing reference individual.

Similarly for example home users whose submitted morphological data indicates that they occupy a gap in the representations within the library may be invited to become a reference individual, for example by visiting a facility capable of producing the HRTFs. As a side effect of this, using the techniques herein this home user will then match with themselves as a reference individual and be able to use their own HRTF at home.

In this way, optionally the HRTF library can be grown over time in response to the characteristics of the user base.

Where it is not possible to find a suitable new reference individual, or whilst waiting for one to be added to the library, optionally for a user that is close to two or more reference individuals but not within a L1 threshold degree of match of any of them, a blend of the HRFTs of the two or more reference individuals may be generated to provide a better estimate of their own HRTF. This blend may be a weighted average or other combination responsive to the relative degree of for two or more reference individual's HRTFs.

The user can re-do the test from scratch as they wish; for example a growing child may wish to do so annually as their head shape changes as they grow. Similarly an older individual may re-take the test if they suspect some hearing loss in either ear.

Referring now also to Figure 9, in a summary embodiment of the present description, an audio personalisation method for a user (e.g. home users) comprises the following steps.

- In a first step s910, capturing at least a first image of a user comprising a view of their head, wherein at least one of the captured images comprises a reference feature of known absolute size in a predetermined relationship to the user's head, as described herein; - In a second step s920, analysing the or each captured image to generate data characteristic of the morphology of the user's head, responsive to the known absolute size of the reference feature (e.g. using the reference feature to enable accurate measurements) model parameters, or mesh points reflecting the true size of the user's head), as described herein; In a third step s930, for a corpus of reference individuals for whom respective head related transfer functions 'HRTF's have been generated, comparing some or all of the generated data from the user with corresponding data of some or all respective reference individuals in the corpus, as described herein; - In a fourth step s940, identifying a reference individual whose generated data best matches the generated data from the user, as described herein; and - In a fifth step s950, using the HRTF of the identified reference individual for the user, as described herein.

It will be apparent to a person skilled in the art that variations in the above method corresponding to operation of the various embodiments of the method and/or apparatus as described and claimed herein are considered within the scope of the present disclosure, including but not limited to that: - the generated data comprises measurements between predetermined points on the user's head identified by image analysis of one or more captured images of the user, as described herein; -the generated data comprises values for a parametric model of the user's head, with at least some of the values being estimated by image analysis of one or more captured images of the user, as described herein; - the generated data comprises a mesh of the user's head, with at least some of the mesh positions being estimated by image analysis of one or more captured images of the user, as described herein; - at least one captured image comprises an image of an ear of the user, and the step of analysing the or each captured image to generate data characteristic of the morphology of the user's head also comprises similarly generating data characteristic of the morphology of the user's ear, as described herein; - the generated data characterises one or more selected from the list consisting of head height, head width, ear size, ear position relative to one or more features of the head, and ear orientation (for example either as measurements, or as part of a parameterised model, or a mesh), as described herein; - the generated data characterises one or more selected from the list consisting of concha width, concha height, pinna width, pinna height, the distance between the tragus (or the ear canal if visible) and antitragus, the distance between the tragus (or the ear canal if visible) and concha, the distance between the ear notch and lower crus of antihelix, and the distance between adjacent to the lower crus of antihelix to the concha (for example either as measurements) or as part of a parameterised model, or a mesh), as described herein; - the reference feature comprises one or more selected from the list consisting of one or more stickers, one or more clip-on earrings, a peripheral item provided with an audio content generator, a commonplace item of standardised size, structured light used in conjunction with a depth sensing camera, and AR markers used in conjunction with a depth sensing camera, as described herein; - the predetermined relationship comprises one selected from the list consisting of for a sticker, being stuck on to the user's face or ear; for a clip-on earring, being clipped onto the user's ear; for a peripheral item or commonplace item, being positioned in contact with a predetermined part of the user's head, face or ear; for structured light, being projected onto the user's head, face, or ear; and for an AR marker, being adjacent within an image to a predetermined part of the user's head, face or ear, as described herein; and - for each of a corpus of reference individuals, generating a respective head related transfer function 'HRTF', capturing at least a first image of the reference individual comprising a view of their head (wherein at least one of the captured images comprises a reference feature of known absolute size in a predetermined relationship to the reference individual's head), analysing the or each captured image to generate data characteristic of the morphology of the reference individual's head, responsive to the known absolute size of the reference feature, and associating the generated data with the respective reference individual and their HRTF.

It will be appreciated that the above methods may be carried out on conventional hardware suitably adapted as applicable by software instruction or by the inclusion or substitution of dedicated hardware.

Thus the required adaptation to existing parts of a conventional equivalent device may be implemented in the form of a computer program product comprising processor implementable instructions stored on a non-transitory machine-readable medium such as a floppy disk, optical disk, hard disk, solid state disk, PROM, RAM, flash memory or any combination of these or other storage media, or realised in hardware as an ASIC (application specific integrated circuit) or an FPGA (field programmable gate array) or other configurable circuit suitable to use in adapting the conventional equivalent device. Separately, such a computer program may be transmitted via data signals on a network such as an Ethernet, a wireless network, the Internet, or any combination of these or other networks.

Accordingly, in a summary embodiment of the present description, an audio personalisation system (for example comprising an audio content generator such as a videogame console 10 or mobile phone 49) for a user comprises the following.

- First, a camera (for example a PlayStation ® camera 41, or a camera of a mobile phone 49) operable to capture at least a first image of a user comprising a view of their head, wherein at least one of the captured images comprises a reference feature of known absolute size (such as a sticker, clip-on item, peripheral of the audio content generator, commonplace standardised item, structured light or AR marker) in a predetermined relationship to the user's head, as described herein.

- Second, an image analysis processor (for example CPU 20A, and/or an equivalent CPU in a mobile phone 49 or remote server) not shown) configured (for example by suitable software instruction) to analyse the or each captured image to generate data characteristic of the morphology of the user's head, responsive to the known absolute size of the reference feature, as described herein.

- Third, a comparison processor (for example CPU 20A, and/or an equivalent CPU in a mobile phone 49 or remote server, not shown) configured (for example by suitable software instruction) to compare, for a corpus of reference individuals for whom respective head related transfer functions 'HRTF's have been generated, some or all of the generated data from the user with corresponding data of some or all respective reference individuals in the corpus, the comparison processor also being configured (again for example by suitable software instruction) to identify a reference individual whose generated data best matches the generated data from the user, as described herein.

- And fourth, an audio processor (for example CPU 20A) configured (for example by suitable software instruction) to use the HRTF of the identified reference individual for the user, as described herein.

In an instance of this summary embodiment, the audio personalisation system comprises a remote server configured to operate as one or more selected from the list consisting of the image analysis processor, and the comparison processor, as described herein.

In an instance of this summary embodiment, the audio personalisation system comprises an object for use as the reference feature (such as a sticker, clip-on item, or peripheral), as described herein.

In an instance of this summary embodiment, the audio personalisation system comprises one or more selected from the list consisting of a videogame console operable to receive image data from a camera, and a mobile phone comprising a camera.

It will be appreciated that the audio personalisation system may be adapted (for example by suitable software instruction) to implement any of the methods and techniques described herein.

The foregoing discussion discloses and describes merely exemplary embodiments of the present invention. As will be understood by those skilled in the art, the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting of the scope of the invention, as well as other claims. The disclosure, including any readily discernible variants of the teachings herein, defines, in part, the scope of the foregoing claim terminology such that no inventive subject matter is dedicated to the public.

Claims

CLAIMS1. An audio personalisation method for a user, comprising the steps of: capturing at least a first image of a user comprising a view of their head, wherein at least one of the captured images comprises a reference feature of known absolute size in a predetermined relationship to the user's head; analysing the or each captured image to generate data characteristic of the morphology of the user's head, responsive to the known absolute size of the reference feature; for a corpus of reference individuals for whom respective head related transfer functions 'HRTF's have been generated, comparing some or all of the generated data from the user with corresponding data of some or all respective reference individuals in the corpus; identifying a reference individual whose generated data best matches the generated data from the user; and using the HRTF of the identified reference individual for the user.
2. An audio personalisation method for a user according to claim 1, in which the generated data comprises measurements between predetermined points on the user's head identified by image analysis of one or more captured images of the user.
3. An audio personalisation method for a user according to any preceding claim, in which the generated data comprises values for a parametric model of the user's head, with at least some of the values being estimated by image analysis of one or more captured images of the user.
4. An audio personalisation method for a user according to any preceding claim, in which the generated data comprises a mesh of the user's head, with at least some of the mesh positions being estimated by image analysis of one or more captured images of the user.
S. An audio personalisation method for a user according to any preceding claim, in which at least one captured image comprises an image of an ear of the user; and the step of analysing the or each captured image to generate data characteristic of the morphology of the user's head also comprises similarly generating data characteristic of the morphology of the user's ear.
6. An audio personalisation method for a user according to any preceding claim, in which the generated data characterises one or more selected from the list consisting of: head height; head width; ear size; iv. ear position relative to one or more features of the head; and v. ear orientation.
7. An audio personalisation method for a user according to any preceding claim, in which the generated data characterises one or more selected from the list consisting of: concha width; concha height; iii. pinna width; iv. pinna height; v. the distance between the tragus (or the ear canal if visible) and antitragus; vi. the distance between the tragus (or the ear canal if visible) and concha; vii. the distance between the ear notch and lower crus of antihelix; and viii. the distance between adjacent to the lower crus of antihelix to the concha. 5
8. An audio personalisation method for a user according to any preceding claim, in which the reference feature comprises one or more selected from the list consisting of: one or more stickers; one or more clip-on earrings; iii. a peripheral item provided with an audio content generator; iv. a commonplace item of standardised size; v. structured light used in conjunction with a depth sensing camera; and vi. AR markers used in conjunction with a depth sensing camera.
9. An audio personalisation method for a user according to any preceding claim, in which the predetermined relationship comprises one selected from the list consisting of: for a sticker, being stuck on to the user's face or ear; for a clip-on earring, being clipped onto the user's ear; For a peripheral item or commonplace item, being positioned in contact with a predetermined part of the user's head, face or ear; iv. For structured light, being projected onto the user's head, face, or ear; and v. For an AR marker, being adjacent within an image to a predetermined part of the user's head, face or ear.
10. An audio personalisation method for a user according to any preceding claim, comprising the steps of: for each of a corpus of reference individuals; generating a respective head related transfer function 'HRTF'; capturing at least a first image of the reference individual comprising a view of their head, wherein at least one of the captured images comprises a reference feature of known absolute size in a predetermined relationship to the reference individual's head; analysing the or each captured image to generate data characteristic of the morphology of the reference individual's head, responsive to the known absolute size of the reference feature; and associating the generated data with the respective reference individual and their H RTF.
11. A computer program comprising computer executable instructions adapted to cause a computer system to perform the method of any one of the preceding claims.
12. An audio personalisation system for a user, comprising a camera operable to capture at least a first image of a user comprising a view of their head, wherein at least one of the captured images comprises a reference feature of known absolute size in a predetermined relationship to the user's head; an image analysis processor configured to analyse the or each captured image to generate data characteristic of the morphology of the user's head, responsive to the known absolute size of the reference feature; a comparison processor configured to compare, for a corpus of reference individuals for whom respective head related transfer functions 'HRTF's have been generated, some or all of the generated data from the user with corresponding data of some or all respective reference individuals in the corpus; the comparison processor being configured to identify a reference individual whose generated data best matches the generated data from the user; and an audio processor configured to use the HRTF of the identified reference individual for the user.
13. An audio personalisation system according to claim 12, comprising: a remote server configured to operate as one or more selected from the list consisting of: i. the image analysis processor; and the comparison processor.
14. An audio personalisation system according to claim 12 or claim 13, comprising: an object for use as the reference feature.
15. An audio personalisation system according to any one of claims 12 to 14, in which the system comprises one or more selected from the list consisting of: a videogame console operable to receive image data from a camera; and a mobile phone comprising a camera.