HK1169897A - System and method for video compression using feedback including data related to the successful receipt of video content - Google Patents
System and method for video compression using feedback including data related to the successful receipt of video content Download PDFInfo
- Publication number
- HK1169897A HK1169897A HK12110542.8A HK12110542A HK1169897A HK 1169897 A HK1169897 A HK 1169897A HK 12110542 A HK12110542 A HK 12110542A HK 1169897 A HK1169897 A HK 1169897A
- Authority
- HK
- Hong Kong
- Prior art keywords
- video
- game
- frame
- user
- video frame
- Prior art date
Links
Abstract
A computer-implemented system and method for performing video compression is described. A method according to one embodiment includes encoding a first plurality of video frames or portions thereof, wherein each encoded video frame or portion is dependent on a previously-encoded video frame or portion thereof. The method further includes transmitting the first plurality of encoded video frames or portions to a client device and receiving feedback information from the client device, the feedback information usable to determine whether data contained in the video frames or portions has not been successfully received and/or decoded; in response to detecting that a video frame has not been successfully received and/or decoded, encoding a current video frame or portion thereof to be dependent on a previously-encoded video frame or portion thereof known to have been successfully received and/or decoded; and transmitting the current video frame or portion thereof to the client device.
Description
RELATED APPLICATIONS
The present application claims priority from U.S. provisional patent application No.61/210,888 entitled "System And Method for configuring Video Using feed", filed on 23.3.2009, continuation of co-pending U.S. application No.12/359,150 entitled "System And Method for detecting certificate Types of Multimedia Data Transmitted Over a Communication Channel", filed on 23.1.23.2009, And copending U.S. application No.11/999,475 entitled "pending And Broadcasting Video Events Using streaming interactive Video", filed on 5.12.2007, continuation of U.S. application No.11/999,475 filed on 12.12.2002, filed on 12.10.10.3.3825, filed on "application for Method Video, filed on 10/315,460, assigned to the same applicant (filed on 10/315,460).
Technical Field
The present disclosure relates generally to the field of data processing systems that improve the ability of users to operate and access audio and video media.
Background
Recorded audio and movie media have become an aspect of society since the tomas Edison (Thomas Edison) age. Recorded audio media (pole and record) and movie media (jukeboxes and movies) were widely released in the early 20 th century, but both technologies were still in their inception. In the late 20 s of the 20 th century, movies were combined with audio, followed by color movies, on a mass market basis. Radio broadcasting is evolving into broadcasting mass-market audio media in a form that largely supports advertising. When Television (TV) broadcast standards were established in the mid 40's of the 20 th century, TV was interfaced with radio in the form of broadcast mass market media to bring previously recorded or live movies into the home.
By the middle of the 20 th century, most american households have had record players (photo record players) for playing recorded audio media, radios for receiving live broadcast audio, and televisions for playing live broadcast audio/video (a/V) media. The 3 "media players" (record player, radio and TV) are often combined into a cabinet that shares common speakers, becoming the "media center" of the home. Although media options are limited for consumers, the media "ecosystem" is very stable. Most consumers know how to use a "media player" and are able to enjoy the full range of their capabilities. At the same time, publishers of media (mostly movie and television studios, and music companies) are able to distribute their media to both movie theaters and homes without suffering from widespread piracy or "secondary sales" (i.e., resale of used media). Typically, publishers do not receive revenue from secondary sales, and thus, secondary sales reduce the revenue that publishers receive for new sales from purchasers who may otherwise use media themselves. Although there is indeed a sale of used albums during the middle of the 20 th century, such a sale does not have a major impact on the album publisher, since, unlike a movie or video program (which is typically viewed once or only a few times by adults), a music track can be listened to hundreds or even thousands of times. Thus, music media is far "longer-lived" (i.e., has a persistent value for adult consumers) than movie/video media. Once a record is purchased, if the consumer likes the music, the consumer may hold it for a long time.
From the middle of the 20 th century to the present, the media ecosystem has undergone a series of radical changes to the interests and losses of both consumers and publishers. With the widespread introduction of audio recorders, in particular cassettes with high quality stereo sound, there is indeed a high degree of consumer convenience. It also marks the beginning of widespread consumer media practice-piracy. Indeed, many consumers record their own recordings using cassette tapes purely for convenience, but an increasing number of consumers (e.g., students in a dormitory who are ready to access each other's collection of recordings) will make pirated copies. Also, rather than purchasing a record or tape from a publisher, the consumer will record music played over the air.
The advent of consumer VCRs led to increased consumer convenience, as VCRs are now set up to record TV programs that can be viewed at a later time, and VCRs have also led to the establishment of the video rental industry, where movies as well as TV programming can be accessed on an "on demand" basis. The rapid development of mass market home media devices since the mid 80's of the 20 th century has led to unprecedented choices and convenience for consumers, and has also led to rapid expansion of the media publishing market.
Today, consumers are faced with a plethora of media choices and a plethora of media devices, many of which are tied to a particular form of media or a particular publisher. An avid media consumer may connect a stack of devices to the TV and computer in each room of the venue, causingA "rat-house" cable to one or more televisions and/or Personal Computers (PCs), and a cluster of remote controls. (in the context of this application, the term "personal computer" or "PC" refers to any kind of computer suitable for use in the home or office), including desktop computers, Macintosh (Macintosh machine)Or other non-Windows (Windows) computer, Windows-compatible device, Unix variant, notebook computer, etc.). Such devices may include video game consoles, VCRs, DVD players, audio surround sound processor/amplifiers, satellite set-top boxes, cable TV set-top boxes, and the like. Furthermore, for an avid consumer, there may be multiple devices of similar functionality due to compatibility issues. For example, a consumer may have both HD-DVD and Blu-ray DVD players, or Microsoft Xbox (Microsoft home Games)And Sony Playstation (Sony game station) Both video game systems. Indeed, due to incompatibility of some cross-game console versions of games, consumers may own XBox with later versions (such as XBox 360)) And both. Often, consumers are confused as to which video input and which remote to use. Even after the disc is placed in the correct player (e.g., DVD, HD-DVD, blu-ray, Xbox, or Playstation), the video and audio inputs for the device are selected, and the correct remote control is found, the consumer still faces technical challenges. For example, in the case of a wide screen DVD, a user may need to first determine the correct aspect ratio (e.g., 4: 3, full, zoom, wide zoom, cinema wide, etc.) and then set the correct aspect ratio on their TV or monitor screen. In a similar manner to that described above,the user may need to first determine the correct audio surround sound system format (e.g., AC-3, dolby digital, DTS, etc.) and then set the correct audio surround sound system format. Oftentimes, consumers are unaware that they may not enjoy the media content at the full capabilities of their television or audio system (e.g., watching a movie squeezed at the wrong aspect ratio, or listening to stereo audio rather than surround sound audio).
Increasingly, internet-based media devices have been added to the stack of devices. Similar Sonos (soros)An audio device of a digital music system streams audio directly from the internet. Likewise, like SlingboxTM(Shilingbao)TM) The entertainment player's device records and streams video out via a home network or via the internet, where the video can be remotely viewed on a PC. And Internet Protocol Television (IPTV) services provide cable TV-like services via Digital Subscriber Line (DSL) or other home internet connections. Recent efforts have also been made to integrate multiple media functions into a single device (such as Moxi (mosi)Media centers and PCs executing versions of the Windows XP media center). While each of the devices provides a little convenience to the functions that it performs, each device lacks universal and simple access to most media. In addition, the devices often cost hundreds of dollars to manufacture, often due to expensive processing and/or the need for local storage. In addition, such modern consumer electronics devices typically consume large amounts of power, even when idle, which means that they are more expensive and waste energy over time. For example, if the consumer forgets to turn the device off or switch it to a different video input, the device may continue to operate. Furthermore, since none of the devices is a complete solution, it must be connected to other devices in the home This still leaves the user with a mouse nest wire and many remote controls.
Furthermore, many newer internet-based devices typically provide media in a more general form (than it might otherwise be available) when they are functioning properly. For example, devices that stream video over the internet often stream only video material, and not interactive "extra items" that often accompany DVDs, such as "production" of video, games, or director's commentary. This is due to the fact that: interactive material is often produced in a particular format intended for a particular device that handles interactivity locally. For example, each of DVD, HD-DVD, and Blu-ray disc has its own particular interactive format. Any home media device or local computer, which may be developed to support all popular formats, will require a degree of sophistication and flexibility, which will likely be too expensive and complex for consumer operations.
Exacerbating this problem, if a new format is introduced later in the future, the local device may not have the hardware capability to support the new format, which would mean that the consumer would have to purchase an upgraded local media device. For example, if a higher resolution video or stereoscopic video is introduced at a later date (e.g., one video stream per eye), the local device may not have the computational power to decode the video, or it may not have the hardware to output the video in a new format (e.g., assuming stereoscopic vision is achieved with 120fps video synchronized with shutter glasses (shuttered glasses) with 60fps delivered to each eye, which option would not be available without an upgraded hardware purchase if the consumer's video hardware could only support 60fps video).
The problems of media device obsolescence and complexity are a serious problem when dealing with sophisticated interactive media, especially video games.
Modern video gaming applications are largely divided into four major non-portable hardware platforms: sony PlayStation1. 2 and 3(PS1, PS2, and PS 3); microsoft XboxAnd Xbox360(ii) a And Nintendo Gamecube (Nintendo cube sugar)And WiiTM(ii) a And PC-based games. Each of the platforms is distinct from the others such that games written to execute on one platform are typically not executed on another platform. There may also be compatibility issues between one generation of equipment and the next. Even if most software game developers build software games that are designed independently of a particular platform, in order to execute a particular game on a particular platform, a proprietary software layer (often referred to as a "game development engine") is required to adapt the game for use on the particular platform. Each platform is sold to consumers in the form of a "console" (i.e., a stand alone box attached to a TV or monitor/speaker) or is itself a PC. Typically, video games are sold on optical media, such as blu-ray DVD, DVD-ROM, or CD-ROM, that contain the video game embodied as a sophisticated real-time software application. As home broadband speeds increase, video games are becoming increasingly available for download.
The specificity of achieving platform compatibility with video game software is extremely demanding due to the real-time nature and high computational requirements of advanced video games. For example, one may desire full game compatibility from one generation of video games to the next (e.g., from XBox to XBox 360, or from Playstation 2 ("PS 2") to Playstation 3 ("PS 3")), as there is general compatibility of productivity applications (e.g., Microsoft Word) from one PC to another PC with faster processing units or cores. However, this is not the case for video games. Because video game manufacturers typically seek the highest possible performance for a given price point when issuing a generation of video games, dynamic architectural changes are often made to the system so that many games written for previous generation systems do not work on later generation systems. For example, XBox is based on the x86 family of processors, while XBox 360 is based on the PowerPC family.
Techniques may be utilized to mimic previous architectures, but given that video games are real-time applications, it is often impractical to achieve exactly the same behavior in the simulation. This is a loss to the consumer, video game console manufacturer, and video game software publisher. For the consumer, this means the necessity to keep both the old and new generation video game consoles on to the TV in order to be able to play all games. For console manufacturers, this means the costs associated with the emulation and slower adoption of new consoles. And for the publisher, this means that multiple versions of a new game may have to be released in order to cover all potential consumers-not only the version for each brand of video game (e.g., XBox, Playstation), but often the version for each version of a given brand (e.g., PS2 and PS 3). For example, a separate version of "crazy football 08" by the electric Arts company limited (Electronic Arts) was developed for XBox, XBox 360, PS2, PS3, Gamecube, Wii, and PC platforms, among others.
Portable devices, such as mobile phones and portable media players, also present challenges to game developers. Increasingly, the devices are connected to wireless data networks and are capable of downloading video games. However, there are a variety of mobile phones and media devices in the market with a variety of different display resolutions and computing capabilities. Also, because such devices typically have power consumption, cost, and weight constraints, they typically lack advanced graphics acceleration hardware similar to a graphics processing unit ("GPU"), such as devices manufactured by NVIDIA (NVIDIA corporation, santa clara, ca, usa. Thus, game software developers typically develop a given game title for many different types of portable devices at the same time. The user can discover: a given game title is not available for its particular mobile phone or portable media player.
In the case of a home gaming console, hardware platform manufacturers typically charge royalties to software game developers for the ability to publish games on their platforms. Mobile phone wireless carriers also typically charge royalties to game publishers for downloading games to mobile phones. In the case of PC games, there are no royalties paid for publishing the games, but game developers typically face high costs due to the high customer service burden for supporting multiple PC configurations and installation issues that may arise. Also, PCs are generally less of an impediment to the piracy of game software because they can be easily reprogrammed by a skilled user and games can be more easily pirated and more easily distributed (e.g., via the Internet). Therefore, there are cost and disadvantages for software game developers to distribute on game consoles, mobile phones, and PCs.
Costs are more than this for game publishers of console and PC software. To distribute the game via the retail channels, the publisher charges the retailer a wholesale price that is less than the sale price to have the retailer a profit margin. Publishers typically must also pay for the cost of manufacturing and distributing the physical media that holds the games. Retailers often also charge "price protection fees" to publishers to cover possible contingent fees (such as games that cannot be sold, or games that are reduced in price, or retailers that must refund some or all of the wholesale price and/or receive back games from purchasers). Additionally, retailers also typically charge publishers a fee to facilitate the sale of games in an advertising flyer. Moreover, retailers increasingly purchase back games from users who have played the games, and then sell the games as used games, typically without sharing the revenue of the used games with the game publisher. The following facts increase the cost burden imposed on the game publisher: games are often pirated and distributed over the internet for download and free copying by users.
As internet broadband speeds increase and broadband connectivity becomes more widespread in the united states and around the world (more specifically, "internet cafes" to homes and to PCs renting internet connections), games are more distributed to PCs or consoles via downloads. Moreover, broadband connections are more used for playing multiplayer and massively multiplayer online games (both of which are referred to in this disclosure by the acronym "MMOG"). These changes alleviate some of the costs and problems associated with retail distribution. Downloading online games addresses some of the disadvantages of game publishers, as distribution costs are typically small and there is little or no cost of unsold media. Downloaded games are still pirated and, due to their size (often many gigabytes in size), can take a very long time to download. In addition, multiple games may be filled with small disk drives, such as those sold in connection with portable computers or in connection with video game consoles. However, to the extent that a game or MMOG requires an online connection to make the game playable, piracy problems are mitigated, as users are typically required to have a valid user account. Unlike linear media (e.g., video and music) that can be copied by a camera taking video of a display screen or a microphone recording audio from a speaker, each video game experience is unique and cannot be copied using simple video/audio recordings. Thus, even in regions where copyright laws are not strongly enforced and piracy is rampant, MMOGs can be protected from piracy and thus can support commerce. For example, the "magic animal world" MMOG of vivenda SA (vivanda) has been successfully deployed without suffering piracy all over the world. And many online or MMOG games, such as the "second life" MMOG of Linden Lab (Linden Lab), generate revenue for the game operator through economic models built into the games, where assets can be brought, sold and even built using online tools. Thus, mechanisms other than traditional game software purchases or subscriptions may be used to pay for use of the online game.
While piracy can often be mitigated due to the nature of online or MMOGs, online gaming operators still face remaining challenges. Many games require a significant amount of local (i.e., in-home) processing resources for the online or MMOG to work properly. If the user has a low performance local computer (e.g., a computer without a GPU, such as a low-end notebook computer), then it may not be able to play the game. Additionally, as game consoles age, they fall far behind the current state of the art and may not be able to handle more advanced games. Even assuming that the user's local PC is capable of handling the computational requirements of the game, there is often installation complexity. There may be driver incompatibilities (e.g., if a new game is downloaded, a new version of a graphics driver may be installed, which renders a previously installed game that relies on an old version of the graphics driver inoperable). As more games are downloaded, the console may run out of local disk space. Complex games typically receive downloaded patches from game developers over time when defects are discovered and repaired or if modifications are made to the game (e.g., if the game developers find the level of the game too difficult or too easy to play). The patch requires a new download. Sometimes not all users complete the download of all patches. At other times, downloaded patches introduce other compatibility or disk space consumption issues.
Also, during game play, large data downloads may be required to provide graphical or behavioral information to a local PC or console. For example, if a user enters one room in an MMOG and encounters a scene or character that consists of graphical data or has behavior that is not available on the user's local machine, the data for that scene or character must be downloaded. If the internet connection is not fast enough, this can cause substantial delays during game play. Furthermore, if an encountered scene or character requires storage space or computing power that exceeds that of the local PC or console, it may create the following situation: where the user cannot continue in the game or must continue with a reduced quality graphic. Thus, online or MMOG games often limit their storage and/or computational complexity requirements. In addition, it often limits the amount of data transfer during the game. Online or MMOG games can also narrow the market for users who can play the games.
Moreover, technology-savvy users are increasingly reverse engineering local copies of games and modifying games so that they can cheat. Cheating may be as simple as making repeated button presses that are more rapid than manually possible (e.g., for a very quick fire). In games that support in-game asset transactions, cheating can reach a level of fraud that results in fraudulent transactions involving assets of substantial economic value. This can lead to substantial detrimental consequences for the gaming operator when the online or MMOG economic model is based on such asset transactions.
The cost of developing new games grows as PCs and consoles are able to produce more sophisticated games, e.g., with more realistic graphics (such as real-time ray tracing), and more realistic behaviors (such as real-time physics simulations). In the early days of the video game industry, video game development was a very similar process to application software development; that is, most development costs are in the development of software (as opposed to the development of graphics, audio, and behavioral elements or "assets"), such as those software developments that can be developed for movies with a wide range of special effects. Today, many sophisticated video game developments are more akin to special feature-rich movie development than software development. For example, many video games provide a simulation of the 3-D world and produce characters, props and environments that are more realistic (i.e., computer graphics that appear as realistic as a live action image photographed). One of the most challenging aspects of photo-realistic game development is creating computer-generated faces that cannot be distinguished from live-action faces. Face capture techniques (such as Contour developed by Mova of san Francisco, Calif.) TM(outline)TM) An authenticity capture system) captures the precise geometry of the performer's face and tracks the precise geometry of the performer's face with high resolution while the performer is in motion. This technique allows rendering (render) of 3D faces on a PC or game console, the 3D facesThe faces are virtually indistinguishable from the captured live-action faces. Accurately capturing and rendering "photo-realistic" faces is useful in a number of ways. First, highly identifiable celebrities or athletes are often used in video games (often hired at high cost), and imperfections may be noticeable to the user, distracting or unpleasant to the viewing experience (viewingexperience). Often, a high degree of detail is required to achieve a high degree of photo-like realism-potentially requiring the rendering of a large number of polygons and high resolution textures (in the case where the polygons and/or textures change on a frame-by-frame basis as the face moves).
When a high polygon count scene with detailed texture changes rapidly, a PC or game console supporting the game may not have enough RAM to store enough polygon and texture data for the required number of animation frames generated in the game segment. In addition, a single optical drive or a single disk drive, which is typically available on a PC or game console, is typically much slower than RAM, and typically cannot keep up with the maximum data rate that a GPU can accept in rendering polygons and textures. Current games typically load most polygons and textures into RAM, which means that a given scene is largely limited in complexity and duration by the capacity of RAM. In cases such as face animation, this may limit a PC or game console to a low resolution face that is not photorealistic, or to a photorealistic face that can only be animated in a limited number of frames before the game pauses and loads polygons and textures (and other data) for more frames.
When a PC or console displays a message similar to "loading." the viewing progress bar moves slowly across the screen is recognized as an inherent drawback by users of today's complex video games. The delay in loading the next scene from the disk (unless otherwise conditional, the term "disk" herein refers to non-volatile optical or magnetic media, and non-disk media such as semiconductor "flash" memory) can take several seconds or even minutes. This wastes time and can be quite frustrating for the game player. As previously mentioned, much or all of the delay may be due to the load time of polygons, textures, or other data from the disk, but may also be the following: when a processor and/or GPU in a PC or console prepares data for a scene, a portion of the load time is spent. For example, a soccer video game may allow a player to select among a large number of players, teams, stadiums, and weather conditions. Thus, depending on what particular combination is selected, different polygons, textures, and other data (collectively "objects") for the scene may be needed (e.g., different subgroups have different colors and patterns on their uniforms). It is possible to enumerate many or all of the various permutations and pre-compute many or all of the objects in advance and store the objects on disk for storage of the game. However, if the number of permutations is large, the amount of storage required for all objects may be too large to fit on disk (or too impractical to download). Thus, existing PC and console systems are typically constrained in both complexity and playback duration for a given scene and suffer from long load times for complex scenes.
Another significant limitation of prior art video game systems and application software systems is that: it increasingly uses large databases of, for example, 3D objects (such as polygons and textures) that need to be loaded into a PC or game console for processing. As described above, when the database is stored locally on disk, the database may take a long time to load. However, if the database is stored at a remote location and accessed via the internet, the load time is typically much more severe. In such a case, it may take minutes, hours, or even days to download a large database. In addition, such databases often incur substantial expenses (e.g., 3D models of detailed high masted sailboats for use in games, movies, or historians) and are intended for sale to local end users. However, once the database is downloaded to the local user, it is at risk of being pirated. In many cases, a user wishes to download a database only for the purpose of evaluating the database to see if it fits the user's needs (e.g., if the 3D garment for the game character has a satisfactory appearance or look when the user performs a particular movement). Long loading times can be an obstacle for users who evaluate 3D databases before deciding to make a purchase.
Similar problems arise in MMOGs (more specifically, games such as those that allow users to utilize more customized characters). For a PC or game console to display a character, it needs to be able to access a database with 3D geometry (polygons, textures, etc.) and the behavior of the character (e.g., if the character has a shield, whether the shield is strong enough to deflect the spear). Typically, when an MMOG is first played by a user, a large database for characters is already available under an initial copy of the game, which is available locally on the game's optical disk or downloaded to a disk. However, as the game progresses, if a user encounters a character or object for which the database is not available locally (e.g., if another user has created a customized character), then their database must be downloaded before the character or object can be displayed. This can result in substantial delay in the game.
Given the sophistication and complexity of video games, another challenge to video game developers and publishers in the case of prior art video game consoles is: developing video games often takes 2 to 3 years, costing tens of millions of dollars. Given that new video game console platforms are introduced at a rate roughly once every five years, game developers need to begin the development of those games years before the new game console is released in order to make the video games available at the same time when the new platform is released. Several consoles from competing manufacturers are sometimes released at about the same time (e.g., within one or two years of each other), but what remains to be distinguished is the popularity of each console (e.g., which console will generate the largest sale of video game software). For example, in the most recent console cycle, Microsoft XBox360, Sony Playstation 3, and Nintendo Wii are scheduled to be introduced at about the same general time period. But in the years before that introduction, game developers essentially had to "place bets" which console platforms will be more successful than others, and invest in their development resources accordingly. Movie production companies must also amortize their limited production resources based on movies that they estimate to be likely successful long before they release. Given the growing degree of investment required for video games, game production is becoming increasingly movie-like, and game production companies routinely devote their production resources based on their estimates of future success for particular video games. However, unlike film companies, this impression is not based solely on the success of the production itself; rather, it is based on the success of the game console on which the game is to be executed. Publishing games on multiple consoles at the same time may mitigate risks, but this additional effort increases costs and often delays the actual publishing of games.
Application software and user environments on PCs are becoming more computationally intensive, dynamic, and interactive, not only making them more visually appealing to users, but also making them more useful and intuitive. For example, a new Windows Vista (Windows distant view)TMOperating system and MacintoshBoth subsequent versions of the operating system incorporate visual animation effects. Advanced graphics tools (such as Maya from Autodesk, Inc.)TM(Maya)TM) Provide very sophisticated 3D rendering and animation capabilities that drive the CPU and GPU limitations of the current state of the art. However, the computational requirements of these new tools create many practical problems for users of the products and software developers.
Because the visual display of an Operating System (OS) must work on a variety of computers, including previous generation computers that are no longer sold but can still be upgraded with new OSs, OS graphics requirements are largely limited by the least common denominator of the computers for which the OS is intended, which typically includes computers that do not include GPUs. This severely limits the graphics capabilities of the OS. Furthermore, battery-powered portable computers (e.g., notebook computers) limit visual display capabilities because high computing activity in the CPU or GPU typically results in higher power consumption and shorter battery life. Portable computers typically include software that automatically reduces processor activity when the processor is not being utilized to reduce power consumption. In some computer models, a user may manually reduce processor activity. For example, Sony's VGN-SZ280P notebook computer includes a switch labeled "Stamina (for low performance, longer battery life) on one side and" Speed "(for high performance, shorter battery life) on the other side. The OS executing on the portable computer must be able to function effectively even if the computer is executing at a fraction of its peak performance capability. Thus, OS graphics performance is often kept well below the available computing power of the current state of the art.
High-end, computationally intensive applications (such as Maya) are often sold with the expectation that they will be used on high performance PCs. This typically results in much higher performance, and more expensive and less portable, minimum common point requirements. Thus, the applications have a much more limited target audience than general purpose OS (or general purpose productivity applications like Microsoft Office) and are typically sold in much lower volumes than general purpose OS software or general purpose application software. The potential audience is further limited because it is often difficult for the intended user to try out the computationally intensive application in advance. For example, suppose a student wishes to know how to use Maya or a potential purchaser who already knows the application wishes to try Maya before making an investment in a purchase (this may involve also purchasing a high-end computer capable of executing Maya). When a student or potential purchaser can download or get a physical media copy of a demonstration version of Maya, it will not be able to make an all-round assessment of the product if it lacks a computer capable of performing the full potential of Maya (e.g., processing complex 3D scenes). This substantially limits the audience for the high-end applications. This also makes the selling price high because the development cost is usually amortized over a number of purchases that is much smaller than the number of purchases of the general-purpose application.
High-priced applications also create more incentive for individuals and businesses to use pirated copies of the application software. As a result, high-end application software suffers from rampant piracy, despite significant efforts by publishers of such software to mitigate such piracy through various techniques. However, even when using pirated high-end applications, the user may not eliminate the need to invest in expensive state-of-the-art PCs to execute pirated copies. Thus, although a user may obtain use of a software application for a fraction of its actual retail price, a user of pirated software still needs to purchase or obtain an expensive PC in order to fully utilize the application.
The same is true for users of high performance pirated video games. Although pirates may get games at a fraction of their actual price, they still need to purchase expensive computing hardware (e.g., a GPU-enhanced PC, or a high-end video game console like the XBox 360) needed to properly play the games. Given that video games are typically consumer entertainment, the additional cost for high-end video game systems may be prohibitively expensive. This situation is worse in countries (e.g., china) where the average annual income of current workers is quite low (relative to the average annual income of current workers in the united states). Thus, a much smaller percentage of the population owns a high-end video game system or a high-end PC. In these countries, "internet cafes" where users can pay fees to use computers connected to the internet are quite common. Often, the internet cafes have older model or low end PCs that do not have high performance features (such as GPUs that would otherwise enable players to play computationally intensive video games). This is a key factor in the success of games performed on low-end PCs (such as vivenda's "magic world", which is highly successful in china and is often played in internet cafes in china). In contrast, computationally intensive games (such as "second life") are less likely to be played on a PC installed in an internet cafe in china. The game is virtually inaccessible to users who only have access to low performance PCs in internet cafes.
There are also obstacles to users who consider purchasing video games and who are first willing to try out a demonstration version of the game by downloading the demonstration to their home via the internet. Video game presentations are often full-featured versions of games in which some features are disabled or a limit is imposed on the amount of game play. This may involve a long process (perhaps hours) of downloading gigabytes of data before the game can be installed and executed on a PC or console. In the case of a PC, it may also involve figuring out which special drivers are required for the game (e.g., DirectX or OpenGL drivers), downloading the correct version, installing the correct version, and then determining whether the PC is capable of playing the game. The latter step may involve determining whether the PC has sufficient processing (CPU and GPU) capability, sufficient RAM, and a compatible OS (e.g., some games execute on Windows XP and not Vista). Thus, after attempting to perform a long process of a video game presentation, a user may find that the video game presentation is unlikely to play given the user's PC configuration. Worse, once the user has downloaded new drivers for attempting the demonstration, these driver versions may be incompatible with other games or applications that the user is accustomed to using on the PC, and thus, installation of the demonstration may render previously operable games or applications inoperable. Not only are these barriers frustrating to users, they also create barriers to video game software publishers and video game developers selling their games.
Another problem that leads to uneconomical efficiency is related to the fact that: a given PC or game console is typically designed to accommodate a particular level of performance requirements for the application and/or game. For example, some PCs have more or less RAM, slower or faster CPUs, and slower or faster GPUs (if they have GPUs). Some games or applications utilize the full computing power of a given PC or console, while some games or applications do not. If the user's game or application selection does not reach the peak performance capabilities of the local PC or console, the user may waste money on the PC or console due to the unused features. In the case of a console, the console manufacturer may pay more than is necessary to subsidize the console cost.
Another problem that exists in the sale and enjoyment of video games relates to allowing users to watch others playing the games before they implement a purchase of the games. There are several prior art methods for recording a video game for playback at a later time. For example, U.S. patent No. 5,558,339 teaches recording game state information (including game controller actions) in a video game client computer (owned by the same or a different user) during "game play". This state information may be used at a later time to replay some or all of the game actions on a video game client computer (e.g., a PC or console). The significant drawbacks of this method are: for a user to view a recorded game, the user must have a video game client computer capable of playing the game and must have a video game application executing on the computer so that the game play is exactly the same when the recorded game state is replayed. In addition, the video game application must be written in such a way that there is no possible execution difference between the recorded game and the played back game.
For example, game graphics are generally computed on a frame-by-frame basis. For many games, depending on whether the scene is particularly complex or whether there are other delays that slow down execution (e.g., on a PC, another process may be executing that takes away CPU cycles from the game application), the game logic may sometimes take less than one frame time or longer to calculate the graphics displayed for the next frame. In such a game, a "threshold" frame calculated in slightly less time than one frame time (e.g., a few CPU clock cycles less) may eventually occur. When this same scene is recalculated using exactly the same game state information, it may easily take several CPU clock cycles more than one frame time (e.g., if the internal CPU bus is slightly out of phase with the external DRAM bus, and even if there is no large delay from another process taking milliseconds of CPU time away from game processing, it introduces a delay of several CPU cycle times). Thus, when the game is played back, the frames become calculated at two frame times rather than at a single frame time. Some actions are based on the frequency at which the game calculates new frames (e.g., when the game samples input from the game controller). When a game is played, this deviation in time references for different behaviors does not affect game play, but it can cause the played back game to produce different results. For example, if the orbit of a basketball is calculated at a steady 60fps rate, but the game controller input is sampled based on the calculated frame rate, the calculated frame rate may be 53fps when the game is recorded and 52fps when the game is replayed, which may cause a difference in whether the basketball is blocked from entering the basket, resulting in a different outcome. Therefore, recording a video game using game state requires a very careful game software design to ensure that playback using the same game state information produces exactly the same result.
Another prior art method for recording video games is to record only the video output of a PC or video game system (e.g., to a VCR, DVD recorder, or to a video capture board on a PC). The video can then be rewound and played back, or alternatively, the recorded video uploaded to the internet (typically after the video was compressed). The disadvantages of this method are: when playing back a 3D game sequence, the user is limited to viewing the sequence only from the viewpoint from which the sequence was recorded. In other words, the user may not change the viewpoint of the scene.
In addition, when compressed video of a recorded game sequence played on a home PC or game console is made available to other users via the internet, it is not possible to upload the compressed video to the internet in real time, even if the video is compressed in real time. The reason for this is because many homes in the world that connect to the internet have highly asymmetric broadband connections (e.g., DSL and cable modems typically have much higher downstream bandwidth than upstream bandwidth). Compressed high-resolution video sequences often have a bandwidth that is higher than the upload bandwidth capacity of the network, making it impossible to upload in real-time. Thus, after the game sequence is played (perhaps minutes or even hours), there will be a significant delay before another user on the internet can view the game. While this delay may be tolerable in certain situations (e.g., viewing the outcomes of game players occurring at a previous time), it eliminates the ability to view a live game (e.g., a basketball tournament played by winning players) or the ability to "instant replay" when the game is played live.
Another prior art approach allows a viewer with a television receiver to watch a video game live, but only under the control of the television producer. Some television channels in the united states and other countries provide video game viewing channels where television viewers can view particular video game users (e.g., top-rated players participating in a tournament) on the video game channels. This is done by feeding the video output of the video game system (PC and/or console) into the video distribution and processing device for the television channel. This is as it is the case when the television channel broadcasts a live basketball game, where several cameras provide live feeds from different angles around the basketball court. The television channel is then able to manipulate the output from the various video game systems with its video/audio processing and effects equipment. For example, a television channel may overlay text indicating the status of different players over video from a video game (just as it may overlay text during a live basketball game), and the television channel may record audio from commentators (which may discuss the actions that occur during the game). Additionally, the video game output may be combined with a camera that records video of the actual player of the game (e.g., displays the player's emotional response to the game).
One problem with this approach is that: the live video feed must be made available in real time to the video distribution and processing equipment of the television channel in order to make it irritating to live broadcasts. However, as previously described, this is often not possible when the video game system is executing from home (especially when a portion of the broadcast includes live video from a camera that is capturing real-world video of the game player). Additionally, in a tournament situation, it is of interest for in-home players to modify games and cheating, as previously described. For these reasons, such video game broadcasts on television channels are often configured with players and video game systems gathered at a common location (e.g., at a television studio or in an arena), where television production equipment may accept video feeds from multiple video game systems and potentially live cameras.
While such prior art video game television channels may provide very exciting shows to television viewers, which is a liked experience to live sporting events (e.g., liked to video game players presented in "athletes"), not only in terms of their actions in the video game world, but also in terms of their actions in the real world, these video game systems are often limited to situations where players are in close physical proximity to one another. Furthermore, because television channels are broadcast, each broadcasted channel can only display one video stream selected by the producer of the television channel. Due to these limitations and the high cost of broadcast time, production equipment and producers, the television channels typically only show top players participating in a top tournament.
In addition, a given television channel broadcasting a full screen image of a video game to all television viewers displays only one video game at a time. This severely limits the choices of the television viewer. For example, a television viewer may not be interested in a game displayed at a given time. Another viewer may only be interested in viewing game plays for a particular player that is not shown by the television channel at a given time. In other cases, the viewer may only be interested in viewing how the expert player deals with a particular level in the game. Other viewers may wish to control a point of view from which the video game is viewed that is different from the point of view selected by the production team or the like. In short, television viewers may have countless preferences in watching a video game (even if several different television channels are available, a particular broadcast of a television network does not adapt to the preferences). For all of the above reasons, prior art video game television channels have significant limitations in presenting video games to television viewers.
Another disadvantage of prior art video game systems and application software systems is that: they are complex and often suffer from errors, crashes and/or unintended and unwanted behavior (collectively, "defects"). While games and applications typically go through a debugging and tuning process (often referred to as "software quality assurance" or SQA) prior to release, it is almost invariably: once a game or application is released to a large audience in the field, the bug can suddenly appear. Unfortunately, it is difficult for software developers to identify and track many defects after release. Software developers may have difficulty realizing the defect. Even when it knows about a defect, there may be only a limited amount of information that it can use to identify what caused the defect. For example, a user may phone a game developer's consumer service hotline and leave a message stating: when playing the game, the screen starts to flash, then turns into solid blue (solid blue) and the PC freezes. It provides the SQA team with very little information useful in tracking defects. Some games or applications that are connected online may sometimes provide more information in certain situations. For example, a "watchdog" process may sometimes be used to monitor whether a game or application "crashes". The watchdog process may collect statistics about the state of the game or application process (e.g., memory stack usage state, degree to which the game or application has progressed, etc.) when the game or application crashes, and then upload that information to the SQA team via the internet. But in complex games or applications, this information can take a very long time to decrypt in order to accurately determine what the user was doing at the time of the crash. Nevertheless, it is not possible to determine what sequence of events caused the crash.
Yet another problem associated with PCs and game consoles is that: it suffers from service problems that make it extremely inconvenient for consumers. Service issues also affect the manufacturer of the PC or game console because it typically requires sending special boxes to safely ship a broken PC or console and thus incur the cost of repair if the PC or console is under warranty. The game or application software publisher may also be impacted by loss of sales (or online service usage) caused by the PC and/or console being in a repaired state.
FIG. 1 illustrates a diagram such as Sony Playstation3、Microsoft Xbox 360、NintendoWiiTMWindows-based personal computers or Apple Macintosh prior art video game systems. Each of the systems includes a Central Processing Unit (CPU), typically a Graphics Processing Unit (GPU), for executing high-level graphics operations, and multiple forms of input/output (I/O) for communicating with external devices and users, for executing program code. The components are shown grouped together as a single unit 100 for simplicity. The prior art video-game system of FIG. 1 is also shown to include an optical media drive 104 (e.g., a DVD-ROM drive); a hard disk drive 103 for storing video game program code and data; a network connection 105 for playing a multiplayer game, for downloading a game, patch, presentation, or other media; a Random Access Memory (RAM)101 for storing program code currently being executed by the CPU/GPU 100; a game controller 106 for receiving input commands from a user during game play; and a display device 102 (e.g., SDTV/HDTV or a computer monitor).
The prior art system shown in fig. 1 suffers from several limitations. First, the optical drive 104 and the hard disk drive 103 tend to have much slower access speeds compared to the access speed of the RAM 101. When working directly through RAM101, CPU/GPU 100 can in practice process much more polygons per second than is possible when program code and data are read out directly from hard disk drive 103 or optical drive 104, due to the fact that RAM101 typically has much higher bandwidth and is not subject to the relatively long seek delays of the disk mechanism. But only a limited amount of RAM is provided in these prior art systems (e.g., 256-512 megabytes). Thus, a "loading." sequence is often required in which the RAM101 is periodically filled with data for the next scene of the video game.
Some systems attempt to overlap the loading of program code with game play simultaneously, but this can only be done when there is a known sequence of events (e.g., if a car is being driven along a road, the geometry of the approaching building on the roadside can be loaded while the car is being driven). For complex and/or fast scene changes, this type of overlap generally does not work. For example, in the situation where a user is in the middle of a campaign and RAM101 is completely filled with data representing objects within the view at that time, if the user moves the view quickly to the left to view objects not currently loaded in RAM101, a discontinuity in action will result because there is not enough time to load new objects into RAM101 from hard disk drive 103 or optical media 104.
Another problem with the system of FIG. 1 is due to the storage capacity limitations of the hard drive 103 and optical media 104. Although disk storage devices can be manufactured with relatively large storage capacities (e.g., 500 megabytes or more than 500 megabytes), they do not provide sufficient storage capacity for the particular conditions encountered in current video games. For example, as previously described, a soccer video game may allow a user to select among many teams, players, and sports arenas around the world. For each team, each player, and each stadium, a large number of texture maps and environment maps are needed to characterize the 3D surface in the world (e.g., each team has a unique jersey, each requiring a unique texture map).
One technique for solving the latter problem described above is: for games, once the user selects a texture and environment map, the texture and environment map is pre-computed. This may involve many computationally intensive processes, including decompressing images, 3D mapping, shading, organizing data structures, and so forth. Thus, there may be a delay for the user when the video game performs these calculations. One way to reduce this delay is in principle: all of these calculations were performed initially when the game was developed-including each permutation of team, player roster and playground. The released version of the game will thus include all of the pre-processed data stored on the optical media 104 or on one or more servers on the internet, with only selected pre-processed data for a given team, player roster, sports field selection being downloaded to the hard drive 103 via the internet when the user makes a selection. However, as a practical matter, this pre-loaded data for each permutation possible in game play can easily be several terabytes (terabytes) of data, which far exceeds the capacity of today's optical media devices. Furthermore, the data for a given team, player roster, playground selection may easily be several megabytes of data or more. In the case of a home network connection (e.g., 10Mbps), downloading this data via the network connection 105 will take longer than computing the data locally.
Thus, the prior art gaming architecture shown in fig. 1 subjects the user to significant delays between large scene transitions of a complex game.
Another problem with prior art methods, such as the one shown in fig. 1, is that: over the years, video games have tended to become more advanced and require more CPU/GPU processing power. Thus, even with an unlimited amount of RAM, video game hardware requirements exceed the peak level of processing power available in the system. Therefore, users are required to upgrade the game hardware every few years to maintain synchronization (or to play newer games at a lower quality level). The consequences of the trend towards more advanced video games than ever are: machines for playing video games for home use are generally not cost effective because their cost is generally determined by the requirements of the highest performance games they can support. For example, it is possible to use the XBox360 to play games like "War machines" (Gears of War), which require high performance CPUs, GPUs and several megabytes of RAM, or it is possible to use the XBox360 to play "bean eating (Pac Man)", which is a game from the 70's 20 th century, which requires only several kilobytes of RAM and a very low performance CPU. In fact, the XBox360 has sufficient computing power to host many simultaneous "bean-eating" games at the same time.
During most hours of the week, video game machines are typically turned off. According to a study of Nielsen entertainment at 7 months 2006 for active players at and above 13 years old, on average, active players spent only 12% of the fourteen or all hours of the week playing the console video game. This means that the average video game console is idle 88% of the time, which is an inefficient use of expensive resources. This is particularly significant given that video game consoles are often subsidized by the manufacturer to reduce purchase prices (with the expectation that the subsidies will be earned back through royalties from future video game software purchases).
Video game consoles also incur costs associated with almost any consumer electronics device. For example, it is desirable to house the electronics and mechanisms of the system in a housing. The manufacturer needs to provide service guarantees. The retailer that sells the system needs to receive a profit on the sale of the system and/or on the sale of the video game software. All of these factors add to the cost of the video game console, which must be subsidized by the manufacturer, passed on to the consumer, or both.
In addition, piracy is a major problem for the video game industry. The security mechanisms utilized on virtually every larger video game system have been "breached" over the years, resulting in unauthorized duplication of the video game. For example, the Xbox 360 security system broke in 2006, month 7 and the user is now able to download illegal copies online. Downloadable games (e.g., games for PCs or macs) are particularly vulnerable to piracy. In certain regions of the world where piracy is poorly regulated, there is virtually no viable market for stand-alone video game software, as users can purchase pirated copies with legitimate copies generally easily at a very small fraction of the cost. Moreover, in many parts of the world, the cost of game consoles is a high percentage of revenue, so that even if piracy is controlled, few people can afford state-of-the-art game systems.
In addition, the market for used games reduces revenue to the video game industry. When a user becomes bored with a game, they may sell the game to a store that resells the game to other users. This unauthorized but widespread practice significantly reduces revenue for game publishers. Similarly, when there is a platform transition every few years, a sales reduction of about 50% typically occurs. This is because: when the user knows that a newer version of a platform is about to be released, the user stops purchasing games for the older platform (e.g., when Playstation 3 is about to be released, the user stops purchasing Playstation 2 games). In combination, the loss of sales and the increased development costs associated with the new platform can have a very significant adverse impact on the profitability of the game developer.
New game consoles are also very expensive. Xbox 360, Nintendo Wii, and SonyPlaystation 3 are retail sold in hundreds of dollars. High-capacity personal computer gaming systems can cost up to $ 8000. This represents a significant investment for the user, particularly in view of the fact that the hardware becomes obsolete after a few years and that many systems are purchased for children.
One approach to the above problem is online gaming, where game program code and data are hosted on a server and delivered to client machines on demand, with compressed video and audio streamed over a digital broadband network. Some companies, such as G-Cluster in finland, which is now a subsidiary of SOFTBANK Broadmedia in japan, are currently providing the service online. Similar gaming services become available in local networks, such as those within hotels and provided by DSL and cable television providers. A major drawback of these systems is the problem of latency, i.e. the time it takes for the signal to travel to and from the game server, which is typically located in the "front end" of the operator. Fast motion video games (also known as "twitch" video games) require very low latency between the time a user performs an action through a game controller and the time a display screen is updated to display the result of the user action. Low latency is required so that the user feels the game responds "instantly". The user may be satisfied at different delay intervals depending on the type of game and the proficiency of the user. For example, a 100 millisecond delay may be tolerable for slow casual games (like the checkers) or slow-action role-playing games, but in fast-action games, delays in excess of 70 milliseconds or 80 milliseconds may cause the user to behave more poorly in the game and thus be unacceptable. For example, in games that require fast reaction times, there is a sharp drop in accuracy as the latency increases from 50 milliseconds to 100 milliseconds.
When a game or application server is installed in a nearby controlled network environment or a network environment where the network path to the user is predictable and/or tolerant of bandwidth peaks, it is much easier to control latency in terms of maximum latency and consistency of latency (e.g., so the user observes steady motion from digital video streaming over the network). This degree of control can be achieved as follows: between the cable TV network head-end to the cable TV user's home, or from the DSL central office to the DSL user's home, or in a business office Local Area Network (LAN) environment from a server or user. Furthermore, it is possible to obtain a point-to-point private connection between businesses with a certain hierarchy of guaranteed bandwidth and latency. But in a game or application system that hosts a game in a server center connected to the general internet and then streams (streams) the compressed video to users via a broadband connection, many factors cause delays, resulting in serious limitations in the deployment of prior art systems.
In a typical broadband connected home, a subscriber may have a DSL or cable modem for broadband services. The broadband service typically causes a round-trip delay between the user's home and the general internet of up to 25 milliseconds (and sometimes more). In addition, there is round-trip latency due to routing data to the server center via the internet. The latency through the internet varies based on the route given to the data and the delay caused by the data as it is routed. In addition to routing delays, round-trip delays are also caused by the speed of light traveling through the optical fibers that interconnect most of the internet. For example, for every 1000 miles, a round-trip delay of about 22 milliseconds results due to the speed of light through the fiber and other overhead.
The additional latency may be due to the data rate of the data flowing over the internet. For example, if a user has DSL service sold at "6 Mbps DSL service," in practice, the user will likely get at most a downlink throughput of less than 5Mbps, and will likely see periodically connection degradation due to various factors, such as congestion at the Digital Subscriber Line Access Multiplexer (DSLAM) during peak loading times. A similar problem may arise if there is congestion in the local shared coaxial cable circulating through neighbors or elsewhere in the cable modem system network, reducing the data rate of the cable modems for connections sold with "6 Mbps cable modem service" to much less than that data rate. If data packets at a steady rate of 4Mbps are made to flow unidirectionally in User Datagram Protocol (UDP) format from the server center over the connection, if everything works properly, the data packets will pass through without additional delay, but if there is congestion (or other impediment) and only 3.5Mbps is available to flow data to the user, then in a typical situation the packets will be dropped, resulting in lost data, or the packets will queue up at the point of congestion until they can be sent, introducing additional delay. Different congestion points have different queue capacities for holding delayed packets, so in some cases packets that cannot successfully resolve congestion are immediately dropped. In other cases, millions of bits of data are queued and eventually sent. However, in almost all cases, queuing at the point of congestion has a capacity limit, and once this limit is exceeded, the queue will overflow and packets will be dropped. Thus, to avoid incurring additional latency (or worse, packet loss), exceeding the data rate capacity from the game or application server to the user must be avoided.
Latency is also caused by the time required to compress the video in the server and decompress the video in the client device. Further delay is incurred when the video game executing on the server is computing the next frame to be displayed. Currently available video compression algorithms suffer from high data rates or high latency. For example, motion JPEG is an intra-frame-only lossy compression algorithm characterized by low latency. Each frame of the video is compressed independently of each other frame of the video. When a client device receives a frame of compressed motion JPEG video, it can immediately decompress the frame and display the frame, resulting in very low latency. But because each frame is compressed separately, the algorithm is unable to take advantage of similarities between successive frames, and therefore only intra-frame video compression algorithms suffer from very high data rates. For example, a 60fps (frames per second) 640 x 480 motion JPEG video may require data at 40Mbps (megabits per second) or above 40Mbps (megabits per second). The high data rate for the low resolution video window will be prohibitively expensive in many broadband applications (and indeed for most consumer internet-based applications). In addition, because each frame is compressed independently, artifacts in the frame that may result from lossy compression may appear at different locations in successive frames. This can result in a visual artifact that appears to the viewer as moving when the video is decompressed.
Other compression algorithms, such as MPEG2, h.264 or VC9 from Microsoft corporation, when used in prior art configurations, can achieve high compression ratios, but at the expense of high latency. The algorithm utilizes inter-frame compression as well as intra-frame compression. Periodically, the algorithm performs intra-only compression of the frame. Such frames are referred to as key frames (commonly referred to as "I" frames). The algorithm then typically compares the I-frame to both the previous frame and the successive frame. Rather than compressing the previous and successive frames independently, the algorithm determines what the change in the image has from the I frame to the previous and successive frames, and then stores the change as: "B" frames (for changes before an I frame) and "P" frames (for changes after an I frame). This results in a much lower data rate than intra-frame only compression. However, it usually comes at the cost of higher latency. I-frames are typically much larger (often 10 times larger) than B-frames or P-frames, and therefore, proportionally take longer to transmit at a given data rate.
Consider (for example) an i case: where an I-frame is 10 times the size of a B-frame and a P-frame, and there are 29B-frames + 30P-frames 59 interframes for each single I-frame, or 60 frames total for each "group of frames" (GOP). Thus, at 60fps, there are 1 GOP of 60 frames per second. It is assumed that the transmission channel has a maximum data rate of 2 Mbps. To achieve the highest quality video in the channel, the compression algorithm will produce a 2Mbps data stream, and given the above ratios, this will produce 2 million bits per frame (Mb)/(59+10) 30,394 bits and 303,935 bits per I frame. When a compressed video stream is received through a decompression algorithm, each frame needs to be decompressed and displayed at regular intervals (e.g., 60fps) in order to stably play the video. To achieve this result, if any frame is subject to transmission delay, all frames need to be delayed by at least that delay, so the worst case frame delay will define the delay for each video frame. Because the I-frame is largest, the I-frame introduces the longest transmission delay, and the entire I-frame will have to be received before the I-frame (or any inter-frame depending on the I-frame) can be decompressed and displayed. Assuming a channel data rate of 2Mbps, it would take 303,935/2Mb to 145 milliseconds to transmit the I-frame.
Inter-frame video compression systems that use a large percentage of the bandwidth of the transmission channel (as described above) will suffer from long delays due to the large size of the I-frames relative to the average size of the frames. Or, in other words, when the prior art inter-frame compression algorithm achieves a lower average data per frame rate (e.g., 2Mbps versus 40Mbps) than the intra-frame only compression algorithm, it still suffers from a high peak data per frame rate due to large I-frames (e.g., 303,935 × 60 ═ 18.2 Mbps). But please remember: the above analysis assumes that both P and B frames are much smaller than I frames. Although this is generally true, it is not true for frames with high image complexity that are not related to previous frames, high motion, or scene changes. In such cases, the P-frames or B-frames may become generally larger than the I-frames (if the P-frames or B-frames become larger than the I-frames, the sophisticated compression algorithm will typically "force" the I-frames and replace the P-frames or B-frames with the I-frames). Thus, a data rate peak of I-frame size may occur in the digital video stream at any time. Thus, for compressed video, high peak data rates from I-frames or large P-frames or B-frames result in high frame latency when the average video data rate is close to the data rate capacity of the transmission channel (often the case given the high data rate requirements for video).
Of course, the above discussion only characterizes the compression algorithm delay resulting from a large B-frame, P-frame, or I-frame in a GOP. If B frames are used, the delay will be higher. The reason is because all B frames and I frames after a B frame must be received before the B frame can be displayed. Thus, in a picture Group (GOP) sequence such as bbbbbipppbbbbippppp, where there are 5B-frames before each I-frame, the first B-frame can be displayed by the video decompressor only after the subsequent B-frame and I-frame are received. Thus, if the video is streamed at 60fps (i.e., 16.67 ms/frame), it will take 16.67 x 6 to 100 ms to receive five B-frames and I-frames before the first B-frame can be decompressed, regardless of the channel bandwidth, and this is the case for only 5B-frames. Compressed video sequences with 30B frames are quite common. Furthermore, at low channel bandwidths, such as 2Mbps, the delay impact due to the size of the I-frame adds significantly to the delay impact due to waiting for a B-frame to arrive. Thus, over a 2Mbps channel, with a large number of B frames, it is quite easy to use prior art video compression techniques for delays exceeding 500 milliseconds or more than 500 milliseconds. If B-frames are not used (at the expense of a lower compression ratio for a given level of quality), then no B-frame delay is incurred, but the delay due to peak frame size described above is still incurred.
The problem is exacerbated by the very nature of many video games. Video compression algorithms utilizing the GOP structure described above are largely optimized for use with live video or movie material to be used for passive viewing. Typically, the camera (real camera, or virtual camera in the case of computer generated animation) and scene are relatively stable simply because if the camera or scene moves around too jerkily, the video or movie material (a) is typically unpleasant to watch, and (b) if it is being watched, the viewer is typically unable to closely follow the action when the camera suddenly jerks around (e.g., if the camera is disturbed when shooting a child blowing out candles on birthday cakes and suddenly jerks around between cakes, the viewer typically focuses on the child and the cakes, rather than taking a brief interruption when the camera suddenly moves). In the case of a video conference or video teleconference, the camera may be held in a fixed position and not moved at all, resulting in very few data peaks at all. But 3D high motion video games are characterized by constant motion (e.g., consider a 3D tournament where the entire frame is in fast motion for the duration of the tournament, or consider a first person shooter game where the virtual camera is constantly moving around jerkily). The video game may produce a sequence of frames with large and frequent peaks where the user may need to clearly see what happens during this sudden movement. Thus, in 3D high motion video games, compression artifacts are far from tolerable. Thus, the video output of many video games (due to their nature) produces a compressed video stream with very high and frequent peaks.
Given that users of fast-action video games have little tolerance for high latency, and given all of the above latency reasons, there has been a limitation to server-hosted video games that stream video over the internet. In addition, if an application requiring a high degree of interactivity is hosted on the general internet and streams video, users of the application suffer from similar limitations. The service requires a network configuration in which the hosting server is located directly in the head-end (in the case of cable broadband) or in the central office (in the case of Digital Subscriber Line (DSL)), or within the LAN (or a specially rated private connection) in a commercial setting, in order to control the routes and distances from the client devices to the server to minimize latency and to accommodate peaks without causing latency. LANs (typically rated at 100Mbps-1Gbps) and leased lines with sufficient bandwidth typically can support peak bandwidth requirements (e.g., 18Mbps peak bandwidth is a fraction of the capacity of a 100Mbps LAN).
Peak bandwidth requirements may also be accommodated by the residential broadband infrastructure if specially adapted. For example, on a cable TV system, digital video communication may be given a dedicated bandwidth that can handle peaks such as large I-frames. Furthermore, on DSL systems, higher speed DSL modems (allowing for high peak values) may be provisioned, or specially rated connections that can handle higher data rates may be provisioned. However, traditional cable modems and DSL infrastructures attached to the general internet are far from tolerant of peak bandwidth requirements for compressed video. Thus, online services (hosting video games or applications in a server center a long distance from the client device, and then streaming the compressed video output over the internet via a traditional residential broadband connection) suffer from significant latency and peak bandwidth requirements-especially for games and applications that require very low latency (e.g., first person shooter games and other multi-user, interactive action games, or applications that require fast response times).
Drawings
The present disclosure will be understood more fully from the detailed description given below and from the accompanying drawings, which, however, should not be taken to limit the disclosed subject matter to the specific embodiments shown, but are for explanation and understanding only.
FIG. 1 illustrates the architecture of a prior art video-game system.
Fig. 2 a-2 b illustrate a high-level system architecture according to one embodiment.
Fig. 3 shows the actual, nominal and required data rates for communication between the client and the server.
FIG. 4a illustrates a hosting service and a client used according to one embodiment.
FIG. 4b illustrates an exemplary delay associated with communication between a client and a hosting service.
Fig. 4c shows a client device according to an embodiment.
Fig. 4d shows a client device according to another embodiment.
Fig. 4e illustrates an example block diagram of the client device in fig. 4 c.
FIG. 4f illustrates an example block diagram of the client device in FIG. 4 d.
Fig. 5 illustrates one example of video compression that may be used in accordance with one embodiment.
Fig. 6a shows an example of video compression that may be used in another embodiment.
Fig. 6b shows peaks in data rate associated with transmitting a low complexity, low motion video sequence.
Fig. 6c shows peaks in data rates associated with transmitting a high complexity, high motion video sequence.
Fig. 7 a-7 b illustrate an example video compression technique used in one embodiment.
Fig. 8 illustrates an additional example video compression technique used in one embodiment.
Fig. 9a to 9c illustrate a frame rate processing technique used in one embodiment of the present invention.
Fig. 10 a-10 b illustrate one embodiment of efficiently encapsulating image tiles within packets.
Fig. 11a to 11d show embodiments using forward error correction techniques.
FIG. 12 illustrates one embodiment of compression using a multi-core processing unit.
Fig. 13 a-13 b illustrate geolocation and communication between host services according to various embodiments.
FIG. 14 illustrates exemplary delays associated with communication between a client and a hosting service.
FIG. 15 illustrates an example hosting service server center architecture.
FIG. 16 illustrates an example screen shot of one embodiment of a user interface including multiple live video windows.
Fig. 17 shows the user interface of fig. 16 after selection of a particular video window.
Fig. 18 shows the user interface of fig. 17 after zooming in a particular video window to full screen size.
FIG. 19 illustrates example collaborative user video data overlaid on a screen of a multiplayer game.
FIG. 20 illustrates an example user page for a game player on a hosting service.
FIG. 21 illustrates an example 3D interactive ad.
Fig. 22 illustrates an example sequence of steps for generating a photorealistic image with a textured surface from surface capture of a live performance.
FIG. 23 illustrates an example user interface page that allows for selection of linear media content.
FIG. 24 is a graph illustrating the amount of time elapsed before a web page is live versus the connection speed.
Fig. 25a-b illustrate an embodiment of the present invention that employs a feedback channel from a client device to a hosting service.
Fig. 26a-b illustrate an embodiment wherein an image block/frame is encoded based on the last known successfully received image block/frame.
27a-b illustrate an embodiment in which the state of a game or application is shipped (ported) from a first host service or server to a second host service or server.
FIG. 28 illustrates an embodiment in which the state of a game or application is conveyed by using difference data.
Fig. 29 illustrates an embodiment of the present invention that employs a temporary decoder on the client device.
FIG. 30 illustrates how "I tiles" are interspersed between "R frames" in accordance with an embodiment of the present invention.
Fig. 31a-h illustrate embodiments of the present invention that generate live streams and/or one or more HQ streams.
Detailed Description
In the following description, specific details are set forth (such as device types, system configurations, communication methods, etc.) in order to provide a thorough understanding of the present disclosure. However, it will be understood by those of ordinary skill in the art that these specific details may not be required to practice the described embodiments.
Fig. 2 a-2 b provide a high-level architecture of two embodiments in which video games and software applications are hosted by a hosting service 210 and accessed via the internet 206 (or other public or private network) by a client device 205 at a user site 211 under a subscription service (note that "user site" means a location anywhere where a user is located, including outdoors if using a mobile device). The client device 205 may be a general purpose computer with a wired or wireless connection to the internet, with an internal or external display device 222, such as a PC based on Microsoft Windows or Linux, or a Macintosh computer of Apple inc, or it may be a dedicated client device such as a set-top box that outputs video and audio to a monitor or television 222 (with a wired or wireless connection to the internet), or it may be a mobile device that presumably has a wireless connection to the internet.
Any of the devices may have its own user input device (e.g., keyboard, buttons, touch screen, tracking pad or inertial-sensing wand, video capture camera and/or motion tracking camera, etc.), or it may use an external input device 221 (e.g., keyboard, mouse, game controller, inertial sensing wand, video capture camera and/or motion tracking camera, etc.) connected by wire or wirelessly. As described in more detail below, the hosting service 210 includes servers of various performance levels (including those with high-power CPU/GPU processing capabilities). During the playing of a game or the use of an application on the hosting service 210, the home or office client device 205 receives keyboard and/or controller input from the user, and then it transmits the controller input to the hosting service 210 via the internet 206, and the hosting service 210 in response executes the game program code and generates successive frames of video output (a sequence of video images) for the game or application software (e.g., if the user presses a button that would direct an on-screen character to move to the right, the game program would then generate a sequence of video images showing the character moving to the right). The sequence of video images is then compressed using a low-latency video compressor, and the hosting service 210 then transmits the low-latency video stream over the internet 206. The home or office client device then decodes the compressed video stream and renders the decompressed video images on a monitor or TV. Thus, the computational and graphics hardware requirements of the client device 205 are significantly reduced. The client 205 need only have processing power for forwarding keyboard/controller inputs to the internet 206 and decoding and decompressing the compressed video stream received from the internet 206, virtually any personal computer today is capable of doing this in software on its CPU (e.g., an Intel corporation dual core CPU executing at about 2GHz is capable of decompressing 720p HDTV encoded using a compressor such as h.264 and Windows media VC 9). Furthermore, in the case of any client device, the dedicated chip can also perform video decompression for the standard in real time at much lower cost and with much less power consumption than a general purpose CPU (such as is required by modern PCs). Notably, to perform the functions of forwarding controller input and decompressing video, the home client device 205 does not require any specialized Graphics Processing Unit (GPU), optical drive, or hard drive (such as the prior art video game system shown in fig. 1).
As games and application software become more complex and more photo-realistic, it will require higher performance CPUs, GPUs, more RAM, and larger and faster disk drives, and can keep the computing power at the hosting service 210 constantly upgraded, but the end user will not need to upgrade the home or office client platform 205 because the processing requirements of the home or office client platform 205 will remain constant for display resolution and frame rate through a given video decompression algorithm. Thus, there are no hardware limitations and compatibility issues seen today in the systems illustrated in fig. 2 a-2 b.
Additionally, because the game and application software is executed only in servers in the hosting service 210, there is never a copy (in the form of optical media, or downloaded software) of the game or application software in the user's home or office (unless otherwise conditional, "office" as used herein would include any non-residential context, including, for example, a classroom). This significantly mitigates the possibility of the game or application software being illegally copied (pirated), as well as the possibility of valuable databases that can be used by the game or application software being pirated. Indeed, if a specialized server is required to play games or application software that is not feasible for home or office use (e.g., requiring very expensive, large, or noisy equipment), even if a pirated copy of the game or application software is obtained, it will not be operable in the home or office.
In one embodiment, the hosting service 210 provides a software development tool to a game or application software developer (which refers generally to a software development company, a game or movie studio, or a game or application software publisher) 220 that designs video games so that it can design games that can be executed on the hosting service 210. The tools allow developers to leverage features of the hosting service that would not normally be available in a standalone PC or game console (e.g., quickly accessing a very large database of complex geometries (unless otherwise conditional, "geometry" will be used herein to refer to polygons, textures, rigging, lighting, behaviors, and other components and parameters that define a 3D dataset)).
Under this architecture, different business models are possible. Under one model, the hosting service 210 collects subscription fees from end users and pays royalties to the developers 220, as shown in FIG. 2 a. In an alternative implementation (shown in FIG. 2 b), the developer 220 charges subscription fees directly from the user and pays the hosting service 210 for hosting the game or application content. These underlying principles are not limited to any particular business model for providing online gaming or application hosting.
Compressed video characteristics
As previously discussed, one significant problem with providing video game services or application software services online is latency. A delay of 70-80 milliseconds (from the time the input device is actuated by the user to the time the response is displayed on the display device) is an upper limit for games and applications that require fast response times. However, this is very difficult to achieve with the architectures shown in fig. 2a and 2b due to a large number of practical and physical constraints.
As indicated in fig. 3, when a user subscribes to an internet service, the connection is typically rated as a nominal maximum data rate 301 to the user's home or office. This maximum data rate may be more or less strictly enforced depending on the provider's policies and routing device capabilities, but the actual available data rate is typically lower for one of many different reasons. For example, there may be excessive network communication at the DSL central office or on the local cable modem loop, or there may be noise on the cable causing dropped packets, or the provider may establish a maximum number of bits per user per month. Currently, the maximum downstream data rate for cable and DSL services is typically in the range of hundreds of kilobits per second (Kbps) to 30 Mbps. Cellular services are typically limited to hundreds of Kbps of downstream data. However, the speed of broadband services and the number of users subscribing to broadband services will increase dramatically over time. Currently, some analysts estimate that 33% of us broadband users have a downstream data rate of 2Mbps or more than 2 Mbps. For example, some analysts predict: by 2010, over 85% of U.S. broadband users will have data rates of 2Mbps or more than 2 Mbps.
As indicated in fig. 3, the actual available maximum data rate 302 may fluctuate over time. Thus, in the case of low-latency, online gaming or application software, it is sometimes difficult to predict the actual available data rate for a particular video stream. Several problems may arise if the data rate 303 required to maintain a given level of quality at a given number of frames per second (fps) at a given resolution (e.g., 640 x 48060fps) for a given amount of scene complexity and motion is increased above the actual available maximum data rate 302 (as indicated by the peaks in fig. 3). For example, some internet services will only drop packets, resulting in lost data and distorted/lost images on the user's video screen. Other services will temporarily buffer (i.e., queue) additional packets and provide them to the client at the available data rate, resulting in an increase in latency-an unacceptable result for many video games and applications. Finally, some internet service providers treat increases in data rates as malicious attacks, such as denial of service attacks (a well-known technique used by computer hackers to disable network connections), and will cut off the user's internet connection for a certain period of time. Accordingly, embodiments described herein seek to ensure that the required data rate for a video game does not exceed the maximum available data rate.
Hosting service architecture
FIG. 4a illustrates an architecture of a hosting service 210 according to one embodiment. The hosting service 210 may be located in a single server center, or may be distributed across multiple server centers (to provide low latency connections for users having lower latency paths to a particular server center than others, to provide load balancing among users, and to provide redundancy in the event of failure of one or more server centers). The hosting service 210 may eventually include thousands or even millions of servers 402, serving a very large user base (user base). The hosting service control system 401 provides overall control of the hosting service 210 and directs routers, servers, video compression systems, billing and accounting systems, and the like. In one embodiment, the hosting service control system 401 is implemented on a Linux-based decentralized processing system that is bound to a RAID array for storing data banks for user information, server information, and system statistics. In the above description, various actions performed by the hosting service 210 are initiated and controlled by the hosting service control system 401 unless due to other specific systems.
The hosting service 210 includes a number of servers 402, such as those currently available from Intel (Intel corporation), IBM (International Business machines corporation, USA), and Hewlett Packard (Hewlett Packard), among others. Alternatively, the server 402 may be assembled into a custom component configuration, or eventually the server 402 may be integrated such that the entire server is implemented as a single chip. Although this figure shows a few servers 402 for illustration, in an actual deployment, there may be as few as one server 402 or as many as millions or more servers 402. The servers 402 may all be configured in the same manner (as examples of some configuration parameters, with the same CPU type and performance; with or without a GPU, and with a GPU, with the same GPU type and performance; with the same number of CPUs and GPUs; with the same amount and same type/speed of RAM; and with the same RAM configuration), or various subsets of the servers 402 may be configured in the same manner (e.g., 25% of the servers may be configured in one particular manner, 50% of the servers are configured in a different manner, and 25% of the servers are configured in yet another manner), or each server 402 may be different.
In one embodiment, servers 402 are diskless, i.e., not have their own local mass storage (which is optical or magnetic, or semiconductor-based storage such as flash memory or other mass storage device serving a similar function), each server accessing a common mass storage via a fast backplane or network connection. In one embodiment, the fast connection is a Storage Area Network (SAN)403 connected to a Redundant Array of Independent Disks (RAID)405 series with connections between devices implemented using HyperText. As known to those skilled in the art, SAN 403 may be used to combine many RAID arrays 405 together, resulting in extremely high bandwidth-approaching or potentially exceeding the bandwidth available from the RAM used in current game consoles and PCs. Furthermore, while RAID arrays based on rotating media, such as magnetic media, often have significant seek time access latency, RAID arrays based on semiconductor storage may be implemented with much lower access latency. In another configuration, some or all of the servers 402 provide some or all of their own mass storage locally. For example, the server 402 may store frequently accessed information (such as a copy of its operating system and video games or applications) on low latency local flash based storage, but it may utilize a SAN to access a rotating media based RAID array 405 with higher search latency to access a large database of geometry or game state information less frequently.
Additionally, in one embodiment, the hosting service 210 uses low-latency video compression logic 404, described in detail below. The video compression logic 404 may be implemented in software, hardware, or any combination thereof (specific embodiments of which are described below). The video compression logic 404 includes logic for compressing audio as well as visual material.
In operation, when a video game is played or an application at the user premises 211 is used via a keyboard, mouse, game controller or other input device 421, control signal logic 413 on the client 415 transmits control signals 406a-b (typically in the form of UDP packets) to the hosting service 210 representing button presses (and other types of user input) caused by the user. Control signals from a given user are routed to the appropriate server (or servers if multiple servers are responsive to the user's input device) 402. As illustrated in fig. 4a, the control signal 406a may be routed to the server 402 via a SAN. Alternatively or additionally, the control signal 406b may be routed directly to the server 402 via a hosting service network (e.g., an ethernet-based area network). Regardless of how the control signals 406a-b are transmitted, the or the server executes the game or application software in response to the control signals 406 a-b. Although not illustrated in fig. 4a, various network connection components, such as firewalls and/or gateways, may handle incoming and outgoing communications at the edge of the hosting service 210 (e.g., between the hosting service 210 and the internet 410) and/or at the edge of the user premises 211 (between the internet 410 and the home or office client 415). The graphics and audio output of the executed game or application software (i.e., the new sequence of video images) is provided to low-latency video compression logic 404, which low-latency video compression logic 404 compresses the sequence of video images in accordance with low-latency video compression techniques, such as those described herein, and transmits a compressed video stream (typically with compressed or uncompressed audio) back to the client 415 via the internet 410 (or, as described below, via a best high-speed network service that bypasses the general internet). Then, low-latency video decompression logic 412 on client 415 decompresses the video and audio streams and reproduces the decompressed video stream, and typically plays the decompressed audio stream on display device 422. Alternatively, the audio may be played on a speaker separate from the display device 422 or not played at all. Note that although input device 421 and display device 422 are shown in fig. 2a and 2b as stand-alone devices, they may be integrated within a client device, such as a portable computer or mobile device.
The home or office client 415 (previously described in fig. 2a and 2b as the home or office client 205) may be a very inexpensive and low-capability device with very limited computing or graphics performance and possibly very limited or no local mass storage. In contrast, each server 402 coupled to SAN 403 and multiple RAIDs 405 may be an exceptionally high performance computing system, and in fact, if multiple servers are used cooperatively in a parallel processing configuration, there are few limitations on the amount of computation and graphics processing capability that can be tolerated. In addition, the computing power of the server 402 is provided to the user due to the low-latency video compression 404 and the low-latency video decompression 412 (as perceived by the user). When a user presses a button on the input device 421, the image on the display 422 is updated (without a perceptually meaningful delay) in response to the button press, as if the game or application software were executing locally. Thus, for a home or office client 415 that is a very low performance computer or simply a low cost chip implementing the low latency video decompression and control signal logic 413, the remote location, which is locally available from the point of view, effectively provides the user with any computing power. This gives the user the ability to play the highest level, processor intensive (typically new) video games and highest performance applications.
Fig. 4c shows a very basic and inexpensive home or office client device 465. The device is one embodiment of a home or office client 415 according to fig. 4a and 4 b. Which is approximately 2 inches long. Having an ethernet jack 462 that interfaces with a power over ethernet (PoE) cable, the device derives its power from the ethernet jack 462 and its connectivity to the internet. The device is capable of performing Network Address Translation (NAT) within a network that supports NAT. In an office environment, many new ethernet switches have PoE and bring the PoE directly to ethernet jacks in the office. In this case, all that is required is an ethernet cable from the wall jack to the client 465. If the available ethernet connection is not carrying power (e.g., in a home with DSL or cable modem but without PoE), there are inexpensive wall "bricks" (i.e., power supplies) available that will accept the unpowered ethernet cable and output the ethernet with PoE.
The client 465 contains control signal logic 413 (fig. 4a) coupled to a bluetooth wireless interface that interfaces with a bluetooth input device 479, such as a keyboard, mouse, game controller, and/or microphone and/or headset. Also, one embodiment of the client 465, if coupled with the display device 468, is able to output video at 120fps, the display device 468 is able to support 120fps video and signal (typically via infrared) a pair of shutter glasses 466 to alternately shutter one eye and then the other for each successive frame. The effect perceived by the user is a stereoscopic 3D image that "pops out" of the display screen. One such display device 468 that supports this operation is Samsung HL-T5076S. Because the video streams for each eye are separate, in one embodiment where two independent video streams are compressed by the hosting service 210, the frames are interleaved in time and the frames are decompressed within the client 465 in two independent decompression processes.
The client 465 also contains low-latency video decompression logic 412 that decompresses incoming video and audio and outputs via an HDMI (high definition multimedia interface), connector 463, which plugs into an SDTV (standard definition television) or HDTV (high definition television) 468, providing video and audio to the TV, or into an HDMI-enabled monitor 468. If the user's monitor 468 does not support HDMI, then HDMI-to-DVI (digital visual interface) may be used, but the audio will be lost. Under the HDMI standard, display capabilities (e.g., supported resolution, frame rate) 464 are expressed from the display device 468 and then this information is communicated back to the hosting service 210 via the internet connection 462 so the hosting service 210 can stream the compressed video in a format suitable for the display device.
Fig. 4d shows a home or office client device 475 that is the same as the home or office client device 465 shown in fig. 4c, except that the client device 475 has more external interfaces. Also, the client 475 may accept PoE to power, or it may tie up an external power adapter (not shown) plugged into a wall. Video camera 477 provides compressed video to client 475 using client 475 USB input, the compressed video being uploaded by client 475 to hosting service 210 for use as described below. A low latency compressor will be created into the camera 477 utilizing the compression techniques described below.
In addition to having an ethernet connector for its internet connection, client 475 also has an 802.11g wireless interface to the internet. Both interfaces are capable of using NAT within a network that supports NAT.
Also, in addition to having an HDMI connector for outputting video and audio, the client 475 also has a dual-link DVI-I connector that includes an analog output (and has a standard adapter cable that will provide a VGA output). It also has analog outputs for composite video and S-video.
For audio, client 475 has left/right analog stereo RCA jacks, and for digital audio output it has TOSLINK (fiber optic) outputs.
In addition to the bluetooth wireless interface to the input device 479, it also has a USB jack for interfacing to the input device.
FIG. 4e shows one embodiment of the internal architecture of the client 465. All or some of the devices shown in this figure may be implemented in a field programmable logic array, a custom ASIC, or in several discrete devices (custom designed or off-the-shelf).
Ethernet with PoE 497 attaches to ethernet interface 481. Power 499 is derived from ethernet with PoE 497 and connected to the rest of the devices in client 465. Bus 480 is a common bus used for communication between devices.
Control CPU 483 (almost any small CPU is suitable, such as MIPSR4000 series CPU at 100MHz with embedded RAM) executing a small client control application from flash memory 476 implements the protocol stack for the network (i.e., ethernet interface) and also communicates with the hosting service 210 and configures all devices in the client 465. It also handles the interface with the input devices 469 and sends the packets (along with the forward error correction protected user controller data, if necessary) back to the hosting service 210. Also, control CPU 483 monitors the packet traffic (e.g., whether the packet is lost or delayed, and a timestamp of its arrival). This information is sent back to the hosting service 210 so that it can constantly monitor the network connection and adjust the content it sends accordingly. Flash memory 476 is initially loaded at the time of manufacture with a control program for control CPU 483 and a serial number unique to the particular client 465 unit. This sequence number allows the hosting service 210 to uniquely identify the client 465 unit.
The bluetooth interface 484 wirelessly communicates via its antenna (internal to the client 465) to the input device 469.
Video decompressor 486 is a low-latency video decompressor configured to implement video decompression as described herein. A large number of video decompression devices exist, either off-the-shelf or as Intellectual Property (IP) with designs that can be integrated in FPGAs or custom ASICs. One company that provides IP for H.264 decoders is Ocean Logic of Manly, New south Wales, Australia (NSW Australia). The advantages of using IP are: the compression techniques used herein do not conform to the compression standard. Some standard decompressors are flexible enough to be configured to accommodate the compression techniques herein, but some standard decompressors may not. However, in the case of IP, there is full flexibility in redesigning the decompressor as needed.
The output of the video decompressor is coupled to a video output subsystem 487, which couples the video to the video output of the HDMI interface 490.
The audio decompression subsystem 488 is either implemented using an available standard audio decompressor, or it may be implemented as IP, or audio decompression may be implemented within a control processor 483 that may implement a Vorbis audio decompressor (found on Vorbis.
The device implementing audio decompression is coupled to an audio output subsystem 489, the audio output subsystem 489 coupling audio to the audio output of the HDMI interface 490.
FIG. 4f shows one embodiment of the internal architecture of client 475. As can be seen, this architecture is the same as that of the client 465, except for the additional interface and optional external DC power from a power adapter plugged into the wall (and if so used, the optional external DC power replaces the power from the ethernet PoE 497). Common functionality with the client 465 will not be repeated below, but additional functionality will be described below.
CPU 483 communicates with and configures additional devices.
WiFi subsystem 482 provides wireless internet access via its antenna as an alternative to ethernet 497. WiFi subsystems are available from a number of manufacturers, including Atheros Communications (attharruss Communications, santa clara, ca, usa.
The USB subsystem 485 provides an alternative to bluetooth communication for the wired USB input device 479. USB subsystems are fairly standard and readily available for FPGAs and ASICs, and are often built into off-the-shelf devices that perform other functions such as video decompression.
Video output subsystem 487 produces a wider range of video outputs than those within client 465. In addition to providing the HDMI 490 video output, it provides DVI-I491, S-video 492, and composite video 493. Also, when the DVI-I491 interface is used for digital video, display capabilities 464 are passed back from the display device to the control CPU 483 so that it can inform the hosting service 210 of the capabilities of the display device 478. All of the interfaces provided by the video output subsystem 487 are fairly standard interfaces and are readily available in many forms.
The audio output subsystem 489 outputs audio digitally via digital interface 494(S/PDIF and/or Toslink) and audio in analog form via stereo analog interface 495.
Round trip delay analysis
Of course, to the benefit of the previous paragraph, the round-trip delay between the user's action using the input device 421 and seeing the consequences of that action on the display device 420 should be no more than 70-80 milliseconds. This delay must take into account all factors in the path from the input device 421 in the user premises 211 to the hosting service 210 and back again to the user premises 211 to the display device 422. Fig. 4b illustrates various components and networks through which signals must travel, and above which is a timeline listing exemplary delays that may be expected in practical implementations. Note that fig. 4b is simplified so that only the important path routes are shown. Other routes of data for other features of the system are described below. Double-headed arrows (e.g., arrow 453) indicate round-trip delays and single-headed arrows (e.g., arrow 457) indicate one-way delays, and "-" represents an approximate measurement. It should be noted that there will be real world situations where the listed delays cannot be achieved, but in a large number of cases in the united states, using DSL and cable modem connections to the customer premises 211, the delays can be achieved in the situation described in the next paragraph. Also, note that while cellular wireless connectivity to the internet will indeed work in the system shown, most current U.S. cellular data systems (such as EVDO) incur very high latency and will not be able to achieve the latency shown in fig. 4 b. However, these basic principles may be implemented on future cellular technologies that may be capable of implementing this level of latency. Further, there are instances of games and applications (e.g., games that do not require a fast user reaction time, such as chess) for which the latency generated by the current US cellular data system, while apparent to the user, is acceptable.
Starting from input device 421 at user premises 211, once the user actuates input device 421, user control signals are sent to client 415 (which may be a stand-alone device such as a set-top box, or which may be software or hardware executing in another device such as a PC or mobile device), and packetized (in UDP format in one embodiment) and given a destination address for the packets to reach hosting service 210. The packet will also contain information indicating from which user the control signal came. The control signal packets are then forwarded to WAN interface 442 via firewall/router/NAT (network address translation) device 443. WAN interface 442 is an interface device provided by a user's ISP (internet service provider) to user premises 211. WAN interface 442 may be a cable or DSL modem, a WiMax transceiver, a fiber optic transceiver, a cellular data interface, a power-line-over-power interface (Internet Protocol-over-powerline interface), or any other of many interfaces to the Internet. Additionally, the firewall/router/NAT device 443 (and possibly the WAN interface 442) may be integrated into the client 415. One such example would be a mobile phone that includes software for implementing the functionality of the home or office client 415, as well as means for routing and connecting to the internet wirelessly via some standard (e.g., 802.11 g).
WAN interface 442, which is a facility that provides an interface between a WAN transporter connected to the user premises 211 and the general internet or private network, then routes control signals to what is referred to herein as a "point of presence" 441 for the user's Internet Service Provider (ISP). The nature of the point of presence will vary depending on the nature of the internet service provided. For DSL, it is typically the telephone company central office where the DSLAM is located. For a cable modem, it is typically a cable multi-system operator (MSO) headend. For cellular systems, it is typically a control room associated with a cellular tower. But whatever the nature of the point of presence, it will then route the control signal packets to the general internet 410. The control signal packets are then routed to the hosting service 210 via the interface that will most likely be the fiber optic transceiver interface to the WAN interface 444. WAN 444 will then route the control signal packets to routing logic 409 (which may be implemented in many different ways, including an ethernet switch and a routing server), which routing logic 409 evaluates the address of the user and routes the control signals to the correct server 402 for the given user.
The server 402 then treats the control signals as input to the game or application software executing on the server 402 and uses the control signals to process the next frame of the game or application. Once the next frame is generated, the video and audio are output from the server 402 to the video compressor 404. Video and audio may be output from the server 402 to the compressor 404 via a variety of means. First, the compressor 404 may be created into the server 402, so compression may be implemented locally within the server 402. Alternatively, the video and/or audio may be output in packetized form via a network connection (such as an ethernet connection) to a network (which is either a private network between the server 402 and the video compressor 404 or a network via a shared network such as the SAN 403). Alternatively, video may be output from the server 402 via a video output connector (such as a DVI or VGA connector) and then captured by the video compressor 404. Also, the audio may be output from the server 402 as digital audio (e.g., via a TOSLINK or S/PDIF connector) or analog audio, which is digitized and encoded by audio compression logic within the video compressor 404.
Once the video compressor 404 has captured the video frames from the server 402 and the audio generated during that frame time, the video compressor will compress the video and audio using the techniques described below. Once the video and audio are compressed, they are packetized by an address to send them back to the user's client 415 and routed to WAN interface 444, WAN interface 444 then routes the video and audio packets through the general internet 410, general internet 410 then routes the video and audio packets to the user's ISP's point of presence 441, point of presence 441 routes the video and audio packets to WAN interface 442 at the user's premises, WAN interface 442 routes the video and audio packets to firewall/router/NAT device 443, which then routes the video and audio packets to client 415.
Client 415 decompresses the video and audio and then displays the video on display device 422 (or the client's built-in display device) and sends the audio to display device 422 or to a separate amplifier/speaker or to an amplifier/speaker created into the client.
To make the user feel that the entire process just described is perceptually lag free, the round trip delay needs to be less than 70 milliseconds or 80 milliseconds. Some of the latency delays in the described round trip paths are controlled by the hosting service 210 and/or the user, while other latency delays are not controlled by the hosting service 210 and/or the user. Nevertheless, based on analysis and testing of a large number of real-world situations, the following is an approximate measurement.
The one-way transmission time 451 for sending control signals is typically less than 1 millisecond, and the round-trip routing 452 via the customer premises is typically completed in about 1 millisecond using an easily available consumer-level firewall/router/NAT switch on the ethernet. The customer ISP varies its round trip delay 453 extensively, but in the case of DSL and cable modem providers, it is typically seen to be between 10 and 25 milliseconds. The round trip delay on the general internet 410 can vary greatly in how traffic is routed and whether there are any failures in the routing (and the problem is discussed below), but typically the general internet provides reasonably optimal routing and the delay is largely determined by the speed of light through the fiber (given the distance to the destination). As discussed further below, 1000 miles has been determined as the approximate furthest distance that is expected to place the hosting service 210 away from the user premises 211. At 1000 miles (2000 miles of round trip), the actual transmission time for the signal over the internet is about 22 milliseconds. The WAN interface 444 to the hosting service 210 is typically a commercial grade fiber optic high speed interface with negligible latency. Thus, the universal internet latency 454 is typically between 1 millisecond and 10 milliseconds. The one-way routing 455 latency through the hosting service 210 can be achieved in less than 1 millisecond. The server 402 will typically calculate a new frame for the game or application in less than one frame time (which is 16.7 milliseconds at 60 fps), so 16 milliseconds is a reasonable maximum one-way delay 456 to be used. In the optimal hardware implementation of the video compression and audio compression algorithms described herein, the compression 457 can be completed in 1 millisecond. In a sub-optimal version, compression may take as much as 6 milliseconds (of course, a less optimal version may take longer, but the implementation will affect the overall latency of the round trip and will require the other latency to be shorter (e.g., the allowable distance over the general internet may be reduced) to maintain the 70-80 millisecond latency target). The round trip delays of the internet 454, user ISP 453 and user premises routing 452 have been considered, so the remainder is a video decompression 458 delay, the video decompression 458 delay varying depending on whether the video decompression 458 is implemented in dedicated hardware or in software on the client device 415 (such as a PC or mobile device), depending on the size of the display and the performance of the decompression CPU. Typically, decompression 458 takes between 1 and 8 milliseconds.
Thus, the worst-case round-trip delay that a user of the system shown in FIG. 4a may expect to experience can be determined by adding all of the worst-case delays seen in practice. They are: 1+1+25+22+1+16+6+8 equals 80 ms. Furthermore, in practice (with the explanations discussed below to prevent misunderstandings), this is roughly the round-trip latency seen using a prototype version of the system shown in fig. 4a (using an off-the-shelf Windows PC as client device and home DSL and cable modem connection in the united states). Of course, the conditions that are better than worst-case can result in much shorter delays, but cannot be relied upon to develop widely used commercial services.
To achieve the latency listed in fig. 4b via the general internet, it is desirable for the video compressor 404 and video decompressor 412 in the client 415 (from fig. 4a) to generate packet streams with very specific characteristics so that the packet sequence generated via the entire path from the host service 210 to the display device 422 is not subject to delay or excessive packet loss, and in particular, consistently falls within the constraints of the bandwidth available to the user via the user's internet connection (via WAN interface 442 and firewall/router/NAT 433). In addition, the video compressor must produce a packet stream that is robust enough so that it can tolerate the inevitable packet loss and packet reordering that occurs in normal internet and network transmissions.
Low latency video compression
To accomplish the above objectives, one embodiment employs a new video compression method that reduces the latency and peak bandwidth requirements for transmitting video. Before describing this embodiment, an analysis of the current video compression technique will be provided with respect to fig. 5 and 6 a-6 b. Of course, the technique may be used in accordance with basic principles if the user has sufficient bandwidth to handle the data rates required by the technique. Note that audio compression is not addressed herein, but rather it is stated that audio compression is implemented simultaneously and synchronously with video compression. Prior art audio compression techniques exist that meet the requirements for this system.
Fig. 5 illustrates one particular prior art technique for compressing video, where each individual video frame 501- > 503 is compressed by compression logic 520 using a particular compression algorithm to produce a series of compressed frames 511- > 513. One embodiment of this technique is "motion JPEG," in which each frame is compressed based on the Discrete Cosine Transform (DCT) according to a Joint Photographic Experts Group (JPEG) compression algorithm. Various different types of compression algorithms may be used, however, still adhering to this basic principle (e.g., wavelet-based compression algorithms such as JPEG-2000).
One problem with this type of compression is that: which reduces the data rate of each frame, but which does not take advantage of similarities between successive frames to reduce the data rate of the overall video stream. For example, as illustrated in fig. 5, assuming a frame rate of 640 × 480 × 24 bits/pixel 640 × 480 × 24/8/1024-900 kilobytes/frame (KB/frame), motion JPEG may only compress 1/10 the stream for a given quality image, resulting in a 90 KB/frame data stream. At 60 frames/sec, this would require a channel bandwidth of 90KB 8 bits 60 frames/sec 42.2Mbps, which would be an extremely high bandwidth for almost all home internet connections in the united states today and an excessively high bandwidth for many office internet connections. In fact, assuming it requires a constant data flow with such high bandwidth, and it will serve only one user, it will consume a large percentage of the bandwidth of a 100Mbps ethernet LAN and the burdensome ethernet switches supporting the LAN, even in an office LAN environment. Thus, compression for motion video is inefficient when compared to other compression techniques, such as those described below. Furthermore, single frame compression algorithms that use lossy compression algorithms, such as JPEG and JPEG-2000, produce compression artifacts that may not be noticeable in still images (e.g., artifacts within dense leaves in a scene may not appear as artifacts, because the eye does not know exactly how the dense leaves should appear). However, once the scene is in motion, artifacts may be prominent because the eye detects artifacts that change from frame to frame, even though the artifacts are in a region of the scene where they may not be noticeable in the still image. This results in the perception of "background noise" in the sequence of frames, which has an appearance similar to the "snowflake" noise visible during edge-simulated TV reception. Of course, this type of compression may still be used in the particular embodiments described herein, but in general, to avoid background noise in the scene, a high data rate (i.e., a low compression ratio) is required for a given perceptual quality.
Other types of compression, such as h.264, or Windows media VC9, MPEG2, and MPEG4, are all more efficient in compressing video streams because they exploit similarities between successive frames. These techniques all rely on the same general techniques for compressing video. Thus, although the h.264 standard will be described, the same general principles apply to various other compression algorithms. A large number of h.264 compressors and decompressors are available, including the x264 open source software library for compressing h.264 and the FFmpeg (a video and audio streaming scheme) open source software library for decompressing h.264.
FIGS. 6a and 6b illustrate exemplary prior art compression techniques in which a series of uncompressed video frames 501-503, 559-561 are compressed by compression logic 620 into a series of "I-frames" 611, 671; "P frame" 612- "613; and "B frame" 670. The vertical axis in fig. 6a generally represents the resulting size of each of the encoded frames (although the frames are not drawn to scale). As described above, video coding using I-frames, B-frames, and P-frames is well understood by those skilled in the art. In short, the I-frame 611 is a DCT-based compression of the completely uncompressed frame 501 (similar to a compressed JPEG image as described above). The size of P frames 612 and 613 is typically significantly smaller than the size of I frame 611 because it utilizes data in the previous I or P frames; that is, it contains data indicating a change between previous I-frames or P-frames. B frame 670 is similar to a P frame, except that it uses frames in subsequent reference frames and, possibly, frames in previous reference frames.
For the following discussion, it will be assumed that the desired frame rate is 60 frames/second, each I-frame is about 160Kb, the average P-frame and B-frame is 16Kb, and a new I-frame is generated every second. Under this set of parameters, the average data rate will be: 160Kb +16Kb 59 ═ 1.1 Mbps. This data rate falls well within the maximum data rate for many current broadband internet connections to homes and offices. This technique also tends to avoid the background noise problem from intra-only coding, since P-frames and B-frames track the difference between frames, so compression artifacts tend not to appear and disappear from frame to frame, reducing the background noise problem described above.
One problem with the above type of compression is that: although the average data rate is relatively low (e.g., 1.1Mbps), a single I-frame may take several frame times to transmit. For example, using prior art, a 2.2Mbps network connection (e.g., DSL or cable modem with a peak of 2.2Mbps from the maximum available data rate 302 of fig. 3 a) would typically be sufficient to stream video at 1.1Mbps, one 160Kbps I frame every 60 frames. This would be done by having the decompressor queue 1 second of video before decompressing the video. Within 1 second, 1.1Mb of data will be transmitted, which will be easily accommodated by the 2.2Mbps maximum available data rate, even assuming that the available data rate may periodically drop by as much as 50%. Unfortunately, this prior art approach would result in a 1 second delay of the video due to the 1 second video buffering at the receiver. Such a delay is sufficient for many prior art applications (e.g., playback of linear video), but is an extremely long latency for fast-action video games that cannot tolerate latencies greater than 70-80 milliseconds.
If an attempt were made to eliminate the 1 second video buffer, it would still not result in a sufficient latency reduction for the fast-action video game. For example, as previously described, the use of B frames would require the receipt of all B frames preceding an I frame as well as the I frame. If it is assumed that 59 non-I frames are roughly split between P and B frames, there will be at least 29B frames and any previously received I frame before the B frame can be displayed. Thus, regardless of the available bandwidth of the channel, a delay of 1/60 seconds duration, or 500 milliseconds delay, is required for 29+1 ═ 30 frames each. Obviously, this time is extremely long.
Thus, another approach would be to eliminate B frames and use only I and P frames. (one consequence of this is that for a given level of quality, the data rate will increase, but for consistency in this example, it continues to be assumed that each I frame is 160Kb in size and the average P frame is 16Kb in size, and thus the data rate is still 1.1 Mbps). This approach eliminates the inevitable delay introduced by B frames because the decoding of each P frame depends only on the previously received frame. The method still has the problems that: i frames are much larger than average P frames so that on low bandwidth channels (as is typical in most homes and many offices), the transmission of I frames adds substantial latency. This is illustrated in fig. 6 b. The video stream data rate 624 is lower than the available maximum data rate 621 (except for I-frames), where the peak data rate 623 required for I-frames far exceeds the available maximum data rate 622 (and even exceeds the nominal maximum data rate 621). The required data rate for P frames is less than the maximum data rate available. Even if the available maximum data rate peak at 2.2Mbps remains stable at its 2.2Mbps peak rate, it will take 160Kb/2.2Mb 71 milliseconds to transmit the I-frame, and if the available maximum data rate 622 drops by 50% (1.1Mbps), it will take 142 milliseconds to transmit the I-frame. Thus, the delay in transmitting an I-frame will fall somewhere between 71 and 142 milliseconds. This delay adds to the delay identified in fig. 4b (which in the worst case amounts to 70 milliseconds), and therefore, this will result in a total round trip delay of 141 + 222 milliseconds from the moment the user actuates the input device 421 until the image is presented on the display device 422, which is extremely high. And if the available maximum data rate drops below 2.2Mbps, the delay will increase further.
It is also noted that there are often serious consequences to "jam" the ISP at a peak data rate 623 that far exceeds the available data rate 622. The equipment in different ISPs will behave differently, but when receiving packets at much higher data rates than the available data rate 622, the following behavior is quite common among DSL and cable modem ISPs: (a) delaying packets by queuing them (introducing latency), (b) dropping some or all packets, (c) disabling the connection for a period of time (most likely because the ISP is concerned that it is a malicious attack, such as a "denial of service" attack). Therefore, transmitting a packet stream at full data rate (with characteristics such as those shown in fig. 6 b) is not a viable option. The peaks 623 may be queued at the hosting service 210 and sent at a data rate that is lower than the maximum data rate available, introducing unacceptable latency as described in the previous paragraph.
Additionally, the video stream data rate sequence 624 shown in fig. 6b is a very "tame" video stream data rate sequence, and would be that data rate sequence that is expected to result from compressing video from the video sequence that does not change much and has very little motion (e.g., as would be common in a video teleconference where the camera is in a fixed position and has very little motion, and objects in the scene (e.g., a seated person talking) show less motion).
The video stream data rate sequence 634 shown in fig. 6c is a typical sequence that is expected to be visible from video with much more motion, such as might be produced in a movie or video game, or in some application software. Note that in addition to I frame peak 633, there are also P frame peaks (such as 635 and 636) that are quite large and in many instances exceed the available maximum data rate. Although the P frame peaks are not quite as large as the I frame peaks, they are still too large to be carried by the channel at full data rate, and as with the I frame peaks, the P frame peaks must be transmitted slowly (thereby increasing latency).
On a high bandwidth channel (e.g., a 100Mbps LAN, or a high bandwidth 100Mbps private connection), the network will be able to tolerate large peaks such as I-frame peak 633 or P-frame peak 636, but in principle, low latency can be maintained. However, such networks are often shared among many users (e.g., in an office environment), and this "peaky" data will impact the performance of the LAN, especially if the network communications are routed to a private shared connection (e.g., from a remote data center to an office). First, remember that this example is an example of a relatively low resolution video stream of 640 × 480 pixels at 60 fps. 1920 x 1080 HDTV streams at 60fps are readily processed by modern computers and displays, and 2560 x 1440 resolution displays at 60fps are increasingly available (e.g., Apple corporation's 30 "displays). With h.264 compression, a 1920 × 1080 high motion video sequence at 60fps may require 4.5Mbps to achieve a reasonable quality level. If the I frame peak is assumed to be 10 times the nominal data rate, it will produce a 45Mbps peak, and a smaller but still significant P frame peak. If several users are receiving video streams on the same 100Mbps network (e.g., a private network connection between an office and a data center), it is easy to see how the peaks of the video streams from several users may happen to align, overwhelming the bandwidth of the network and possibly overwhelming the bandwidth of the backplane of the switch supporting the users on the network. Even in the case of ultra-high speed ethernet, if enough users have enough peaks aligned at the same time, they can overwhelm the network or network switch. Furthermore, once 2560 x 1440 resolution video becomes more common, the average video stream data rate may be 9.5Mbps, resulting in perhaps a 95Mbps peak data rate. Needless to say, a 100Mbps connection between a data center and an office (which is an exceptionally fast connection today) will be completely overwhelmed by peak communications from a single user. Thus, even though LANs and private network connections may be more tolerant of peaky streaming video, streaming video with high peaks is not needed and may require special planning and adaptation by the IT department of the office.
Of course, for standard linear video applications, this problem is not an issue because the data rate is "smoothed" at the transmission point and the data for each frame is below the maximum available data rate 622, and a buffer in the client stores the sequence of I, P, and B frames before decompressing the sequence. Thus, the data rate on the network remains close to the average data rate of the video stream. Unfortunately, this introduces latency, which is unacceptable for low latency applications such as video games and applications that require fast response times, even if B-frames are not used.
One prior art solution for mitigating video streams with high peaks is to use a technique often referred to as "constant bit rate" (CBR) coding. Although the term CBR appears to imply that all frames are compressed to have the same bit rate (i.e., size), it is often referred to as a compression paradigm in which a maximum bit rate is allowed across a certain number of frames (in our case, 1 frame). For example, in the case of fig. 6c, if a CBR constraint is imposed on the encoding that limits the bit rate to, for example, 70% of the nominal maximum data rate 621, the compression algorithm will limit the compression of each of the frames such that any frame that would normally be compressed using more than 70% of the nominal maximum data rate 621 will be compressed with fewer bits. This result is: the frame that would normally require more bits to maintain a given level of quality would be "starved" of bits and the image quality of the frame would be worse than that of other frames that do not require more bits than 70% of the nominal maximum data rate 621. This approach may produce acceptable results for a particular type of compressed video where (a) less motion or scene change is expected and (b) the user may accept periodic quality degradation. A good example of an application that is suitable for CBR is video teleconferencing, as there are fewer peaks, and most users are acceptable in the event of temporary degradation in quality (e.g., if the camera is panned, resulting in significant scene motion and large peaks, there may not be enough bits during panning for high quality image compression that would result in degraded image quality). Unfortunately, CBR is not well suited for scenes with high complexity or large amounts of motion and/or many other applications requiring reasonably constant quality levels.
The low-latency compression logic 404 used in one embodiment uses a number of different techniques to address many of the problems of streaming low-latency compressed video while maintaining high quality. First, the low-latency compression logic 404 generates only I-frames and P-frames, thereby alleviating the need to wait several frame times to decode each B-frame. In addition, as illustrated in FIG. 7a, in one embodiment, the low-latency compression logic 404 subdivides each uncompressed frame 701-760 into a series of "tiles" and individually encodes each tile as an I-frame or a P-frame. The group of compressed I and P frames is referred to herein as an "R frame" 711-770. In the particular example shown in fig. 7a, each uncompressed frame is subdivided into 16 image blocks of a 4 x 4 matrix. However, the underlying principles are not limited to any particular subdivision mechanism.
In an embodiment, low latency compression logic 404 divides a video frame into a number of tiles, and encodes (i.e., compresses) one tile from each frame into an I-frame (i.e., compresses the tile as if it were a separate video frame of 1/16 size of a full image, and the compression for this "mini" frame is an I-frame compression) and the remaining tiles into P-frames (i.e., the compression for each "mini" 1/16 frame is a P-frame compression). Tiles compressed into I frames and tiles compressed into P frames will be referred to as "I tiles" and "P tiles," respectively. The image blocks to be encoded as I image blocks are changed with each successive video frame. Thus, in a given frame time, only one of the tiles in a video frame is an I tile, and the remainder of the tiles are P tiles. For example, in FIG. 7a, tile 0 of the uncompressed frame 701 is encoded as an I tile I0 and the remaining 1-15 tiles are encoded as P tiles (P1-P15) to generate an R frame 711. In the next uncompressed video frame 702, tile 1 of uncompressed frame 701 is encoded as I tile I1 and the remaining tiles 0 and 2 through 15 are encoded as P tiles (P0, and P2 through P15) to generate R frame 712. Thus, the I tiles and P tiles for a tile are progressively interleaved in time over successive frames. This process continues until an R image block 770 is produced, the last image block in the matrix being encoded as an I image block (i.e., I15). The process then restarts, producing another R frame, such as frame 711 (i.e., encoding the I tile for tile 0). Although not illustrated in fig. 7a, in one embodiment, the first R frame of the video sequence of R frames contains only I picture blocks (i.e., such that subsequent P frames have reference picture data (from which motion is calculated)). Alternatively, in an embodiment, the startup sequence uses the same I tile pattern as normal, but does not include P tiles for the tiles that have not yet been encoded along with the I tiles. In other words, certain tiles are not encoded along with any data until the first I tile arrives, thereby avoiding startup peaks in the video stream data rate 934 in fig. 9a, which is described in further detail below. Moreover, as described below, a variety of different sizes and shapes may be used for the tiles while still adhering to the underlying principles.
Video decompression logic 412 executing on client 415 decompresses each tile as if it were a separate video sequence of small I and P frames, and then renders each tile to a frame buffer that drives display device 422. For example, tile 0 of the video image is decompressed and reproduced using I0 and P0 from R frames 711-770. Similarly, tile 1 is reconstructed using I1 and P1 from R frames 711 to 770, and so on. As described above, decompression of I-frames and P-frames is well known in the art, and decompression of I tiles and P tiles may be accomplished by having multiple executions of the video decompressor execute on the client 415. Although the multiplication process appears to increase the computational burden on client 415, it does not actually increase the computational burden on client 415 because the tiles themselves are proportionally smaller (relative to the number of additional processing), so the number of pixels displayed is the same as if there was one processing and using the traditional full-size I and P frames.
This R-frame technique significantly mitigates the bandwidth peaks typically associated with I-frames (illustrated in fig. 6b and 6 c) because any given frame is predominantly composed of P-frames, which are typically smaller than I-frames. For example, again assuming a typical I frame is 160Kb, the I tile for each of the frames illustrated in FIG. 7a would be approximately 1/16 or 10Kb of that quantity. Similarly, assuming a typical P frame is 16Kb, the P frame for each of the image blocks illustrated in FIG. 7a may be approximately 1 Kb. The final result is an R frame of about 10Kb +15 × 1Kb to 25 Kb. Thus, each 60 frame sequence would be 25Kb by 60 ═ 1.5 Mbps. Thus, at 60 frames/second, this would require a channel capable of maintaining a bandwidth of 1.5Mbps, but with much lower peaks because the I picture blocks are spread throughout the 60 frame interval.
Note that in the previous example, with the same assumed data rate for I and P frames, the average data rate is 1.1 Mbps. This is because in the previous example, only a new I-frame was introduced every 60 frame times, whereas in this example the 16 tiles that make up the I-frame cycle through 16 frame times, and thus the equivalent of an I-frame was introduced every 16 frame times, resulting in a slightly higher average data rate. However, in practice, introducing more frequent I-frames does not linearly increase the data rate. This is due to the fact that: a P frame (or P tile) encodes mainly the difference from the previous frame to the next frame. Thus, if the previous frame is quite similar to the next frame, the P-frame will be very small, and if the previous frame is quite different from the next frame, the P-frame will be very large. But because P-frames are derived largely from previous frames, not from actual frames, the resulting encoded frames may contain more errors (e.g., visual artifacts) than I-frames with a sufficient number of bits. In addition, error accumulation can occur when one P frame follows another P frame (getting worse when there are long sequences of P frames). Now, the sophisticated video compressor will detect the fact that the quality of the image degrades after a sequence of P frames, and if necessary, it will allocate more bits to the following P frames to improve the quality, or if it is the most efficient course of action, replace the P frames with I frames. Thus, when a long P frame sequence is used (e.g., 59P frames, as in the previous example above), in particular when the scene has a lot of complexity and/or motion, typically more bits are needed for the P frames (as they become farther away from the I frame).
Alternatively, a P frame that closely follows an I frame tends to require fewer bits from the opposite viewpoint than a P frame that is further away from the I frame. Thus, in the example shown in fig. 7a, no P-frame is spaced further (before) than 15 frames from the I-frame, whereas in the previous example, a P-frame may be spaced 59 frames from the I-frame. Thus, in the case of more frequent I frames, P frames are smaller. Of course, the exact relative size will vary based on the nature of the video stream, but in the example of fig. 7a, if an I tile is 10Kb, then the size of the P tiles may be on average only 0.75Kb, resulting in 10Kb +15 x 0.75Kb 21.25Kb, or at 60 frames/sec the data rate will be 21.25Kb 60 Mbps 1.3Mbps, or about 16% higher than that of a stream with I frames followed by 59P frames at 1.1 Mbps. Again, the relative results between the two methods for video compression will vary depending on the video sequence, but in general we have empirically found that for a given level of quality, about 20% more bits are required to use R frames than to use I/P frame sequences. Of course, however, the R frames drastically reduce the peak values, which makes the video sequence available at a much smaller delay than the sequence of I/P frames.
The R-frame may be configured in a number of different ways depending on the nature of the video sequence, the reliability of the channel, and the available data rate. In an alternative embodiment, a number of image blocks other than 16 are used in a 4 x 4 configuration. For example, 2 tiles may be used in a 2 × 1 or 1 × 2 configuration, 4 tiles may be used in a 2 × 2, 4 × 1 or 1 × 4 configuration, 6 tiles may be used in a 3 × 2, 2 × 3, 6 × 1 or 1 × 6 configuration or 8 tiles may be used in a 4 × 2 (as shown in fig. 7 b), 2 × 4, 8 × 1 or 1 × 8 configuration. Note that the image blocks need not be square, nor do the video frames need to be square, or even rectangular. The image blocks can be broken down into whatever shape best suits the video stream and application being used.
In another embodiment, the loop for I tiles and P tiles is not locked to the number of tiles. For example, in an 8 tile 4 x 2 configuration, a 16-cycle sequence may still be used as illustrated in fig. 7 b. Sequential uncompressed frames 721, 722, 723 are each divided into 8 tiles 0-7, and each tile is individually compressed. From R frame 731, only tile 0 is compressed as an I tile, and the remaining tiles are compressed as P tiles. For the subsequent R frame 732, all 8 tiles are compressed as P tiles, and then for the subsequent R frame 733, tile 1 is compressed as an I tile and the other tiles are all compressed as P tiles. Further, the ordering continues for 16 frames as such, with only every other frame generating an I tile, thus generating the last I tile for tile 7 during the 15 th frame time (not shown in FIG. 7 b) and during the 16 th frame time (R frame 780 compressed using all P tiles). Then, the sequence starts again with tile 0 compressed as an I tile and the other tiles compressed as P tiles. As in the previous embodiment, the first frame of the entire video sequence will typically be an I tile to provide a reference for P tiles from that point onwards. The cycles of I tiles and P tiles need not even be even multiples of the number of tiles. For example, in the case of 8 tiles, each frame with one I tile may be followed by 2 frames all of P tiles before another I tile is used. In yet another embodiment, if, for example, a particular area of a screen is known to have more motion (requiring more frequent I tiles) while other areas are more static (e.g., displaying a score of a game) (requiring less frequent I tiles), then the particular tile may be ordered along with the I tiles more often than other tiles. Furthermore, although each frame is illustrated in fig. 7 a-7 b as having a single I tile, multiple I tiles may be encoded in a single frame (depending on the bandwidth of the transmission channel). Conversely, a particular frame or sequence of frames may be transmitted without an I tile (i.e., only a P tile).
The reason why the method of the previous paragraph works properly is that: while not having I tiles scattered across each single frame appears to result in large peaks, the behavior of the system is not as simple. Because each tile is compressed separately from other tiles, the encoding of each tile may become less efficient as the tiles become smaller because the compressor of a given tile is unable to take advantage of similar image features and similar motion from other tiles. Thus, dividing the screen into 16 tiles will generally result in less efficient encoding than dividing the screen into 8 tiles. However, if the screen is divided into 8 tiles and it causes data of one full I-frame to be introduced every 8 frames (rather than every 16 frames), it results in a much higher data rate overall. Thus, by introducing one full I-frame every 16 frames (rather than every 8 frames), the overall data rate is reduced. Also, by using 8 larger tiles (instead of 16 smaller tiles), the overall data rate is reduced, which also mitigates data peaks caused by larger tiles to some extent.
In another embodiment, the low-latency video compression logic 404 in fig. 7a and 7b automatically controls the allocation of bits to the tiles in the R frame by setting a pre-configuration based on known characteristics of the video sequence to be compressed or based on an ongoing analysis of the image quality in each tile. For example, in some racing video games, the front of a player's car (which is relatively motionless in the scene) occupies a large portion of the lower half of the screen, while the upper half of the screen is completely filled with approaching roads, buildings, and scenery, which is almost always in motion. If the compression logic 404 allocates an equal number of bits to each tile, the tiles in the lower half of the screen in uncompressed frame 721 in FIG. 7b (tiles 4-7) will typically be compressed with a higher quality than the tiles in the upper half of the screen in uncompressed frame 721 in FIG. 7b (tiles 0-3). If the particular game or this particular scene of the game is known to have the characteristics, the operator of the hosting service 210 may configure the compression logic 404 to allocate more bits to tiles at the top of the screen (as compared to the bits allocated to tiles at the bottom of the screen). Alternatively, compression logic 404 may estimate the compression quality of the tiles after compressing the frame (using one or more of a number of compression quality metrics, such as peak signal-to-noise ratio (PSNR)), and if it is determined that a particular tile consistently produces better quality results over a particular time window, it gradually allocates more bits to tiles that produce lower quality results until the various tiles reach a similar level of quality. In an alternative embodiment, the compressor logic 404 allocates bits to achieve higher quality in a particular image block or group of image blocks. For example, it may provide a better overall perceived appearance to have a higher quality at the center of the screen than at the edges.
In one embodiment, to improve the resolution of particular regions of the video stream, the video compression logic 404 encodes regions of the video stream having relatively more scene complexity and/or motion using smaller image blocks (as compared to regions of the video stream having relatively less scene complexity and/or motion). For example, as illustrated in fig. 8, smaller tiles are used around a moving character 805 in an area of one R frame 811, possibly followed by a series of R frames (not shown) having the same tile size. Then, when the person 805 moves to a new area of the image, smaller image blocks are used around this new area within another R frame 812, as illustrated. As described above, a variety of different sizes and shapes may be used as "tiles" while still complying with the underlying principles.
Although the cyclic I/P tiling described above substantially reduces peaks in the data rate of a video stream, it does not completely eliminate peaks, especially in the case of rapidly changing or highly complex video images, such as occur under movies, video games, and some application software. For example, during an abrupt scene transition, a complex frame may be followed by another complex frame that is completely different. Even though several I tiles may precede a scene transition by only a few frame times, it does not help in this situation because the material of the new frame is independent of the previous I tiles. In this case (and in other cases where a large number of pictures change, if not all of them), video compressor 404 will determine that many, if not all, P tiles are more efficiently encoded as I tiles, and that a very large peak in the data rate of the frame results.
As previously discussed, it is only the case for most consumer grade internet connections (and many office connections) that it cannot "jam" data beyond the available maximum data rate shown as 622 in fig. 6c and the nominal maximum data rate 621. Note that the nominal maximum data rate 621 (e.g., "6 Mbps DSL") is essentially a sales figure for a user considering purchasing an internet connection, but typically it does not guarantee a performance level. For the purpose of this application it is irrelevant, as we are only concerned with the maximum data rate 622 available when streaming video over a connection. Thus, in fig. 9a and 9c, when describing a solution to the peak problem, the nominal maximum data rate is omitted from the graph and only the available maximum data rate 922 is shown. The video stream data rate must not exceed the available maximum data rate 922.
To address this issue, the first thing the video compressor 404 does is to determine the peak data rate 941, which is the data rate that the channel can handle stably. The rate may be determined by a number of techniques. One such technique is to gradually send increasingly higher data rate test flows from the host service 210 to the client 415 (in fig. 4a and 4 b), and have the client provide feedback to the host service regarding the level of packet loss and latency. When packet loss and/or delay onset show a sharp increase, it is an indication that the maximum data rate 922 available is reached. The hosting service 210 may then gradually reduce the data rate of the test flow until the client 415 reports that the test flow has been received in a reasonable period of time (packet loss is at an acceptable level, and latency is near a minimum). This determines the peak maximum data rate 941, which will then be used as the peak data rate for the streaming video. Over time, the peak data rate 941 will fluctuate (e.g., if another user in the home begins to heavily use the internet connection), and the client 415 will need to constantly monitor the peak data rate 941 to see if packet loss or latency increases (indicating that the available maximum data rate 922 falls below the previously determined peak data rate 941), and if so, the peak data rate 941. Similarly, if over time, the client 415 finds that packet loss and latency remain at an optimal level, it may request that the video compressor slowly increase the data rate to see if the available maximum data rate increases (e.g., if another user in the home has stopped heavy use of the internet connection), and wait again until packet loss and/or higher latency indicates that the available maximum data rate 922 has been exceeded, and may again find a lower level for the peak data rate 941, but that lower level may be higher than before testing for increased data rate. Thus, the peak data rate 941 can be found using this technique (and other techniques like it) and adjusted periodically as needed. The peak data rate 941 determines the maximum data rate that can be used by the video compressor 404 to stream video to the user. The logic for determining the peak data rate may be implemented at the user premises 211 and/or on the hosting service 210. At user premises 211, client device 415 performs calculations to determine the peak data rate and transmits this information back to hosting service 210; at the hosting service 210, the server 402 at the hosting service performs calculations to determine a peak data rate based on statistical data (e.g., peak loss, latency, maximum data rate, etc.) received from the client 415.
FIG. 9a shows an example video stream data rate 934 with substantial scene complexity and/or motion, which is generated using the cyclic I/P tile compression techniques previously described and illustrated in FIG. 7a, FIG. 7b, and FIG. 8. The video compressor 404 is configured to output compressed video at an average data rate that is lower than the peak data rate 941, and note that most of the time, the video stream data rate remains lower than the peak data rate 941. Comparison of the data rate 934 with the video stream data rate 634 shown in FIG. 6c (which is generated using I/P/B or I/P frames) shows that the cyclic I/P tile compression produces a much smoother data rate. But at frame 2 peak 952 (which is close to 2 peak data rate 942) and frame 4 peak 954 (which is close to 4 peak data rate 944), the data rate still exceeds peak data rate 941, which is unacceptable. In practice, even for high motion video from a fast changing video game, peaks in excess of the peak data rate 941 occur in less than 2% of the frames, peaks in excess of 2 times the peak data rate 942 occur rarely, and peaks in excess of 3 times the peak data rate 943 occur rarely. However, when it does occur (e.g., during a scene transition), the data rate it requires must produce good quality video images.
One way to address this problem is to simply configure the video compressor 404 so that its maximum data rate output is the peak data rate 941. Unfortunately, the resulting video output quality during peak frames is poor because the compression algorithm is "starving" of bits. What results is that compression artifacts occur when there is a sudden transition or fast motion, and in time, the user begins to recognize: artifacts always appear suddenly when there is a sudden change or rapid motion, and they can become quite annoying.
Although the human visual system is quite sensitive to visual artifacts that occur during sudden changes or rapid movements, it is not very sensitive to detecting a reduction in frame rate in such situations. In fact, when the sudden change occurs, it appears that the human visual system is focused on tracking the change, and if the frame rate drops temporarily from 60fps to 30fps and then immediately returns to 60fps, the human visual system will not notice. Furthermore, in the case of very sharp transitions (such as sudden scene changes), the human visual system does not notice if the frame rate drops to 20fps or even 15fps and then immediately returns to 60 fps. As long as the frame rate reduction only occurs occasionally, it appears to the human observer that the video is constantly performing at 60 fps.
This characteristic of the human visual system is exploited by the technique illustrated in fig. 9 b. The server 402 (from fig. 4a and 4b) produces an uncompressed video output stream at a steady frame rate (in one embodiment, at 60 fps). The timeline displays each 1/60 seconds per frame 961-970 output. Starting at frame 961, each uncompressed video frame is output to low-latency video compressor 404, which low-latency video compressor 404 compresses the frame in less than one frame time, resulting in compressed frame 1981 for the first frame. The data generated for compressed frame 1981 may be larger or smaller depending on many factors as previously described. If the data is small enough that it can be transmitted to the client 415 at or below the peak data rate 941 in one frame time (1/60 seconds), it is transmitted during transmit time (xmit time) 991 (the length of the arrow indicating the duration of the transmit time). In the next frame time, the server 402 generates an uncompressed frame 2962, compresses it into a compressed frame 2982, and transmits it to the client 415 at the peak data rate 941 during a transmission time 992 that is less than one frame time.
Next, in the next frame time, the server 402 generates uncompressed frame 3963. When uncompressed frame 3963 is compressed by video compressor 404, the resulting compressed frame 3983 is more data than can be transmitted in one frame time at peak data rate 941. Thus, it is transmitted during transmission time (2 times peak) 993, which occupies all of the frame time and a portion of the next frame time. Now, during the next frame time, the server 402 generates another uncompressed frame 4964 and outputs it to the video compressor 404, but the data is ignored and illustrated by 974. This is because the video compressor 404 is configured to ignore other uncompressed video frames that arrive while it is still transmitting the previously compressed frame. Of course, the video decompressor of client 415 will fail to receive frame 4, but it simply continues to display frame 3 on display device 422 for 2 frame times (i.e., momentarily reduces the frame rate from 60fps to 30 fps).
For the next frame 5, the server 402 outputs an uncompressed frame 5965, compresses it into a compressed frame 5985 and transmits it within 1 frame during a transmission time 995. The video decompressor of client 415 decompresses frame 5 and displays it on display device 422. The server 402 then outputs an uncompressed frame 6966, which the video compressor 404 compresses into a compressed frame 6986, but at this point the resulting data is very large. The compressed frame is transmitted at the peak data rate 941 during transmit time (4 times peak) 996, but takes almost 4 frame times to transmit the frame. During the next 3 frame times, video compressor 404 ignores the 3 frames from server 402 and the decompressor of client 415 holds frame 6 steady on display device 422 for 4 frame times (i.e., momentarily reduces the frame rate from 60fps to 15 fps). Then finally, server 402 outputs frame 10970, video compressor 404 compresses it into compressed frame 10987 and transmits it during transmit time 997, and the decompressor of client 415 decompresses frame 10 and displays it on display device 422 and the video starts over again at 60 fps.
Note that although video compressor 404 discards video frames from the video stream generated by server 402, it does not discard audio data (regardless of what form the audio is), and when video frames are discarded video compressor 404 continues to compress and transmit audio data to client 415, client 415 continues to decompress the audio data and provide the audio to whatever device is used by the user to playback the audio. So during the period of dropped frames the audio continues unabated. Compressed audio consumes a relatively small percentage of bandwidth compared to compressed video, and therefore does not have a large impact on the overall data rate. Although not illustrated in either of the data rate maps, there is always data rate capacity reserved for the compressed audio stream within the peak data rate 941.
The example just described in FIG. 9b was chosen to illustrate how the frame rate drops during the data rate peak, but not to illustrate that when the previously described cyclic I/P tiling technique is used, the data rate peak and the consequent dropped frames are rare, even during high scene complexity/high action sequences, such as those found in video games, movies, and some application software. Therefore, reduced frame rates are infrequent and temporary, and the human visual system does not detect them.
If the frame rate reduction mechanism just described is applied to the video stream data rate illustrated in fig. 9a, the resulting video stream data rate is illustrated in fig. 9 c. In this example, 2 peak 952 has been reduced to flattened 2 peak 953, and 4 peak 955 has been reduced to flattened 4 peak 955, and the entire video stream data rate 934 remains at or below the peak data rate 941.
Thus, using the techniques described above, high action video streams may be transmitted with low latency over general purpose internet and consumer-level internet connections. Additionally, in an office environment on a LAN (e.g., 100Mbs ethernet or 802.11g wireless network) or on a private network (e.g., a 100Mbps connection between a data center and an office), high action video streams may be transmitted without peaks, such that multiple users (e.g., 1920 x 1080 at 60fps transmitted at 4.5 Mbps) may use the LAN or a shared private data connection without flooding the overlapping peaks (overlay) network or network switch backplane.
Data rate adjustment
In one embodiment, the hosting service 210 initially evaluates the available maximum data rate 622 and latency of the channel to determine an appropriate data rate for the video stream and then dynamically adjusts the data rate in response thereto. To adjust the data rate, the hosting service 210 can, for example, modify the image resolution and/or the number of frames per second of the video stream to be sent to the client 415. Moreover, the hosting service may adjust the quality level of the compressed video. When changing the resolution of the video stream (e.g., from 1280 × 720 resolution to 640 × 360), the video decompression logic 412 on the client 415 may scale up the image to maintain the same image size on the display screen.
In one embodiment, in the case of a channel complete drop (drop out), the hosting service 210 suspends the game. In the case of multiplayer games, the hosting service reports to other users that the user's game has been dropped and/or suspends the game for other users.
Dropped or delayed packets
In one embodiment, the video decompression logic 412 is able to mitigate visual artifacts if data is lost due to packet loss between the video compressor 404 and the client 415 in fig. 4a or 4b, or due to packets arriving too late to decompress and meeting the latency requirements of the decompressed frames being received out of order. In a streaming I/P frame implementation, if there are lost/delayed packets, the entire screen is affected, possibly causing the screen to completely freeze for a period of time or display other screen-wide visual artifacts. For example, if a lost/delayed packet causes the loss of an I-frame, the decompressor will lack a reference for all the following P-frames before receiving a new I-frame. If a P frame is lost, it will affect the following P frame for the entire screen. Depending on how long before the I-frame occurs, this will have a longer or shorter visual impact. With interleaved I/P tiles as shown in fig. 7a and 7b, a lost/delayed packet is less likely to affect the entire screen because it only affects the tiles contained in the affected packet. If the data for each tile is sent in a separate packet, it affects only one tile if the packet is lost. Of course, the duration of the visual artifact will depend on whether an I tile packet is lost and how many frames it will take before an I tile occurs in the event of a P tile loss. However, given that different tiles on the screen are updated very frequently (possibly every frame) through the I-frame, even if one tile on the screen is affected, other tiles may not be affected. Additionally, if some event causes several packets to be lost simultaneously (e.g., a spike in power adjacent to the DSL line that temporarily interrupts the data flow), some image blocks will be more affected than others, but because some image blocks will be updated quickly with new I image blocks, they are only temporarily affected. Also, in the case of a streaming I/P frame implementation, not only are the I frames the most critical frames, but the I frames are extremely large, so if there is an event that causes a dropped/delayed packet, there is a higher probability that the I frames are affected than a much smaller I tile (i.e., if any portion of the I frame is lost, it is not possible to decompress the I frames at all). For all of the reasons, using I/P tiles results in much smaller visual artifacts when packets are dropped/delayed compared to the case of I/P frames.
One embodiment attempts to reduce the effect of lost packets by intelligently encapsulating compressed tiles within TCP (transmission control protocol) packets or UDP (user datagram protocol) packets. For example, in one embodiment, tiles are aligned with packet boundaries whenever possible. Fig. 10a illustrates how image blocks may be packed within a series of packets 1001-1005 without implementing this feature. Specifically, in FIG. 10a, tiles cross packet boundaries and are inefficiently packed such that the loss of a single packet results in the loss of multiple frames. For example, if a packet 1003 or 1004 is lost, three tiles are lost, resulting in visual artifacts.
In contrast, fig. 10b illustrates tile packaging logic 1010 for intelligently packaging tiles within packets to reduce the effects of packet loss. First, the tile encapsulation logic 1010 aligns the tiles with the packet boundaries. Thus, the image blocks T1, T3, T4, T7, and T2 are aligned with the boundaries of the packets 1001-1005, respectively. The tile encapsulation logic also attempts to combine tiles within a packet in a potentially more efficient manner without crossing packet boundaries. Based on the size of each of the image blocks, image blocks T1 and T6 are combined in one grouping 1001; combining T3 and T5 in one packet 1002; combining image blocks T4 and T8 in a packet 1003; adding picture block T8 to packet 1004; and adds picture block T2 to packet 1005. Thus, under this scheme, a single packet loss will result in the loss of no more than 2 tiles (rather than 3 tiles as illustrated in FIG. 10 a).
One additional benefit of the embodiment shown in FIG. 10b is that: the image blocks are transmitted in a different order in which they are displayed within the image. This approach creates less noticeable artifacts on the display if neighboring packets are lost due to the same event that interferes with the transmission, which would affect areas on the screen that are not near each other.
One embodiment uses Forward Error Correction (FEC) techniques to protect certain portions of a video stream from channel errors. As is known in the art, FEC techniques such as reed-solomon and Viterbi generate and append error correction data information to data transmitted over a communication channel. If an error occurs in the base data (e.g., I frame), FEC may be used to correct the error.
FEC codes increase the data rate of transmission and therefore ideally they are only used when most needed. If data is being sent and it will not result in very noticeable visual artifacts, it may be preferable not to use FEC codes to protect the data. For example, a P tile immediately preceding a lost I tile will only produce 1/60 of a second of visual artifacts on the screen (i.e., the tile on the screen will not be updated). Such visual artifacts are hardly detectable by the human eye. As P tiles move further back from I tiles, missing P tiles become increasingly more noticeable. For example, if a tile loop pattern is an I tile followed by 15P tiles before the I tile is available again, then if the P tile immediately following the I tile is lost, it results in the tile displaying an incorrect image for 15 frame times (at 60fps, this would be 250 milliseconds). The human eye will easily detect the interruption of the 250 ms stream. Thus, the more backward a P tile is from a new I tile (i.e., the closer a P tile follows an I tile), the more noticeable the artifact is. As previously discussed, however, in general, the closer a P tile follows an I tile, the smaller the data for that P tile. Thus, not only is the P tile following the I tile more critical to protection from loss, but its size is smaller. Furthermore, in general, the smaller the data that needs to be protected, the smaller the FEC code that is needed to protect it.
Thus, as illustrated in fig. 11a, in one embodiment, only the I tiles are provided with FEC codes due to the importance of the I tiles in the video stream. Thus, FEC 1101 contains an error correction code for I tile 1100 and FEC 1104 contains an error correction code for I tile 1103. In this embodiment, no FEC is generated for P tiles.
In one embodiment illustrated in fig. 11b, FEC codes are also generated for P tiles that are most likely to cause visual artifacts when lost. In this embodiment, FEC 1105 provides error correction codes for the first 3P tiles but not for the following P tiles. In another embodiment, the FEC code is generated for the P tiles with the smallest data size (which will tend to select the P tiles that occur earliest after the I tiles, which are most critical for protection).
In another embodiment, rather than sending the FEC code along with the tile, the tile is transmitted twice, each time in a different packet. If a packet is lost/delayed, another packet is used.
In one embodiment shown in fig. 11c, FEC codes 1111 and 1113 are generated for audio packets 1110 and 1112, respectively, transmitted from the hosting service concurrently with the video. Maintaining the integrity of the audio in the video stream is particularly important because distorted audio (e.g., clicks or hisses) would result in a particularly undesirable user experience. The FEC code helps ensure that the audio content is rendered at the client computer 415 without distortion.
In another embodiment, rather than sending the FEC code along with the audio data, the audio data is transmitted twice, each time in a different packet. If one packet is lost/delayed, another packet is used.
Additionally, in one embodiment illustrated in fig. 11d, FEC codes 1121 and 1123 are used for user input commands (e.g., button presses) 1120 and 1122, respectively, that are transmitted upstream from the client 415 to the hosting service 210. This is important because missing button presses or mouse movements in a video game or application may result in an undesirable user experience.
In another embodiment, rather than sending the FEC code along with the user input command data, the user input command data is transmitted twice, each time in a different packet. If one packet is lost/delayed, another packet is used.
In one embodiment, the hosting service 210 evaluates the quality of the communication channel with the client 415 to determine whether to use FEC and, if so, what portion of video, audio, and user commands should be applied with FEC. Assessing the "quality" of the channel may include functions such as estimating packet loss, delay, etc., as described above. If the channel is particularly unreliable, the hosting service 210 may apply FEC to all I tiles, P tiles, audio, and user commands. In contrast, if the channel is reliable, the hosting service 210 may apply FEC only to audio and user commands, or may not apply FEC to audio or video, or may not use FEC at all. Various other permutations of the application of FEC may be used while still adhering to the underlying principles. In one embodiment, the hosting service 210 constantly monitors the condition of the channel and changes the FEC policy accordingly.
In another embodiment, referring to fig. 4a and 4b, when packets are lost/delayed, resulting in loss of image block data, or if FEC is not able to correct the lost image block data, possibly due to particularly bad packet loss, the client 415 evaluates how many frames remain before a new I tile will be received and compares it to the round trip delay from the client 415 to the hosting service 210. If the round-trip delay is less than the number of frames before the new I tile should arrive, then the client 415 sends a message to the hosting service 210 requesting the new I tile. This message is routed to the video compressor 404, and instead of generating a P tile for a tile for which data has been lost, it generates an I tile. Assuming that the system shown in fig. 4a and 4b is designed to provide a round trip delay that is typically less than 80 milliseconds, this results in the tile being corrected to within 80 milliseconds (at 60fps, the frame has a duration of 16.67 milliseconds, so in a full frame time, the 80 millisecond delay will result in the corrected tile being within 83.33 milliseconds, 83.33 milliseconds being 5 frame times, which is an noticeable break, but much less noticeable than, for example, a 250 millisecond break for 15 frames). When compressor 404 generates such an I tile out of its usual loop order, if an I tile would cause the bandwidth of the frame to exceed the available bandwidth, compressor 404 would delay the loop of other tiles so that other tiles receive P tiles during the frame time (even though one tile would normally be an I tile during the frame), and then the usual loop would continue from the next frame, and a tile that would have received an I tile in the previous frame would typically receive an I tile. While this action temporarily delays the phase of the R frame cycle, it will typically not be visually noticeable.
Video and audio compressor/decompressor implementations
FIG. 12 illustrates a particular embodiment in which 8 tiles are compressed in parallel using a multi-core and/or multi-processor 1200. In one embodiment, using a dual core processor, quad core Xeon (to strong) CPU computer system executing at 2.66GHz or higher, each core implements the open source x264 H.264 compressor as an independent process. However, various other hardware/software configurations may be used while still complying with the underlying principles. For example, each of the CPU cores may be replaced by an h.264 compressor implemented with an FPGA. In the example shown in FIG. 12, cores 1201-1208 are used to process I tiles and P tiles simultaneously as eight independent threads. As is well known in the art, current multi-core and multi-processor computer systems are inherently capable of multi-threading when integrated with multi-threading operating systems such as Microsoft Windows XP professional edition (either 64-bit or 32-bit edition) and Linux.
In the embodiment illustrated in FIG. 12, because each of the 8 cores is responsible for only one tile, it operates largely independently of the other cores, each performing a separate instantiation of x 264. Uncompressed video at 640 x 480, 800 x 600, or 1280 x 720 resolutions is captured using a PCI Express x 1-based DVI capture card, such as the Sendero video imaging IP development board from Microtronix of Oosterhout from Netherlands, and the FPGA on the card uses Direct Memory Access (DMA) to transfer the captured video into system RAM via the DVI bus. The tiles are arranged in a 4 x2 arrangement 1205 (although illustrated as square tiles, in this embodiment they have a 160 x 240 resolution). Each instantiation of x264 is configured to compress one of the 8 160 x 240 tiles, and it is synchronized so that each core enters a cycle after the initial I tile compression, each frame out of phase with the other frame, to compress an I tile followed by seven P tiles, as illustrated in FIG. 12.
At each frame time, the resulting compressed tiles are combined into a packet stream using the techniques previously described, and then the compressed tiles are transmitted to destination client 415.
Although not illustrated in fig. 12, if the data rate of the combined 8 tiles exceeds the specified peak data rate 941, then all 8 x264 processes will be suspended for the necessary frame time until the data for the combined 8 tiles has been transmitted.
In one embodiment, client 415 is implemented as software on a PC that executes 8 instantiations of FFmpeg. The receiving process receives the 8 tiles and routes each tile to the FFmpeg instantiation, which decompresses the tile and renders it to the appropriate tile location on the display device 422.
The client 415 receives keyboard, mouse, or game controller input from the PC's input device driver and transmits it to the server 402. The server 402 then applies the received input device data to a game or application executing on the server 402, the server 402 being a PC executing Windows using an Intel 2.16GHz dual core CPU. The server 402 then generates and outputs new frames via its DVI output from the motherboard-based graphics system or via the NVIDIA 8800GTX PCI Express card's DVI output.
At the same time, server 402 outputs audio generated by the game or application via its digital audio output (e.g., S/PDIF) coupled to a digital audio input on a dual quad-core Xeon-based PC that implements video compression. The Vorbis open source audio compressor is used to compress audio simultaneously with video using whatever core is available to handle the threads. In one embodiment, the core that completes compressing its tile first performs audio compression. The compressed audio is then transmitted along with the compressed video and decompressed at the client 415 using a Vorbis audio decompressor.
Hosting service server central distribution
Light passing through glass, such as an optical fiber, travels at some fraction of the speed of the light in a vacuum, and thus the exact propagation speed of the light in the optical fiber can be determined. However, in practice, considering the time spent for routing delays, transmission inefficiencies, and others, we observe that the optimal delay on the internet reflects a transmission speed closer to 50% of the speed of light. Thus, the optimal 1000 mile round trip delay is about 22 milliseconds, and the optimal 3000 mile round trip delay is about 64 milliseconds. Thus, a single server on one U.S. coast would be too far away to serve clients on the other coast (which may be as far as 3000 miles) with the desired latency. However, as illustrated in fig. 13a, if the hosting service 210 server hub 1300 is located at a center in the united states (e.g., kansas, nebraska, etc.) such that the distance to any point in the continental united states is about 1500 miles or less than 1500 miles, the round-trip internet latency can be as low as 32 milliseconds. Referring to fig. 4b, note that: although the worst-case latency allowed by the customer ISP 453 is 25 milliseconds, typically we observe a latency closer to 10-15 milliseconds in the case of DSL and cable modem systems. Also, FIG. 4b assumes a maximum distance of 1000 miles from user premises 211 to hosting center 210. Thus, with the typical 15 ms user ISP round trip delay used and a maximum internet distance of 1500 miles for a 32 ms round trip delay, the total round trip delay from the time the user actuates input device 421 to seeing a response on display device 422 is 1+1+15+32+1+16+6+ 8-80 ms. Thus, an 80 millisecond response time can typically be achieved over an internet distance of 1500 miles. This would allow any customer premises in the continental united states with a sufficiently short customer ISP latency 453 to access a single server center centrally located.
In another embodiment illustrated in FIG. 13b, the hosting service 210 server centers HS1-HS6 are strategically located around the United states (or other geographic area), with certain larger hosting service server centers located near high population centers (e.g., HS2 and HS 5). In one embodiment, server centers HS1-HS6 exchange information via network 1301, which network 1301 can be the Internet or a private network or a combination of both. In the case of multiple server centers, users with high user ISP latency 453 can be served with lower latency.
While distance over the internet is indeed a factor that has an impact on round trip latency through the internet, other factors that are sometimes largely unrelated to latency also play a role. Packet flows are sometimes routed via the internet to a remote location and back again, causing delays from long loops. Sometimes there are improperly operating routing devices on the path, resulting in delays in transmission. There are sometimes communications overloading the path, which introduces delay. Furthermore, sometimes there is simply a failure to prevent the user's ISP from routing to a given destination. Thus, while the general internet typically provides a connection from one point to another with a fairly reliable and optimal route and latency that is largely determined by distance (especially in the case of long distance connections that result in routing to outside of the user's local area), this reliability and latency is not guaranteed at all and is often not achievable from the user's premises to a given destination on the general internet.
In one embodiment, when a user client 415 initially connects to the hosting service 210 to play a video game or use an application, the client communicates (e.g., using the techniques described above) with each of the available hosting service server centers HS1-HS6 at startup. If the delay is low enough for a particular connection, that connection is used. In one embodiment, the client communicates with all or a subset of the hosting service server centers, selecting the hosting service server center with the lowest latency connection. The client may select the service center with the lowest latency connection, or the server center may identify the server center with the lowest latency connection and provide this information (e.g., in the form of an internet address) to the client.
If a particular hosting service server center is overloaded and/or the user's games or applications can tolerate a delay to another, less loaded hosting service server center, the client 415 may be redirected to another hosting service server center. In this case, the game or application being executed by the user would be paused on the server 402 at the user's overloaded server center and the game or application state data would be transferred to the server 402 at another hosting server center. The game or application will then be restarted. In one embodiment, the hosting service 210 will wait until the game or application reaches a natural point of pause (e.g., between levels in the game, or after the user initiates a "save" operation in the application) before transferring. In yet another embodiment, the hosting service 210 will wait until user activity ceases for a specified period of time (e.g., 1 minute) and then will initiate a transfer at this time.
As described above, in one embodiment, the hosting service 210 subscribes to the Internet bypass service 440 of FIG. 14 in an attempt to provide guaranteed latency to its clients. An internet bypass service as used herein is a service that provides a private network route with guaranteed characteristics (e.g., latency, data rate, etc.) from one point to another on the internet. For example, if the hosting service 210 is receiving a large amount of communications from a user using the AT & T's DSL service provided in san francisco (rather than being routed to a central office based on san francisco of AT & T), the hosting service 210 will lease a high capacity private data connection from a service provider (possibly the AT & T itself or another provider) between the central office based on san francisco and one or more of the server centers for the hosting service 210. Then, if the route from all hosting service server centers HS1-HS6 to the user using AT & T DSL in san francisco via the general internet results in too high a delay, a private data connection may be used instead. While private data connections are generally more expensive than routes over the general internet, as long as they keep a small percentage of the hosting service 210 connected to the user, the overall cost impact is low and the user will experience a more consistent service experience.
In the event of a power failure, a server center often has two layers of backup power. The first tier is typically backup power from a battery (or from an alternative immediately available energy source, such as a flywheel that remains operational and is attached to a generator) that provides power immediately upon a power mains failure and keeps the server center operational. If the power failure is temporary and the mains returns quickly (e.g., within one minute), then it is desirable for the battery to keep the server center running. However, if the power failure lasts a longer period of time, a generator (e.g., diesel powered) is typically started instead of the battery and the generator can operate as long as it has fuel. The generator is extremely expensive because it must be able to produce as much power as the server center typically derives from the power mains.
In one embodiment, each of the hosting services HS1-HS5 share user data with each other so that when one server center has a power failure, it can pause the game and application in progress and then transfer the game or application state data from each server 402 to the servers 402 at the other server centers and then notify each user's client 415 to direct it to the new server 402. Given that the situation occurs infrequently, it may be acceptable to transfer the user to a hosting service server center that is not capable of providing the optimal latency (i.e., the user would only have to tolerate higher latencies for the duration of the power failure), which would allow a much wider range of options for transferring the user. For example, given a difference in time zones across the united states, a user on the east coast may be about to sleep at 11:30PM, while a user on the west coast is beginning to peak video game usage at 8:30 PM. If there is a power failure in the hosting service server center on the west coast at that time, there may not be enough west coast servers 402 at the other hosting service server centers to handle all users. In such a scenario, some users may be transferred to a hosting service server center with available servers 402 on east coast, and the only consequence for the user would be a higher latency. Once a user is transferred from a server center that has lost power, the server center may then begin an orderly shutdown of its servers and devices in order to shut down all devices before the battery (or other immediate power backup) is exhausted. In this way, the cost of the generator for the server center can be avoided.
In one embodiment, during times of heavy loading of the hosting service 210 (either due to peak user loading or because one or more server centers have failed), users are transferred to other server centers based on latency requirements of the games or applications being used by the users. Thus, users using games or applications that require low latency will be given a preference for available low latency server connections with limited provisioning.
Host service features
FIG. 15 illustrates an embodiment of components of a server center for hosting service 210 utilized in the following feature description. As with the hosting service 210 illustrated in fig. 2a, unless otherwise conditioned, the components of this server center are controlled and coordinated by the hosting service 210 control system 401.
Inbound internet traffic 1501 from user clients 415 is directed to inbound routing 1502. Typically, inbound internet traffic 1501 will enter the server center via a high speed fiber connection to the internet, but any network connection device with sufficient bandwidth, reliability, and low latency will be sufficient. Inbound routing 1502 is a system of switches and routing servers supporting the switches (which may be implemented as an ethernet, fibre channel network, or via any other transport device) that takes arriving packets and routes each packet to the appropriate app/game server 1521-1525. In one embodiment, the packets transmitted to a particular application/game server represent a subset of the data received from the client and/or may be translated/changed by other components within the data center (e.g., network connection components such as gateways and routers). In some cases, packets are routed to more than one server 1521-1525 at a time, such as if a game or application were executing on multiple servers concurrently in parallel. RAID arrays 1511-1512 are connected to the inbound routing network 1502 so that the app/game servers 1521-1525 can read RAID arrays 1511-1512 and write to RAID arrays 1511-1512. In addition, a RAID array 1515 (which may be implemented as multiple RAID arrays) is also connected to the inbound routing 1502, and data from the RAID array 1515 may be read from the app/game servers 1521-1525. Inbound routing 1502 may be implemented in a variety of prior art network architectures (including tree-structured switches with inbound Internet traffic 1501 at their root); implemented in a mesh structure interconnecting all the various devices; or as a sequence of interconnected sub-networks (centralized communication among the interworking devices is isolated from centralized communication among the other devices). One type of network is configured as a SAN (storage area network), which, although commonly used for storage devices, can also be used for general high-speed data transfer between devices. Also, the app/game servers 1521-1525 may each have multiple network connections to the inbound routing 1502. For example, servers 1521-1525 may have a network connection to a sub-network attached to RAID arrays 1511-1512 and another network connection to a sub-network attached to other devices.
The app/game servers 1521-1525 may be configured identically, somewhat differently, or all differently, as previously described with respect to the server 402 in the embodiment illustrated in fig. 4 a. In one embodiment, each user is typically at least one application/game server 1521-1525 when using a hosting service. For simplicity of illustration, it will be assumed that a given user is using app/game server 1521, but multiple servers may be used by one user, and multiple users may share a single app/game server 1521-1525. User control input sent from the client 415 (as previously described) is received as inbound internet communications 1501 and routed to the app/game server 1521 via inbound routing 1502. App/game server 1521 uses the user's control input as control input to the game or application executing on the server and calculates the next frame of video and audio associated therewith. The app/game server 1521 then outputs the uncompressed video/audio 1529 to the common video compression 1530. The application/game server may output uncompressed video via any means, including one or more hyper-speed ethernet connections, but in one embodiment, video is output via a DVI (interactive digital video system) connection and audio and other compression and communication channel status information is output via a Universal Serial Bus (USB) connection.
The common video compression 1530 compresses uncompressed video and audio from the app/game servers 1521-1525. The compression may be implemented entirely in hardware or in hardware executing software. There may be a dedicated compressor for each app/game server 1521-1525 or if the compressors are fast enough, a given compressor may be used to compress video/audio from more than one app/game server 1521-1525. For example, at 60fps, the video frame time is 16.67 milliseconds. If the compressor is capable of compressing 1 frame in 1 millisecond, it can be used to compress video/audio from up to 16 app/game servers 1521-1525 by taking input from one server after the other, saving the state of each video/audio compression process and switching the background as it cycles through the video/audio streams from the servers. This results in substantial cost savings in compression hardware. Because different servers will complete frames at different times, in one embodiment, the compressor resources are in a common pool 1530 having a common storage device (e.g., RAM, flash) for storing the state of each compression process, and when the servers 1521-1525 frames are complete and ready to be compressed, the control device determines which compression resource is available at that time, providing the compression resource with the state of the server's compression process and the frames of uncompressed video/audio to be compressed.
Note that part of the state of the compression process for each server includes information about the compression itself, such as decompressed frame buffer data of the previous frame (which may be used as a reference for P tiles), the resolution of the video output; the quality of the compression; an image block structure; allocation of bits per image block; compression quality, audio format (e.g., stereo, surround sound, Dolby)AC-3 (Dolby)) AC-3). But the compression process state also includes communication channel state information about: the peak data rate 941, and whether the previous frame (as illustrated in fig. 9 b) is currently being output (and thus the current frame should be ignored), and potentially whether there are channel characteristics that should be considered in the compression, such as excessive packet loss, that affect the compression decision (e.g., in terms of the frequency of the I tile, etc.). Because the peak data rate 941 or other channel characteristics change over time, as determined by the app/game servers 1521-1525 supporting each user to monitor the data sent from the client 415, the app/game servers 1521-1525 send relevant information to the common hardware compression 1530.
Common hardware compression 1530 also packetizes the compressed video/audio using a device such as those previously described, and, where appropriate, applies FEC codes, replicates particular data, or takes other steps in order to adequately ensure the ability of the video/audio data stream to be received by client 415 and decompressed with a feasible high quality and reliability.
Some applications, such as those described below, require that the video/audio output of a given app/game server 1521-1525 be available at multiple resolutions (or in other multiple formats) simultaneously. If the app/game server 1521-1525 so notifies the shared hardware compression 1530 resource, the uncompressed video audio 1529 of that app/game server 1521-1525 will be compressed simultaneously in a different format, a different resolution, and/or in a different packet/error correction structure. In some cases, some compression resources may be shared among multiple compression processes that compress the same video/audio (e.g., in many compression algorithms, there is a step whereby the image is scaled to multiple sizes before compression is applied. In other cases, separate compression resources would be required for each format. In any case, all of the various resolutions and formats of compressed video/audio 1539 required for a given app/game server 1521-1525 (or servers) are output simultaneously to the outbound routing 1540. In one embodiment, the output of the compressed video/audio 1539 is in UDP format, so it is a unidirectional packet stream.
The outbound routing network 1540 includes a series of routing servers and switches that direct each compressed video/audio stream to an intended user or other destination via an outbound internet communications 1599 interface (which would typically be connected to a fiber optic interface to the internet) and/or back to the delay buffer 1515 and/or back to the inbound routing 1502, and/or output via a private network (not shown) for video distribution. Note (as follows): outbound routing 1540 may output a given video/audio stream to multiple destinations simultaneously. In one embodiment, this is implemented using Internet Protocol (IP) multicast, where a broadcast is intended to flow to a given UDP stream of multiple destinations simultaneously, and the broadcast is repeated by the routing servers and switches in the outbound routing 1540. The multiple destinations of the broadcast may be to multiple users' clients 415 via the Internet, to multiple app/game servers 1521-1525 via inbound routing 1502, and/or to one or more delay buffers 1515. Thus, the output of a given server 1521-1522 is compressed into one or more formats, and each compressed stream is directed to one or more destinations.
Additionally, in another embodiment, if multiple app/game servers 1521-1525 are used simultaneously by one user (e.g., in a parallel processing configuration for generating 3D output with complex scenes) and each server generates portions of the resulting image, the video outputs of multiple servers 1521-1525 may be combined into a combined frame by the common hardware compression 1530 and the combined frame processed as described above from that point forward as if it came from a single app/game server 1521-1525.
Note that in one embodiment, a copy of all videos generated by app/game servers 1521-1525 (at least at the resolution of the video viewed by the user or higher) is recorded in delay buffer 1515 for at least some number of minutes (15 minutes in one embodiment). This allows each user to "playback" the video from each session in order to check for previous work or performance (in the case of games). Thus, in one embodiment, each compressed video/audio output 1539 stream routed to the user client 415 is also multicast to the delay buffer 1515. When video/audio is stored on delay buffer 1515, the directory on delay buffer 1515 provides a cross-reference between the network address of app/game server 1521-1525 (which is the source of the delayed video/audio) and the location on delay buffer 1515 where the delayed video/audio can be found.
Live, instantly viewable, instantly playable game
The app/game servers 1521-1525 may be used not only to execute a user's given app or video game, but they may also be used to set up a user interface application for the hosting service 210 that supports navigation and other features via the hosting service 210. One such screenshot of the user interface application is shown in FIG. 16 (the "Game Finder" screen). This particular user interface screen allows the user to view 15 games that are played live (or delayed) by other users. Each of the "thumbnail" video windows (such as 1600) is a live video window in motion that displays one video from one user's game. The view displayed in the thumbnail may be the same view the user is looking at, or it may be a delayed view (e.g., if the user is playing a fighting game, the user may not want other users to see where they are hidden and they may choose to delay any view of their game play for a period of time (e.g., 10 minutes)). The view may also be a camera view of the game that is different from any user's view. Through menu selections (not shown in this illustration), a user may select a selection of games to view simultaneously based on a variety of criteria. As a small sampling of exemplary choices, the user may select a random selection of games (such as the one shown in fig. 16), all one category of games (played by different players), only the top level players of the game, players at a given level in the game, or lower level players (e.g., if the player is learning the base), players who are "partners" (or are competitors), games with the greatest number of viewers, and so forth.
Note that typically, each user will decide whether the video from their game or application is viewable by others, and if so, which others and when, whether the video is viewable by others, whether the video is viewable only with a delay.
The app/game server 1521-1525 that generated the user interface screen displayed in fig. 16 retrieves the 15 video/audio feeds by sending a message to the app/game server 1521-1525 of each user from whom the app/game server 1521-1525 is requesting a game. The message is sent via inbound routing 1502 or another network. The message will include the size and format of the requested video/audio and will identify the user viewing the user interface screen. A given user may choose to select "pirate" mode and not allow any other user to view the video/audio of their game (from their point of view or from another point of view), or as described in the previous paragraph, a user may choose to allow the video/audio from their game to be viewed, but delay the viewed video/audio. The user app/game server 1521-1525, which receives and accepts the request to allow its video/audio to be viewed, will therefore acknowledge to the requesting server, and it will also inform the common hardware compression 1530 that an additional compressed video stream of the requested format or screen size (assuming the format and screen size are different from those already generated) needs to be generated, and that it will also indicate the destination of the compressed video (i.e., the requesting server). If the requested video/audio is only delayed, the requesting app/game server 1521-1525 will be so notified and it will retrieve the delayed video/audio from the delay buffer 1515 by looking up the location of the video/audio in the directory on the delay buffer 1515 and the network address of the app/game server 1521-1525 that is the source of the delayed video/audio. Once all of the requests are generated and processed, up to 15 live thumbnail sized video streams are routed from the outbound route 1540 to the inbound route 1502 to the app/game servers 1521-1525 that generated the user interface screens and will be decompressed and displayed by the servers. The delayed video/audio stream may be at an excessive screen size and if so, the app/game server 1521-1525 will decompress the stream and scale down the video stream to the thumbnail size. In one embodiment, the request for audio/video is sent to (and managed by) a central "management" service (not shown in FIG. 15) similar to the hosting service control system of FIG. 4a, which then redirects the request to the appropriate app/game server 1521-1525. Further, in one embodiment, no request may be needed because the thumbnail is "pushed" to the clients of those users who are allowed to do so.
All simultaneously mixed audio from 15 games may produce harsh sounds. The user may choose to mix all the sounds together in this manner (perhaps just for the "noisy" sensation created by all the actions being viewed), or the user may choose to listen to the audio from only one game at a time. Selection of a single game is accomplished by moving the yellow selection frame 1601 (shown as a black rectangular outline in the black and white rendering of fig. 16) to the given game (yellow frame movement can be accomplished by using arrow keys on a keyboard, by moving a mouse, by moving a joystick, or by pushing a directional button on another device such as a mobile phone). Once a single game is selected, only the audio from that game is played. Also, game information 1602 is displayed. In the case of the game, for example, a publisher logo (e.g., "EA" of "Electronic Arts" and game logos such as "premium flyer card canyon" and an orange bar (reproduced as a bar with vertical stripes in fig. 16) indicate, in relative terms, the number of people playing or viewing the game at that particular moment (in this case, many, so the game is "hot"). Additionally, a "status" (i.e., statistics) is provided indicating that there are 80 different instances of 145 players actively playing the mastership flyer game (i.e., the game may be played by an individual player game or a multiplayer game), and there are 680 viewers (of which this user is one). Note that this statistical data (and other statistical data) is collected by the hosting service control system 401 and stored on RAID arrays 1511-1512 for keeping a log of the hosting service 210 operations and for appropriately billing the user and paying the publisher that provided the content. Some statistics are recorded as a result of actions taken by the service control system 401, and some statistics are reported to the service control system 401 by individual app/game servers 1521-1525. For example, when a game is being viewed (and when a game is stopped from viewing), app/game servers 1521-1525 executing this game finder application send messages to the hosting service control system 401 so that the hosting service control system 401 can update the statistics of how many games are in view. Some of the statistics may be available to a user interface application, such as this game viewfinder application.
If the user clicks the start button on his input device, he will see the thumbnail video in the yellow frame enlarged to full screen size while continuing to play the live video. This effect is shown in the process of fig. 17. Note that the size of video window 1700 increases. To implement this effect, the app/game server 1521-1525 requests a copy of the video stream having the full screen size (at the resolution of the user's display device 422) of the game routed to it from the app/game server 1521-1525 running the selected game. The app/game server 1521-1525 executing the game notifies the common hardware compressor 1530 that a thumbnail-sized copy of the game is no longer needed (unless such a thumbnail is needed by another app/game server 1521-1525), and then it directs the common hardware compressor 1530 to send a full-screen-sized copy of the video to the app/game server 1521-1525 enlarging the video. The user playing the game may or may not have a display device 422 with the same resolution as the display device of the user that enlarged the game. Additionally, other viewers of the game may or may not have a display device 422 of the same resolution as the user who enlarged the game (and may have a different audio playback device, such as stereo or surround sound). Thus, the common hardware compressor 1530 determines whether an appropriate compressed video/audio stream has been generated that satisfies the requirements of the user requesting the video/audio stream, and if an appropriate compressed video/audio stream does exist, the common hardware compressor 1530 notifies the outbound routing 1540 to route a copy of the stream to the app/game server 1521-1525 that amplified the video, and if an appropriate compressed video/audio stream does not exist, compresses another copy of the video that is appropriate for the user and directs the outbound routing to send the stream back to the inbound routing 1502 and the app/game server 1521-1525 that amplified the video. The server, now receiving the full screen version of the selected video, will decompress it and gradually scale it up to full size.
Fig. 18 illustrates how the screen looks (as indicated by the image pointed at by arrow 1800) after the game has been fully enlarged to full screen and displayed at the full resolution of the user's display device 422. The app/game server 1521-1525 executing the game viewfinder application sends a message to the other app/game servers 1521-1525 providing the thumbnails indicating that the thumbnails are no longer needed and a message to the hosting service control server 401 indicating that other games are no longer being viewed. At this point, the only display generated is an overlay 1801 at the top of the screen, which provides information and menu control to the user. Note that as the game progresses, the audience grows to 2,503 viewers. With so many viewers, there must be many viewers with display devices 422, the display devices 422 having the same or nearly the same resolution (each app/game server 1521-1525 has the ability to scale video for adjusting the degree of fit).
Because the game displayed is a multiplayer game, the user may decide to join the game at some point. The hosting service 210 may or may not allow the user to join the game for a variety of reasons. For example, a user may have to pay a fee to play a game and choose not to pay, the user may not have a sufficient level to join that particular game (e.g., for other players, it will not be competitive), or the user's internet connection may not have a latency low enough to allow the user to play (e.g., there is no latency constraint for viewing the game so a game that is played remotely (indeed, on another continent) can be viewed without latency concerns, but for a game to be played, the latency must be low enough for the user to (a) enjoy the game and (b) be on equal level with other players that may have lower latency connections). If the user is permitted to play, the app/game server 1521-1525 providing the user with a game viewfinder user interface will request the hosting service control server 401 to initialize (i.e., locate and launch) the app/game server 1521-1525 appropriately configured for playing the particular game to load the game from the RAID array 1511-1512, and then the hosting service control server 401 will direct the ingress port route 1502 to pass the control signal from the user to the now game hosting application/game server and the now game hosting application/game server will direct the common hardware compression 1530 to switch from compressing the video/audio from the application/game server hosting the game viewfinder application to compressing the video/audio from the now game hosting application/game server. The vertical synchronization of the game viewfinder application/game service with the new application/game server hosting the game is not synchronized and therefore there may be a time difference between the two synchronizations. Because the common video compression hardware 1530 will begin compressing video immediately after the app/game servers 1521-1525 complete a video frame, the first frame from the new server, which may be before the previously compressed frame completes its transmission, may complete earlier than the full frame time of the old server (e.g., consider transmission time 992 of FIG. 9 b-if uncompressed frame 3963 completes half of the frame time earlier, it will affect (imperge) transmission time 992). In this case, the common video compression hardware 1530 will ignore the first frame from the new server (e.g., as ignore (974) frame 4964), and the client 415 will keep the last frame from the old server for an additional frame time, and the common video compression hardware 1530 will begin compressing the next frame time video from the new application/game server hosting the game. Visually, the transition from one application/game server to another will be seamless to the user. The hosting service control server 401 will then notify the app/game server 1521-1525 hosting the game viewfinder to switch to an idle state until it is needed again.
The user can then play the game. Furthermore, the exception is that the game will play perceptually instantaneously (since the game has been loaded onto app/game server 1521-1525 from the Raid array 1511-1512 at gigabit/second speed) and the game will be loaded onto a server that is exactly suited for the game, through an ideal driver, scratchpad configuration (in the case of Windows), along with the operating system that is correctly configured for the game, and no other applications that may compete with the operation of the game are executing on the server.
Also, as the user progresses through the game, each of the pieces of the game will be loaded into the server from the RAID arrays 1511-1512 at a speed of gigabits/second (i.e., loading gigabytes in 8 seconds), and because of the huge storage capacity of the RAID arrays 1511-1512 (which may be very large because it is a common resource for many users, but still cost effective), so that the geometry settings or other game piece settings can be pre-computed and stored on the RAID arrays 1511-1512 and loaded very quickly. Furthermore, because the hardware configuration and computing power of each app/game server 1521-1525 is known, pixel and vertex shading may be pre-computed.
Thus, the game may start almost instantaneously, it will execute in an ideal environment, and subsequent segments will load almost instantaneously.
However, in addition to these advantages, the user will be able to watch others playing the game (via the game viewfinder, previously described, and other devices), and both decide whether the game is interesting, and if so, learn skills from watching others. Further, the user will be able to demonstrate the game immediately without having to wait for a large download and/or installation, and the user will be able to play the game immediately (perhaps on a less-expensive trial basis, or on a longer-term basis). Furthermore, the user will be able to play the game on a Windows PC, Macintosh, on a television, at home, on the go, and even on a mobile phone, over a wireless connection with sufficiently low latency (although latency is not a problem for viewing only). Furthermore, this may all be done without ever having a physical possession of the game copy.
As previously stated, a user may decide not to allow their game play to be viewed by others, allow their game to be viewed after a delay, allow their game to be viewed by a selected user, or allow their game to be viewed by all users. Regardless, in one embodiment, the video/audio is stored in the delay buffer 1515 for 15 minutes, and the user will be able to "rewind" and watch their previous game play, and pause the game, play back slowly, fast forward the game, etc., as they can when watching TV with a Digital Video Recorder (DVR). Although in this example, the user is playing a game, the same "DVR" capability is available if the user is using an application. This may be useful in checking previous work and in other applications as detailed below. Additionally, if the game is designed with the ability to fall back based on utilizing game state information so that the camera view, etc. can be changed, this "3D DVR" capability would also be supported, but it would require the game to be designed to support the "3D DVR" capability. The "DVR" capability of using delay buffer 1515 will work in conjunction with any game or application (of course, limited to video generated when the game or application is used), but in the case of 3D DVR-capable games, the user can control the 3D "fly-through" of previously played segments and have delay buffer 1515 record the resulting video and record the game state of the game segment. Thus, a particular "fly-through" will be recorded as a compressed video, but since the game state will also be recorded, a different fly-through will likely be at a later date of the same segment of the game.
As described below, users on the hosting service 210 will each have a user page in which the user can publish information about themselves and other data. One of the things that a user will be able to publish is a video clip from a game play that the user has saved. For example, if a user has overcome a particularly difficult challenge in a game, the user may "fall back" just before where they obtained their big result in the game, and then direct the hosting service 210 to save a video clip of a certain duration (e.g., 30 seconds) on the user's user page for other users to view. To implement this, the only thing the user is using the app/game server 1521-1525 is to play back the video stored in the delay buffer 1515 to the RAID arrays 1511-1512 and then index the video clips on the user's user page.
If the game has 3D DVR capabilities, as described above, the game state information needed for 3D DVR can also be recorded by the user and made available to the user's user page.
In the case where the game is designed to have "spectators" (i.e., users that are able to travel and observe actions in the 3D world without participating) in addition to active players, then the game finder application will enable the users to join the game as spectators as well as players. From a viewing perspective, there is no difference to the host system 210 that the user is a spectator, not an active player. The game is loaded onto the app/game server 1521-1525 and the user will control the game (e.g., control a virtual camera viewing the world). The only difference is the user's gaming experience.
Multiple user collaboration
Another feature of the hosting service 210 is the ability for multiple users to collaborate while watching live video (even if viewed using widely different devices). This is useful both when playing games and when using applications.
Many PCs and mobile phones are equipped with video cameras and have the ability to do real-time video compression, especially when the images are small. Also, small cameras are available, can be attached to televisions, and it is not difficult to implement real-time compression in software or using one of many hardware compression devices for compressing video. Also, many PCs and all mobile phones have microphones, and headsets are available with microphones.
The camera and/or microphone combined with local video/audio compression capabilities, in particular using the low latency video compression techniques described herein, will enable a user to transmit video and/or audio from the user premises 211 to the hosting service 210 along with input device control data. When using the technique, then the capabilities illustrated in FIG. 19 may be implemented: a user may have their video and audio 1900 appear on a screen within another user's game or application. This example is a multiplayer game where teammates collaborate in a racing car. The user's video/audio can only be selectively viewed/heard by their teammates. Furthermore, because there will effectively be no latency using the techniques described above, players will be able to talk to each other or move in real time without appreciable delay.
The video/audio integration is accomplished by having the compressed video and/or audio from the user's camera/microphone arrive as an inbound internet communication 1501. The inbound routing 1502 then routes the video and/or audio to the app/game servers 1521-1525 that are permitted to view/hear the video and/or audio. Then, users of the respective app/game servers 1521-1525 who select to use video and/or audio decompress the video and/or audio and integrate it as needed to appear within the game or app, such as illustrated by 1900.
The example of FIG. 19 shows how the collaboration is used in a game, but the collaboration can be an extremely powerful tool for applications. Consider a scenario in which: one of the large buildings is being designed for new york city by architects in chicago for new york based real estate developers, but the decision involves a financial investor on a trip and happens to be at miami airport, and a decision needs to be made about the specific design elements of the building (in terms of how they are collocated with the buildings in their vicinity) to satisfy both investors and real estate developers. Assume a building company has a high resolution monitor with a camera attached to a PC in chicago, a real estate developer has a notebook computer with a camera in new york, and an investor has a mobile phone with a camera in miami. The building company may use the hosting service 210 to host powerful building design applications that are capable of highly realistic 3D rendering, and it may utilize a large database of buildings in new york city, as well as a database of the building being designed. The architectural design application will execute on one of the app/game servers 1521-1525 (or on several if it requires a lot of computing power). Each of the 3 users at disparate locations will be connected to the hosting service 210 and each will have simultaneous viewing of the video output of the architectural design application, but they will be sized appropriately by the common hardware compression 1530 for the given device and network connection characteristics each user has (e.g., the building company can see 2560 x 144060 fps display via a 20Mbps commercial internet connection, the property developer in new york can see 1280 x 72060 fps images via a 6Mbps DSL connection on their notebook computer, and the investor can see 320 x 18060 fps images via a 250Kbps cellular data connection on their mobile phone). Each party will hear the other party's voice (the conference call will be handled by any of the many widely available conference call suites in the app/game server 1521-1525) and, via actuation of a button on the user input device, the user will be able to cause a video to appear using their local camera. As the meeting progresses, the architect will be able to display what the building looks like when it rotates the building and flies (fly-by) it next to another building in the area through a very photo-realistic 3D rendering, and all parties will see the same video at the resolution of the parties' display devices. It is not a problem that any of the local devices used by any party can handle 3D animations with this realism, let alone download or even store the huge database required to render the surrounding buildings in new york city. From the perspective of each of the users, although far away, and despite being disparate local devices, they will simply have a seamless experience with an incredible degree of realism. Furthermore, when a party wishes that their face be seen to better convey their emotional state, it may do so. Additionally, if a real estate developer or investor wishes to control a building program and use its own input device (which is a keyboard, mouse, keypad, or touch screen), it can do so, and it can respond with an imperceptible delay (assuming its network connection does not have unreasonable delay). For example, in the case of a mobile phone, if the mobile phone is connected to a WiFi network at an airport, it will have very low latency. But if it uses the cellular data networks available today in the united states, it will likely suffer from noticeable hysteresis. However, for most purposes of conferencing (where the investor is watching an architect to control a building over-the-air or talking about a video teleconference), even cellular delays should be acceptable.
Finally, at the end of the collaborative conference call, the real estate developer and investor will make their comments and stop broadcasting from the hosting service, the building company will be able to "rewind" the video of the conference that has been recorded on the delay buffer 1515 and check the comments, facial expressions and/or actions applied to the 3D model of the building made during the conference. If there is a particular segment that it wishes to save, that segment of video/audio may be moved from delay buffer 1515 to RAID arrays 1511-1512 for archival storage and later playback.
Also, from a cost perspective, if an architect only needs to use the computing power and large database of new york city for 15 minutes of conference calls, it only needs to pay for the time that the resource is used, rather than having to own a high-powered workstation and having to purchase an expensive copy of the large database.
Video rich community service
The hosting service 210 gives rise to an opportunity to build video rich community services on the internet. FIG. 20 shows an exemplary user page for a game player on the hosting service 210. As with the game viewfinder application, a user page is an application that executes on one of the app/game servers 1521-1525. All thumbnails and video windows on the page display constantly moving video (if the segment is short, it loops).
Using a video camera or by uploading a video, a user (with the user name "KILLHAZARD") can publish his own video 2000 (which other users can view). The video is stored on RAID arrays 1511-1512. Also, when other users come to the user page of KILLHAZARD, if KILLHAZARD is using the hosting service 210 at that time, then whatever live video 2001 is being played (assuming KILLHAZARD permits the user viewing his user page to view the video). This would be done by the app/game server 1521-1525 hosting the user page app requesting from the service control system 401 whether the KILLHAZARD is active (and if so, the app/game server 1521-1525 it is using). The compressed video stream of the appropriate resolution and format is then sent to and displayed by the app/game server 1521-1525 executing the user page app, using the same method used by the game viewfinder app. If the user selects the window of the live game play with KILLHAZARD and then clicks on their input device appropriately, the window will zoom in (again using the same method as the game viewfinder application) and the live video will fill the screen at the resolution of the viewing user's display device 422 (which is appropriate for viewing the user's characteristics of the internet connection).
The key advantages of this over the prior art methods are: a user viewing a user page can see a game played live that the user does not own, and may not have a local computer or game console that can play the game. It provides the user with an excellent opportunity to see the user shown as "active" in the user page playing the game, and this is an opportunity to learn to see the game that the user may wish to try or be better at.
Video clips recorded or uploaded by the camera of the partner 2002 from killhahakard are also displayed on the user's page, and below each video clip is text indicating whether the partner is playing the Game online (e.g., six _ shot is playing the Game "dragon knight" (here shown as Game4) and MrSnuggles99 is offline, etc.). By clicking on a menu item (not shown), the buddy video clip switches from displaying recorded or uploaded video to live video of the content currently playing the buddy on the host service 210 in its game at that moment in time. Thus, it becomes a game finder that groups the partners. If a buddy's game is selected and the user clicks on the game, the game will zoom in to full screen and the user will be able to view the game played live full screen.
Again, the user viewing the partner's game does not own a copy of the game, nor does it own the local computing/gaming console resources for playing the game. The game viewing is effectively instantaneous.
As previously described above, when a user plays a game on the hosting service 210, the user is able to "rewind" the game and find a video clip that he wishes to save, and then save that video clip to his user page. This is called a Brag Clip. The video segments 2003 are brag clips 2003 saved by killhahakard from previous games played thereby. Numeral 2004 shows how many times the brag clip has been viewed, and when the brag clip was viewed, the user has an opportunity to rate them, and the number of illustrations 2005 of the orange (shown as black outline) keyhole shape indicates how high the level is. As the user views the user page, the brag clip 2003 constantly cycles along with the rest of the video on the page. If the user selects and clicks on one of the brag clips 2003, it zooms in to present the brag clip 2003, and DVR controls that allow the clip to be played, paused, rewound, fast forwarded, stepped, etc.
Brag clip 2003 playback is implemented by the app/game server 1521-1525 loading the compressed video segments stored on the RAID arrays 1511-1512 when the user recorded the brag clip and decompressing and playing them back.
The brag clip 2003 may also be a "3D DDVR" video clip from a 3D DVR capable enabled game (i.e., a sequence of game states from a game that can be played back and that allows the user to change the camera view point). In this case, the game state information is stored in addition to the compressed video recording of a particular "fly-through" that the user made when recording the game piece. When the user page is being viewed and all thumbnails and video windows are constantly looping, the 3D DVR brag clip 2003 will constantly loop the brag clip 2003 that was recorded as compressed video when the user recorded the "fly-through" of the game segment. However, when the user selects the 3D DVR brag clip 2003 and clicks on the 3D DVR brag clip 2003, the user will be able to click on the button that gives his 3D DVR capabilities for the game segment in addition to the DVR control that allows the compressed video brag clip to be played. It will be able to independently control the camera "fly-through" during the game segment, and if it so wishes (and the user owning the user page allows it), it will be able to record an alternative brag clip "fly-through" in the form of compressed video, which will then be available to other viewers of the user page (either immediately, or after the owner of the user page has the opportunity to check for the brag clip).
The 3D DVR brag clip 2003 capability is enabled by launching a game that will replay recorded game state information on another app/game server 1521-1525. Because the game can be started almost instantaneously (as described previously), it is not difficult to start it (its play is limited to the game state recorded by the brag clip) and then allow the user to "fly" with the camera while recording the compressed video to the delay buffer 1515. Once the user is finished "flying through," the game is deactivated.
From the user's perspective, initiating a "fly-through" with a 3D DVR brag clip 2003 is no more difficult than controlling a DVR control that controls a linear brag clip 2003. The user may not know the game or even how to play the game. The user indicates a virtual camera operator staring at the 3D world during a game segment recorded by another operator.
The user will also be able to record their own audio over the brag clip (or recorded or uploaded from the microphone). In this way, the brag clip can be used to generate customized animations using characters and actions from the game. This animation technique is commonly referred to as "gaming movie" (machinima).
As the user progresses through the game, they will reach different skill levels. The game played will report the outcome to the service control system 401 and the skill level will also be displayed on the user page.
Interactive animated advertising
Online advertising has transitioned from text to still images, video, and now to interactive segments, typically implemented using animation thin clients such as Adobe Flash. The reason for using an animation thin client is that: users are often less patience to be delayed by privileges to promote products or services thereto. Also, thin clients execute on very low performance PCs, and thus advertisers may have a high degree of confidence: the interactive ad will work properly. Unfortunately, animation thin clients such as Adobe Flash are limited in the extent of interactivity and the duration of the experience (to reduce download time, and available to almost all user devices, including low-performance PCs and apple computers that do not have a GPU or high-performance CPU).
FIG. 21 illustrates an interactive ad in which the user will select the exterior and interior colors of the car as it rotates in the showroom while real-time ray tracing shows how the car looks. The user then selects a character to drive the car, and the user can then employ the car for driving on a racing track or across a foreign location such as morna. The user may select a larger engine or a better tire and then see how the changed configuration affects the ability of the vehicle to accelerate or remain stable.
Of course, the advertisement is effectively a sophisticated 3D video game. But for such an advertisement that may be played on a PC or video game console, it would require a possible 100MB download, and in the case of a PC, it may require the installation of a special driver, and may not be executed at all when the PC lacks sufficient CPU or GPU computing power. Thus, the advertisement is impractical in prior art configurations.
In the hosting service 210, the advertisement is placed almost instantaneously, and preferably executes, regardless of the user's client 415 capabilities. Thus, it is delivered more quickly, experienced richer, and highly reliable than thin client interactive ads.
Geometry of flow during real-time animation
The RAID arrays 1511-1512 and inbound routing 1502 may provide such fast data rates and have such low latency that it is possible to design video games and applications that rely on RAID arrays 1511-1512 and inbound routing 1502 to reliably deliver geometry directly in the middle of game play or in applications during real-time animation (e.g., fly-through with complex databases).
With prior art systems, such as the video game system shown in fig. 1, the available mass storage devices, especially in a practical home device, are too slow to stream the geometry during game play (except where the required geometry is somewhat predictable). For example, in a driving game where there are specified roads, the geometry for entering buildings within the field of view may be reasonably well predicted and the mass storage device may search ahead for the location where the upcoming geometry is located.
But in complex scenes with unpredictable changes (e.g., in battle scenes with complex characters around), if the RAM on a PC or video game system is completely filled with the geometry for the objects currently in view, and then what happens after the user suddenly turns their character to view it, there may be a delay before the geometry can be displayed if it is not pre-loaded into RAM.
In the hosting service 210, the RAID arrays 1511-1512 may stream data at speeds in excess of the ultra-high speed Ethernet speed, and in SAN networks, speeds better than 10 gigabits of Ethernet or better than 100 gigabits per second of other network technologies are possible. 100 megabits/second will load one gigabyte of data in less than one second. Within a 60fps frame time (16.67 ms), approximately 170 megabits (21MB) of data can be loaded. Of course, even in a RAID configuration, rotating the media will still result in a latency of more than one frame time, but flash-based RAID storage will eventually be as large as a rotating media RAID array and will not incur this high latency. In one embodiment, a cache written via a large amount of RAM is used to provide very low latency access.
Thus, with a high enough network speed, and a large amount of memory with low enough latency, the geometry can be streamed into the app/game servers 1521-1525 as fast as the CPUs and/or GPUs can process 3D data. Thus, in the example given previously, where the user suddenly turned their character and looked backwards, the geometry of all the characters behind them could be loaded before the character completed the rotation, and thus, it would appear to the user as if they were in a photo-realistic world as the action of live broadcasting.
As discussed previously, one of the last borders in a photo-realistic computer animation is a human face, and the slightest error from a photo-realistic face can result in a negative reaction from the viewer due to the sensitivity of the human eye to imperfections. FIG. 22 shows the use of ContourTMAuthenticity capture technology (subject matter of the following co-pending applications: application Ser. No. 10/942,609 of application No. 9/15/2004 "Apparatus and method for capturing the motion of a performance" for application No. 10/942,413 of application No. 9/15/2004 "Apparatus and method for capturing the expression of a performance"; application Ser. No. 11/066,954 of application No. 2/25/2005 "Apparatus and method for imaging the marking performance System"; application Ser. No. 11/077,628 of application No. 3/10/2005 "Apparatus and method for capturing the motion of a performance System"; application Ser. No. 11/255 of application No. 1/255 of application No. 11/255 of application No. 1/4/1/9/15/2005 "synthesis System for capturing the expression of the The using phosphor applications; no. 11/449,043, applied on 7/6/2006, "System and method for reforming motion capture by strobing a fluorescent lamp"; application No. 11/449,127 of 7/6/2006, "System and method for three dimensional capture of stop-motion imaged characters", each of which is assigned to the assignee of the present CIP application), results in a very smooth capture surface, thereby achieving a high polygon count tracking surface (i.e., polygon motion accurately follows the motion of the face). Finally, when a video of a live performance is mapped onto the tracking surface to produce a textured surface, a photo-realistic result is produced.
While current GPU technology is capable of rendering many polygons in a tracked surface and texture and illuminating that surface in real time, if the polygons and textures change every frame time (which would produce the most photorealistic results), it would quickly consume all of the available RAM of a modern PC or video game console.
Using the streaming geometry techniques described above, it becomes practical to constantly feed geometry into the app/game servers 1521-1525 so that they can constantly animate photo-realistic faces, allowing video games to be generated with faces that are almost indistinguishable from live action faces.
Integration of linear content with interactive features
Movies, television programs, and audio material (collectively, "linear content") are widely available in many forms to home and office users. Linear content can be acquired on physical media such as CD, DVD, and blu-ray media. It can also be recorded by DVR from satellite and cable TV broadcasts. In addition, it may be available via satellite and cable TV pay-per-view (PPV) content and with video-on-demand (VOD) over cable TV.
Increasingly linear content is available over the internet in both downloaded and streaming content. Today, there is really no place to experience all the features associated with linear media. For example, DVDs and other video optical media often have interactive features (such as director's commentary, "catwalk" clips, etc.) that are not available at other locations. Online music sites have cover art and song information that is not typically available on CDs, but not all CDs are available online. And websites associated with television shows often have additional features, blogs, and sometimes comments from actors or creative people.
In addition, in the case of many movies or sporting events, there are typically video games that are often released (in the case of movies) along with linear media or (in the case of sports) that can be closely tied to real-world events (e.g., a player's transactions).
The hosting service 210 is well suited to delivering linear content when joining disparate forms of related content together. Indeed, delivering movies is less challenging than delivering highly interactive video games, and the hosting service 210 is able to deliver linear content to a variety of devices in the home or office, or to mobile devices. FIG. 23 shows an exemplary user interface page for the hosting service 210 that displays the selection of linear content.
However, unlike most linear content delivery systems, the hosting service 210 is also capable of delivering related interactive components (e.g., menus and features on DVD, interactive overlays on HD-DVD, and Adobe Flash animations on websites (as described below)). Thus, the client device 415 limitations no longer introduce limitations on what features are available.
In addition, the hosting system 210 is capable of linking linear content with video game content dynamically and in real time. For example, if a user is watching a Quidditch game in a Harry potter movie and decides that they would like to try to play Quidditch, they may simply click a button and the movie will pause and it will be immediately delivered to the Quidditch segment of the Harry potter video game. After playing the quididitch game, another click of the button will resume and the movie will immediately begin again.
In the case of photo-realistic graphics and production techniques, where the video captured by the camera is indistinguishable from live action characters, the two scenes are virtually indistinguishable when the user makes a transition from the Quidditch game in live action movies to the Quidditch game in video games on a hosted service (as described herein). This provides entirely new authoring options for directors of both linear and interactive (e.g., video game) content, as the lines between the two worlds become indistinguishable.
With the hosting service architecture shown in fig. 14, control of the virtual camera in a 3D movie can be provided to the viewer. For example, in a scene occurring within a train, it would be possible to allow a viewer to control the virtual camera and look around the train as the story progresses. This assumes that all 3D objects ("assets") in the train are available, as well as a sufficient level of computing power to be able to render the scene and the original movie in real time.
And even for non-computer generated entertainment, there are very exciting interactive features that can be provided. For example, the 2005 movie "pride and prejudice" has many scenes in decorative gorgeous old uk mansions. For a particular building scene, the user may pause the video and then control the camera to tour the building, or possibly the surrounding area. To do this, a camera with a fish-eye lens may be carried through the building, much like the QuickTime VR of Apple (Apple) corporation, which implements the prior art, as it tracks its position. The various frames will then be transformed so the image is not distorted, and then they are stored on the RAID arrays 1511-1512 along with the movie, and played back when the user chooses to continue the virtual tour.
In the case of sporting events, live sporting events (such as a basketball game) may be streamed via the hosting service 210 for viewing by the user (as it would for a common TV). After a user views a particular play, the video game of the game (eventually the basketball player appears generally photo-realistic as a real player) can catch up with the player starting in the same location, and the users (perhaps each controlling a player) can replay to see if they can do better than the player.
The hosting service 210 described herein is well suited to support this future world as it is capable of withstanding computing power and mass storage resources that are impractical to install in the home or in most office settings, and whose computing resources are always up-to-date (with the latest computing hardware available), but in the home setting there will always be homes with older generations of PCs and video games. Furthermore, in the hosting service 210, all this computational complexity is kept from the user, so even though the user may be using a very sophisticated system, from the user's perspective it is as simple as changing channels on a television. Additionally, the user will be able to access all computing power and the experience that the computing power will bring from any client 415.
Multi-player game
To the extent that the game is a multiplayer game, the game will be able to communicate not only to the app/game servers 1521-1525 via the inbound route 1502 network, but also to the Internet (not shown) with servers or game machines that are not executing in the hosting service 210 through a network bridge. When playing a multiplayer game over a computer on the general internet, then the app/game servers 1521-1525 will have the benefit of extremely fast access to the internet (as compared to the case where the game is executed on a server in the home), but will be limited by the ability of other computers playing the game over a slower connection, and also potentially by the fact that the game servers on the internet are designed to accommodate a least common point (commonomunicator), which is a relatively slow consumer computer in the home over the internet connection.
A significant difference may be achieved when a multiplayer game is played entirely within the hosting service 210 server center. Each app/game server 1521-1525 hosting a game for a user will be interconnected with other app/game servers 1521-1525 and any servers hosting a central control for a multiplayer game with extremely high speed, extremely low latency connectivity and large, very fast storage arrays. For example, if ultra-high speed Ethernet is used for the inbound routing 1502 network, the app/game servers 1521-1525 would communicate among each other and to any server hosting a central control for multiplayer games at gigabit/second speed with potentially only 1 millisecond or less latency. In addition, the RAID arrays 1511-1512 will be able to respond very quickly and then transfer data at gigabit/second speeds. As one example, if a user customizes a character in terms of appearance and clothing so that the character has a large number of geometries and behaviors unique to the character, under prior art systems limited to game clients executing in the home on a PC or game console, if the character were to come into view of another user, the user would have to wait until the long slow download was complete in order to load all of the geometry and behavior data into their computer. Within the hosting service 210, the same download may be better than a super-speed Ethernet network served from RAID arrays 1511-1512 at gigabits/second speeds. Even if a home user has an 8Mbps internet connection (which is extremely fast according to today's standards), ultra high speed ethernet is 100 times faster. Thus, work that takes one minute to perform on a fast internet connection will take less than one second on a gigabit ethernet network.
Top player grouping and tournament
The hosting service 210 is well suited for tournaments. Because no game is executing in the local client, there is no chance for users to cheat (e.g., in prior art tournaments, they may give them an unfair benefit by modifying the game copy running on their local PC). Also, due to the ability of the output routing 1540 to multicast UDP streams, the hosting service 210 is able to broadcast a larger tournament to thousands of people or more in the audience at the same time.
In fact, when there is such a vogue that thousands of users are receiving a particular video stream of the same stream (e.g., a view showing a larger tournament), the video stream may be more efficiently sent to a Content Delivery Network (CDN), such as Akamai (Akamai corporation) or light (spotlight corporation), for mass distribution to many client devices 415.
A similar level of efficiency may be obtained when using a CDN to display a game finder page for a top level player grouping.
For larger tournaments, a live celebrity announcer may be used to provide commentary during a particular tournament. Although a large number of users will be watching a larger tournament, and a relatively small number will be playing in the tournament. The audio from the celebrity announcer may be routed to the app/game servers 1521-1525 hosting users playing in the tournament and hosting any spectator mode copies of the game in the tournament, and the audio may be recorded on top of the game audio. The video of the celebrity explainer may be superimposed on the game (possibly just above the spectator view as well).
Acceleration of web page loading
The world Wide Web and its primary transport protocol, Hypertext transfer protocol (HTTP), are envisioned and defined in an era where only businesses have high speed Internet connections and online consumers use dial-up modems or ISDN. At this point, the "gold standard" for the fast connection is the T1 line, which provides 1.5Mbps data rate symmetrically (i.e., with equal data rates in both directions).
Today, the situation is completely different. The average home connection speed via DSL or cable modem connections in a large number of developed worlds has a much higher downstream data rate than the T1 line. In fact, in some parts of the world, fiber-to-the-curb (fiber-to-the-curb) is bringing data rates up to 50Mbps to 100Mbps into the home.
Unfortunately, HTTP is not architected (nor implemented) to effectively take advantage of this dramatic speed improvement. A web site is a collection of files on a remote server. Very briefly, HTTP requests a first file, waits to download the file, and then requests a second file, waits to download the file, etc. Indeed, HTTP allows more than one "open connection" (i.e., more than one profile at a time), but only permits very few open connections due to agreed-upon criteria (and the desire to prevent the web server from being overloaded). Furthermore, because of the way web pages are constructed, browsers are often unaware of the multiple simultaneous pages available for immediate download (i.e., it becomes apparent only after parsing one page that a new archive, such as an image, needs to be downloaded). Thus, the files on the web site are loaded essentially one by one. Furthermore, due to the request and response protocol used by HTTP, there is approximately (typical web server in the united states of america visited) a 100 millisecond delay associated with each profile loaded.
In the case of a relatively slow connection, this does not introduce much of a problem, since the download time for the profile itself determines the latency of the web page. However, as the connection speed increases (especially in the case of complex web pages), problems begin to arise.
In the example shown in fig. 24, a typical commercial website is shown (this particular website is from a larger athletic shoe brand). The website has 54 files. The files include HTML, CSS, JPEG, PHP, JavaScript, and Flash files, and include video content. A total of 1.5 mbytes must be loaded before the web page is live (i.e., the user can click on the web page and begin using the web page). There are many reasons for large numbers of files. First, the web page is a complex and sophisticated web page, and second, the web page is a web page that is dynamically assembled based on information about the user accessing the page (e.g., which country the user is from, what language, whether the user has previously made a purchase, etc.), and different profiles are downloaded depending on all of these factors. However, it is still a very typical commercial web page.
Fig. 24 shows the amount of time that elapses before the web page is live as the connection speed increases. At a 1.5Mbps connection speed 2401, using a conventional web server with a conventional web browser, it takes 13.5 seconds before the web page is live. At a 12Mbps connection speed 2402, the load time is reduced to 6.5 seconds, or about twice as fast. But at a 96Mbps connection speed 2403, the load time is only reduced to about 5.5 seconds. The reason for this is because at such high download speeds, the time to download the files themselves is minimal, but a delay of approximately 100 milliseconds per file remains, resulting in a delay of 5.4 seconds for 54 files 100 milliseconds. Thus, regardless of how fast the connection to the home is, the website will always take at least 5.4 seconds before being live. Another factor is server-side queuing; each HTTP request is added at the back of the queue, so on busy servers this will have a significant impact, as for each small item to be obtained from the web server, the HTTP request needs to wait for it to return.
One way to address these problems is to discard or redefine HTTP. Alternatively, it may be preferable for the website owner to merge their files into a single file (e.g., in Adobe Flash format). However, as a practical matter, the company and many others have a large investment in their website architecture. In addition, although some homes have 12-100Mbps connections, most homes still have slower speeds, and HTTP does work well at slow speeds.
One alternative is to host the web browser on the app/game server 1521-1525 and host the archive for the web server on the RAID array 1511-1512 (or potentially in RAM or on local storage on the app/game server 1521-1525 hosting the web browser). Due to the very fast interconnect through the inbound routing 1502 (or to local storage), rather than having 100 milliseconds of latency per file using HTTP, there will be a minimum latency per file using HTTP. Then, instead of having the user in the home access the web page via HTTP, the user may access the web page via the client 415. Then, even with a 1.5Mbps connection (because this web page does not require a large amount of bandwidth for its video), the web page will be live in less than 1 second per line 2400. In essence, there will be no latency before the web browser executing on the app/game server 1521-1525 displays a live page, and there will be no detectable latency before the client 415 displays video output from the web browser. When a user uses a mouse to search for and/or type a word on a web page, the user's input information is sent to a web browser executing on the app/game server 1521-1525, and the web browser will respond accordingly.
One disadvantage of this method is that: if the compressor is constantly transmitting video data, bandwidth is used even if the web page becomes static. This can be remedied by configuring the compressor to transmit data only when (and if) the web page changes and then only to the portion of the page where the change occurred. When there are some web pages with flashing banners or the like that change constantly, the web pages tend to be annoying, and unless there is a reason to move something (e.g., a video clip), the web pages are typically static. For the web page, the following may be the case: using the hosting service 210 will transmit less data (compared to a traditional web server) because only the actually displayed image will be transmitted, there is no thin client executable code, and there are no large objects that may never be viewed (such as scrolling through a flipped image).
Thus, using hosting service 210 to host legacy web pages, the web page load time can be reduced to the point where opening a web page is similar to changing channels on a television: effectively live the web page instantly.
Facilitating debugging of games and applications
As previously mentioned, video games and applications with real-time graphics are very complex applications and often contain deficiencies when they are released into the field. While a software developer will get feedback from the user about the bug, and it may have some way to pass back the machine state after a crash, it is very difficult to identify exactly what caused the game or real-time application to crash or to execute improperly.
When a game or application is executed in the hosting service 210, the video/audio output of the game or application is constantly recorded on the delay buffer 1515. In addition, a watchdog process executes each app/game server 1521-1525, which will periodically report to the hosting service control system 401 that the app/game server 1521-1525 is executing smoothly. If the watchdog process fails to report, the server control system 401 will attempt to communicate with the app/game servers 1521-1525 and if successful, will collect whatever machine state is available. Whatever information is available is sent to the software developer along with the video/audio recorded by the delay buffer 1515.
Thus, when a game or application software developer gets notification of a crash from the host service 210, it gets a frame-by-frame record of the cause of the crash. This information may be of great value in tracking and repairing defects.
It should also be noted that when the app/game server 1521-1525 crashes, the server is restarted at the most recent time that it can be restarted and a message is provided to the user apologizing for technical difficulty.
Resource sharing and cost savings
The system shown in fig. 4a and 4b provides a number of benefits to both end users and game and application developers. For example, typically home and office client systems (e.g., PCs or game consoles) are in use for only a small percentage of the hours of the week. According to the 2006-year-10-month-5-day communication draft by Nielsen entertainment "Active Game Benchmark Study" (http:// www.prnewswire.com/cgi-bin/stores. ploCCT ═ 104& STORY ═/www/store/10-05-2006/0004446115 & EDATE ═), Active players take an average of 14 hours a week to play on the video game console and about 17 hours a week to play on the handheld device. The report also states: active players average 13 hours a week for all game play activities, including console, handheld and PC game play. Considering the higher number console video game play time, there are 24 x 7-168 hours of the week, which implies that in an active player's home, the video game console is in use only for 17/168-10% of the hours of the week. Alternatively, the video game console is idle 90% of the time. Given the high cost of video game consoles, and the fact that manufacturers subsidize the equipment, this is a very inefficient use of expensive resources. PCs within the industry are also typically used only during a portion of the hour of the week, especially non-portable desktop PCs often required by high-end applications such as Autodesk Maya. While some businesses operate on all hours and holidays, and some PCs (e.g., portable PCs taken home for work at night) are used on all hours and holidays, most business activities tend to focus on about 9AM to 5PM from monday to friday, fewer holidays, and rest times (such as lunch) in a given business time zone, and because most PC usage occurs when a user is actively utilizing a PC, it follows: the utilization of desktop PCs tends to follow these operating hours. If we assume that PC is used continuously from 9AM to 5PM for five days of the week, this would imply that PC is used in 40/168 ═ 24% of the hours of the week. High performance desktop PCs are a very expensive investment for business and this reflects very low availability. Schools teaching on desktop computers may use the computer for an even smaller portion of the week, and although it varies depending on the hours of teaching, most teaching occurs during the daytime hours from monday to friday. Thus, in general, PCs and video game consoles are utilized for only a small fraction of the hours of the week.
Notably, because many people work in commerce or at school during daytime hours on monday through friday other than holidays, these people typically do not play video games during these hours, and thus when they do play video games, they typically are during other hours (such as at night, on weekends, and on holidays).
Given the configuration of the hosting service shown in fig. 4a, the usage patterns described in the two paragraphs above result in very efficient utilization of resources. It is apparent that there is a limit to the number of users that can be served by the hosting service 210 at a given time, especially if the users require real-time responsiveness for complex applications, such as sophisticated 3D video games. However, unlike video game consoles in the home or PCs used by businesses (which are typically idle most of the time), the server 402 may be reused by different users at different times. For example, a high performance server 402 with high performance dual CPUs and dual GPUs and a large amount of RAM can be utilized by businesses and schools from 9AM to 5PM on non-holidays, but by players playing sophisticated video games in the evening, weekends and holidays. Similarly, low performance applications may be utilized by businesses and schools during business hours on low performance servers 402 with a Celeron CPU, no GPU (or very low end GPU), and limited RAM and low performance games may utilize the low performance servers 402 during non-business hours.
In addition, with the hosting service configuration described herein, resources are effectively shared among thousands if not millions of users. In general, an online service has only a small percentage of its total user base using the service at a given time. It is easy to see why the Nielsen video game usage statistics listed previously are considered. If an active player plays the console game only 17 hours a week, and if the peak usage time of the game is assumed to be during typical non-working, non-commercial hours in the evening (5-12AM, 7 x 5 days-35 hours/week) and weekend (8AM-12AM, 16 x 2-32 hours/week), there are 35+ 32-65 peak hours a week for 17 hours of game play. It is difficult to estimate the exact peak user load on the system for many reasons: some users will play during off-peak times, there may be a clustering (clustering) peak of users at a particular time of day, peak times may be affected by the type of game played (e.g., a child's game will likely be played at an earlier time in the evening), and so on. However, assuming that the average number of hours played by a player is much less than the number of hours during the day when the player is likely to play the game, only a fraction of the number of users of the hosting service 210 will be using the hosting service 210 at a given time. For this analysis, we assume a peak load of 12.5%. Thus, only 12.5% of the computing, compression, and bandwidth resources are used at a given time, resulting in only 12.5% of the hardware cost to support a given user's play of a given level of performance games due to the reuse of resources.
Further, given that some games and applications require more computing power than others, resources may be dynamically allocated based on the game played by the user or the application executed by the user. Thus, a user selecting a low performance game or application will be assigned a low performance (less expensive) server 402, and a user selecting a high performance game or application will be assigned a high performance (more expensive) server 402. In practice, a given game or application may have lower performance and higher performance regions of the game or application, and the user may be switched from one server 402 to another server 402 between regions of the game or application to keep the user executing on the lowest cost server 402 that meets the needs of the game or application. Note that a RAID array 405 that is much faster than a single disk will be available to even low performance servers 402, which has the benefit of faster disk transfer rates. Thus, the average cost per server 402 across all games played or applications used is much less than the cost of most expensive servers 402 playing the highest performance games or applications, however, even low performance servers 402 will receive disk performance benefits from the RAID array 405.
Additionally, the server 402 in the hosting service 210 may simply be a PC motherboard with no disks or peripheral interfaces (other than network interfaces) and, just as well, may be integrated down into a single chip with just a fast network interface to the SAN 403. Moreover, the RAID array 405 will likely be shared among many more users than there are disks, so the disk cost per active user will be much less than one disk drive. All of the equipment will likely reside in racks in the environmentally controlled server room environment. If the server 402 fails, it can be easily repaired or replaced at the hosting service 210. In contrast, a PC or game console in a home or office must be rugged, a separate appliance that must be able to survive reasonable wear and tear to prevent being banged or dropped requires a housing, has at least one disk drive, must survive adverse environmental conditions (e.g., being crammed into an overheated AV cabinet with other appliances), requires service assurance, must be packaged and shipped, and sold by a retailer who may collect retail profits. In addition, the PC or game console must be configured to meet the peak performance of the most computationally intensive anticipated game or application to be used at some point in the future, even though lower performance games or applications (or sectors of games or applications) may be played most of the time. Furthermore, if a PC or console fails, it is an expensive and time consuming process to get it repaired (adversely affecting manufacturers, users, and software developers).
Thus, given that the system shown in FIG. 4a provides a user with an experience comparable to that of a local computing resource, for the user to experience a given level of computing power in a home, office, or school, it is much less expensive to provide that computing power through the architecture shown in FIG. 4 a.
Eliminating the need for upgrades
In addition, the user no longer has to worry about upgrading the PC and/or console to play new games or handle higher performance new applications. Any games or applications on the hosting service 210, regardless of the type of server 402 they require, may be available to the user, and all games and applications execute near instantaneously (i.e., quickly loaded from the RAID array 405 or local storage on the server 402) and with up-to-date updates and bug fixes in place (i.e., a software developer will be able to select the ideal server configuration for the server 402 executing a given game or application, and then configure the server 402 with the best drives, and then over time, the developer will be able to simultaneously provide updates, bug fixes, etc. to all copies of the games or applications in the hosting service 210). Indeed, after the user begins using the hosting service 210, the user may find that the games and applications continue to provide a better experience (e.g., via updates and/or bug fixes) and may be the following: a user discovers a new game or application a year later that is available on the service 210 that utilizes computing technology (e.g., a higher performance GPU) that was not even present a year ago, so it would not be possible for the user to purchase a one year ago technology that would play the game or execute the application a year later. Because the computing resources to play the game or execute the application are not visible to the user (i.e., the user only selects the game or application to begin executing near instantaneously from the user's perspective-much like the user changes channels on a television), the user's hardware will have been "upgraded" without the user even being aware of the upgrade.
Eliminating the need for backup
Another major problem for users in businesses, schools, and homes is backup. If the disk fails, or if there is an inadvertent erasure, the information stored in the local PC or video game console (e.g., in the case of the console, the user's game achievements and ratings) may be lost. There are many applications available that provide manual or automatic backups for PCs, and game console state can be uploaded to an online server for backup, but local backups are typically copied to another local disk (or other non-volatile storage device) that must be stored somewhere safe and organized, and backups for online services are often limited due to the slow upstream speeds available over typical low-cost internet connections. Under the hosting service 210 of FIG. 4a, the data stored in the RAID array 405 may be configured using prior art RAID configuration techniques that are well known to those skilled in the art so that when a disk fails, the data will not be lost and a technician at the server center housing the failed disk will be notified and then replace the disk, which will then be automatically updated so that the RAID array is again fault tolerant. Additionally, because all disk drives are close to each other with a fast local network therebetween via SAN 403, it is not difficult to periodically backup all disk system configurations in a server center to secondary storage (which may be stored at the server center or easily relocated). From the perspective of the user of the hosting service 210, their data is always completely secure, and they never have to consider backups.
Access to presentations
Users often wish to try out games or applications before purchasing them. As previously mentioned, there are prior art devices by which to demonstrate (the verb form of "demonstrate" means to try out a demonstration version, also referred to as "demonstrate", but as a noun) games and applications, but each of which suffers from limitations and/or inconveniences. Using the hosting service 210, it is easy and convenient for the user to try out the presentation. In practice, what the user does is select a presentation via a user interface (such as the user interface described below) and try out the presentation. The presentation will load almost instantaneously on the server 402 appropriate for that presentation, and it will execute exactly like any other game or application. The presentation will work from the user's perspective regardless of whether the presentation requires a very high performance server 402 or a low performance server 402, and regardless of the type of home or office client 415 used by the user. The software publisher of the game demonstration or application demonstration will be able to control exactly what demonstration the user is permitted to try out and for how long, and of course, the demonstration may include user interface elements that provide the user with the opportunity to gain access to the full version of the game or application being demonstrated.
Because the demos may be lower cost or provided free of charge, some users may attempt to use repeated demos (especially game demos that may be fun to play repeatedly). The hosting service 210 may use various techniques to limit the presentation usage for a given user. The most straightforward approach is to establish a user ID for each user and limit the number of times a presentation is allowed to play for a given user ID. However, the user can set a plurality of user IDs, especially in the case where it is free. One technique for addressing this problem is to limit the number of times a given client 415 is allowed to play a presentation. If the client is a standalone device, the device will have a serial number, and the hosting service 210 may limit the number of times the presentation can be accessed by the client having the serial number. If the client 415 is executing in software on a PC or other device, a serial number may be assigned by the hosting service 210 and stored on the PC and used to restrict presentation use, but assuming the PC can be reprogrammed by a user and the serial number erased or changed, another option is for the hosting service 210 to keep a record of the PC network adapter Media Access Control (MAC) address (and/or other machine specific identifier, such as a hard drive serial number, etc.) and restrict presentation use to that MAC address. It is assumed that the MAC address of the network adapter can be changed, however, this is not an extremely simple approach. Another approach is to limit the number of times a presentation can be played to a given IP address. Although IP addresses may be reassigned periodically by cable modems and DSL providers, they do not occur very frequently in practice, and if it can be determined (e.g., by contacting an ISP) that the IP is in a block of IP addresses for residential DSL or cable modem access, a small number of demo uses for a given household can typically be established. Also, there may be multiple devices in the home behind a NAT router that shares the same IP address, but typically in a residential setting there will be a limited number of such devices. If the IP address is in a block serving a business, a larger number of demonstrations for the business may be established. Finally, however, the combination of all the previously described methods is the best way to limit the number of presentations on a PC. While there may not be an extremely simple way for a determined and technically skilled user to be limited in the number of replays of a demo, establishing a large number of hurdles may establish a sufficient barrier to make most PC users not worth the hassle to abuse the demo system, and instead, use demos when they would like to try out new games and applications.
Benefits to schools, businesses and other institutions
Significant benefits arise especially in businesses, schools and other institutions that utilize the system shown in fig. 4 a. Businesses and schools have substantial costs associated with installing, maintaining, and upgrading PCs, especially when dealing with PCs that execute high performance applications such as Maya. As stated previously, PCs are typically utilized only a portion of the hours of the week, and as in the home, a PC with a given level of performance capability costs much more in an office or school environment than in a server-centric environment.
In the case of a larger business or school (e.g., a large university), IT may be practical for the entity's IT department to set up a server center and maintain computers that are accessed remotely via LAN-level connections. There are many solutions for remotely accessing computers via a LAN or via a private high bandwidth connection between offices. For example, with Microsoft's Windows terminal server, or with virtual network computing applications (such as VNC from RealVNC (remote control) limited) or with thin client devices from Sun Microsystems, users can gain remote access to a PC or server with a range of quality in graphical response time and user experience. Additionally, such self-managed server centers are typically dedicated to a single business or school, and thus are not able to take advantage of the overlap of usage that is possible when disparate applications (e.g., entertainment and business applications) utilize the same computing resource at different times of the week. As a result, many businesses and schools lack the scale, resources, or expertise to independently set up a server center with a network connection to each user's LAN speed. In fact, a large percentage of schools and businesses have the same internet connection (e.g., DSL, cable modem) as homes.
However, the organization may still have a need for very high performance computations (either periodically or periodically). For example, a small building company may have only a small number of architects, with relatively modest computational needs when doing design work, but it may periodically require very high performance 3D computations (e.g., when building a 3D fly-through of a new building design for a client). The system shown in fig. 4a is well suited for the tissue. The organization need only be the same kind of network connection (e.g., DSL, cable modem) that is provided to the home and is typically very inexpensive. It may utilize an inexpensive PC as the client 415, or may be entirely absent, and utilize inexpensive dedicated equipment that simply implements the control signal logic 413 and low-latency video decompression 412. This feature is particularly attractive to schools that may have problems with theft of the PC or damage to specialized components within the PC.
This configuration solves many of the problems for the organization (and many of these advantages are also common to home users who do general-purpose computing). For example, the operating costs (which ultimately must be passed back to the user in some form in order to have a viable business) can be much lower because (a) the computing resources are shared with other applications that have different peak usage times in the week, (b) the organization can only obtain (and incur costs for) access to high performance computing resources when needed, (c) the organization does not have to provide resources for backing up or otherwise maintaining the high performance computing resources.
Removal of piracy
Additionally, games, applications, interactive movies, etc. may no longer be pirated as they are today. Because each game is stored and executed at the hosting service 210, the user does not have access to the underlying program code, and thus there is no piracy. Even if the user were to copy the original code, the user would not be able to execute the code on a standard game console or home computer. This opens up a market around the world (such as china) where standard video games are not available. Reselling of used games is also not possible because no game copies are distributed to users.
For game developers, as is the case today, new generations of game consoles or PCs are introduced into the market, with fewer market discontinuities. In contrast to the current situation where a completely new generation of console or PC technology forces users and developers to upgrade and where game developers rely on timely delivery of hardware platforms to users (e.g., in the case of gaming machine 3, introduction is delayed by more than a year, and developers must wait until they are available and a large number of units are purchased), the hosting service 210 can be gradually updated over time with more advanced computing technology as game requirements change.
Streaming interactive video
The above description provides a variety of applications enabled by the novel basic concept of general internet-based low-latency streaming interactive video (which implicitly also includes audio along with video, as used herein). Prior art systems that provide streaming video over the internet have only enabled applications that can be implemented through high latency interactions. For example, basic playback controls for linear video (e.g., pause, rewind, fast forward) work properly at high latency, and it is possible to select among linear video feeds. Furthermore, as previously stated, the nature of some video games allows them to be played with high latency. However, the high latency (or low compression ratio) of prior art methods for streaming video severely limits the potential applications of streaming video or narrows its deployment to specialized network environments, and even in such environments, prior art introduces a substantial burden on the network. The technology described herein opens the door to a wide variety of applications possible under low-latency streaming interactive video over the internet, particularly those enabled via consumer-level internet connections.
Indeed, with client devices as small as the client 465 of FIG. 4c, it is sufficient to provide an enhanced user experience with an efficient arbitrary amount of computing power, arbitrary amount of fast storage, and extremely fast network connections between powerful servers, which enables a new computing era. Additionally, because bandwidth requirements do not grow as the computing power of the system grows (i.e., because bandwidth requirements only relate to display resolution, quality, and frame rate), once broadband internet connectivity is ubiquitous (e.g., via widely distributed low-latency wireless coverage), reliable, and with sufficiently high bandwidth to meet the needs of all users' display devices 422, the problem will be whether a heavy client (such as a PC or mobile phone executing Windows, Linux, OSX, etc.) or even a thin client (such as Adobe Flash or Java) is necessary for typical consumer and business applications.
The advent of streaming interactive video has led to a reconsideration of assumptions about the structure of the computing architecture. One example of this is the hosting service 210 server-centric embodiment shown in FIG. 15. The video path for the delay buffer and/or the packetized video 1550 is a feedback loop in which the multicast streaming interactive video output of the app/game server 1521 & 1525 is fed back into the app/game server 1521 & 1525 via path 1552 in real time or after a selectable delay via path 1551. This enables a variety of practical applications (e.g., such as those illustrated in fig. 16, 17, and 20) that would not be possible or feasible with prior art server or local computing architectures. However, as a more general architectural feature, the feedback loop 1550 provides recursion at the streaming interactive video level, as the video can be cycled indefinitely as the application requires it. This makes possible a variety of application possibilities that have never been available before.
Another key architectural feature is: the video stream is a unidirectional UDP stream. This effectively enables any degree of multicasting of streaming interactive video (in contrast, a bi-directional stream such as a TCP/IP stream will create more and more traffic stalls on the network from back-and-forth communications as the number of users increases). Multicasting is an important capability within a server center because it allows the system to respond to the growing needs of internet users (and indeed, the world's population) to communicate on a one-to-many or even many-to-many basis. Again, the examples discussed herein that illustrate the use of both streaming interactive video recursion and multicasting (such as fig. 16) are only the tips of very large icebergs with the potential.
NON-switching peer (NON-TRANSIT PEERING)
In one embodiment, hosting service 210 has one or more peer-to-peer connections to one or more Internet Service Providers (ISPs) that also provide internet services to users, such that hosting service 210 can communicate with users through non-transit routes maintained within the ISP network. For example, if the hosting service 210WAN interface 441 is directly connected to the Combrest Cable communications Limited's network, providing broadband services to the customer premises 211 through a Combrest cable modem, the routing between the hosting service 210 and the client 415 may be established entirely within the Combrest's network. Potential advantages would include: lower communication costs (because IP transit costs between two or more ISP networks can be avoided), potentially more reliable connections (which can prevent congestion or other transit interruptions between ISP networks), and lower latency (which can prevent congestion, inefficient routing, or other delays between ISP networks).
In this embodiment, at the beginning of a call, when the client 415 begins to contact the hosting service 210, the hosting service 210 receives the IP address of the user premises 211. It then checks to see if the IP address is an IP address assigned to a particular ISP connected to the hosting service 210 using, for example, a valid IP address table from the ARINs (american network address registration management organization), the hosting service 210 being able to route to the customer premises 211 without IP transit through another ISP. For example, if the IP address is located between 76.21.0.0 and 76.21.127.255, the IP address is assigned to Combrest Cable communications, Inc. In this example, if the hosting service 210 maintains connections to the compacter, AT & T, and Cox ISPs, it selects compacter as the ISP most likely to provide the best route for that particular user.
Video compression using feedback
In an embodiment, feedback is provided from the client device to the hosting service to indicate successful (or unsuccessful) tile and/or frame transfers. The feedback information provided from the client is then used to adjust the video compression operation at the hosting service.
For example, FIGS. 25a-b illustrate an embodiment of the invention in which a feedback channel 2501 is established between the client device 205 and the hosting service 210. The client device 205 sends a packetized acknowledgement of successfully received image blocks/frames and/or an indication of unsuccessfully received image blocks/frames using the feedback channel 2501.
In one embodiment, after each tile/frame is successfully received, the customer sends an acknowledgement to the hosting service 210. In this embodiment, the hosting service 210 detects packet loss if no acknowledgement is received after a specified period of time and/or if an acknowledgement is received for tiles/frames received by the client device 205 that are later than the tiles/frames that have been sent. Alternatively, or in addition, the client device 205 may detect packet loss and send an indication of packet loss to the hosting service 210 along with an indication of tiles/frames affected by the packet loss. In this embodiment, no continuous acknowledgement of successfully transmitted tiles/frames is required.
Regardless of how packet loss is detected, in the embodiment illustrated in FIGS. 25a-b, after generating an initial set of I-tiles for a picture (not shown in FIG. 25 a), the encoder then generates only P-tiles for the picture until a packet loss is detected. Note that in fig. 25a, each frame such as 2510 is shown as 4 vertical image blocks. The frame may be laid out in different configurations, e.g., 2 × 2, 2 × 4, 4 × 4, etc., or the frame may be encoded entirely without tiles (e.g., as 1 large tile). The above examples of frame placement configurations are provided for the purpose of illustrating this embodiment of the present invention. The underlying principles of the invention are not limited to any particular frame placement configuration.
Transmitting only P-tiles reduces the channel bandwidth requirement for all the reasons set forth above (e.g., P-tiles are typically smaller than I-tiles). When packet loss is detected via the feedback channel 2501, a new I-tile is generated by the encoder 2500, as shown in fig. 25b, to reinitialize the state of the decoder 2502 on the client device 205. As shown, in one embodiment, the I-tiles are distributed over multiple encoded frames to limit the bandwidth consumed by each individual encoded frame. For example in fig. 25, where each frame comprises 4 tiles, a single I-tile is transmitted at different positions within 4 consecutive encoded frames.
The encoder 2500 may combine the techniques described in connection with this embodiment with other encoding techniques described herein. For example, in addition to generating I-tiles in response to detected packet losses, encoder 2500 may generate I-tiles in other cases, in which case I-tiles may be beneficial to correctly render an image sequence (e.g., in response to an abrupt scene transition).
Fig. 26a illustrates another embodiment of the invention, which relies on a feedback channel 2601 between the client device 205 and the hosting service 210. Rather than generating new I-tiles/frames in response to detected packet loss, encoder 2600 of this embodiment adjusts the dependencies of P-tiles/frames. As an initial problem, attention is paid to: the specific details set forth in this example are not necessarily in accordance with the underlying principles of the invention. For example, although the present example is described using P-tiles/frames, the underlying principles of the invention are not limited to any particular encoding format.
In FIG. 26a, an encoder 2600 encodes a plurality of uncompressed tiles/frames 2605 into a plurality of P-tiles/frames 2606 and transmits the P-tiles/frames to a client device 205 over a communication channel (e.g., the Internet). A decoder 2602 on client device 205 decodes the P-tiles/frames 2606, generating a plurality of decompressed tiles/frames 2607. The past state 2611 of the encoder 2600 is stored in a storage 2610 on the hosting service 210, while the past state 2621 of the decoder 2602 is stored in a storage 2620 on the client device 205. The "state" of a decoder is a term well known in the art of video coding systems such as MPEG-2 and MPEG-4. In one embodiment, the past "state" stored in memory contains the combined data from the previous P-tile/frame. Memories 2611 and 2621 may be integrated into encoder 2600 and decoder 2602, respectively, rather than being separate from encoder 2600 and decoder 2602, as shown in fig. 26 a. In addition, various types of memory may be used, including (by way of example and not limitation) random access memory.
In one embodiment, when no packet loss occurs, the encoder 2600 encodes each P-tile/frame as dependent on the previous P-tile/frame. Thus, as indicated by the symbol used in FIG. 26a, P-tiles/frame 4 depend on P-tiles/frame 3 (using symbol 4) 3Identification); p-tiles/frames 5 depend on P-tiles/frames 4 (using the sign 5)4Identification); whereas P-tiles/frames 6 depend on P-tiles/frames 5 (using the sign 6)5Identification). In this example, P-tiles/frame 43Has been lost during transmission between encoder 2600 and decoder 2602. This loss may be communicated to encoder 2600 in various ways, including but not limited to those described above. For example, this information may be communicated from the decoder 2602 to the encoder 2600 each time the decoder 2606 successfully receives and/or decodes an image block/frame. If encoder 2600 does not receive an indication after a period that a particular tile/frame has been received and/or decoded, encoder 2600 will assume that the tile/frame was not successfully received. Alternatively, or in addition, the decoder 2602 may inform the encoder 2600 when a particular image block/frame is not successfully received.
In an embodiment, regardless of how the lost image block/frame is detected, once this occurs, the encoder 2600 encodes the next image block/frame by using the last image block/frame known to be successfully received by the decoder 2602. In the example shown in fig. 26a, tiles/frames 5 and 6 are not considered "successfully received" and cannot be correctly decoded by decoder 2602 due to the loss of tile/frame 4 (i.e., the decoding of tile/frame 5 depends on tile/frame 4 and the decoding of tile/frame 6 depends on tile/frame 5). Thus, in the example shown in FIG. 26a, encoder 2600 encodes tile/frame 7 as a tile/frame 6 that depends on tile/frame 3 (the last successfully received tile/frame) rather than being incorrectly decoded by decoder 2602. Although not shown in fig. 26a, subsequently image block/frame 8 is encoded as dependent on image block/frame 7 and image block/frame 9 is encoded as dependent on image block/frame 8, assuming no further packet loss is detected.
As mentioned above, the encoder 2600 and decoder 2602 maintain past decoder and encoder states 2611 and 2621 in memories 2610 and 2620, respectively. Thus, when encoding a block/frame 7, the encoder 2600 fetches the previous encoder state associated with block/frame 3 from the memory 2610. Similarly, a memory 2620 associated with the decoder 2602 stores at least the last known good decoder state (the state associated with tile/frame 3 in the example). Thus, the decoder 2602 obtains past state information related to tile/frame 3 so that tile/frame 7 can be decoded.
Due to the above techniques, real-time, low-latency, interactive video can be encoded and streamed using a relatively small bandwidth, since no I-tiles/frames are ever needed (except for initializing the decoder and encoder at the beginning of the stream). Furthermore, while the video image produced by the decoder may temporarily include unwanted distortion due to the missing tile/frame 4 and tiles/frames 5 and 6 (which cannot be decoded correctly due to the loss of tile/frame 4), the distortion is only visible for a very short duration. Furthermore, if image blocks (rather than full video frames) are used, this distortion will be limited to a particular area of the reproduced video image.
FIG. 26b illustrates a method according to one embodiment of the invention. At 2650, image blocks/frames are generated based on previously generated image blocks/frames. At 2651, a missing image block/frame is detected. In an embodiment, the missing image blocks/frames are detected based on information transmitted from the encoder to the decoder, as described above. At 2652, a next tile/frame is generated based on the tiles/frames known to have been successfully received and/or decoded at the decoder. In an embodiment, the encoder generates the next tile/frame by loading the state from memory regarding successfully received and/or decoded tiles/frames. Likewise, when a new tile/frame is received by the decoder, it decodes the tile/frame by loading the state from memory related to the successfully received and/or decoded tile/frame.
In an embodiment, the next tile/frame is generated based on the last tile/frame successfully received and/or decoded at the encoder. In another embodiment, the next tile/frame generated is an I tile/frame. In yet another embodiment, the selection of whether to generate the next tile/frame or an I-frame based on a previously successfully received tile/frame is based on: how much of the latency of the image block/frame and/or channel is lost. In the case where a relatively small number (e.g., 1 or 2) of tiles/frames are lost and the round trip delay is relatively low (e.g., 1 or 2 frame times), generating a P tile/frame may be optimal because the difference between the last successfully received tile/frame and the newly generated tile/frame may be relatively small. If several tiles/frames are lost and the round trip delay is high, then generating an I tile/frame may be optimal because the difference between the last successfully received tile/frame and the newly generated tile/frame may be large. In an embodiment, a tile/frame loss threshold and/or a latency threshold is set to determine whether to transmit an I tile/frame or a P tile/frame. If the number of lost tiles/frames is below a tile/frame loss threshold and/or if the round trip delay is below a delay threshold, generating a new I tile/frame; otherwise, a new P image block/frame is generated.
In an embodiment, the encoder always attempts to generate a P tile/frame that relates to the last successfully received tile/frame, and if the encoder determines during the encoding that the P tile/frame will likely be larger than the I tile/frame (e.g., if it has compressed 1/8 tiles/frames and the compressed size is larger than the average size of the previously compressed I tiles/frames 1/8), the encoder will forgo compressing the P tiles/frames and will compress the I tiles/frames instead.
The above-described system of using feedback to report dropped (dropped) tiles/frames typically results in very minor interruptions in the video stream to the user if lost packets occur infrequently, because tiles/frames dropped by a lost packet are replaced in roughly one round trip time between client device 205 and hosting service 210, assuming that encoder 2600 compresses the tiles/frames in a short time. Also, the video stream does not lag the uncompressed video stream because the new tile/frame that is compressed is based on a later frame in the uncompressed video stream. But if a packet containing a new tile/frame is also lost, this will result in at least two round trips of delay to request and send another new tile/frame again, which in many practical cases will result in a significant interruption of the video stream. Thus, it is important that the newly encoded tiles/frames that are sent after the lost tiles/frames are successfully sent from the hosting service 210 to the client device 205.
In an embodiment, Forward Error Correction (FEC) techniques, such as those previously described and illustrated in fig. 11a, 11b, 11c, and 11d, are used to mitigate the possibility of losing newly encoded image blocks/frames. If FEC coding is already used when transmitting tiles/frames, a stronger FEC code is used for the newly coded tiles/frames.
One potential cause of lost packets is a sudden loss of channel bandwidth, for example, if some other user of the broadband connection at user premises 211 begins to use a large amount of bandwidth. If the newly generated tiles/frames are also lost due to dropped packets (even if FEC is used), in one embodiment: when the client 415 notifies the hosting service 210 that the second newly encoded tile/frame is lost, the video compressor 404 reduces the data rate as it encodes the subsequent newly encoded tile/frame. Different embodiments use different techniques to reduce the data rate. This reduction in data rate is achieved, for example, in one embodiment by reducing the quality of the encoded image blocks/frames by increasing the compression ratio. In another embodiment, the data rate is reduced by reducing the frame rate of the video (e.g., from 60fps to 30fps) and slowing the rate of data transmission accordingly. In one embodiment, techniques for reducing the data rate are used (e.g., both reducing the frame rate and increasing the compression ratio). If the lower rate of data transmission is successful in mitigating dropped packets, the hosting service 210 will continue to encode at the lower data rate in accordance with the channel data rate detection and adjustment method previously described, and then gradually adjust the data rate up or down as the channel allows. The continuous reception of feedback data related to dropped packets and/or latency allows the hosting service 210 to dynamically adjust the data rate based on current channel conditions.
State management in an online gaming system
One embodiment of the invention uses techniques to efficiently store and carry the current state of active games between servers. Although the embodiments described herein are with respect to online gaming, the underlying principles of the invention may be applied to other types of applications (e.g., design applications, word processors, communications software such as email or instant messaging, etc.). Fig. 27a shows an exemplary system architecture for implementing the embodiment and fig. 27b shows an exemplary method. While the method and system architecture will be described concurrently, the method shown in FIG. 27b is not limited to any particular system architecture.
At 2751 in FIG. 27b, the user starts a new online game on the hosting service 210a from the client device 205. In response, at 2752, a "clean" image 2702a of the game is loaded from memory (e.g., a hard drive, whether connected directly to a server executing the game or connected to the server over a network) to memory (e.g., RAM) on the hosting service 210 a. The "clean" image includes the runtime program code and data for the game (e.g., as it would be when the game is first executed) before any game play is initiated. Thereafter, at 2753, the user plays the game such that the "clean" image becomes a non-clean image (e.g., the game being executed as represented by "state A" in FIG. 27 a). At 2754, the game is paused or terminated by either the user or the hosting service 210 a. At 2755, state management logic 2700a on hosting service 210a determines the difference between the "clean" image of the game and the current game state ("state A"). Various known techniques can be used to compute the difference between the two binary images, including, for example, those used by the well-known "diff" tools on the UNIX operating system. Of course, the underlying principles of the invention are not limited to any particular technique for difference calculation.
Regardless of how the differences are calculated, once they are used, the difference data may be stored locally within the storage 2705a and/or transmitted to a different hosting service 210 b. If transferred to a different hosting service 210b, the difference data may be stored on a storage device (not shown) of the new hosting service 210 b. In either case, the difference data is associated with the user account on the host service so that it can be identified the next time the user logs into the host service and starts the game. In one embodiment, rather than transmitting immediately, the difference data is not transmitted to the new hosting service until the next time the user attempts to play the game (while a different hosting service is determined to be the best choice for hosting the game).
Returning to the method shown in FIG. 27b, at 2757, the user restarts the game from the client device, which may be the client device 205 or a different client device (not shown) used by the user to initially play the game. In response, at 2758, the state management logic 2700b on the hosting service 210b obtains the "clean" image of the game and the difference data from the storage. At 2759, the state management logic 2700b combines the clean image and the difference data to reconstruct the state of the game on the original hosting service 210a ("state A"). The difference data can be used to recreate the state of the binary image using various known techniques, including, for example, those used in the well-known "patch" tool on the UNIX operating system. The use of difference calculation techniques in known backup programs such as PC backup may also be used. The underlying principles of the invention are not limited to any particular technique for reconstructing a binary image using difference data.
Further, at 2760, platform-dependent data 2710 is incorporated into the final game image 2701 b. The platform-related data 2710 may include any data that uniquely corresponds to the target server platform. By way of example and not limitation, platform-related data 2710 may include the Media Access Control (MAC) address, TCP/IP address, time of day, hardware sequence numbers (e.g., for hard drives and CPUs), network server addresses (e.g., DHCP/Wins servers), and software sequence numbers/activation codes (including operating system sequence numbers/activation codes) of the new platform.
Other client/user related platform related data may include (but is not limited to) the following:
1. the screen resolution of the user. When the user resumes the game, the user may use different devices with different resolutions.
2. A user's controller configuration. When the game resumes, the user may have switched from the game controller to the keyboard/mouse.
3. User privileges such as whether a discount rate has expired (e.g., if the user has played the game during a promotion and is now playing in a general period of higher consumption), or whether the user or device has some age restriction (e.g., the user's parent may have changed settings for a child so that the child is not allowed to see adult details, or if the device that is playing the game (e.g., a computer in a public library) has some restriction on whether adult details are displayed).
4. The rank of the user. A user may have been allowed to play a multiplayer game in a league, but the user may have been demoted to a smaller league because some other users have exceeded the user's rating.
The foregoing example of platform-related data 2710 is provided for purposes of explaining this embodiment of the present invention. The underlying principles of the invention are not limited to any particular set of platform-related data.
FIG. 28 graphically illustrates how the state management logic 2700a at the first hosting service extracts the difference data 2800 from the game 2701a being executed. Thereafter, the state management logic 2700b at the second hosting service combines the clean image 2702b and the difference data 2800 and the platform-related data 2710 to regenerate the state of the game 2701b being executed. As shown generally in fig. 28, the size of the difference data is significantly smaller than the size of the entire game image 2701a, thus saving a large amount of memory space and bandwidth by storing/transmitting only the difference data. Although not shown in fig. 28, the platform-related data 2700 may rewrite some difference data when it is incorporated into the final game image 2701 b.
Although an implementation of an online video game is described above, the underlying principles of the invention are not limited to video games. The aforementioned state management techniques may be implemented, for example, in the context of any type of online host application.
Techniques for maintaining client decoders
In one embodiment of the invention, whenever a user requests a connection to the hosting service 210, the hosting service 210 transmits a new decoder to the client device 205. Thus, in this embodiment, the decoder used by the client device is always up-to-date and is specifically tailored to the software/hardware implemented on the client device.
As shown in fig. 29, in this embodiment, the application permanently installed on the client device 205 does not include a decoder. In contrast, it is the client downloader application 2903 to manage the downloading and installation of the temporary decoder 2900 each time a client device 205 connects to the hosting service 210. The downloader application 2903 may be implemented as hardware, software, firmware, or any combination thereof. In response to a user request for a new online session, the downloader application 2903 communicates information relating to the client device 205 over a network, such as the internet. This information may include identification data identifying the client device and/or the client device hardware/software configuration (e.g., processor, operating system, etc.).
Based on this information, the downloader application 2901 on the hosting service 210 selects the appropriate temporary decoder 2900 for use on the client device 205. The downloader application 2901 on the host service then communicates the temporary decoder 2900, after which the downloader application 2903 on the client device authenticates and/or installs the decoder on the client device 205. The encoder 2902 then encodes the audio/video content using any of the techniques described herein, and communicates the content 2910 to the decoder 2900. Once the new decoder 2900 is installed, it decodes the content for the current online session (i.e., using one or more of the audio/video decompression techniques described herein). In an embodiment, the decoder 2900 is removed (e.g., offloaded) from the client device 205 when the session is terminated.
In an embodiment, as temporary decoder 2900 is being downloaded, downloader application 2903 characterizes the channel by making channel assessments (e.g., the achievable data rate on the channel (e.g., by determining how long it takes to download data), the packet loss rate on the channel, and the latency of the channel). Downloader application 2903 generates channel characterization data describing the channel assessment. This channel characterization data is then transmitted from the client device 205 to the host service downloader 2901, which uses the channel characterization data to determine how best to utilize the channel to transmit media to the client device 205.
During the download of the temporary decoder 2900, the client device 205 will typically send a message back to the hosting service 205. These messages may include acknowledgement information indicating whether the packet was received without or with error. In addition, this message provides feedback to downloader 2901 regarding the data rate (calculated based on the rate at which packets are received), packet error rate (based on the percentage of packets reported as received with errors), and channel round trip delay (based on the amount of time it takes before downloader 2901 receives feedback regarding a given packet that has been transmitted).
By way of example, if the data rate is determined to be 2Mbps, the downloader selects a smaller video window resolution for the encoder 2902 (e.g., 640 × 480, 60fps) than if the data rate is determined to be 5Mbps (e.g., 1280 × 720, 60 fps). Different Forward Error Correction (FEC) or packet structures may be selected depending on the packet loss rate.
If the packet loss is very low, the compressed audio and video may be transmitted without error correction. If the packet loss is moderate, the compressed audio and video may be transmitted using error correction coding techniques (e.g., like those previously described and illustrated in FIGS. 11a, 11b, 11c, and 11 d). If the packet loss is very high, it may be determined that a good quality audio-visual stream cannot be delivered, and the client device 205 may notify the user that the hosting service cannot be obtained over a communication channel (e.g., "link"), or that it may attempt to establish a different route to the hosting service with lower packet loss (as described below).
If the delay is short, the compressed audio and video may be transmitted with low delay and the session established. If the latency is too high (e.g., above 80ms), for games requiring a short latency, the client device 205 may notify the user that host service is not available over the link (which is available but will be sluggish or "lagging" in response time to user input), or the user may attempt to establish a different route to a host service with a lower latency (as described below).
The client device 205 can attempt to connect to the hosting service 210 via another route over a network (e.g., the internet) to see if the impairment is reduced (e.g., less packet loss, less latency, or even higher data rate). For example, the hosting service 210 may connect to the Internet from multiple geographic locations (e.g., hosting centers in los Angeles and one in Denver), with higher packet loss possible due to congestion in los Angeles, but no congestion in Denver. Further, the hosting service 210 may be connected to the Internet through a plurality of Internet service providers (e.g., AT & T and concatester).
Packet loss and/or high latency and/or constrained data rates may occur because of congestion or other problems between the host device 205 and one of the service providers (e.g., AT & T). However, if the client device 205 is connected to the hosting service 210 through another service provider (e.g., concatester), it may be possible to connect without congestion problems and/or lower packet loss and/or shorter latency and/or higher data rates. Thus, when downloading the temporary decoder 2900, if the client device 205 experiences packet loss above a specified threshold (e.g., a specified number of packet drops over a specified duration), latency above a specified threshold, and/or a data rate below a specified threshold, in one embodiment it attempts to reconnect to the hosting service 210 (typically by connecting to a different IP address or a different domain name) through an alternative route to determine if a better connection can be obtained.
After the optional connection option is exhausted, if the connection still experiences unacceptable impairments, it may be: the local connection of the client device 205 to the internet suffers impairments, or: too far from the hosting service 210 to obtain the appropriate latency. In this case, the client device 205 may notify the user that no hosting service is available over the link or that the hosting service is only marginally available and/or that only certain types of low-latency games/applications are available.
Because the evaluation and potential improvement of the link characteristics between the hosting service 210 and the client device 205 is made while the temporary decoder is being downloaded, it reduces the amount of time it takes for the client device 205 to download the temporary decoder 2900 and evaluate the link characteristics, respectively. In another embodiment, however, the client device 205 performs the downloading of the temporary decoder 2900 separately from the evaluation and potential improvement of the link characteristics (e.g., by using dummy test data rather than decoder program code). There are many reasons why this may be a preferred implementation. For example, in some embodiments, the client device 205 is implemented partially or wholly in hardware. Thus, for these embodiments, there is no need to download a software decoder per se.
Compression using standard-based image block sizes
As mentioned above, when using tile-based compression, the underlying principles of the invention are not limited to any particular tile size, shape, or orientation. In DCT-based compression systems such as MPEG-2 and MPEG-4, for example, the image blocks may be the size of a macroblock (a component used in data compression which generally represents a block of 16 x 16 pixels). This embodiment provides a very fine level of granularity for working with the image block.
Moreover, regardless of tile size, various types of placement patterns may be used. For example, FIG. 30 illustrates an embodiment in which multiple I-tiles are used in each of the R-frames 3001-3004. A rotation mode is used in which the I-tiles are interspersed at each R-frame so that every four R-frames generate one complete I-frame. Spreading the I-tile blocks in this manner will reduce the impact of packet loss (limiting the loss to a small area of the display).
The size of the image block may also be the overall natural structure of the basic compression algorithm. For example, if the H.264 compression algorithm is used in one embodiment, the image block is set to the size of an H.264 "slice". This allows the techniques described herein to be easily integrated into the context of a variety of different standard compression algorithms, such as H.264 and MPEG-4. Once the tile size is set to the natural compression configuration, the same techniques as those described above may be implemented.
Techniques for stream REWIND (REWIND) and playback operations
As previously described in connection with FIG. 15, the uncompressed video/audio streams 1529 generated by the app/game servers 1521-1525 may be compressed by the common hardware compression 1530 at multiple resolutions while generating multiple compressed video/audio streams 1539. For example, video/audio streams generated by app/game server 1521 may be compressed by shared hardware compression 1530 at 1280 × 720 × 60fps and delivered to the user as outbound internet communications 1599 via outbound routing 1540. The same type of video/audio stream may be simultaneously scaled down to the thumbnail size (e.g., 200 x 113) by common hardware compression 1530 and sent to the application/game server 1522 via path 1552 (or through delay buffer 1515) for display as one thumbnail 1600 of the set of thumbnails in fig. 16. When thumbnail 1600 is enlarged to size 1800(1280 × 720 × 60fps) in fig. 18 via level size 1700 in fig. 17, rather than decompressing the thumbnail stream, app/game server 1522 may decompress a copy of the 1280 × 720 × 60fps stream sent to the user of app/game server 1521 and zoom to the higher resolution video as if it were zoomed from the thumbnail size to the 1280 × 720 size. This approach has the advantage of reusing the 1280 x 720 compressed stream twice. It has several disadvantages: (a) the compressed video stream sent to the user may vary in image quality if the data throughput of the user's internet connection changes resulting in a change in the image quality viewed by the "watching (playing)" user of the app/game server 1522 even if the user's internet connection does not change, (b) the app/game server 1522 will have to decompress the entire 1280 x 720 image using processing resources and then scale this image (possibly applying a resampling filter) to display a much smaller size (e.g., 640 x 360) during scaling, (c) if frames are lost due to limited internet connection bandwidth and/or lost/corrupted packets, the watching user "falls back" and "pauses" the video recorded in the delay buffer 1515, the watching user will find the frame loss in the delay buffer (if the user "steps" from frame to frame, this would be particularly evident), and (d) if the viewing user falls back to look for a particular frame in the video recorded in the delay buffer, app/game server 1522 would have to look for an I-frame or I tile in the video stream recorded in the delay buffer that precedes the sought frame, and then decompress all P-frames/tiles until the desired frame is reached. The same restrictions will exist not only for users who "attend to viewing" live video/audio streams, but also for users who view archived copies of the video/audio streams (e.g., "highlight clips") (including users who generate video/audio streams).
Alternative embodiments of the present invention address these problems by compressing the video stream in more than one size and/or configuration. As described herein, a stream (a "live" stream) is optimally compressed and streamed to an end user based on characteristics of the network connection (e.g., data bandwidth, packet reliability) and the user's local client capabilities (e.g., decompression capability, display resolution). Other streams are compressed at high quality, one or more resolutions, and a structure that can be subject to video playback verification (referred to herein as "HQ" streams), and such HQ streams are routed and stored into the server center 210. For example, in one embodiment, the HQ compressed stream is stored on a RAID disk array 1515 and is used to provide functions such as pause, rewind, and other playback functions (e.g., "highlight clips," which can be released to other users for viewing).
As shown in fig. 31a, one embodiment of the invention comprises a decoder 3100 capable of compressing a video stream in at least two formats: one format 3110 is periodically includes I-tiles or I-frames and one format 3111 does not include I-tiles and I-frames unless necessary to include I-tiles and I-frames due to interruption of the stream or because an I-tile or I-frame is determined to be potentially smaller than an I-tile or I-frame (as described above). For example, a "live" stream 3111 transmitted to a user when playing a video game may be compressed using only P-frames (unless an I-tile or I-frame is necessary or smaller, as described above). In addition, the inventive encoder 3100 simultaneously compresses the live video stream 3111 in a second format that, in one embodiment, periodically includes I-tiles or I-frames (or similar type of image format).
Although the above-described embodiments use I-tiles, I-frames, P-tiles, and P-frames, the underlying principles of the invention are not limited to any particular compression algorithm. For example, any type of image format in which a frame depends on a previous or subsequent frame may be used in place of a P-tile or P-frame. Likewise, any type of image format may be used in place of the I-tiles or I-frames described above, which is not dependent on previous or subsequent frames.
As mentioned above, the HQ stream 3110 includes periodic I-frames (e.g., approximately every 12 frames in an embodiment). This is important because an I-tile or I-frame is needed if the user always wants to rewind the stored video stream quickly back to a certain point. In the case of a compressed stream of only P-frames (e.g., the first frame of the sequence is not an I-frame), the decoder must go back to the first frame of the sequence (which may be hours long) and then decompress the P-frames until the point at which the user wants to rewind is reached. With every 12 frames of stored I-frames within the HQ stream 3110, the user can decide to rewind to a certain point and the nearest previous I-frame of the HQ stream is no more than 12 frames ahead of the wanted frame. Even if the decoder maximum decoding rate is real-time (e.g., 1/60 for a 60 frame/second stream of one second), then it is 1/5 seconds away from the I-frame. While in many cases the decoder can operate faster than real-time, so at a decoding rate of 2 x real-time, for example, the decoder can decode 12 frames in 6 frames, with a delay for "rewind" of only 1/10 which is a one second delay. Needless to say, even a fast decoder (e.g. 10 x real-time) will have an unacceptable delay (e.g. it will take 1 hour/10 to 6 minutes to make a rewind) if the nearest previous I-frame precedes the rewind point by a large number of frames. In another embodiment, periodic I-tiles are used, in which case when the user seeks a rewind, the decoder will look for the closest previous I-tile prior to the rewind point, and then proceed to begin decoding of that tile from that point until all tiles are decoded to the rewind point. While periodic I-tiles or I-frames may result in less efficient compression than no I-frames at all, the hosting service 210 generally has sufficient locally available bandwidth and storage capacity to manage the HQ stream.
In another embodiment, the encoder 3100 encodes an HQ stream with periodic I-tiles or I-frames followed by P-tiles or P-frames, as previously described, and preceded by B-tiles or B-frames. As previously described, B-frames are frames that precede I-frames and are based on frame differences from I-frames when working backwards in time (backward). A B-tile is a tile copy, previously working with an I-tile and based on back-over-time (backward) frame differences from the I-tile. In this embodiment, if the desired rewind point is a B-frame (or contains a B-tile), the decoder will look for the next I-frame or I-tile that is closest and decode back in time until the desired rewind point is decoded, after which video playback proceeds from that point and the decoder will decode B-frames, I-frames, and P-frames (or their tile copies) frame by frame forward. The advantage of using B-frames or B tiles in addition to I and P types is: higher quality can generally be achieved at a given compression ratio.
In yet another embodiment, the encoder 3100 encodes the HQ stream as full I-frames. The method has the advantages that: each rewind point is an I-frame so that no other frames need to be decoded in order to reach the rewind point. The disadvantages are that: the compressed data rate will be very high compared to I, P or I, P, B stream encoding.
Other video stream playback operations (e.g., fast or slow rewind, fast or slow forward, etc.) are generally more realistically accomplished by periodic I-frames or I-tiles (using P and/or B copies alone or in combination), because in each case the stream is played back in a different frame order than temporally frame-by-frame forward, so that the decoder needs to find a particular (typically arbitrary) frame in the decoding sequence. For example, in the case of very fast forward (e.g., 100 x speed), every next frame played is 100 frames after the previous frame. Even with a decoder that has 10 x real-time operation and decodes 10 frames in 1 frame time, it will still be 10 x, too slow to obtain 100 x fast forward. However, with periodic I-frames or I-tiles as described above, the decoder can find the applicable I-frame and I-tile that is closest to the frame required for the next play, and decode only the intermediate frames or tiles to the point of the target frame.
In another embodiment, where I-frames are encoded into the HQ stream at a consistent number of cycles (e.g., always every 8 frames), the user's available multiple speeds for fast forwarding and rewinding, which are faster than the I-frame rate, is an exact multiple of the number of I-frame cycles. For example, if the I-frame number is 8 frames, the fast-forward or rewind speeds made available to the user may be 1 ×, 2 ×, 3 ×, 4 ×, 8 ×, 16 ×, 64 × and 128 × and 256 ×. For speeds faster than the number of I-frame periods, the decoder will first skip forward at that speed to the nearest I-frame before a certain number of frames (e.g., if the currently displayed frame is 3 frames ahead of the I-frame, then at 128 x, the decoder will skip to the frame that is 128+3 frames ahead), after which for each subsequent frame the decoder will skip with the selected speed (e.g., at the selected speed of 128 x, the decoder will skip 128 frames), exactly the number of frames that will fall exactly on the I-frame each time. Thus assuming that all speeds faster than the number of I-frame cycles are an exact multiple of the number of I-frame cycles, the decoder will never need to decode any previous or subsequent frame to find the desired frame, and need only decode one I-frame for each displayed frame. For slower speeds than the number of I-frame periods (e.g., 1 x, 2 x, 3 x, 4 x), or for faster speeds that are not multiples of the number of I-frame periods, for each frame displayed, the decoder seeks a frame that minimizes the number of additional newly decoded frames needed to display the desired frame, which may be an un-decoded I-frame or a decoded frame (in RAM or other fast memory) that is still in decoded form, after which, if necessary, intermediate frames may be decoded until the desired frame is decoded and displayed. For example, fast forward at 4 x, if the current frame is a P-frame 1 frame after the next I-frame in an I, P encoded sequence having 8 x I-frame periods, the desired frame to be displayed is the 4 frame after frame, which is the 5 th P-frame after the previous I-frame. If the currently displayed frame (which was just decoded) is used as the starting point, the decoder will need to decode 4 more P-frames to display the desired frame. If the previous I-frame is used, the decoder will need to decode 6 frames (I-frame and the next 5P-frames) in order to display the desired frame. (obviously, in this case, it is advantageous to use the currently displayed frame to minimize the frames that need to be additionally decoded). The next 4 frames to be decoded are then the first P-frame after the I-frame. In this case, if the currently decoded frame is used as a starting point, the decoder also decodes 4 frames (2P-frames, one I-frame, and one P-frame). However, if the next I-frame is used instead, the decoder will only need to decode the I-frame and subsequent P-frames. (obviously, in this case, it is advantageous to use the next I-frame as a starting point to minimize the frames that need additional decoding.) thus, in this example, the decoder may alternate between using the currently decoded frame as a starting point and using the subsequent I-frame as a starting point. As a main principle, regardless of the HQ video stream playback mode (fast forward, rewind or step) and speed, for each subsequent frame displayed at that playback mode and speed, the decoder will start with a frame that minimizes the number of newly decoded frames needed to display the desired frame, i.e., an I-frame or a previously decoded frame.
As shown in fig. 31b, one embodiment of the hosting service 210 includes stream replay logic 3112 for managing user requests to replay HQ streams 3110. The stream playback logic 3112 receives a client request containing a video playback command (e.g., pause, rewind, playback from a specified point, etc.), translates the command and decodes the HQ stream 3110 from the specified point (starting with an I-frame or previously decoded frame, and proceeding forward or backward to the specified point, as the case may be). In an embodiment, the decoded HQ stream (possibly the same encoder 3100 if it is capable of encoding more than one stream at a time, or a separate encoder 3100) is provided to the encoder 3100 so that it can be recompressed (using the techniques described herein) and transmitted to the client device 205. The decoder 3102 on the client device then decodes and reproduces the stream as described above.
In an embodiment, stream replay logic 3112 does not decode the HQ stream and then causes encoder 3100 to re-encode the stream. Instead, it simply flows the HQ stream 3110 from the specified point directly to the client device 205. The decoder 3102 on the client device 205 then decodes the HQ stream. Because the playback functionality described herein generally does not have the low latency requirements of playing a real-time video game (e.g., if a player simply views a previous game play, rather than actively playing the game), the added latency typically inherent in a normal high quality HQ stream can result in an acceptable end-user experience (e.g., having a longer latency but a high quality video).
By way of example and not limitation, if a user is playing a video game, encoder 3100 provides a live stream with essentially all P-frames optimized for the user's connection and local client (e.g., approximately 1.4Mbps, a 640 x 360 resolution). At the same time, encoder 3100 also compresses the video stream within hosting service 310 into HQ stream 3100 and stores the HQ stream on a local digital video decoder RAID array, e.g., with I-frames per 12 frames at 1280 × 720, 10 Mbps. If the user clicks the "pause" button, the game will pause on the last decoded frame by the client and the screen will freeze. Thereafter, if the user clicks the "rewind" button, the stream replay logic 3112 will read the HQ stream 3110 from the DVRRAID starting at the latest I-frame or available decoded frames, as described above. Stream replay logic 3112 will decompress the intermediate P or B frames, if needed, and reorder the frames, if needed, to rewind (backward) the playback sequence at the desired rewind speed, and then adjust (using prior art image scaling techniques known in the art) the stream intended to be decoded, going from display at 1280 x 720 to display at 640 x 360, after which live stream encoder 3100 will recompress and deliver the reordered stream to the user at 640 x 360 resolution. If the user pauses again and steps through the video to view the sequence in close proximity, the HQ stream 3110 on DVR RAID may have every frame available for stepping (even though the original live stream may have dropped frames for any of the many reasons described herein). Furthermore, the quality of video playback at each point in the HQ stream will be very high, while there may be some points in the live stream that cause a temporary reduction in the quality of the compressed image due to bandwidth impairments. The impaired image quality, which is acceptable to the user when it lasts for a brief period of time or in a moving image, may not be acceptable if the user stops (or slowly steps through) at a particular frame and closely studies the frame carefully. By specifying a point within the HQ stream (e.g., the first 2 minutes), the user is also provided with the ability to fast forward or skip a particular point. All of these operations cannot be performed with their full versatility and high quality in the case of live video streams with only P-frames or few (or unpredictable) I-frames.
In one embodiment, the user is provided with a video window (not shown), such as an appleQuickTime or Adobe Flash video window, having a "brush" (i.e., left-right slider control) that allows the user to scan forward or backward through the video stream just before the HQ stream stores the video. While it appears to the user that he or she is "brushing" through the live stream, in fact he or she is "brushing" through the stored HQ stream 3110, which HQ stream 3110 is then resized and recompressed as a live stream. Furthermore, as previously mentioned, if anyone else at the same time or a user at a different time watches the HQ stream, the HQ stream can be viewed at a higher (or lower) resolution than the resolution of the live stream when the HQ stream is encoded at the same time, while the quality will be as high as the quality of the viewer's live stream, potentially reaching the quality of the HQ stream.
Thus, the user is provided with the desired configuration of both scenarios by encoding both the live stream (in a suitable manner for its low latency, bandwidth and packet tolerance requirements, as described herein) and the HQ stream with its high quality, stream playback operational requirements. And in fact, it is effectively transparent to the user: there are two different streams that are coded differently. From the user's perspective, the experience is highly responsive, with very low latency, and despite running on highly variable and relatively low bandwidth internet connections, Digital Video Recording (DVR) functionality is of very high quality, with flexible operation and flexible speed.
Because of the above-described techniques, users benefit from live and HQ video streams during online game play or other online interactions, independent of either the live stream or the HQ stream.
FIG. 31c illustrates one embodiment of a system architecture to perform the above-described operations. As shown, in this embodiment, encoder 3100 encodes a series of "live" streams 3121L, 3122L, and 3125L, respectively, and a corresponding series of "HQ" streams 3121H1-H3, 3122H1-H3, and 3125H 1-H3. Each HQ stream H1 is full resolution encoded, while each encoder H2 and H3 scales the video stream to a smaller size prior to encoding. For example, if the video stream is 1280 × 720 resolution, H1 would be encoded at 1280 × 720 resolution, whereas H2 could be scaled to 640 × 320 and encoded at that resolution, and H3 could be scaled to 320 × 180 and encoded at that resolution. Any number of simultaneous Hn scalers/encoders may be used, providing multiple simultaneous HQ encoding of various resolutions.
Each live stream operates in response to channel feedback signals 3161, 3162, and 3165 received via the inbound internet connection 3101, as described above (see, e.g., the discussion of feedback signals 2501 and 2601 in fig. 25-26). The live stream is transmitted out over the internet (or other network) via the outbound routing logic 3140. The live compressors 3121L-3125L include logic to adjust (adapt) the compressed video stream based on channel feedback, including scaling, frame dropping, etc.
The inbound routing logic 3141 and 1502 routes the HQ stream to an internal delay buffer (e.g., RAID array 3115) or other data storage device via signal path 3151 and/or feeds the HQ stream back to the application/game server and encoder 3100 for additional processing via signal path 3152. Upon request, the HQ stream 3121Hn-3125Hn is then streamed to the end user (see, e.g., FIG. 31b and associated content), as described above.
In an embodiment, the encoder 3100 is implemented as the common hardware compression logic 1530 shown in fig. 15. In another embodiment, some or all of the encoder and scaler are separate subsystems. The underlying principles of the invention are not limited to any particular shared or hardware/software configuration of scaling or compression resources.
The advantage of the configuration of fig. 31c is that: application/game servers 3121-3125 that require a video window that is smaller than the full-size video window will not need to process and decompress the full-size window. In addition, application/game servers 3121-3125 that require an intervening window size may receive a compressed stream that approximates the desired window size and then scale up or down to the desired window size. Furthermore, if multiple application/game servers 3121-. If the application/game server receiving the broadcast changes the size of the video window, it can switch to a different video size broadcast. Thus, an arbitrarily large number of users can view the application/game server video stream simultaneously, each user having the flexibility to scale their video window and always reap the benefit of scaling the video stream to a size close to the desired window size.
One disadvantage of the method shown in fig. 31c is that: in many practical implementations of the hosting service 210, it never happens that all compressed HQ streams are viewed together, let alone all sizes of all compressed HQ streams. This waste is mitigated when the encoder 3100 is implemented as a common resource (e.g., a scaler/compressor, which may be implemented as software or hardware). However, there may be practical problems in connecting large numbers of uncompressed streams to a common shared resource due to the bandwidth involved. For example, each 1080p60 stream is almost 3Gbps, which is even far in excess of gigabit ethernet. The following alternative embodiments address this issue.
FIG. 31d shows an alternative embodiment of hosting service 210 in which each application/game server 3121-3125 has two compressors assigned to it: (1) live stream compressors 3121L-3125L that condition the compressed video stream based on channel feedback 3161 and 3165, and (2) an HQ stream compressor that outputs a full resolution HQ stream, as described above. Notably, the live compressor is dynamic and adaptive, using two-way communication with the client 205, while HQ streams are non-adaptive and unidirectional. Other differences between streams are: the live stream quality may vary dynamically depending on the channel conditions and the characteristics of the video material. Some frames may be of low quality and there may be dropped frames. Furthermore, the live stream may be almost entirely P-frames or P-tiles, with few I-frames and I-tiles present. HQ streams generally have higher data rates than live streams, which will provide consistent high quality without dropping any frames. The HQ stream may be all I-frames or may have frequent and/or regular I-frames or I-tiles. The HQ stream may also include B-frames or B-tiles.
In one embodiment, the common video scaling and recompression 3142 (described in detail below) selects only some of the HQ video streams 3121H1-3125H1, which will be scaled and recompressed at one or more different resolutions before being sent to the inbound routing 3141 for routing as previously described. Other HQ video streams either pass through the previously described ingress route 3141 for routing at full size or not at all. In one embodiment, this determination is made based on whether there is an application/game server 3121- > 3125 that requests a particular HQ stream at a particular resolution (or close to a scaled resolution or full resolution): which HQ streams are scaled and recompressed, and/or which HQ streams are fully passed. With this approach, only the HQ stream that is scaled and recompressed (or potentially passed through completely) is the HQ stream that is actually needed. In many applications of the hosting service 210, this results in a significant reduction in scaling and compression resources. Furthermore, assuming that each HQ stream is compressed at full resolution by at least compressors 3121H1-3125H1, the bandwidth required to be routed and the bandwidth in the common video scaling and recompression 3142 is significantly reduced compared to if it were an accepted uncompressed video. For example, a 3Gbps uncompressed 1080p60 stream would be compressed to 10Mbps and still retain very high quality. Thus, with gigabit Ethernet connections, one uncompressed 3Gbps video stream cannot be transmitted (carry), but many 10Mbps video streams can be transmitted without a significant reduction in quality.
Fig. 31f shows details of the common video scaling and recompression 3142 and the bulk HQ video compressor HQ 3121H1-3131H 1. The internal routing 3192 typically selects a subset of the compressed HQ streams from the HQ video compressor HQ 3121H1-3131H1 based on a request from the application/game server 3121-. The streams in the selected subset of streams are routed through decompressor 3161 and 3164 if the requested stream is to be scaled, or the streams in the selected subset of streams are routed over an unscaled video path 3196 if the requested stream is at full resolution. The streams to be scaled are decompressed by decompressors 3161 and 3164 into uncompressed video, then each scaled to the requested size by scalers 3171 and 3174, and then each compressed by compressors 3181 and 3184. Note that: if a particular HQ stream is requested at more than one resolution, the internal route 3192 multicasts (using IP multicast techniques well known to practitioners in the art) the stream onto one or more decompressors 3161 and 3164 and (if one of the requested sizes is full resolution) the outbound route 3193. All requested streams, whether scaled (from compressor 3181 and 3184) or not (from internal route 3192), are then sent to outbound route 3193. Route 3193 then sends each requested stream to the application/game server 3121 that requested it 3125. In one embodiment, if more than one application/game server requests the same stream at the same resolution, the outbound route 3193 multicasts the stream to all application/game servers 3121 and 3125 that made the request.
In the currently preferred embodiment of the common video scaling and recompression 3142, the routing is implemented using gigabit ethernet switches, and the decompression, scaling and compression are implemented by discrete specialized semiconductor devices that perform each function. The same functionality can be implemented in hardware with a higher degree of integration or by a very fast processor.
Fig. 31e shows another embodiment of the hosting service 210, where the previously described functions of the delay buffer 3115 are implemented in the common video delay buffering, scaling and decompression subsystem 3143. Fig. 31g shows details of the subsystem 3143. The operation of the subsystem 3143 is similar to that of the subsystem 3142 shown in fig. 31f, except that 3191 first selects which HQ video streams are to be routed in accordance with requests from the application/game server 3121 and 3125, after which the HQ streams that are requested to be delayed are routed through a delay buffer 3194 (implemented in this embodiment as a RAID array (but which may be implemented as any storage media with sufficient bandwidth and capacity)), and the streams that are not requested to be delayed are routed through a non-delayed video path 3195. The output of the delay buffer 3194 and the undelayed video 3195 is then routed by the internal route 3192 based on whether the requested stream is to be scaled or unscaled. The scaled stream is routed to the outbound route 3193 through decompressor 3161 & 3164, scaler 3171 & 3174 and compressor 3181 & 3184. Unscaled video 3196 is also sent to the outbound route 3193, after which the outbound route 3193 sends the video to the application/game server in unicast or multicast mode, in the same manner as described in the subsystem 3142 of FIG. 31 f.
Another embodiment of the video delay buffering, scaling and decompression subsystem 3143 is shown in fig. 31 h. In this embodiment, a separate delay buffer HQ3121D-HQ 3131D is provided for each HQ stream. Given the rapid cost reduction of RAM and flash ROM, which can be used to delay separate compressed video streams, this will ultimately result in lower cost and/or more flexibility than having a common delay buffer 3194. Alternatively, in yet another embodiment, a single delay buffer 3197 (shown in dashed lines) may provide delay for all HQ streams individually, with high performance aggregate resources (e.g., very fast RAM, flash, or disk). In either case, each delay buffer HQ3121D-3131D is capable of variably delaying or passing the stream without delay from the HQ video source. In another embodiment, each delay buffer is capable of providing multiple streams with different amounts of delay. All delays or no delays are requested by application/game server 3121-3125. In all of these cases, the delayed and undelayed video stream 3198 is sent to the internal route 3192 and proceeds through the other portions of the subsystem 3143 previously described in connection with fig. 31 g.
In the foregoing embodiments relating to the various fig. 31n, note that: live streams use a bi-directional connection and are tailored to a particular user with minimal delay. HQ streams use unidirectional connections and are unicast and multicast. Note that: although the multicast function is shown as a single unit in these figures, which may be implemented as a gigabit ethernet switch, for example, in large-scale systems, the multicast function may be implemented by a tree structure consisting of a plurality of switches. In fact, for a video stream from a high-level video game player, that is, the case, the player's HQ stream is viewed by millions of users simultaneously. In this case, it is possible to have a large number of individual switches in the successive stages of broadcasting the multicasted HQ stream.
For diagnostic purposes, and to provide feedback to the user (e.g., let the user know how popular his gameplay performance is), in one embodiment, the hosting service 210 will know how many spectators are simultaneously streaming the video of each application/game server 3121 and 3125. This may be accomplished by the application/game server for a particular video stream by keeping count of the number of active requests. Thus, a player with 100000 spectators at the same time will know that his or her game play is very popular, which will create incentives for game players to make better achievements and attract spectators. When a video stream (e.g., of a video game tournament game) has a very high rating, an instructor may be required to make an explanation during the video game tournament, so that some or all users watching the multicast can hear their commentary.
Applications and games running on the application/game server will be provided with an Application Programming Interface (API) where the application and/or game may submit requests for a particular video stream with particular characteristics (e.g., resolution and amount of delay). Further, these APIs, subject to the operating environment running on the application/game server or the hosting service control system 401 of fig. 4a, may deny these requests for various reasons. For example, a requested video stream may have certain permission restrictions (e.g., so that it may only be viewed by a single viewer and not be broadcast to others), may have subscription restrictions (e.g., a viewer may have to pay to have rights to view the stream), may have age restrictions (e.g., a viewer may have to be 18 years old to view the stream), may have privacy restrictions (e.g., a person using the application or playing the game may only be able to view a selected number or category of viewers (e.g., his or her "friends"), or may not be allowed to view at all), may have restrictions that require that the profile be delayed (e.g., if the user is playing a covert type game in which his or her location may be exposed). There are any number of other restrictions that may limit the viewing of the stream. In any of these cases, the application/game server request will be denied, with the reason for the denial being stated, and in one embodiment, an alternative (e.g., stating what fee to pay to subscribe) by which the request may be accepted.
The HQ video stream stored in the delay buffer in any of the previous embodiments may be exported to other purposes outside of the hosting service 210. For example, a particular video stream of interest may be exported onto YouTube by an application/game server request (typically by a user request). In this case, the video stream is transmitted over the Internet in a format consistent with the YouTube format, along with appropriate descriptive information (e.g., the player's name, game, time, score, etc.). This may be accomplished by multicasting the commentary audio in a separate stream to all game/application servers 3121-3125 that request such commentary. The game/application server combines the audio of the narration with the audio stream sent to the user premises 211 using audio mixing techniques well known to practitioners in the art. There are likely to be multiple instructors (e.g. with different views, or in different languages) and the user can choose among them.
In a similar manner, separate audio streams may be mixed or used as a replacement for audio tracks of a particular video stream (or separate streams) in the hosting service 210, mixing or replacing audio in the video stream in real-time or from a delay buffer. This audio may be a caption or narration, or it may provide sound to the character in the video stream. This would make the engine movie (Machinima), a user generated animation from a video game video stream, easy to create by a user.
The video stream described by the entire document is shown as: a video stream captured from the video output of the application/game server, which is then streamed and/or delayed and reused or distributed in various ways. The same delay buffer can be used to hold video material from non-application/game server sources and provide the same degree of flexibility for playback and distribution, with appropriate limitations. Such sources include live feeds from television stations (either radio or non-radio, e.g., CNN, or paid, e.g., HBO, or free). Such resources also include prerecorded movies or television programs, home movies, advertisements, and live video conference feeds. The live feed may be processed like the live output of a game/application server. The pre-recorded data may be processed as the output of the delay buffer.
In one embodiment, the various functional modules and associated steps described herein may be performed by specific hardware components that contain hardwired logic for performing the steps, such as an application specific integrated circuit ("ASIC"), or by any combination of programmed computer components and custom hardware components.
In one embodiment, the module may be implemented on a programmable digital signal processor ("DSP") such as the TMS320x architecture of Texas instruments (e.g., TMS320C6000, TMS320C 5000.. or the like). A variety of different DSPs may be used while still complying with the underlying principles.
Embodiments may include various steps as set forth above. The steps may be embodied in machine-executable instructions, which cause a general-purpose or special-purpose processor to perform certain steps. Various components not related to these basic principles (e.g., computer memory, hard disk drive, input device) have been omitted from the figures to avoid obscuring the relevant aspects.
Elements of the disclosed subject matter may also be provided as a machine medium for storing the machine-executable instructions. The machine-readable medium may include, but is not limited to, flash memory, optical disks, CD-ROMs, DVD ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards, propagation media or other type of machine-readable media suitable for storing electronic instructions. For example, the present invention may be downloaded as a computer program which may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
It should also be understood that elements of the disclosed subject matter may also be provided as a computer program product which may include a machine-readable medium having stored thereon instructions which may be used to program a computer (e.g., a processor or other electronic device) to perform a sequence of operations. Alternatively, the operations may be performed by a combination of hardware and software. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards, propagation media or other type of media/machine-readable medium suitable for storing electronic instructions. For example, elements of the disclosed subject matter may be downloaded as a computer program product, wherein the program may be transferred from a remote computer or electronic device to a requesting process by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
In addition, although the disclosed subject matter has been described in connection with specific embodiments, numerous modifications and variations are well within the scope of the present disclosure. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Claims (24)
1. A computer-implemented method for performing video compression, the method comprising:
encoding a first plurality of video frames or portions thereof, wherein each encoded video frame or portion thereof depends on a previously encoded video frame or portion thereof, respectively;
transmitting the first plurality of encoded video frames or portions to a client device;
receiving feedback information from the client device, the feedback information being usable to determine whether data contained in the video frame or portion was not successfully received and/or decoded;
in response to detecting that a video frame or portion thereof has not been successfully received and/or decoded, encoding a current video frame or portion thereof as dependent on a previously encoded video frame or portion thereof that is known to have been successfully received and/or decoded on the client device; and
transmitting the current video frame or a portion thereof to the client device.
2. The method of claim 1, wherein the previously encoded video frame or portion thereof known to have been successfully received and/or decoded comprises a last known video frame or portion thereof known to have been successfully received and/or decoded.
3. The method of claim 1, wherein the video frame or portion thereof comprises a P-frame or P-tile, respectively.
4. The method of claim 1, wherein the feedback information comprises an indication that the frame or portion thereof has been successfully received and/or decoded at the client device.
5. The method of claim 1, wherein the feedback information comprises an indication that the frame or portion thereof was not successfully received and/or decoded at the client device.
6. The method of claim 1, wherein encoding a current video frame or a portion thereof as dependent on a previously encoded video frame or a portion thereof known to have been successfully received and/or decoded on the client device further comprises:
a previous state is retrieved from memory, the previous state comprising a state of a post-encoding encoder that encoded the previously encoded video frame or portion.
7. The method of claim 6, comprising:
decoding the current video frame or a portion thereof, wherein decoding comprises retrieving from memory a previous state comprising a state of a decoder after decoding the previously encoded video frame or portion.
8. The method of claim 1, further comprising:
decoding, at the client device, the encoded video frame or a portion thereof; and
displaying an image associated with each of the video frames or portions thereof on a display on the client device.
9. A system comprising a memory for storing program code and a processor for processing the program code to perform the operations of:
encoding a first plurality of video frames or portions thereof, wherein each encoded video frame or portion thereof depends on a previously encoded video frame or portion thereof, respectively;
transmitting the first plurality of encoded video frames or portions to a client device;
receiving feedback information from the client device, the feedback information being usable to determine whether data contained in the video frame or portion was not successfully received and/or decoded;
in response to detecting that a video frame or portion thereof has not been successfully received and/or decoded, encoding a current video frame or portion thereof as dependent on a previously encoded video frame or portion thereof that is known to have been successfully received and/or decoded on the client device; and
Transmitting the current video frame or a portion thereof to the client device.
10. The system of claim 9, wherein the previously encoded video frame or portion thereof known to have been successfully received and/or decoded comprises a last known video frame or portion thereof known to have been successfully received and/or decoded.
11. The system of claim 9, wherein the video frame or portion thereof comprises a P-frame or P-tile, respectively.
12. The system of claim 9, wherein the feedback information comprises an indication that the frame or portion thereof has been successfully received and/or decoded at the client device.
13. The system of claim 9, wherein the feedback information comprises an indication that the frame or portion thereof was not successfully received and/or decoded at the client device.
14. The system of claim 9, wherein encoding a current video frame or a portion thereof as dependent on a previously encoded video frame or a portion thereof known to have been successfully received and/or decoded on the client device further comprises:
A previous state is retrieved from memory, the previous state comprising a state of a post-encoding encoder that encoded the previously encoded video frame or portion.
15. The system of claim 14, comprising additional program code to cause the processor to:
decoding the current video frame or a portion thereof, wherein decoding comprises retrieving from memory a previous state comprising a state of a decoder after decoding the previously encoded video frame or portion.
16. The system of claim 9, comprising additional program code to cause the processor to:
decoding, at the client device, the encoded video frame or a portion thereof; and
displaying an image associated with each of the video frames or portions thereof on a display on the client device.
17. A machine-readable medium having program code stored thereon, which when executed by a machine, causes the machine to perform operations comprising:
encoding a first plurality of video frames or portions thereof, wherein each encoded video frame or portion thereof depends on a previously encoded video frame or portion thereof, respectively;
Transmitting the first plurality of encoded video frames or portions to a client device;
receiving feedback information from the client device, the feedback information being usable to determine whether data contained in the video frame or portion was not successfully received and/or decoded;
in response to detecting that a video frame or portion thereof has not been successfully received and/or decoded, encoding a current video frame or portion thereof as dependent on a previously encoded video frame or portion thereof that is known to have been successfully received and/or decoded on the client device; and
transmitting the current video frame or a portion thereof to the client device.
18. The machine-readable medium of claim 17, wherein the previously encoded video frame or portion thereof known to have been successfully received and/or decoded comprises a last known video frame or portion thereof known to have been successfully received and/or decoded.
19. The machine-readable medium of claim 17, wherein the video frame or portion thereof comprises a P-frame or P-tile, respectively.
20. The machine-readable medium of claim 17, wherein the feedback information comprises an indication that the frame or portion thereof has been successfully received and/or decoded at the client device.
21. The machine-readable medium of claim 17, wherein the feedback information comprises an indication that the frame or portion thereof was not successfully received and/or decoded at the client device.
22. The machine-readable medium of claim 17, wherein encoding a current video frame or a portion thereof as dependent on a previously encoded video frame or a portion thereof known to have been successfully received and/or decoded on the client device further comprises:
a previous state is retrieved from memory, the previous state comprising a state of a post-encoding encoder that encoded the previously encoded video frame or portion.
23. The machine-readable medium of claim 22 comprising additional program code to cause the processor to:
decoding the current video frame or a portion thereof, wherein decoding comprises retrieving from memory a previous state comprising a state of a decoder after decoding the previously encoded video frame or portion.
24. The machine-readable medium of claim 17, comprising additional program code to cause the processor to:
decoding, at the client device, the encoded video frame or a portion thereof; and
displaying an image associated with each of the video frames or portions thereof on a display on the client device.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US61/210,888 | 2009-03-23 | ||
| US12/538,096 | 2009-08-07 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| HK1169897A true HK1169897A (en) | 2013-02-08 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN102428656B (en) | System and method for encoding video using a selected tile and tile rotation pattern | |
| CN102428703B (en) | System and method for selecting video compression format based on feedback data | |
| CN102428697B (en) | System and method for utilizing forward error correction with video compression | |
| CN101918937B (en) | System for collaborative conferencing using streaming interactive video | |
| RU2524845C2 (en) | System and method for multi-stream video compression using multiple encoding formats | |
| CN101889442B (en) | System for combining a plurality of views of real-time streaming interactive video | |
| CN101897183B (en) | Method of combining linear content and interactive content compressed together as streaming interactive video | |
| CN101918934B (en) | System for accelerating web page delivery | |
| CN101918943B (en) | System and method for compressing video by allocating bits to image tiles based on detected intraframe motion or scene complexity | |
| CN102428483A (en) | System and Method for Multi-Stream Video Compression | |
| CN102428704A (en) | System and Method for Compressing Video Based on Latency Measurements and Other Feedback | |
| CN102428699A (en) | System and method for video compression using feedback including data related to the successful receipt of video content | |
| CN102428698A (en) | System and method for compressing video frames or portions thereof based on feedback information from a client device | |
| CN101918936A (en) | Mobile Interactive Video Client Device | |
| CN101889274A (en) | Host and broadcast virtual events using streaming interactive video | |
| CN101896236A (en) | System for reporting recorded video prior to system failure | |
| CN101889437A (en) | System for combining recorded application state with application streaming interactive video output | |
| CN101888884A (en) | Method for switching user sessions between mobile interactive video servers | |
| CN101918957A (en) | System and method for protecting certain types of multimedia data transmitted over a communication channel | |
| CN101918933A (en) | System and method for intelligently allocating client requests to server hubs | |
| CN101918955A (en) | Systems and methods for compressing video based on detected data rates of communication channels | |
| CN101918958A (en) | System and method for compressing video based on detected intra-frame motion | |
| CN101919242A (en) | Video compression system and method for compensating for bandwidth limitations of a communication channel | |
| HK1169897A (en) | System and method for video compression using feedback including data related to the successful receipt of video content | |
| HK1169817A (en) | System and method for accelerated machine switching |