HK1173801A

HK1173801A - Multi-user terminal services accelerator

Info

Publication number: HK1173801A
Application number: HK13100853.1A
Authority: HK
Inventors: N．玛格鲁易斯
Original assignee: 微软技术许可有限责任公司
Priority date: 2005-11-01
Filing date: 2013-01-18
Publication date: 2013-05-24

Description

Multi-user terminal services accelerator

The patent application of the invention is a divisional application of an invention patent application with the international application number of PCT/US2006/040755, the international application date of 2006 of 10-19.2006 and the application number of 200680050231.7 in the stage of entering China, and is named as a multi-user terminal service accelerator.

Background

Technical Field

The present invention relates generally to multi-user host computer systems, and more particularly to efficient terminal services support for remote clients, and more particularly to multi-user terminal services accelerators.

Discussion of the background Art

Developing efficient multi-user host computer systems is an important goal of contemporary system designers and manufacturers.

Conventional computer systems may utilize a local display device to display output directly to a user. The local display device is typically positioned near the computer system due to limitations imposed by the various physical connections that electrically couple the display device to the output of the computer system. Some computer systems may support a second display device with similar proximity restrictions due to physical connections.

The remote user requires additional flexibility in selecting an appropriate viewing location and network connection with the host system. For example, in an enterprise environment, an enterprise may wish to place all of the host computers into a "computer room," which is a secure central location with physical security and environmental management, such as air conditioning and power backup systems. However, it is necessary for users to utilize the host systems in offices and desks located outside the "computer room".

Today's typical office environment includes personal computers and an increasing number of thin clients physically located at the user's location. These personal computers and thin clients operate on a network having a centralized system for storage, file serving, file sharing, network management, and various administrative services. Initially, the system centralizes all disk storage associated with the computer system, and the user is able to run applications on their local desktop. Recently, recognizing security benefits, reduction in operation costs, and general requirements for centralized control, personal computers and thin clients may operate as Remote Terminals (RTs) in a server-based computing (SBC) solution that runs applications on a server.

The traditional approach to RT for SBC environments is for the host system to use some form of server-client communication exchange, such as microsoft's Remote Display Protocol (RDP). RDP uses its own video driver at the server and uses the RDP protocol to construct the presentation information into network packets and send them over the network to the client. The client receives the presentation data and interprets these packets into corresponding microsoft win32 Graphics Device Interface (GDI) API calls. Support for redirecting client keyboard and mouse commands to the server and managing local audio and local client drivers is also included.

To enhance communication between the host system and the client, other systems have used the host CPU of the host system to improve the performance of the RT. This has been done for thin clients and traditional PCs as remote clients. This approach is effective for host systems that support only one user at a time. However, for multi-user systems, the approach of using the host's main CPU to improve performance for any one user has significant limitations. Computational resources such as main memory and CPU cycles used for optimization for one user may reduce the ability to support workloads for other users.

Cost can be reduced by efficiently supporting multiple users from a single host. In a typical office environment, it is rare that everyone uses their computer at the same time, and similarly, it is rare that any user uses all of the computing resources on their computer. Thus, for example, a company with 100 offices may only need a system capable of supporting 60 users at any one time. Even as noted above, such a system may be designed to support all 100 users, as long as they are provided with sufficient computational throughput to make it appear that each of them has its own host computer.

As host computers continue to enhance their performance by including multiple CPUs and CPUs with multiple processor cores, the one-to-one-machine limitation has less and less economic implications. While some RTs may be connected locally to the multi-user host system through a Local Area Network (LAN), other RTs will be connected to the host system through a Wide Area Network (WAN) where they have a lower performance network connection to the host system.

In a distributed office environment where RTs are located in different parts of the world, a centralized multi-user system can support different parts of the world at different working hours depending on the respective time zones.

Server-based computing, where applications for users run on servers with only RT services supported on user terminals, is another way to more efficiently allocate computing resources for multiple users. SBCs allow the host system to dynamically allocate shared resources, such as memory and CPU cycles, to the user with the highest priority. SBC systems may employ techniques of Virtual Machines (VMs), load balancing, and other means to grant access to different levels of performance and resources to different users based on multiple criteria. Different priority schemes may be used to allocate SBC resources. SBCs can be used as a means to achieve higher data security, centralize support for one organization, enhanced disaster recovery and service continuity, and reduce data storage requirements for the entire organization.

However, increased complexity may be required to enable a multi-user host to efficiently manage, control, and provide rich application capabilities for the various RT devices that an organization may have. There is a need for a solution that enables a multi-user host system to more efficiently support numerous remote users with outstanding computing and display performance.

Summary of The Invention

The present invention provides an efficient architecture for a mainframe or server system within a multi-user computer system that includes one or more remote terminals with interactive graphics and video capabilities. The host system typically manages applications and performs server-based computations. Each RT has its own keyboard, mouse and display, and possibly other peripherals. The RT provides individual users with access to applications on the server and a rich graphical user interface.

In a first preferred embodiment, the host system includes an auxiliary processor called a Terminal Services Accelerator (TSA) that offloads the computing tasks of the management remote graphics protocol for each RT. TSA allows multi-user host computers to be economically scaled to adaptively support the many different RTs that can be networked through a variety of different bandwidth solutions. The TSA may include processing elements in the form of configurable processors, Digital Signal Processors (DSPs), or hardware blocks to best perform offloading from the host and further do improved work in support of multiple terminals. Offloading may include encapsulating graphics commands into network packets, encoding different data blocks so that the communication channel may be used more efficiently, and tracking the cached data for each RT. There may be a local graphics processor in the host system for supporting the local terminal, but it is not part of the RT's support system. The TSA may process local graphics to provide remote KVM management capability.

In a second preferred embodiment, the host computer utilizes a combination of software, graphics processor and data encoding to support multiple RTs by creating a virtual display environment for each RT such that only a minimal amount of commands or data need be communicated thereto. The most common methods for communicating with the RT include sending encapsulated graphics commands or sending encoded sub-frame data. The software that manages the RT may run on the host CPU or TSA as in the first embodiment, or a combination of both. The selective updating of each RT may be coordinated with software or by means of hardware in the graphics processor. The graphics processor may follow the proposed VESA Digital Packet Video Link (DPVL) standard or an improved method of using status bits or signatures for sub-frames. In other enhancements, PCI express or another bus is used instead of DVI for output data, additional data encoding is performed within the graphics processor or with an encoder attached to the graphics processor, and software uses a single virtual graphics processor for multi-user support.

Each embodiment can be further offloaded from the host CPU by intercepting functions such as video playback using tracing software with the TSA. Instead of having the host CPU perform video decoding locally and provide bitmaps for transmission to the RT, the TSA may intercept the native video stream before the CPU decodes it, and may pass the native video stream or a modified version such as a transcoded (trans) version to the target RT. Communications with the RT may be made using other dedicated channels in addition to the standard RDP channels, but still managed within the RDP protocol.

In the host system of each embodiment, after the data is encapsulated or encoded, the network processor or CPU working in conjunction with a simpler network controller sends graphics packets to the RT over a wired and/or wireless network. Each RT system decodes the graphics packets ready for its display, manages frame updates, and performs the necessary processing for the display screen. Other features, such as masking of lost packets in network transmissions, are managed by the remote display system. When there are no new frame updates, the remote display controller refreshes the display screen with data from the previous frame.

The system can feed back network information from various wired and wireless network connections to the host system CPU, TSA and data encoding system. The host system uses the network information to affect the various processing steps that generate the RT updates, and based on the network feedback, the frame rate and data encoding can be changed for different RTs. In addition, for systems that include a noisy transmission channel as part of the network, the encoding step may be combined with forward error correction protection to prepare the transmitted data for adaptation to the characteristics of the transmission channel. The combination of these steps maintains an optimal frame rate with low latency for each RT. The TSA may be implemented as a stand-alone subsystem or in conjunction with other offload and acceleration processes such as a network processor, a secure processor, an XML accelerator, an iSCSI processor, or any combination of the above.

Thus, for at least the foregoing reasons, the present invention effectively enables a flexible multi-user RT system that utilizes various heterogeneous components to facilitate system interoperability and functionality. The present invention thus efficiently implements an enhanced multi-user RT system.

Brief Description of Drawings

FIG. 1 is a block diagram of a multi-user computer system including one or more host computers, a network, and a plurality of remote terminals;

FIG. 2 is a block diagram of a multi-user RT system host computer with a terminal services accelerator according to one embodiment of the invention;

FIG. 3 illustrates RTs cooperating with the host computer of FIG. 2;

FIG. 4 is a block diagram of a multi-user RT system host computer having a terminal services accelerator with a graphics processor unit, according to a second embodiment of the present invention;

FIG. 5 illustrates a memory organized into eight display regions, wherein one display region includes a display window and two display regions are used to support a large display;

FIG. 6A shows a more detailed view of the display map 536 of FIG. 5;

FIG. 6B shows the rectangle of FIG. 6A subdivided into tiles;

FIG. 7 is a block diagram illustrating details of one exemplary terminal service accelerator 224 of FIG. 2 or FIG. 4424;

FIG. 8 is a block diagram of an offload subsystem for accelerating terminal services, networking, and other tasks;

FIG. 9 is a flowchart of steps in a method for performing terminal service acceleration, according to one embodiment of the present invention; and

fig. 10 is a flow chart of steps in a method for performing a network reception and display procedure for a remote terminal according to one embodiment of the present invention.

Detailed description of the preferred embodiments

The present invention relates to improvements in multi-user Remote Terminal (RT) computer systems. Although the described embodiments relate to a multi-user RT computer system, the same principles and features may be equally applicable to other types of single-user systems and other types of thin clients.

Referring to FIG. 1, the present invention provides an efficient architecture for a multi-user computer system 100. A multi-user server-based computer 120, referred to as a "host computer 120," handles applications for multiple users, each utilizing some form of remote terminal. Including the local terminal 110, is primarily for single user or administrative tasks and the host 120 generates a display update network stream to each of the RTs 300, 302, 304, etc., over the wired network 290 or to the display 306 over the wireless network 290. Users of the RT are able to share the host computer 120 time-share as if it were their own local computer, and get full support for all types of graphical, textual, and video content with the same type of user experience available on the local computer. The additional connection 292 may be to a WAN, a storage subsystem, other host, or various other data center connections, which may take the form of gigabit Ethernet, 10G Ethernet, iSCSI, Fibre Channel (FC), Fibre Channel IP (FCIP), or another electrical or optical connection. Connection 242 may connect other data or video sources to host system 120.

Throughout this document, "host" may refer to host 120, host 200, or host 400, which may be configured in a variety of ways to support multi-user server-based computing. Multiple hosts 120 may be clustered together to form a dynamically sharable computing resource. Within each host 120, multiple computer hosts 200 may be assembled together in the form of blades in a rack connected by a backplane or another multi-processor configuration. Various multi-user operating systems or software that virtualize a single-user Operating System (OS) may be deployed on one or more processor blades or motherboards. Operating systems such as Citrix or Windows Server are designed as multi-user OS. Although not specifically designed for multiple users, Windows XP may be used in such configurations with the help of lower level virtualization software such as VMWARE or Xen Source, or other means of performing user switching as fast as a multi-user OS. Different administrative controls may cause the RT and program to move between processors either statically or dynamically. Load balancing may be performed by the operating system for each processor, or the system may perform load balancing across multiple processors.

FIG. 2 is a block diagram of one blade 200 of a server system in which each blade may itself be a host computer 120 or multiple blades may be mounted on a rack to create a more powerful host computer 120. According to one embodiment of the invention, a single blade (motherboard) system 200 or multiple blades 200 may be used in the multi-user system 100. The more blades and CPUs the host system 120 has, the more users can be simultaneously supported. The basic elements of the host computer 200 preferably include, but are not limited to, a CPU subsystem 202, a bus bridge controller 204, a main system bus 206 such as PCI express, local I/O208, main RAM210, and a graphics and display controller 212 and possibly its own memory 218. The graphics and display controller 212 may have an interface 220 that allows a local connection 222 to the local terminal 110. The program source and multimedia bitstream may enter the host computer 200 through one of the network interfaces 209 or via one of the program sources 240 with an I/O path 246. The network controller 228 also processes the display update stream and provides network communications to the various RTs 300-306, etc. over one or more network connections 290. These network connections may be wired or wireless.

In other configurations, more than one CPU subsystem 202 may share one or more of the devices such as graphics and video display controller 212 and terminal services accelerator 224. Other systems may be split wherein the network controller 228 may be shared by multiple host systems 200. The system bus 206 may be connected to a backplane bus to connect between multiple blades in a system. Paths 226 may share backplane bus 206 or may have additional inter-system buses. More than one network controller 228 may be included in the system, one for connections 290 to multiple remote terminals, another network controller (not shown) performing infrastructure network connections to other blades, other server systems, or other data center equipment such as storage systems. Each CPU subsystem 202 may include multiple processor cores, where each core may execute more than one thread simultaneously.

The host computer 200 preferably includes a Terminal Service Accelerator (TSA)224 connected to the main system bus 206 and may have an output path 226 to a network controller 228. The TSA224 may include dedicated RAM230 or may share main system RAM210, graphics and display controller RAM218, or network controller RAM 232. The main RAM210 may be more closely associated with the CPU subsystem 202 as shown at RAM 234. Additionally, when the host system 200 may share the main RAM210, the RAM218 associated with the graphics and display controller 212 may not be necessary.

The function of TSA224 is to offload some of the management of each RT by main CPU202 and to speed up some of the offloaded processing, with improved display support for each RT. The types of offload and acceleration support include encapsulating graphics operations into remote graphics commands, assisting in determining which capabilities and bitmaps to cache at each RT to determine which graphics commands are most appropriate, encoding and encapsulating bitmaps that need to be communicated to the RT, and optimally managing multimedia bitstreams.

Additional functionality such as examining and encapsulating extensible markup language (XML) traffic, Simple Object Access Protocol (SOAP), HTTP traffic, Java Virtual Machine (JVM), and other traffic associated with internet-based communications may also be supported. The host system in conjunction with TSA224 may enable the RT to efficiently perform remote access to the entire internet while performing any required anti-spam, anti-virus, content filtering, access restriction execution, or other packet filtering based algorithms. This additional functionality may be particularly useful for supporting RT internet browsing in cases where the host is a proxy for internet access. Although there may be some redundancy in the system, this approach may provide more specific user control than the internet security devices utilized between the host system and the WAN.

The special browser at the RT may use other enhancements to internet-based traffic, which may include reformatting or re-encoding the internet-based content based on the RT display device and the execution capabilities within the RT. For example, if the RT device is a cell phone or Personal Digital Assistant (PDA) with limited screen resolution, the TSA224 may filter the high resolution content into a low resolution image for faster and more appropriate display. The TSA224 may run other more intelligent web page interpretation algorithms to perform functions such as removing banner advertisements and other extraneous information so that core information may be sent to the cell phone. Other types of web content, such as those utilizing Active-X controls, Macromedia Flash, or other runtime programs, may not be compatible with devices such as phones or PDAs. The TSA224 may mediate and transmit display data to the waiting PDA after the Active-X control. Application layer regular expression (RegEx) content processing may also be performed. Recoding may also be performed to improve security of the client. Although XML and SOAP may be hijacked and other forms of virus passage, the TSA224 may re-encode XML and SOAP into a secure display format so that the RT client does not suffer this risk.

The multimedia bitstream may comprise a video stream already in a compressed format that is stored locally at host system 200 or received via one of system network interface 290 or programming source interface 246. In some configurations, the multimedia bitstream is already in a format compatible with the target RT. In this case, the TSA224 encapsulates the bitstream into an appropriate packet format for transmission to the RT. Encapsulation may include adding header information, such as the origin of a video display window, or modifying packet organization, such as converting a transport stream into a program stream having a different packet size.

In many cases, the multimedia bitstream will not be in a format that is easily handled by the target RT, or in a format that is suitable for network connectivity. In this case, the TSA224 performs more complex steps of decoding and re-encoding, decoding, or transcoding the multimedia bitstream. For example, the incoming multimedia bitstream may be an encoded HDTVMPEG-2 stream. If the window size at RT is set to a small window of 320x240, it makes sense to save network bandwidth and have the TSA224 transcode and transcode the video to a low bit rate representing the desired display window size. Similarly, if the incoming video is in a format that the RT is unable to decode, the TSA224 may transcode the video into a compatible format. Even if the formats are compatible, there may be other incompatibilities such as Digital Rights Management (DRM) encryption schemes. The TSA224 may also convert from a DRM or encryption scheme to a scheme appropriate for the target RT.

Microsoft's Remote Desktop Protocol (RDP) provides significantly less efficient processing for compressing video bitstreams. With RDP, a drive within the host system detects the bitstream and decodes it into a Device Independent Bitmap (DIB). The DIB is then converted to an RDP transfer command and attempts to transfer DIB-formatted data to the RT over the network. In most cases, only a few DIB data frames arrive at the RT for display. Therefore, it is inefficient for the host CPU to perform decoding and to transmit the decoded data in an inefficient format over the network. Other RDP-based graphics operations also use DIB.

Conventional graphics bitmaps, such as those from websites, also need to be transmitted from the host system 200 to the RT. The TSA224 may perform various levels of encoding on conventional graphics bitmaps, such as DIB. The encoding of the graphics bitmap may be lossless or lossy, with the goal of providing a visually indistinguishable representation of the original graphics quality. The simplified software interface of TSA224 may simply include interfacing with the host CPU through the RDP API, while a more aggressive implementation may enable TSA224 to access the underlying DriectX driver framework. Coded DIB transmission and specific compressed video domain transmission are not part of the standard RDP implementation. Thus, these transports may be piggybacked into existing RDP transport formats, run as some type of proprietary RDP extension, or run outside of the RDP framework.

Certain versions of host operating systems and RDPs need to meet the additional security requirements of the RDP protocol. The RDP client may be required to exchange keys with the host to use the encrypted packets. Because TSA224 intercepts RDP client packets, TSA224 may include appropriate acceleration and offloading of key exchanges and decryption to communicate with the host processor. Additionally, to maintain system security, the TSA224 and network controller 228 will ensure that all communications with the RT are properly encrypted.

Fig. 3 is a block diagram of a Remote Terminal (RT)300 according to one embodiment of the present invention that preferably includes, but is not limited to, a display screen 310, a local RAM312, and a remote terminal system controller 314. The remote terminal system controller 314 includes a keyboard, mouse, and I/O control subsystem 316 with corresponding connections for a mouse 318, keyboard 320, and other miscellaneous devices 322, such as speakers for reproducing audio, or with a Universal Serial Bus (USB) connection that can support the various devices. Other integrated or peripheral connections for supporting user authentication through secure means, including biometric or secure cards, may also be included. These connections may be designated as single use, such as a PS/2 type keyboard or mouse connection, or general use, such as USB. In other embodiments, the I/O may include a game controller, a local wireless connection, an IR connection, or no connection at all. The remote terminal system 300 may also include other peripheral devices such as a DVD drive.

Some embodiments of the present invention do not require any input at the remote terminal system 300. An example of such a system is a retail store or an electronic billboard, where different displays may be used in different locations and may display a variety of information and entertainment. Each display may function independently and may be updated based on various factors. Similar security systems may also include some displays that accept touch screen input, such as Automated Teller Machines (ATMs) at kiosks or banks. Other security systems such as entertainment gaming machines may also be based on such remote terminals.

The network controller 336 supports security protocols over a network path 290 over which the supported network may be wired or wireless and data communicated over the network may be encrypted via key exchange. The supported network for each remote display system 300 needs to be supported by the network controller 228 of fig. 2, either directly or through some type of bridge. An example of a common network is an ethernet network, such as a CAT5 line running some type of ethernet network, preferably a gigabit ethernet network, where the I/O control path may use an ethernet-capable protocol such as the standard transport control protocol and internet protocol (TCP/IP) or some form of lightweight handshaking in conjunction with UDP transport. Industrial efforts such as Real Time Streaming Protocol (RTSP) and real time transport protocol (RTP) along with Real Time Control Protocol (RTCP) can be used to enhance data packet transport and can be further enhanced by the addition of a relay forwarding protocol. Other recent efforts surrounding the use of quality of service (QoS) efforts, such as layer 3 Differentiated Services Code Point (DSCP), WMM protocol as part of the Digital Living Network Alliance (DLNA), microsoft's Qwave (high quality audio visual experience), uPnP (universal plug and play), QoS (quality of service), and 802.1p protocol, are also improved methods of using existing network standards.

In addition to the data packets for supporting the I/O devices, the network also carries encapsulated, encoded display commands and data needed for display. The CPU324 cooperates with the network controller 336, the 2D drawing engine 332, the 3D drawing engine 334, the data decoder 326, the video decoder 328, and the display controller 330 to support all types of visual data representations that may be presented on the host computer and to display them locally on the display screen 310. There is no requirement that the RT include any particular combination of display processing blocks. While more likely to have at least one type of decoder or drawing engine, a very thin RT might only include a display controller 330 with a CPU to perform the display processing.

The RT may first be initialized by booting from a local flash memory (not shown) with additional information provided by the host computer 200 over the network. During RT initialization, the connection between the RT system controller 314 and the display screen 310 may be used in a reverse or bi-directional mode, which may employ standards such as Display Data Channel (DDC) interface, Extended Display Identification Data (EDID), and other extensions to identify the capabilities of the display monitor. USB connections via a keyboard, mouse, and I/O controller 316 may also be used for connection to the display screen 310. Information such as available resolution and control is then processed by the CPU 324. System 300 may implement a protocol such as uPnP or another discovery mechanism capable of communicating with host 200. In initializing the communication, the CPU324 may provide RT information including display monitor information to the host 200 so that each RT may be instantiated on the host side.

The initial display screen may come from flash memory or host computer 200. After the first complete frame of display data, the host computer 200 need only send partial frame information over the network 290 as part of the display update network stream. If the pixels of the display have not changed from the previous frame, the display controller 330 may refresh the display screen 310 with the previous frame contents from the local RAM memory 312.

The display updates are sent via a network stream and may consist of encapsulated 2D drawing commands, 3D drawing commands, encoded display data, or encoded video data. Network controller 326 receives the network display flow and CPU324 determines from the encapsulation header which of functional units 332, 334, 326 and 328 are required for the packet. The functional unit performs the necessary processing steps to render or decode the image data and update the appropriate area of the RAM312 with the new image. In the next refresh cycle, the display controller 330 uses the updated frame for the display screen 310.

The display controller 330 transfers a representation of the current image frame from the RAM312 to the display screen 310. Typically, the image is stored in the RAM312 in a format ready for display, but in systems where RAM costs are an issue, the image or portions of the image may be stored in an encoded format. The external RAM312 may be replaced by a large buffer within the remote end system controller 314. Display controller 330 may also combine two or more display surfaces stored in RAM312 to synthesize an output image for display. Different blending operations may be performed in conjunction with the compositing operation.

The CPU324 communicates with the TSA224 to best set up and manage the overall display operation of the RT. Initial setup may include enumerating the types of functions supported by the RT system controller 314, the specifications of the display screen 310, the amount of RAM312 available to buffer and cache data, the instruction set supported by the 2D drawing engine 332, the instruction set supported by the 3D drawing engine 334, the format supported by the data decoder 326, the format supported by the video decoder 328, and the capabilities of the display controller 330. Other management optimizations during runtime include managing and caching display bitmaps in RAM312 so that they do not need to be resent.

Figure 4 illustrates a second preferred embodiment of a multi-user host system 400 that makes several changes to the host system 200. First, graphics and video display controller 212 is replaced with a more powerful graphics processing unit (GPU-P)412 that includes support for selective display updating via grouping and may comply with some or all of the proposed VESA (video electronics standards Association) Digital Packet Video Link (DPVL) standards. Second, replacing TSA224 with TSA-G424, TSA-G424 is modified to more directly support packet display updates from GPU-P412 via system bus 206, or preferably support input paths 414 and 416 which may be serial digital video output SDVO1 and SDVO2 or generic ports with different bus bandwidths, signaling protocols, and frequencies. Examples include: digital Video Output (DVO), Digital Visual Interface (DVI), High Definition Multimedia Interface (HDMI), displayport or other Low Voltage Differential Signaling (LVDS), Transition Minimized Differential Signaling (TMDS), and PCI Express (Express), or another scheme. The display output path may operate at a speed sufficient to output multiple frames of video at a high refresh rate, where the frames may be selectively updated rectangles corresponding to more than one target RT. Similar to TSA224, TSA-G424 may be connected to network controller 228 via main system bus 206 through a dedicated link 426, or more tightly integrated via a system on a chip (SOC) implementation.

In addition to performing conventional graphics processing, GPU-P412 generates selective updates that indicate which portion of the display is changed. The selective update may take the form of a rectangle or a tile through video output path 414 or 416 or through the output of main system bus 206. The rectangle update includes a packet header to indicate the origin, size and format of the window. The origin may be used to indicate which RT is the destination. Tiles may also be used and may be normalized to one or more fixed sizes so that the header may require less information to describe the tiles. Other information such as whether and how to scale the rectangle or splice at RT may also be included in the header. Other forms of selective updating include support for BitBlt, Area Fill, and Pattern Fill, where rather than sending large blocks of data, a minimal amount of data is sent along with command parameters for operations performed at the RT. Other headers support updates in the form of video streams, genlocks, scaled video streams, Gamma tables (Gamma tables), and Frame Buffer control (Frame Buffer control 1). The proposed DPVL specification details one possible implementation of selective updating along with its header.

By organizing RAM418 into various surfaces that each contain display data for multiple RTs, one GPU-P412 can be effectively virtualized for use by the system for all RTs 300. The 2D, 3D, and video graphics processors (not shown) of GPU-P412 are preferably used to achieve high graphics and video performance. The graphics processing unit may include 2D graphics, 3D graphics, video encoding, video decoding, scaling, video processing, and other advanced pixel processing. The display controller of GPU-P412 may also perform functions such as blending and keying video and graphics data, as well as overall screen refresh operations. In addition to RAM418 for the primary and secondary display surfaces, there is sufficient off-screen memory to support various 3D and video operations. As an alternative to the DPVL approach of managing selective updates, the buffer memory (S-buffer) 404 may be selectively updated within RAM 418. In one embodiment, the S-buffer 404 stores a status bit, a signature, or both a status bit and a signature corresponding to each tile for the respective virtual display. In another embodiment, the S-buffer 404 stores the splice chip itself, with or without a header, status bits, and signature information, where the splice chip is arranged to be output for selective updating.

The graphics engine and display controller typically converge into a complete display image corresponding to each RT display major surface. RAM418 will effectively contain an array of display frames for all RTs. DPVL allows virtual displays up to 64Kx64K, with applications primarily in multi-monitor support. In this application, the RT display may be mapped into a 64Kx64K array. Because this application involves multiple independent RTs, GPU-P412 may add different security features to protect different display areas and prevent a user from being able to access another user's frame buffer. For security and reliability reasons, the system preferably includes hardware locks that prevent unauthorized access to protected portions of the display memory.

FIG. 5 illustrates an example configuration of memory 418 of FIG. 4 in which the virtual display space is set to 3200 pixels horizontally and 4800 pixels vertically. Memory 418 is divided into 8 1600x1200 display areas, labeled 520, 522, 524, 526, 528, 530, 532, and 534, respectively. A typical high quality display mode may be configured to a bit depth of 24 bits per pixel, although this configuration tends to use 32 bits per pixel, as organized in RAM418, to make it easier to arrange and use the additional 8 bits for other purposes when the display is accessed by the graphics and video processor. The description of the tiled memory is conceptual in nature with respect to GPU-P412. The actual RAM addressing may also involve memory page size and other factors.

Fig. 5 further illustrates the display update rectangle 550 in the display area 528. 1600x1200 shows dashed line 540 corresponding to a coarser block boundary of 256x256 pixels called a partition. As is apparent from display window 550, the arrangement of the display window boundaries does not necessarily align with the partition boundaries. This is typically the case because the user will resize and position the window at will on the display screen. To support remote screen updates that do not require updating the entire frame, each partition affected by the display window 550 needs to be updated. Moreover, the type of data within display window 550 and the surrounding display pixels may be of disparate types and unrelated. Thus, a partition based on an encoding algorithm needs to ensure that there are no visual artifacts associated with the edges of the partition or with the boundaries of the display window 550 if lossy. The actual encoding process may occur on blocks smaller than partitions, such as 8x8 or 16x 16. Thus, the preferred embodiment uses a deterministic encoding algorithm, where the same result can be obtained for a group of pixels regardless of the surrounding pixels, and any arrangement of windows does not produce artifacts.

The block boundaries of the coding scheme are also a consideration for the tiles. For example, an encoding scheme may require a block boundary of a multiple of 8 pixels. If the source tile is not a multiple of 8 pixels, it needs to be padded with the surrounding data. In another case, it is often more preferable to direct the tile boundaries to the screen rather than a particular user placing a rectangle or tile. If the user operates a window of 80x80 pixels, even though it can theoretically be set to the minimum of 10 blocks of 8x8 in both the horizontal and vertical directions (100 blocks in total), it is more likely that it spans 11 blocks in each direction (121 blocks). The rectangle update and any encoding of the rectangle will therefore encode 88x88 pixels (121 blocks) where some of the surrounding pixels need to be padded. Although the DVPL specification does not treat rectangular coding as part of the selective update scheme, there may be other granularity limitations in DVPL that would result in the use of appropriately sized rectangular boundaries of modulo 8 pixels of the DVPL CRTC output mechanism.

It is also possible to support RT with different size displays. In one example, GPU-P412 may support any number of displays of any size. In another example, a smaller display as a sub-window or a larger display as an overlay window spanning more than one display area may be supported relatively simply. A 1920x1080 window would require the use of both 532 and 534 regions simultaneously, as delineated by rectangle 536. While this wastes area, it is simpler to implement than creating a custom size for each display. Because of the selective rectangle update mechanism of GPU-P412, only the relevant area of the screen is transferred. Other more flexible mechanisms, such as S-buffers, may be implemented that require less processor intervention when the DVPL dynamically controls CRTC control registers to manage selective updates.

A more flexible system may also split the concept of DVPL rectangles into more regularly sized entities such as tiles. There is a tradeoff between the efficiency of having header information of arbitrary rectangular size and potentially simpler headers using less flexible tile sizes but more screen data. In a preferred embodiment, the tiles may be dynamically set to any multiple of the block size, where the block size is the smallest entity for the data encoding algorithm. The blocks may be oriented to the source image or to fixed block locations of the screen. The size of the tiles may be included in the header information.

A memory region such as 530 may be designated as an S-buffer 404 to manage selective updates. In one embodiment, the S-buffer includes status bits corresponding to tiles of display frames 520, 522, 524, and 526, where the status bits indicate whether the tiles need to be selectively updated. The S-buffer 404 may also store a signature for each tile, which is then used to determine the need for selective updates. In another embodiment, tiles from frames 520, 522, 524, and 526 that require selective updating are copied to memory area 530 and queued for selective updating of the output. The queued splice slices may include various header, status, and signature information.

Fig. 6A shows a more detailed view of the fig. 5 display map 536 having a 1920x1080 High Definition Television (HDTV) resolution referred to as 1080P. The fixed-size rectangle 614 in FIG. 6A is oriented according to the screen position boundaries. Each rectangle is 160 pixels across and 120 pixels down. There are 12 rectangles (12x160 ═ 1920) in each row and 9 rectangles (9x120 ═ 1080) in each column. The system may use these rectangles as tiles that form the basis for selective updates. Another system further divides rectangle 614 into tiles 620 of 80x40 pixels in fig. 6B, and the system can select these smaller tiles as the basis for selective updating. A more flexible system may utilize a larger rectangle 614 of 6 tiles 620 and the tiles themselves, and use the header information to delineate which type is output at any given time.

In both cases, the blocks that form the basis of the coding algorithm are accommodated within tiles or rectangles. Assuming 8x8 tiles, each tile has tiles in a 10x5 configuration, and each rectangle has tiles in a 20x15 configuration. Systems that utilize both larger rectangles and smaller triangles may use different mechanisms for both when determining selective update requirements. In a preferred embodiment, the larger rectangles may have associated status bits indicating whether they have changed, and the smaller tiles may make this determination using the signature. These status bits and signatures may be managed with an S-buffer as described below.

GPU-P412 may integrate this processing to directly perform selective encoding of the tiles, or each tile may be examined and output to TSA-G424 using a selective update process and would include the appropriate headers. The header will be processed by TSA-G424 and based on the fields in the header, TSA-G424 will know the RT to which the tiles are directed and the location on the display screen. TSA-G424 will, where appropriate, encode the tiles into a compressed format, adjust any required header information, and provide the tiles and headers for further network processing.

GPU-P412 and TSA-G424 may partition the selective update process differently. In some cases, GPU-P412 may perform full management and may send only the tiles that need updating to TSA-G424. In other cases, TSA-G424 is required to perform further filtering of the slices in order to determine which slices actually need updating. Within GPU-P412, the selective update mechanism may be hardwired or require CPU intervention, which may be implemented across a drawing engine and a selective update refresh engine. Encoding of the tiles may also be performed in GPU-P412 or TSA-G424. GPU-P412 may also output graphics drawing commands for RT to TSA-G424 over a digital video bus, or a software driver may provide these commands directly to TSA-G424.

For selective tile updates, in a first embodiment, an S-buffer is used, where GPU-P412 has a drawing engine that manages each tile status bit and a selective update refresh engine that monitors each tile status bit as it is managed for selective display updates. As with the Z-buffer used in 3D graphics, the S-buffer may be implemented as a separate data storage surface. As with the Z-buffer, hardware drawing operations of the enhanced GPU-P412 may update the S-buffer status bits without additional commands. The selective update hardware then uses these status bits to determine which tiles need updating at the RT. Similar to the refresh cycle of the display controller, the selective update hardware may periodically traverse the S-buffer and read the status bits. Based on the state of the status bit, the selective update hardware either ignores the tiles that do not need to be updated, or reads the tiles for selective update, outputs the tiles along with header information, and updates the status bit accordingly. In one less efficient implementation, the GPU-P may use more traditional graphics rendering operations to generate the S-buffer.

In another preferred embodiment, which does not require specific S-buffer hardware, GPU-P412 may manage a selective update buffer of concatenated tiles that need to be updated. The selective update buffer may be constructed in a separate memory area. Each time the GPU-P performs an operation that changes the tile, it will then copy the tile to the selective update buffer. Header information may be stored at the beginning of each tile and the tiles may be packaged together. The display controller is arranged to use the selective update buffer and output it through the refresh port using standard display controller output operations. GPU-P412 may manage one or more buffers such as ring buffers or a linked buffer list of concatenated tiles and provide a continuous output through the SDVO output that is treated by TSA-G424 as a tile list. For GPU-P, various schemes may be used to determine the priority of placement in the list. This approach may be most efficient for utilizing a GPU-P with less specific hardware for supporting multiple RTs and with little or no specific selective update hardware.

In another preferred embodiment, TSA-G424 operates with GPU-P412 to decide which tiles at RT300 need updating. The ability of GPU-P412 to manage the status bits on a tile-by-tile basis can be very difficult, can combine these tiles into a large tile or full virtual RT display and has only a limited granularity for the status bits. Reducing large tiles to small tile updates may be performed based on tracking the signature of each tile. The signature is typically generated when the splice is first processed and checked against subsequent signatures. The signature may be generated and processed by the TSA-G424 operating from the incoming data or in conjunction with selectively updated hardware of the GPU-P412. If TSA-G424 performs signature checking for each tile, the network bandwidth to each RT300 may be preserved. If GPU-P412 performs signature checking, the bandwidth through the video path to TSA-G424 will also be preserved. GPU-P412 may generate and manage signature storage planes corresponding to tiles, where status bits may be part of a signature plane or a separate plane. Alternatively, the status bits and signature bits may be managed by GPU-P412 in a RAM cache and with a linked list.

Depending on the type of graphics commands generated by the graphics operations on host 400 and the capabilities of RT300, the commands may be encapsulated and sent for execution at the RT, or the commands may be executed locally by GPU-P412. In many cases, although the command is sent to execute at the RT, the command is also executed locally by GPU-P412 in order to save a local copy of the virtual display. Ideally, any tile that changes as a result of redundant local graphics commands is filtered out with status bits to prevent unnecessary tile update packets from being sent to the RT. Sending the command instead of the coded splice typically requires less bandwidth, but this may not always be the case. A system that manually manages the selective update buffer would also consider the commands being sent to the RT. Tiles that are to be updated by commands executed at the RT are ideally not placed into the selective update buffer by GPU-P412.

In another example, graphics commands intended for the RT are processed by TSA-G424 and split into encoded data transfers and modified graphics commands. For example, the host system may wish to perform a BitBlt operation from an off-screen memory or mode to an on-screen memory. This can be easily performed at the GPU-P412 subsystem. However, at RT, the source data that is a BitBlt request is not cached. Thus, to be able to send graphics commands, it may be necessary to first encode, encapsulate, and send source data or patterns to the RT, and then encapsulate and send modified graphics commands to the RT. This process may be offloaded by TSA-G424. It is often more efficient for the DirectX driver to pass commands directly to TSA-G424 when it is likely that the DirectX driver will miss commands through GPU-P412 which then output those commands to TSA-G424.

Fig. 7 shows the functional blocks of subsystems 700 and TSA724 of a preferred embodiment of TSA224 or TSA-G424. This subsystem communicates with the tracking software running on host 200 or host 400, which includes a connection to host system bus 206, and possibly a direct connection to the network subsystem via path 226. In the case of TSA-G424, TSA724 may also include a direct connection to the graphics controller GPU-P412 via paths SDVO1414 and SDVO 2416. Path 416 may be a second SDVO2416 or a connection to another subsystem. Memory 730 is included in a subsystem, which may be embedded in TSA724 or an external memory subsystem. Each functional block may also include its own internal memory.

The system controller 708 manages the interface to the host system and other subsystems and performs certain settings and management for the TSA 724. DirectX interpreter 704 offloads the DirectX software driver running on the host system to manage 2D graphics commands, 3D graphics commands, video streams, and other window functions. In combination with the RDP interpreter 702 and the data/video encoder and transcoder 706, the TSA724 frees the host processor from performing many of the computationally intensive aspects of managing RTs and can also optimize the command, data, and video streams to be sent from the host system to the various RTs.

In the case where the host-based GPU212 is not used for RT display support in the system 200, the TSA subsystem 700 may perform a variety of graphics-based optimizations. Various modes of BitBlt, source-to-screen destination BitBlt, and other bitmap transfers may be enhanced by RDP interpreter 702. RDP interpreter 702 may intercept calls from the host, encode the source data, patterns or bitmaps into a more efficient format via data/video encoder and transcoder 706, transmit these encoded data, patterns or bitmaps via system controller 708, and finally issue modified graphics commands to RT 300. The destination RT receives these encoded source data, patterns or bitmaps, decodes them as needed, and performs the desired operations upon receiving the modified graphics commands. The transmission of encoded data and modified commands may employ RDP transmissions or RDP-like transmissions supported by the TSA subsystem 700 and RT 300.

For video streams in system 200, DirectX interpreter 704 may intercept and offload video stream processing and provide the best stream to the target RT. The first step in the offloading is to determine that the host processor is not performing video decoding on the host CPU. Host-based decoding has some drawbacks, the two most important of which are, first, that it requires a large number of CPU cycles to perform the actual decoding. Second, decoding a video frame at the host is not necessarily the best way to have the frame displayed at the target RT. In contrast, DirectX interpreter 704 intercepts Microsoft Windows in certain versionsMay need to use the DirectX call of DirectShow so that the video stream can be accessed while it is still in compressed form. In order for RDP to continue normal operation, DirectX interpreter 704 may need to provide analog frames to the RDP interface.

At the same time, the system controller 708 knows what video stream format the RT is capable of decoding, what is the nominal network throughput from the host system to the RT, and what resolution and display characteristics the video stream is for. With this information, the system controller 708 configures the data/video encoder and transcoder 706 to process the incoming video stream to generate the desired stream for network, RT, and display output requirements. This may require transcoding from one encoding format to another, conversion from one bit rate to another, changing frame rates, changing display formats, changing resolution, or some combination of the above. RDP interpreter 702 and system controller 708 then encapsulate the processed bitstream and send it for network processing over main system bus 206 or direct connection 226.

In the case of system 400, the TSA subsystem 700 may include the functionality described with respect to system 200, but also include additional support for operating in conjunction with the GPU-P412. There are several ways that the RDP702 and GPU-P412 can interact, and the operation of the TSA subsystem 700 will change accordingly. Two embodiments are considered in detail herein, the first being "termination and regeneration" and the second being "unloading and boosting". Variations of these embodiments are also possible, and these variations may utilize aspects of each embodiment.

In the case of "terminate and regenerate," the RDP client runs on the host system. As long as the host is involved, the RDP operation terminates and the RDP client creates a virtual display using GPU-P412. As previously described, GPU-P412 uses the virtual display space to support multiple virtual RTs by creating a single large display map, where each user is offset within the map, or where each virtual display is treated as a separate display with its own map. The RDP client software may need to utilize key exchange and security processes within the TSA subsystem 700 of the RDP host that require secure client communication. When the RDP client receives a command from the RDP host, the client delivers the display frame to the display subsystem using GPU-P412. GPU-P412 then generates the appropriate selective update that will be sent out via path 414.

The selective update packet, including the rectangular tiles, is then encoded, encapsulated and forwarded for network transmission. The main reason for using "terminate and regenerate" instead of merely transmitting drawing commands to the RT300 is that the request commands are not supported at the RT. Other more subtle reasons that also work are based on bandwidth, type or sequence of commands, relative performance of the RT.

"offload and enhancement" may be performed with a tracking software layer that redirects the DirectX video and data streams. DirectX interpreter 704 intercepts host DirectX calls. The intercepted call is offloaded to the data/video encoder and transcoder 706 which performs the function of a DirectX call. Offloading this function makes the host CPU202 available to other users of the multi-user system. Encoding and decoding may be done with an understanding of the display environment and network bandwidth that allows for optimal processing.

RDP interpreter 702 may also be used to manage status bits when graphics commands are executed locally and forwarded to the RT for execution. The reason the host graphics executes the command is that the current copy of the frame buffer can be managed for subsequent use. Because the graphics commands are executed at the RT, tiles that change on the host as a result of the graphics commands need not have the selective update hardware send encoded tiles. To prevent this, the RDP interpreter 702 needs to calculate which tiles are affected by the graphics commands. The status bits in the S-buffer corresponding to these tiles may be managed so that selective tile-based updates are not performed.

The tracking software layer may also be used to assist in the encoding selection of changed display frames and to request the generation of a display update stream. The encoding is performed to reduce the data required by the remote display system 300 to reproduce the display data generated by the graphics and display controller 412 of the host computer. The tracking software layer may help identify the type of data within the tiles so that the best type of encoding may be performed. Some RTs may not have sufficient graphics processing power to execute graphics commands, but may send encoded data to them that has been processed by GPU-P412.

For example, if the tracking software layer identifies that the surface of the tiles is real-time video, a more efficient video coding scheme with smooth spatial transitions and temporal locality can be used for these tiles. If the tracking software layer identifies that the surface of the tile is mostly text, a more efficient encoding scheme for the sharp edges and sufficient empty space of the text can be employed. Identifying which type of data is in which area is a complex problem. However, this implementation of the tracking software layer allows for an interface to the graphics driver architecture of the host operating system and the host display system that facilitates the identification. For example, in MicrosoftIn (1), a surface utilizing some kind of DirectShow command may beVideo data, however, the surface that uses the color extended bit block transfers (BitBlits) typically associated with text may be text. Each operating system and graphics driver architecture will have its own characteristic indicators. Other implementations may perform multiple types of parallel data encoding and then select the encoding scheme to use that produces the best results based on the encoder feedback.

Certain types of coding schemes are particularly useful for certain types of data, while certain coding schemes are less sensitive to such data. For example, RLE works reasonably well for text and bad for video, DCT-based schemes work reasonably well for video and bad for text, and wavelet transform-based schemes may work well for both video and text. Although any type of lossless or lossy coding can be used in the system, wavelet transform coding (which may also be of the lossless or lossy type) is particularly suitable for this application, especially progressive wavelet transforms with a deterministic arithmetic encoder that can encode each tile without taking into account the surrounding tiles. A derivative of the JPEG2000 wavelet encoder that improves the process for better real-time execution is one possible implementation.

Fig. 8 is a block diagram of a preferred embodiment subsystem 800 (820 of fig. 2 and 840 of fig. 4) for offloading and accelerating networking, security, terminal services, storage, and other tasks from a host processor, such as internet access. The offload subsystem 800 communicates with the host system 200 or 400 primarily over the system bus 206. Connecting SDVO1414 and SDVO2416 is optional and is incorporated for host systems that include graphics processing for RT or for simpler graphics systems that provide redirection of a single remote keyboard, video, and mouse (KVM) over a network for system management. These connections may be direct or through the interface chip 850. Interface control 810 manages the various I/O connections. The network interface may include access to the WAN and RT. High speed networks such as gigabit ethernet networks are preferred, but not always practical. Low speed networks such as 10/100 ethernet, power line ethernet, ethernet over coaxial cable, ethernet over telephone lines, or wireless ethernet standards such as 802.11a, b, g, n, s, and future derivatives and Ultra Wide Band (UWB) versions thereof may also be supported.

KVM may be implemented and used for convenience to remotely control the host in "in-band" by using the main network connection and software running on the host CPU. Or the KVM may operate "out of band" by using as few host system resources as possible. When used "out-of-band," video surveillance using a network interface rather than an "in-band" primary network connection may occur. Additionally, instead of software running on the main processor for the remote KVM function, a special stand-alone Baseboard Management Controller (BMC) is also typically included. The BMC may run protocols such as Intelligent Platform Management Interface (IPMI). The BMC may provide its own network interface or may support a side port connection to a master network controller.

To support dynamic processing of different offload tasks, offload subsystem 800 uses processing blocks that are programmable and configurable and can quickly perform task switches and reconfigurations as workloads change. Various memory blocks may be included in each processing block, and a larger memory 830 may also be included. The CPU808 is a general programmable processor that includes its own cache memory and may perform housekeeping and management as well as perform some higher layer protocol and interface processing for the offload subsystem 800. Network processor and MAC controller 806 manages the Network Interface Control (NIC) functions of the offload subsystem and may manage multiple bi-directional communication channels. Specific internal memory, such as Content Address Memory (CAM), as well as conventional memory, may also be included within the NIC 806. The full NIC806 functionality may require additional processing from the Secure Processor (SP)804 and the Configurable Data Processor (CDP) 802.

The configurable data processor 802 may be designed to be easily reconfigured to perform different processes at the throughput typically associated with dedicated hardware blocks. By utilizing the CDP802 rather than dedicated hardware, different offload tasks may be performed by the same hardware. Prior art methods for designing CDPs such as reconfigurable datapaths, dynamic instruction sets, Very Long Instruction Words (VLIW), Single Instruction Multiple Data (SIMD), Multiple Instruction Multiple Data (MIMD), Digital Signaling Processing (DSP), and other forms of reconfigurable computing can be combined to perform ultra-high performance computations. The security processor 804 may be implemented by some form of CDP802, more specialized hardware, or a combination of CDP802 and additional specialized hardware blocks for encryption techniques and key-related functions.

For terminal service acceleration, CDP802 may be configured to perform data encoding of tiles and rectangles, various forms of transcoding or transcoding of video or data, generation and comparison of tile signatures, and other tasks described in part in TSA224 or 424. For storage acceleration, the CDP802 may be configured for iSCSI, Fibre Channel (FC), Fibre Channel Internet Protocol (FCIP), and different aspects of internet protocol related tasks. Connection 416 may be configured to connect to FC instead of SDVO 2. For internet content acceleration, CDPs may be configured to handle extensible markup language (XML) traffic, Simple Object Access Protocol (SOAP), HTTP traffic, Java Virtual Machine (JVM), and other traffic associated with internet-based communications.

To manage the data incoming through the SDVO 1214 and SDVO 2216 paths, specific buffering and processing may be provided, or the CDP802 may be configured to perform specific tasks, which may include deconstructing larger rectangles into tiles, processing the tiles (including signature generation and comparison), and managing the various packets as they relate to the target RT. Previous tile signatures may be stored within subsystem 800 so that the signatures may be compared when a new tile is received.

The GPU may have any number of physical and logical connections for a display output port including VGA, DVO, DVI, SDVO, display port, or any number of higher or lower speed ports. As such, an interface chip 850 may be required between the GPU display output port and the offload subsystem 800. The connection 816 from the offload subsystem may be implemented as a PCI express port of arbitrary bandwidth. In a preferred embodiment, offload subsystem 800 acts as a PCI express root controller, and interface control 810 manages PCI express ports. The interface chip may perform some buffering and some required pre-processing. For example, the interface chip may buffer multiple display data lines and perform data packing, format conversion, color space conversion, subband decomposition, or any number of other functions. In a preferred embodiment, the output from the graphics chip via DVO connection 416 is 24-bit RGB data. The interface chip 850 buffers the RGB data, converts it to YUV 4:4:4 data, and splits the pixels into separate Y, U and V data packets. The offload subsystem 800, using interface control 810, performs PCI express root control and Y, U and V data packets are sent to different regions of memory 830 via path 816.

Offload subsystem 800 may be implemented by a programmable solution that also addresses the general offload task of several unrelated operations. Servers may benefit from offloading of networking, storage, security, and other tasks. The offload processor may be designed to statically or dynamically balance the various offload tasks and speed up the overall throughput of the system for any given workload. For example, a server may perform server-based computing for a thin client during the day, while running large database operations at night. During the day, the offload engine will run the operations described for TSA. The offload engine will run iSCSI acceleration in the evening to access large databases from disk storage systems. This flexibility may be managed by an on-board or on-chip management processor that tracks various workloads. The granularity of switching between offloading tasks may be minimal. The offload engine may be designed to perform extremely fast context switching, enabling it to perform network, terminal services, storage, security, and other offload tasks for the same session in a single session.

Fig. 9 is a flowchart of method steps for performing a terminal service acceleration procedure, according to one embodiment of the invention. For clarity, the process is discussed with reference to displaying data including video. However, it is contemplated that processes involving audio, keyboard, mouse, and other data are equally applicable to the present invention. Initially, in step 910, the multi-user server based computer 200 or 400 and the remote terminal system 300 follow different procedures to initialize and set up the host side and terminal side of the various subsystems in order to start each RT. In step 912, the tracking software layer on host 200 or host 400 operates in conjunction with TSA224 or TSA-G424 to process the various graphics and video calls in order to determine which operations need to be performed and where. Note that host system 200 does not utilize a host-resident GPU or virtual frame buffer to perform RT graphics operations.

If the graphics operations comprise 2D graphics, then in step 924 the 2D graphics engine GPU-P412 preferably processes these operations into the appropriate virtual display in RAM 430. Similarly, 3D drawing is performed by GPU412 on the appropriate virtual display in RAM in step 926. In step 928, TSA224 or TSA-G424 may determine that the video or graphics command is to be forwarded to the appropriate RT. The flow to step 940 may be unaffected by the bypass step 928. In step 940, GPU-P412 composes each virtual display into a frame suitable for display. This compositing may be performed with any combination of the operations of CPU subsystem 202, the 2D engine, the 3D engine, and any video processing element within GPU 412. As part of the compositing step, since GPU-P412 includes S-buffer management in the graphics processing hardware, the drawing engine updates the S-buffers for the corresponding tiles.

GPU-P412 may return to processing the next frame of the same RT or a different RT as needed, as shown by return path 944. Once the composition operation is performed, step 946 manages the tiles and associated S-buffer status bits and signature bits at the appropriate locations. Step 946 considers any graphics and video operations that are processed through the video and graphics bypass step 928 that may affect the S-buffer status bit. For example, if the drawing operation is performed at step 924 and bypassed to the remote terminal via step 928, then selective updates need not be performed on tiles affected by the drawing operation when it occurs at the RT.

As the status bits and signatures of the tiles are processed in step 946 (which may occur within GPU-P412 or in conjunction with TSA-G424), step 950 may perform selective updates of the tiles. The size of these tiles may be fixed or variable. The header information included within the slice indicates the format and the intended remote terminal destination. In step 954, the TSA-G424 performs the necessary encoding on the tile received from step 950. The encoding is preferably a deterministic scheme, in which the orientation of the data within the tiles and the surrounding tiles need not be considered in the encoding step. Also in step 954, the video data and graphics commands after step 928 are processed. Video data can be transcoded with changes in bit rate or frame rate, scaled in the frequency domain or spatial domain and, if necessary, transcoded to different encoding standards. Network feedback via return path 968 along with the RT may help determine the encoding step 954.

Step 954 also performs any graphics operations that require additional processing, where graphics data may need to be encoded. In step 958, the TSA-G424 performs further encapsulation of the graphics commands, data transfers, or video transfers processed in the previous step. Network feedback is also considered in this step in terms of network characteristics such as bandwidth, latency, specific packet size, and transmission issues. In step 962, the encapsulated packet is processed via the network controller 228 and the packet is transmitted along the network to the appropriate RT 300.

Network processing step 962 uses information from the system control. The information may include information about which remote displays require which frame update data streams, what type of network transport protocol is used for each frame update stream, and how the priority and retry characteristics of each portion of each frame update stream. Network processing step 962 utilizes network controller 228 to manage any number of network connections. The various networks may include gigabit ethernet, 10/100 ethernet, powerline ethernet, ethernet over coaxial cable, ethernet over telephone line, or wireless ethernet standards such as 802.11a, b, g, n, s, as well as future derivatives. Other non-ethernet connections are possible and may include USB, 1394a, 1394b, 1394c, or other wireless protocols such as Ultra Wideband (UWB) or WiMAX.

Fig. 10 is a flow chart of steps of a method for performing a network reception and display procedure in accordance with one embodiment of the present invention. For clarity, the process is discussed with reference to displaying data including video. However, it is contemplated that processes involving audio and other data are equally applicable to the present invention.

In the embodiment of FIG. 10, initially, in step 1012, remote terminal 300 preferably receives a network transmission from host computer 200 via path 290. Subsequently, in step 1014, the network controller 336 preferably performs network processing to execute a network protocol to receive the transmitted data, whether the transmission is wired or wireless.

In step 1020, the CPU324 interprets the incoming transmission to determine which functional unit the transmission is directed to. If the incoming transmission is a 2D graphics command, CPU324 initiates an operation via 2D drawing engine 332; if a 3D command, then via 3D drawing engine 334; if a video data stream, then via video decoder 328; and via the data decoder 326 if it is an encoded data splice. Some drawing commands may use both the drawing engine and the data decoder 326.

A varying number of commands and data transfers may occur, with various functional units operating and preferably manipulating data information into an appropriate displayable format. In step 1030, the manipulated data from each functional unit is combined via the frame manager 330 and an updated display frame may be generated into the RAM 312. The updated display frame may include display frame data from a previous frame, new frame data that has been manipulated and decoded, and any processing required to conceal display data errors that occur during transmission of the new frame data.

Finally, in step 1040, the display controller 330 provides the most recently completed display frame data to the remote terminal display screen 310 for viewing by the user of the remote terminal system 300. Display refresh is an asynchronous operation that typically runs between the remote terminal controller 314 and the display 310 from 60 to 72 times per second to avoid flicker. In step 1030, the generation of new display frames typically occurs infrequently, although it may occur at 30 frames per second or more if desired. Without the screen saver or power down mode, the display processor will continue to update the remote display screen 310 with the most recently completed display frame during the display refresh process as indicated by feedback path 1050.

The present invention thus implements a multi-user server-based computer system that supports remote terminals that users can effectively utilize in a variety of applications. For example, an enterprise may deploy a computer system rack at one location, while providing them with a simple, inexpensive remote terminal system 300 on a remote location user's desk. Different remote locations may be supported through a LAN, WAN, or through another connection. The RT may be a desktop or notebook personal computer, or in another system may be a specialized device such as a cell phone, personal digital assistant, or in conjunction with other consumer products such as a portable video player, game console, or remote control system. A user may flexibly utilize the host computer of the multi-user system 100 to obtain the same level of software compatibility and similar level of performance that the host computer system can provide to local users. Thus, the present invention effectively implements a flexible multi-user system that utilizes a variety of different components to facilitate optimal system interoperability and functionality.

The invention has been described above with reference to preferred embodiments. Other embodiments will be apparent to those skilled in the art in light of this disclosure. For example, the present invention can be readily implemented using other configurations in addition to those described in the above preferred embodiments. In addition, the present invention may also be usefully employed in conjunction with systems other than those described above as preferred embodiments. Accordingly, the present invention is limited only by the following claims, which are intended to cover these and other variations of the preferred embodiments.

Claims

1. A host computer system capable of supporting a plurality of remote terminals, comprising:

a host CPU connected to other subsystems within the host computer system via a system bus;

an offload subsystem for managing the remote terminal, having:

means for intercepting graphics commands or video data on said bus from processing by said host CPU; and

means for processing the intercepted graphics commands or video data, the processing comprising at least one of: transcoding, changing frame rate, changing display format, changing resolution, the processing being performed according to video decoding performance at the plurality of remote terminals; and

means for managing the intercepted and processed graphics commands or video data for transmission to the corresponding remote terminal via a network subsystem.

2. The system of claim 1, wherein the graphics commands are intercepted and data blocks associated with the graphics commands are encoded by the offload subsystem to reduce bandwidth required for transmission by the network.

3. The system of claim 1, wherein the video data is intercepted by the offload subsystem and processed to match network performance and decoding capabilities of individual ones of the remote terminals, wherein the processing may include changing bit rates (transcoding), frame rates, resolutions, or encoding algorithms (transcoding).

4. The system of claim 1, wherein the intercepting means comprises a tracing software layer running on the host CPU.

5. The system of claim 1, wherein the offload subsystem comprises means for connecting from a graphics and display controller to one or more display output paths.

6. The system of claim 5, wherein the graphics and display controller is configured to generate output for a local display, and remote management is performed via the remote terminal.

7. The system of claim 5, wherein the offload subsystem is configured to perform encoding of the host system graphics and display output prior to transmitting its network to one of the remote terminals.

8. The system of claim 5, wherein the graphics and display controller supports multiple RTs and the connection provides selective subframe updates to the offload system corresponding to different subframes in multiple remote terminals.

9. In a system including a host computer, a method for operating a multi-user host system having a plurality of remote terminals, the host including software, a host CPU and an offload engine, comprising:

using the offload engine to assist the main CPU in processing graphics commands and video data, wherein processing the video data includes at least one of: changing a bit rate, changing a resolution, changing a frame rate, and changing an encoding algorithm, and processing the graphics commands comprises: by encapsulating and encoding data associated with the graphics commands;

determining which of the remote terminals are destinations of the processed graphics commands and the processed video data; and

the processed graphics commands and processed video data are propagated through a network interface according to network protocol techniques.

10. The method of claim 9, wherein the multi-user host system comprises a local graphics processor with frame memory corresponding to one or more of the remote terminals, wherein the local graphics processor is configured to perform the steps of:

representing the graphics commands in a display frame;

determining on a sub-frame basis which sub-frames need to be selectively updated on each of the remote terminals; and

sending the selective update to the offload engine.

11. The method of claim 10, wherein the offload engine is configured to perform encoding on the selectively updated subframes.

12. The method of claim 9, wherein the offload engine is further to offload one or more other processing tasks from the host CPU.