US20140368739A1
2014-12-18
14/476,530
2014-09-03
As information to be processed at an object-based video or audio-visual (AV) terminal, an object-oriented bitstream includes objects, composition information, and scene demarcation information. Such bitstream structure allows on-line editing, e.g. cut and paste, insertion/deletion, grouping, and special effects. In the interest of ease of editing, AV objects and their composition information are transmitted or accessed on separate logical channels (LCs). Objects which have a lifetime in the decoder beyond their initial presentation time are cached for reuse until a selected expiration time.
Get notified when new applications in this technology area are published.
H04N5/265 » CPC main
Details of television systems; Studio circuitry; Studio devices; Studio equipment ; Cameras comprising an electronic image sensor, e.g. digital cameras, video cameras, TV cameras, video cameras, camcorders, webcams, camera modules for embedding in other devices, e.g. mobile phones, computers or vehicles; Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects Mixing
G06T1/60 » CPC further
General purpose image data processing Memory management
H04N5/44 IPC
Details of television systems Receiver circuitry for the reception of television signals according to analogue transmission standards
This application is a continuation of U.S. patent application Ser. No. 13/345,208, filed Jan. 6, 2012, which is a continuation of U.S. patent application Ser. No. 12/482,292, filed Jun. 10, 2009, which is a continuation of U.S. patent application Ser. No. 11/688,368, filed Mar. 20, 2007 which is a divisional of U.S. patent application Ser. No. 09/367,433, filed Jan. 13, 2000, which is a national stage of International Application PCT/US98/02668, filed Feb. 13, 1998, which claims the benefit of U.S. Provisional Application Ser. No. 60/037,779, filed Feb. 14, 1997, each of which is incorporated by reference in its entirety herein, and from which priority is claimed.
This invention relates to the representation, transmission, processing and display of video and audio-visual information, more particularly of object-based information.
Image and video compression techniques have been developed which, unlike traditional waveform coding, attempt to capture high-level structure of visual content. Such structure is described in teams of constituent “objects” which have immediate visual relevancy, representing familiar physical objects, e.g. a ball, a table, a person, a tune or a spoken phrase. Objects are independently encoded using a compression technique that gives best quality for each object. The compressed objects are sent to a terminal along with composition information which tells the terminal where to position the objects in a scene. The terminal decodes the objects and positions them in the scene as specified by the composition information. In addition to yielding coding gains, object-based representations are beneficial with respect to modularity, reuse of content, ease of manipulation, ease of interaction with individual image components, and integration of natural, camera-captured content with synthetic, computer-generated content.
In a preferred architecture, structure or format for information to be processed at an object-based video or audio-visual (AV) terminal, an object-oriented bitstream includes objects, composition information, and scene demarcation information. The bitstream structure allows on-line editing, e.g. cut and paste, insertion/deletion, grouping, and special effects.
In the preferred architecture, in the interest of ease of editing, AV objects and their composition information are transmitted or accessed on separate logical channels (LCs). The architecture also makes use of “object persistence”, taking advantage of some objects having a lifetime in the decoder beyond their initial presentation time, until a selected expiration time.
FIG. 1 is a functional schematic of an exemplary object-based audio-visual terminal.
FIG. 2a is a schematic of an exemplary object-based audio-visual composition packet.
FIG. 2b is a schematic of an exemplary object-based audio-visual data packet.
FIG. 2c is a schematic of an exemplary compound composition packet.
FIG. 3 is a schematic of exemplary node and scene description information using composition.
FIG. 4 is a schematic of exemplary stream-node association information.
FIG. 5 is a schematic of exemplary node/graph update information using a scene.
FIG. 6 is a schematic of an exemplary audio-visual terminal design.
FIG. 7 is a schematic of an exemplary audio-visual system controller in the terminal according to FIG. 6.
FIG. 8 is a schematic of exemplary information flow in the controller according to FIG. 7.
An audio-visual (AV) terminal is a systems component which is instrumental in forming, presenting or displaying audio-visual content. This includes (but is not limited to) end-user terminals with a monitor screen and loudspeakers, as well server and mainframe computer facilities in which audio-visual information is processed. In an AV terminal, desired functionality can be hardware-, firmware- or software-implemented. Information to be processed may be furnished to the terminal from a remote information source via a telecommunications channel, or it may be retrieved from a local archive, for example. An object-oriented audio-visual terminal more specifically receives information in the form of individual objects, to be combined into scenes according to composition information supplied to the terminal.
FIG. 1 illustrates such a terminal, including a de-multiplexer (DMUX) 1 connected via a logical channel LC0 to a system controller or “executive” 2 and via logical channels LC1 through LCn to a buffer 3. The executive 2 and the buffer 3 are connected to decoders 4 which in turn are connected to a composer unit 5. Also, the executive 2 is connected to the composer unit 5 directly, and has an external input for user interaction, for example.
In the preferred AV architecture, the AV objects and their composition information are transmitted or accessed on separate logical channels. The DMUX receives the Mux2 layer from the lower layers and de-multiplexes it into logical channels. LC0 carries composition information which is passed on to the executive. The AV objects received on other logical channels are stored in the buffer to be acted upon by the decoders. The executive receives the composition information, which includes the decoding and presentation time stamps, and instructs the decoders and composer accordingly.
The system handles object composition packets (OCP) and object data packets (ODP). A composition packet contains an object's ID, time stamps and the “composition parameters” for rendering the object. An object data packet contains an object ID, an expiration time stamp in case of persistent objects, and object data.
Preferably, any external input such as user interaction is converted to OCP and/or ODP before it is presented to the executive. There is no need for headers in a bitstream delivered over a network. However, headers are required when storing an MPEG4 presentation in a file.
FIGS. 2a and 2b illustrate the structure of composition and data packets in further detail. Relevant features are as follows:
| Composition Objects (16-bit object IDs) |
| 0X0000 | scene configuration object | |
| 0X0001 | node hierarchy specification | |
| 0X0002 | stream-node association | |
| 0X0003 | node/scene update | |
| 0X0004 | compound object | |
| Object Data (object type, 6 most significant bits) |
| 0b00.0010 | text | |
| 0b00.0011 | MPEG2 VOP (rectangular VOP) | |
| Object_data_packet{ | ||
| ObjectiID | 16 bits + any extensions; | |
| CIPI | 2 bits | |
| if (CIPI <= 1){ | ||
| Priority | 5 bits |
| if (object type is MPEG VOP) | |
| (any prediction based compression) |
| VOP_type | 2 bits | |
| } | ||
| if (CIPI == 1) | ||
| ETS | 28 bits | |
| ObjectData | ||
| } | ||
| Object_composition_packet{ | ||
| ObjectID | 16 bits + any extensions | |
| OCR_Flag | 1 bit | |
| Display_Timers_Flag | 1 bit | |
| DTS | 30 bits | |
| if (OCR_Flag) | ||
| OCR | 30 bits |
| if (Display_Timers_Flag) { |
| PTS | 30 bits | |
| LTS | 28 bits | |
| } | ||
| Composition parameters; | ||
| } | ||
| Composition_pararneters( |
| visibility | 1 bit | |
| composition_ order | 5 bits | |
| number_of_motion_sets | 2 bits | |
| x_delta_0 | 12 bits | |
| y_delta_0 | 12 bits |
| for (i = 1; i <= number_of_motion_sets; i++){ |
| x_delta_i | 12 bits | |
| y_delta_i | 12 bits | |
| } | ||
| } | ||
| Compound_composition_packet{ | ||
| ObjectID | 16 bits | |
| PTS | 30 bits | |
| LTS | 28 bits |
| Composition_parameters |
| ObjectCount | 8 bits | |
| for (i = 0; i < ObjectCount; i++){ |
| Object_composition_packet; |
| } | |
| } | |
AV terminal buffers are flushed using Flush_Cache and Scene_Update flags. When using hierarchical scene structure, the current scene graph is flushed and the terminal loads the new scene from the bitstream. Use of flags allows for saving the current scene structure instead of flushing it. These flags are used to update the reference scene width and height whenever a new scene begins. If the Flush_Cache_Flag is set, the cache is flushed, removing the objects (if any). If Scene_Update_Flag is set, there are two possibilities: (i) Flush_Cache-Flag is set, implying that the objects in the cache will no longer be used; (ii) Flush_Cache_Flag is not set, the new scene being introduced (an editing action on the bitstream) splices the current scene and the objects in the scene will be used after the end of the new scene. The ETS of the objects, if any, will be frozen for the duration of the new scene introduced. The beginning of the next scene is indicated by another scene configuration packet.
| Scene_configuration packet{ | ||
| ObjectID | 16 bits (OXOOOO) | |
| Flush_Cache_Flag | 1 bit | |
| Scene_Update_Flag | 1 bit | |
| if (Scene_Update_Flag){ | ||
| ref_scene_width | 12 bits | |
| ref_scene_height | 12 bits | |
| } | ||
| } | ||
A hierarchy of nodes is defined, describing a scene. The scene configuration packets can also be used to define a scene hierarchy that allows for a description of scenes as a hierarchy of AV objects. Each node in such a graph is a grouping of nodes that groups the leaves and/or other nodes of the graph into a compound AV object. Each node (leaf) has a unique ID followed by its parameters as shown in FIG. 3.
As illustrated by FIG. 4, table entries associate the elementary object streams in the logical channels to the nodes in a hierarchical scene. The stream IDs are unique, but not the node IDs. This implies that more than one stream can be associated with the same node.
FIG. 5 illustrates updating of the nodes in the scene hierarchy, by modifying the specific parameters of the node. The graph itself can be updated by adding/deleting the nodes in the graph. The update type in the packet indicates the type of update to be performed on the graph.
The embodiment described below includes an object-based AV bitstream and a terminal architecture. The bitstream design specifies, in a binary format, how AV objects are represented and how they are to be composed. The AV terminal structure specifies how to decode and display the objects in the binary bitstream.
Further to FIG. 1 and with specific reference to FIG. 6, the input to the de-multiplexer 1 is an object-based bitstream such as an MPEG-4 bitstream, consisting of AV objects and their composition information multiplexed into logical channels (LC). The composition of objects in a scene can be specified as a collection of objects with independent composition specification, or as a hierarchical scene graph. The composition and control information is included in LC0. The control information includes control commands for updating scene graphs, reset decoder buffers etc. Logical channels 1 and above contain object date. The system includes a controller (or “executive”) 2 which controls the operation of the AV terminal.
In the object cache 7, objects are stored for use beyond their initial presentation time. Such objects remain in the cache even if the associated node is deleted from the scene graph, but are removed only upon the expiration of an associated time interval called the expiration time stamp. This feature can be used in presentations where an object is used repeatedly over a session. The composition associated with such objects can be updated with appropriate update messages. For example, the logo of the broadcasting station can be downloaded at the beginning of the presentation and the same copy can be used for repeated display throughout a session. Subsequent composition updates can change the position of the logo on the display. Objects that are reused beyond their first presentation time may be called persistent objects.
The system controller controls decoding and playback of bitstreams on the AV terminal. At startup, from user interaction or by looking for a session at default network address, the SC first initializes the DMUX to read from a local storage device or a network port. The control logic is loaded into the program RAM at the time of initialization. The instruction decoder reads the instructions from the program and executes them. Execution may involve reading the data from the input buffers (composition or external data), initializing the object timers, loading or updating the object tables to the data RAM, loading object timers, or control signaling.
FIG. 7 shows the system controller in further detail. The DMUX reads the input bitstream and feeds the composition data on LC0 to the controller. The composition data begins with the description of the first scene in the AV presentation. This scene can be described as a hierarchical collection of objects using compound composition packets, or as a collection of independent object composition packets. A table that associates the elementary streams with the nodes in the scene description immediately follows the scene description. The controller loads the object IDs (stream IDs) into object list and render list which are maintained in the data RAM. The render list contains the list of objects that are to be rendered on the display device. An object that is disenabled by user interaction is removed from the render list. A node delete command that is sent via a composition control packet causes the deletion of the corresponding object IDs from the object list. The node hierarchy is also maintained in the data RAM and updated whenever a composition update is received.
The composition decoder reads data from the composition and external data buffer and converts them into a format understood by the instruction decoder. The external input includes user interaction to select objects, disenable and enable objects and certain predefined operations on the objects. During the execution of the program, two lists are formed in the data RAM. The object list, containing a list of objects (object IDs) currently handled by the decoders and a render list, containing the list of active objects in the scene. These lists are updated dynamically as the composition information is received. For example, if a user chooses to hide an object by passing a command via the external input, the object is removed from the render list until specified by the user. This is also how external input is handled by the system. Whenever there is some external interaction, the composition decoder reads the external data buffer and performs the requested operation.
The SC also maintains timing for each AV object to signal the decoders and decoder buffers of decoding and presentation time. The timing information for the AV objects is specified in terms of its time-base. The terminal uses the system clock to convert an object's time base into system time. For objects that do not need decoding, only presentation timers are necessary. These timers are loaded with the decoding and presentation timestamps for that AV object. The controller obtains the timestamps from the DMUX for each object. When a decoding timer for an object runs out, the appropriate decoder is signaled to read data from the input buffers and to start the decoding process. When a presentation timer runs out, the decoded data for that object is transferred to the frame buffer for display. A dual buffer approach could be used to allow writing to a frame buffer while the contents of the second buffer are displayed on the monitor. The instruction decoder can also reset the DMUX or input buffers by signaling a reset, which initializes them to the default state.
FIG. 8 shows the flow of information in the controller. To keep the figure simple, the operations performed by the instruction decoder are shown in groups. The three groups respectively concern object property modifications, object timing, and signaling.
These operations manipulate the object IDs, also called elementary stream IDs. When a scene is initially loaded, a scene graph is formed with the object IDs of the objects in the scene. The controller also forms and maintains a list of objects in the scene (object list) and active objects in the object from the render list. Other operations set and update object properties such as composition parameters when the terminal receives a composition packet.
This group of operations deals with managing object timers for synchronization, presentation, and decoding. An object's timestamp specified in terms of its object time base is converted into system time and the presentation and decoding time of that object are set. These operations also set and reset expiration timestamps for persistent objects.
Signaling operations control the over-all operation of the terminal. Various components of the terminal are set, reset and operated by controller signaling. The controller checks the decoding and presentation times of the objects in the render list and signals the decoders and presentation frame buffers accordingly. It also initializes the DEMUX for reading from a network or a local storage device. At the instigation of the controller, decoders read the data from the input buffers and pass the decoded data to decoder output buffers. The decoded data is moved to the presentation device when signaled by the controller.
1. A method for processing object-based data at a receiver to generate an arrangement of the data, comprising:
(a) receiving in a data bit stream at least one object and composition information for the object;
(b) locally storing the at least one object;
(c) processing the received composition information to compose an arrangement using the at least one stored object; and
(d) generating the composed arrangement.
2. The method of claim 1, wherein the at least one object includes at least one audio object.
3. The method of claim 1, wherein the at least one object includes at least one audiovisual/video object.
4. The method of claim 1, wherein the at least one object includes at least two objects.
5. The method of claim 4, wherein the composition information is for at least one of the at least two object.
6. The method of claim 1, further comprising the step of presenting the composed arrangement, wherein presenting the composed arrangement includes playing audio.
7. The method of claim 1, wherein the composition information is relative to a single object.
8. The method of claim 1, wherein the composition information includes information for composing audio objects.
9. The method of claim 8, wherein the composition information includes information for composing audio object within a set of speakers.
10. The method of claim 1, wherein locally storing the at least one object includes storing the at least one object in a memory.
11. The method of claim 10, wherein the memory is selected from the group consisting of a cache memory, decoder buffer, and input buffer.
12. The method of claim 1, further comprising:
removing the at least one audiovisual/video object from local storage prior to its initial presentation time.
13. An apparatus for processing object-based data, comprising:
(a) a receiver circuit configured to receive in a data bit stream at least one object and composition information for the object;
(b) local storage configured to store the at least one object;
(c) a composer circuit configured to process the received composition information to compose an arrangement using the at least one object; and
(d) an arrangement generating device configured to generate the composed arrangement.
14. The apparatus of claim 13, further comprising:
an output device configured to output the composed arrangement.
15. The apparatus of claim 13, wherein the output device is configured to present the arrangement.
16. The apparatus of claim 15, wherein the output device is further configured to present the arrangement by playing audio.
17. The apparatus of claim 13, wherein the composer circuit is configured to process composition information related to an audio object.
18. The apparatus of claim 13, wherein the composer circuit is configured to process composition information including information for composing audio object within a set of speakers.