- From: <noah_mendelsohn@us.ibm.com>
- Date: Thu, 17 Mar 2005 14:46:45 -0500
- To: Chris Lilley <chris@w3.org>
- Cc: Robin Berjon <robin.berjon@expway.fr>, www-tag@w3.org
Robin Berjon writes:
> Some of the things people were spending time on
> were XML-related. For example UTF-8 to UTF-16
> conversion
I don't think I buy this as a rationale for a binary XML standard. The
line of reasoning I see in the above is:
XML is text, often UTF-8. As an industry we went and cooked up APIs that
pass around all the strings as UTF-16, which to be fair is common on many
platforms. Not surprisingly, there are conversion overheads, and I agree
they are very significant.
Why does this problem justify a binary XML standard? Instead of making
the platform or the API more efficient at dealing with UTF-8, which seems
like a good investment on that platform, we're going to force the whole
industry to accept interchange of a new form of XML? Maybe or maybe not
that binary form's representations of strings will go into your API with
lower conversion overhead, but I do note that Java in particular uses
UTF-16 under the covers, and you can if you wish use UTF-16 for XML today.
We've done some work in this area in IBM. I am not at all convinced that
the answer to platforms and API's that are bad at manipulating UTF-8 is to
define a binary XML. There's a lot you can do to avoid character
conversions of you're careful and your API is suitably designed. Indeed,
it seems to me that things are just dandy in XML for use with platforms
that do UTF-8 efficiently. Will the binary form be faster or slower for
them?
> or assigning data types with schema to make a
> PSVI. If a binary format already has the PSVI
> information
I think you need to be very careful heading down this path, depending on
your use case. The term PSVI in particular relates to schema validation.
In many cases the reason you are doing schema validation is because you
don't entirely trust the source of the data. Once you're doing other
aspects of validation to check the data, I would claim (having built such
systems) that type assignment is nearly free in many cases. The same is
true for many deserialization use cases, even where you don't use xml
schema for validation: if you know you're deserializing a "quantity"
field then the deserializer very often has static knowledge that it's an
int. I don't see why there's overhead for that in the common use cases.
Maybe what you're hinting is that for an integer you're going to send the
binary "int" and not the character string. If so, then that's not XML in
a deeper sense, and the fact that you know the "PSVI" type is incidental
to the fact that you've moved from characters to abstract numbers. With
the binary "int", you can't distinguish "123" from "00123", and that's a
huge difference. For example, an XML DSIG over the two would be
different. In any case, now you're into sending something closer to a
subset of the XPath 2.0 XQuery data model than optimized XML. An
interesting thing to consider, but it has all sorts of deep implications.
SOAP, in particular, uses infosets. In a SOAP message, "123" is different
from "00123", even if the schema or xsi:type claims you've got an integer.
DSIGs on the two will be different.
--------------------------------------
Noah Mendelsohn
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Chris Lilley <chris@w3.org>
03/17/05 12:31 PM
Please respond to Chris Lilley
To: Robin Berjon <robin.berjon@expway.fr>
cc: noah_mendelsohn@us.ibm.com, www-tag@w3.org
Subject: Re: Draft minutes of 15 March 2005 Telcon
On Wednesday, March 16, 2005, 7:54:21 PM, Robin wrote:
RB> noah_mendelsohn@us.ibm.com wrote:
>> DO: I thought that one of the interesting presentations at the workshop
>> from Sun analyzed not just message size (and thus network overhead) but
>> also what was happening in the processor.
>> ... A lot of time was spent in the binding frameworks.
>> ... Even if you came along and doubled the network performance by
>> halving the size, you might get only 1/3 of improvement
RB> Yes, if you're doing a lot of other things that aren't XML, then
RB> speeding up XML won't help. But when you're rendering an SVG document
RB> and the vast majority of your time is spent waiting for the network
and
RB> parsing the XML, then you know there's going to be speedup.
Some of the things people were spending time on were XML-related. For
example UTF-8 to UTF-16 conversion (to create a DOM) or assigning data
types with schema to make a PSVI.
If a binary format already has the PSVI information and speeds up the
production of a DOM (or obviates the need to construct a separate data
structure to implement the DOM APIs eficiently, might be a better way of
putting it) that would result in a significant speedup.
It might not be measured in x times smaller or x times faster to parse,
though. But it would show up in transactions-per-second measurements.
--
Chris Lilley mailto:chris@w3.org
Chair, W3C SVG Working Group
W3C Graphics Activity Lead
Received on Thursday, 17 March 2005 19:47:18 UTC