Do you want to learn how read binary? (Me either). Stéphane Rodriguez dissects the new Excel 2007 binary format.
23 thoughts on “Reading Binary”
Posting code? Use <pre> tags for VBA and <code> tags for inline.
Do you want to learn how read binary? (Me either). Stéphane Rodriguez dissects the new Excel 2007 binary format.
Posting code? Use <pre> tags for VBA and <code> tags for inline.
Thanks for the link. I am actually writing en entire article at the moment, I have found more underlying .BIN file formats to document. I can’t believe my eyes : they have created BIFF12. Will try to get the article posted somewhere : either on openxmldeveloper.org or on devsites like codeproject.
When you do tell us about it.
Intresting stuff.
Here’s an intresting thing. We (at JMT) where talking about Open Office .org. One of the things that came up here and also at OOO, was the need for VBA support.
One big thing is that with VBA Excel is basically useless to a org. Well if yon didnt need VBA, you use OOO for a faction of the cost. So i think that most of the 2007 file will still be in .xls. But for sure XML is still really useful.
Thanks for the info Stephane (and Dick!)
@Stephane: I suggest you ask Brian Jones before jumping to conclusions
@ross: did you mean to say “without VBA”?
Just taking some very cursory looks at XML over the last couple days. Seems like a lot of work to me. IE. Create the document, then a transform and then a definition – YUK! Of course there is the offchance I am missing something ;-)
Jumping to which conclusions? I have already asked B.J. and his answer is the official one : Office 2007 is about XML, period.
The article I’ll publish will prove that a substantial time was spent on binary file formats rather than XML, and that the impedance mismatch between XML and BIN is clearly showing some worrying foundational problems for the bright “Office 2007 XML” future (like having to guess the default values of just about every XML tag in the document, in essence reconstructing the actual implementation of the software).
Stephane,
Your argument that MS is not fully commited to XML seems a little off. One has only to look at the work that went into creating the full fidelity XML formats (not to mention the fact the work that went into submitting the Open XML schemas to the ECMA) to see that MS has firmly put all of its eggs into the future of XML. Anyone who has worked with the new Open XML file formats for 10 minutes can make a solid arguement that MS has gone past the point of no return on XML. Also, I believe that MS has got its eye on the next frontier, the Internet. An internet-based Office will require a foundation in XML.
Another thing. I don’t necessarily think that you can put an XML construct next to its binary counterpart and pick out holes. To say that a new, binary format puts the future of XML in Office in danger just sound a little nutty.
Good luck getting your full article on the Brian Jone’s blog.
Mike, your post reads like a marketing brochure. May I go back to reality?
The XML route : it seems you are also victim of that misleading XML naming thing. The Microsoft Office team is putting its weight on XML as a data source, it serves the L.O.B. model and as a foundation to their integrated client-server suite agenda. In other words, business, business, business. Whether the .docx/… files contain some whatever XML inside is not only a mischaracterization of what it is (it’s a lot more complex to work with than the XML in Office 2003), it is irrelevant and it does not matter. It is not possible to use it as is, except for the trivial scenarios that have already been shown over at B.J.’s blog (such as replacing an image, big deal!).
Full fidelity : nope. From a 10 mile view, it’s true. From a hardcore 10-centimer-to-the-metal view, it’s not true. A number of options have been removed. And there is already ample comments about that (read the comments in the Excel 12 blog).
My comment on the impedance mismatch between XML and BIN shall not be underestimated. It’s not that you should care about BIN, nor that BIN will take over XML (I said that?) it’s that the same impedance mismatch is prevalent in the serialized format’s missing tags (default values). But I’ll get more to that in the article, I guess.
What I said is that a lot, a big fat lot, of work went into working the binary parts. That’s a big surprise for me, since I read B.J.’s blog, and according to me it’s XML and only XML.
As for getting the article in B.J.’s blog, yes I don’t hope that it will be there. But there is hope that it gets over at openxmldeveloper.org. Will see. ;-)
I must admit that your post is a bit nebulus. Still not sure what you’re trying to say.
Are you saying that MS should not spend time on anything but XML?
As for full fidelity…it’s damned close to it. I create an Excel file with every functionality I can think of, and it’s there. I can open the Excel package and I can see XML part after XML part representing all of the components that make up my file. I don’t think it’s a “naming thing” or a marketing ploy. I can open the parts in my Excel file and….yep…I see XML.
If you’re saying that there are still holes in the Open XML file formats, then OK. I’m with you on that one. But I look at this first attempt at an XML-based Office as a stepping stone to better versions (Web-enabled Office maybe?).
BTW, here’s an excellent post by Simon St. Laurent that details the limations of XML functionality in Office. Although some of these issues are cleared up with Office 2007, many are still present. http://www.oreillynet.com/xml/blog/2005/03/holes_in_microsoft_office_xml.html
Oh and I know I spelled nebulous wrong in my last post. If you use a big fancy word, you should probably spell it right!
>>I must admit that your post is a bit nebulus. Still not sure what you’re trying to say.
>>Are you saying that MS should not spend time on anything but XML?
I think it’s obvious. I am talking from the trenches, not just repeating what’s said in the Office team blogs (a lot of which is marketing, that’s understandable though).
MS do what they want. Case in point, I think that when they say XML, it’s not the XML you think it is. So let me break this down for you :
– XML as a data source : see their blog posts about “Custom XML”. It’s a very convenient and efficient way to bind an arbitrary data source to a template document. This rocks. L.O.B. will fly with this! :-) By the way, it’s an Office 2003 feature (some would say it’s THE Office 2003 feature…)
– XML as a serialized file format : totally irrelevant for 99.9999% of Office developers. It’s a lot more complex than the theory says. Just take a look at the code snippets they have posted, you don’t have to take my words. In my blog post the other day, I have been a bit further and found something that is putting even more concern to this.
I hope this clarifies a bit for now.
>>I can open the Excel package and I can see XML part after XML part representing all of the components that make up my file. I don’t think it’s a “naming thing” or a marketing ploy.
Sure, you can open those files by hand. That’s fun, isn’t it? Now, try to do some REAL processing with source code. One simple example should get an idea of the mountain to climb : take one of those 2 lines of automation code (I’m sure you have tons of that), and replace it with code that does direct changes to the underlying part. Not only that, the changes should be consistent and coherent, just like when you are execution automation code, i.e. if you change the value of a cell, all objects referencing the cell should be resynched (formulas and other objects).
In short, the automation object model exposes…objects. You can deal with objects and be done with very short and concise code, and never have to deal with anything else than those objects, for instance chart properties and so on.
Now, just compare this to the XML parts (this name is actually somewhat misleading, this is however easy to understand). What is exposed? guts. Only guts. Good luck!
Ahh, but there are some very cool things you can do with the XML parts that prove to be much easier than going through the Excel Object Model. A little XML knowledge goes a long way with the Open XML files. I have actually played with automating the process of opening an Excel file, changing a part and zipping the file. It’s admittedly not as clean as I would like it to be, but it’s very doable.
>if you change the value of a cell, all objects referencing the cell should be resynched
(formulas and other objects)
This is where Sharedstrings and Sharedformula XML parts shine. Real content/data are stored here and linked throughout the spreadsheet using indexes. This means that content changes are applied automatically. That is to say the built-in indexing and relationahip mechanisms ensure that your spreadsheet stays synchronized.
Stephane –
You’re quite right when you discuss using the object model to modify workbooks. I’m sure for the vast majority of my manipulation of workbooks/documents/presentations in Office 2007, I will rely on the object model and VBA. It’s what I know, it’s not usually too tediously slow, and it can do 99% of what I need.
The point is, in principle the XML parts enable much functionality with the ease of editing text files. I say “in principle” because there are not many tools yet available for manipulation of Office 2007 XML files, and I’m not yet really too fluent in it. I’ll probably use it at first on a file by file basis to hack off a part I don’t want or drag in some parts from another file. I’ll have to use it for UI customization; that’s my biggest gripe about Office 2007, and XML is the way past it.
@Mike : assume I have a “str” in cell B2, and a formula in B3=B2&B2. Now let’s say I want to change the value of B2. I can crack-open the file with code, traverse the row/c/v XML tree, find that the string is stored in the shared strings part with some index (note that strings can also be stored inline in the general case, according to TC45 specs), and then I go open the shared strings part, do some prone error iteration to traverse the list of strings, and finally get to the string being referred to where I make the change. Ok, fine. Now what about B3? B3 is out-of-sync.
If you are using automation code however, B3 will get automatically updated for you.
If you are doing direct access to the “guts”, you are on your own. This is what I mean. I am pretty sure creative minds will find some useful scenarios other than make a trivial image file change (which is the scenario being often referred to in B.J.’s blog) but you know for the most part I don’t think exposing the guts is game changing for most Office developers.
@Jon : I agree. a couple major third-parties that currently provide read/write/calc to BIFF8, making it currently an alternative to Excel’s automation on the server, will soon provide read/write/calc for the new .xlsx/.xlsm/.xlsb file formats and life will be good. But to expect Office developers themselves to work with the guts is…an interesting statement IMHO. This statement would imply each and every Office developer basically figures out and reimplements a fraction of the Excel run-time. Ouch!
Stephane: Interesting discussion.
I disagree that the xml guts aren’t going to be useful to Office devs. I for one am currently working on a project in which exactly that format would have been extremely useful. Beats the hell out of parsing the BIFF files for the info I need to extract/change.
And no, unfortunately an NDA prevents me from giving more details than this.
I suppose in the end, it will be up to the developer to decide whether to use the Excel Object Model or go hand-to-hand combat with the targeted XML part. I for one am grateful for the choice. I would encourage all Office developers to look past the complexity of XML and think about the possiblities that this new file format presents.
FWIW, based on the my back and forth communications (via the blogs by Kevin Boske/Brian Jones). Microsoft’s position seems to be that if you’re working in Excel, you should use the Excel Object Model…period. Open XML (in their eyes) is a interoperability tool that allows programmers on other platforms to create and manipulate Excel Spreadsheets. I suspect that this is why the really useful Packaging API is not exposed to COM. I don’t think they get it. Bummer.
Ron de bruin has written some code samples to use WinXP’s ‘zip’ feature:
http://www.rondebruin.nl/windowsxpzip.htm
>>I disagree that the xml guts aren’t going to be useful to Office devs. I for one am currently working on a project in which exactly that format would have been extremely useful.
As I said, creative minds will find interesting scenarios, but I have not a lot of hope otherwise. Office developers solve business scenarios, they don’t care about the plumbing.
>> Beats the hell out of parsing the BIFF files for the info I need to extract/change.
Oh, you don’t have to, never ever parse BIFF. The guts compete with Excel’s automation (which happens to provide you the run-time along with it, i.e. everything stays coherent and consistant). You’ll never have to work with the guts directly between third parties (or possibly Microsoft own tools, who knows?) will provide you the library to abstract this away. By abstracting it, I don’t mean reducing freedom, I mean exposing objects to work with. Just like today’s third parties to read/write/calc Excel files without using Excel.
Sorry if this thread is a big confusing with the blog post itself above where Dick was linking to my BIFF12 draft document.
>> I suspect that this is why the really useful Packaging API is not exposed to COM.
Working on it. Sshhhhhh!!
I have finally managed to post an article about my journey in Office 2007 .bin file formats. You get a lot of undocumented pieces, dirty secrets, plus a BIFF12 reader which can read any cell of an arbitrary Excel 2007 binary workbook.
Stephane,
Excellent work and many thanks for sharing it with us :)
Kind regards,
Dennis
Is it possible to do zipping/unzipping files from inside VBA?
I am using MS Excel 2003 SP2 on MS windows XP Prof.
Searching the VBA Help did not get me too far.
Any ideas?
Thanks and Regards,
Ilan.
Hi to all here!
Does anybody of you know where to get the description about OOXML, or is Microsoft not giving it away?
Because I want to work directly with the ExcelPackage, but then I would have to know the different tag names and where to set values.
Can anybody help me with this?
Greetings
Roman