Bugzilla – Bug 402
Bytecode Format Enhancements Needed
Last modified: 2004-07-25 16:23:40
You need to log in before you can comment on or make changes to this bug.
This bug is just to capture some enhancements to the bytecode format that we're planning for 1.3 so that we cover as many changes as possible in release 1.3 and not disrupt users again in 1.4. Encode Types As 24-bit Quantities. ================================== We need a new primitive, uint24_vbr, to encode types. Because of the use of bit fields for global variables and elsewhere, types are currently not fully 32-bit quantities (see bug 392). The recommended plan is to always encode types into 24-bit fields, but provide for extension by using the value (2^24)-1 as an indicator that what follows is a uint64_vbr that contains the type. It is unlikely that many, if any, bytecode files will need more than 16 million distinct types. VBRize the Block Headers ======================== While block headers are only 8 bytes currently, in very small files (say containing a few types), their overhead becomes quite large. We can skip the aligment of these fields (possibly saving a few bytes) and pack both the block type and block length into a single uint32_vbr. This will provide 8 bits for the block type (its doubtful if we'll ever need more than 256 block types) and 24-bits for the block length. Similarly, its doubtful if we'd ever need a single block longer than 16MBytes. CDRize Binary Data Content ========================== We should use a standard for representing various binary quantities in the bytecode file. Integers are pretty much handled by VBR. However, float and double types should be regularized to IEEE format and written according to a cross-platform standard such as CDR (CORBA), NDR (Sun), or XDR (RPC). CDR is the most modern but has its shortcomings. There might be other applicable standards too. Strings should be regularized to a a standard format as well.
I'm not sure that I understand the final item. We already standardize on IEEE floating point, in little endian mode. Things may get sticky when and if we ever support a target that uses non-IEEE floating point, but this may never happen. Also, I don't understand your point about strings: LLVM has no support for strings (and doesn't want it). -Chris
My point in suggesting CDR was exactly that: the "what ifs" of the future. If we were to just say "we encode with CDR" then that settles all questions for now and the future. I find it much preferable to saying "we encode using CDR rules" rather than providing a long list and description of the way we encode various fundamental data types. I think the users would appreciate it too. The only question is whether CDR is the right choice for a rule set. Its not particularly compressed. As for strings, we most certainly do have them and we encode them little endian. Symbol tables have strings and we handle the global string constants very specially in the bytecode format.
Okay, let me restate this. If we end up supporting other FP formats, a LOT of other stuff will have to change as well. I would much rather make this change lazily, rather than build in something up-front that we don't have any experience with and we don't know that we need. w.r.t. strings, we most certainly do not have them. :) What we have are arrays of bytes, not strings. They happen to commonly be used as strings by certain front-ends (like the C front-end), but they aren't special in any way. Likewise with SymbolTable entries, they are just arrays of bytes, not "strings". -Chris
Some more rebuttal: Choosing CDR as a format says nothing about how it gets implemented. I agree that we should do things incrementally. Right now, the only part of CDR that we'd implement is the way IEEE floats and doubles are encoded. If we end up supporting a platform that doesn't have IEEE fp then we'd have to deal with that at the time but the end result would be to encode it as IEEE fp using CDR. Making the decision to use CDR gives some stability to our specification of bytecode files without saying anything about how we go about implementing it. As for experience with it, I have plenty. I was a CORBA architect for AT&T Wireless for two years. Hence, my slant towards CDR. If you put aside the implementation aspect of this change, what do you have against CDR? What other alternatives do you suggest? Nothing? One off, piecemeal implementation that isn't compatible with anything else out there? Why do we need to re-invent this wheel? As for strings, I think we're into parsing semantics here. I was suggesting string in the classical notion as an ordered list of characters, not suggesting that we store std::string. In that sense the "array of char" is the string I'm talking about. Would it be okay with you if I stored strings in the bytecode with all the even indexed characters first and then all the odd indexed ones? I think not, little-endian "strings" are what we store, what is natural, and what is common in various specs like CDR. Let's stop having this silly discussion.
Disclaimer (should have said this before): I know nothing about CDR. > As for experience with it, I have plenty. I'm sorry, I didn't mean experience with CDR, I meant experience with the future problems we will run into with the .bc file format. :) > What other alternatives do you suggest? Nothing? I suggest we stay with what we have until there is a reason to change it. > In that sense the "array of char" is the string I'm talking about. Okay. I just want to make sure that it is absolutely clear that what we are storing in the .bc files is not sufficient for, say, Java or MSIL strings. Also, it is not even true that we are storing C strings either. We are literally special casing random arrays of bytes, that's all. They just happen to be commonly used as strings. In any case, I don't think there is much point in continuing this discussion in this bug! -Chris
Agreed. I'll drop the CDR discussion for now but should we start extending to esoteric platforms or getting interest from users about using common data representations, I will once again become a strong advocate of CDR :) No further comments on the string or CDR topics needed. However, there are possibly other small incremental bc enhancements that could be made before 1.3. We should document those here.
> I'll drop the CDR discussion for now but should we start extending to esoteric > platforms or getting interest from users about using common data > representations, I will once again become a strong advocate of CDR :) Please do! > However, there are possibly other small incremental bc enhancements that could > be made before 1.3. We should document those here. Sounds great. Bug 263 is one of them :) -Chris
Mine
Types have been made 24-bit quantities that overflow to 32-bit if necessary. Block headers could not be made vbr because the size field has to be constant size in order for the fixup logic to work. However, it has been reduced 50% to a single 32-bit quantity. CDRizing various types won't be done until necessary so this bug is done.