File format suggestions

quote:
Originally posted by CookieRevised
File Chunks : Unlimited packs of data which each have 7 parts:

Name (limited to 255 bytes) - Chunk name, should be a unique identifier, not forced tho.
Comments: Make the Name shorter. 255 bytes are too much of a waste of space. Make it like the "File Type", so 5 bytes. That's more then enough to make all kind of different chuncks... Also, make it forced, I mean it should be required IMO (to be consistent with the overall global file format you're creating).

Version (1 byte) - Same as "File Version". Adds the same advantages but then for individual chunk-types...
Comments: Concearning the chunck-checksum (see below): You could reserve bit 8 to imply that there is a checksum or not. In this way the checksum could be optional.

Comments Length (1 byte) - To identify the length of the comment, otherwise you wont know where to comment begins or ends. Unless you will always use 255 bytes. But for short comments this is a waste of space.

Comments (limited to 255 bytes) - Like the file comments, additional information for this chunk.

I agree with cookie on his format, however i'd improve what is in the quote: Name and comments are human-readable strings, I mean: they have no weird symbols, only characters; so you can use null-terminated strings: The size required would be the same (because you add a null character at the end, but you no longer need the byte identifying the lenght) and this way you can have strings of any size. You may want to have a 500-byte comments on the chunk. In any case, the recommendation of making the name and comments as short as possible still applies.

A 2nd improvment, would be the (optional) use of unicode strings in name and comments. As you know, unicode strings have (may have) null characters very often and you'd need something to distinguish between the null bytes of the characters in an unicode string and the null byte in an ansi string that means end-of-string.

That can be done, puting the bytes 255 and 254 at the begining of the string. If those bytes are there, the string is unicode and ends when you find 2 null bytes (ie: 1 null unicode character). If the string begins with the characters 254 and 255 (note the order), that means the string is unicode but big endian. In any other case, the string is ansi.

quote:
Originally posted by CookieRevised
Checksum (x bytes; depend on what kind of checksum you use) - You could add a checksum to the chunck to make it possible to verify the integrity of your data.
[i]Comments: But that would imply reading/saving/checking the data, which could mean slow-processing. On the other hand, you can create your own type of checksum (only take the hash of byte 10 thru byte 100 or something). This has some advantages: since it only checks some bytes and not all, the speed wouldn't be as slow as if you would check the whole chunck-data. And people who wanna "hack" your fileformat will have a hard time doing it, because they don't know how the checksum is calculated.

this should be optional and this should be said in (for example) one bit in the version byte.

in case of using the checksum, it should be applied to all the chuck. today there are quite fast algorithms to compute a CRC32 or a MD5 very fast (for example, once i mde a program in vb that takes a file and calculates its CRC32. It worked at about 600 Kb/s, and that's a good speed, because VB doesn't have the needed support for some kind of operations and this slows the speed. I'm sure that the same program (well) done in C would go at 2 Mb/s or faster)

Shoutbox

login | register | shoutbox