Shoutbox

File format suggestions - Printable Version

-Shoutbox (https://shoutbox.menthix.net)
+-- Forum: MsgHelp Archive (/forumdisplay.php?fid=58)
+--- Forum: Skype & Technology (/forumdisplay.php?fid=9)
+---- Forum: Tech Talk (/forumdisplay.php?fid=17)
+----- Thread: File format suggestions (/showthread.php?tid=27047)

File format suggestions by Millenium_edition on 06-11-2004 at 02:24 PM

I'm developing a general file format for all my applications. It should be used globally. I was just wondering: what should it need? I mean, features, like the list below.

What I have now:

  • File Type (fixed-length 5bytes long string) - It's a kind of iD. Since the different applications use this same format, there must be a way to recognize what file is yours and what file isn't. It's length is fixed. Eg: I have an application, GIF editor, my iD will be "GIFED" of "EDITG", or anything similar.
  • File Author (limited to 255 bytes) - The person/company who created it. It can also be the name of the program which created it. Eg. "Millenium", "Microsoft Corp." or "Mozilla Firefox"
  • File Comments (limited to 255 bytes) - For storing general information about the file. Eg: I have an IM client. I want files for each account. The file author section will be the name of my application, and the comments section will cover all information about my account: e-mail, name, age, etc.
  • File Chunks : Unlimited (unlimited chunks, not actual data in the chunk) packs of data which each have 3 parts:
    • Name (limited to 255 bytes) - Chunk name, should be a unique identifier, not forced tho.
    • Comments (limited to 255 bytes) - Like the file comments, additional information for this chunk.
    • Data (limited to 16777215 bytes, or &HFFFFFF, which is approx. 16.7 kb) - The actual content of the chunk, can be a configuration file, anything, reallyl

So what do you people think of it? What should I add more?
RE: File format suggestions by Concord Dawn on 06-11-2004 at 02:50 PM

Yes, icons. Programs and file formats usually have their own icons so that you can tell what is what, such as a .dll and a .bmp have different icons.


RE: File format suggestions by Millenium_edition on 06-11-2004 at 03:02 PM

don't take it personal, but do you even know what i'm talking about? i'm talking about the STRUCTURE of the file, this is for developers :S i don't care about icons :S


RE: File format suggestions by CookieRevised on 06-11-2004 at 11:51 PM

With the apprauch your taking (chuncks) you can include anything you want, so that seems ok...

Though, "File Author" and "Comments", can be a chunck also, a chunck which is "required". Call it the "FileInfo"-chunck or something :D Of course, you can decide not to include it in a special chunck and leave it in the main format part. But don't forget that the comments-field needs a length identifier. Otherwise you'll waste valuable space for short comments:
[actual comment] = "my new file format" + 237-unused bytes = 255 bytes!
compared to:
[length-identifier][actual comment] = &H12 + "my new file format" = 19 bytes

Chuncks need a small structure also. Otherwise you couldn't identify what kind of chunk you have. And you can add checksums or something also if you like...

Also the "File Type" should have a version number if you want to make it perfect; In the futur you could improve a certain application and decide to use a better compression or something (I dunno). The "File Type" will be the same, but the file's data will be different, so add a version number to it and that's solved also (example: GIF87a GIF89a; both are GIF, but both have different filetructures)

So:

  • File Type (fixed-length 5bytes long string) - It's a kind of ID. Since the different applications use this same format, there must be a way to recognize what file is yours and what file isn't. It's length is fixed. Eg: I have an application, GIF editor, my iD will be "GIFED" of "EDITG", or anything similar.
  • File Type Version (1 byte) - the bits identify up to 255 different versions.
    Comments: This can be used to identify what kind of chuncks the file uses. For example, in futur version you could want to enhance your chunck-layout. The "File Type" would be the same, but the actual chunck-layout is different, so this "version"-identifier can be checked to know if the "file type" is the old version or the new enhanced version...
  • File Chunks : Unlimited packs of data which each have 7 parts:
    • Name (limited to 255 bytes) - Chunk name, should be a unique identifier, not forced tho.
      Comments: Make the Name shorter. 255 bytes are too much of a waste of space. Make it like the "File Type", so 5 bytes. That's more then enough to make all kind of different chuncks... Also, make it forced, I mean it should be required IMO (to be consistent with the overall global file format you're creating).
    • Version (1 byte) - Same as "File Version". Adds the same advantages but then for individual chunk-types...
      Comments: For example, futur types of the chunck called "graph" can be compressed. So then you have a uncompressed "graph" version 1, and a compressed "graph" version 2 or something like that. Also, concearning the chunck-checksum (see below): You could reserve bit 8 in this version-field to imply that there is a checksum or not. In this way the checksum could be optional.
    • Comments Length (1 byte) - To identify the length of the comment, otherwise you wont know where to comment begins or ends. Unless you will always use 255 bytes. But for short comments this is a waste of space.
    • Comments (limited to 255 bytes) - Like the file comments, additional information for this chunk.
    • Data Length (unsigned double word, =4 bytes) - The length of the actual chunk-data. 4 bytes means a maximum length of 4GB!
      Comments: This is needed as you have no other means of telling where the chunck will start and where it will end. Also, you can use less bytes to tell the length, but this will of course mean that your actual chunck-datalength will be not as high. But reading a double word is much easier then reading 3 bytes for example. Though, you can choose to limit it to 2 bytes (max chunck data length is 65535 bytes then (64Kb)). But that would limit you in what you can have as chunckdata. So 4 bytes, unsigned double word, is the best choice (and, btw, also used in almost all existing filetypes)...
    • Data (limited to 16777215 bytes, or &HFFFFFF, which is approx. 16.7 kb) - The actual content of the chunk, can be a configuration file, anything, really.
      Comments: 16777215 bytes = +-16Mb. But anyway, since I would use 4 bytes for the chuncks "data length"-identifier, this would be +- 4Gb...
    • Checksum (x bytes; depend on what kind of checksum you use) - You could add a checksum to the chunck to make it possible to verify the integrity of your data.
      Comments: But that would imply reading/saving/checking the data, which could mean slow-processing. On the other hand, you can create your own type of checksum (only take the hash of byte 10 thru byte 100 or something). This has some advantages: since it only checks some bytes and not all, the speed wouldn't be as slow as if you would check the whole chunck-data. And people who wanna "hack" your fileformat will have a hard time doing it, because they don't know how the checksum is calculated.

A special (and required?) chunck would be the "FileInfo"-chunck wich would hold:
  • Name: FINFO
  • Version: &H01
  • Comments Length: &H11
  • Comments: "General File Info"
  • Data Length: xxxx
  • Data: consists of some "mini-chuncks":
    [Type (1 byte)] [Length (1 byte)] [actual string (max 255 bytes)]
    Type:
    &H01 = defines the "Author"-string
    &H02 = defines the "Application"-string
    &H03 = defines the ...
    etc...
    Length:
    Length of the actual string.
  • Checksum: xxxx

Maybe you noticed that this file format approach is very similar to PNG. Well, there is a reason for it. The format has a very large potential and you can do whatever you want with it.

Some will say: "Cookie, you make it again more difficult then it is". Well in fact, again, it isn't. It seems difficult, but it realy isn't. This approach makes it that you can do whatever you want with the format and you can store whatever you want with it. And, most important, you are "save" for any futur developments you want to make without reinventing/recreating a new fileformat. This means, your old applications would even read files from your new applications without errors. (If they can interpret the data is something else, that depends on how "compatible" you make your new applications).

In fact, this general format I just discribed is used by many many existing companies because of it's versitile use. (And I use just the same for some of my applications)
RE: File format suggestions by Choli on 06-12-2004 at 12:22 AM

quote:
Originally posted by CookieRevised
File Chunks : Unlimited packs of data which each have 7 parts:


Name (limited to 255 bytes) - Chunk name, should be a unique identifier, not forced tho.
Comments: Make the Name shorter. 255 bytes are too much of a waste of space. Make it like the "File Type", so 5 bytes. That's more then enough to make all kind of different chuncks... Also, make it forced, I mean it should be required IMO (to be consistent with the overall global file format you're creating).

Version (1 byte) - Same as "File Version". Adds the same advantages but then for individual chunk-types...
Comments: Concearning the chunck-checksum (see below): You could reserve bit 8 to imply that there is a checksum or not. In this way the checksum could be optional.

Comments Length (1 byte) - To identify the length of the comment, otherwise you wont know where to comment begins or ends. Unless you will always use 255 bytes. But for short comments this is a waste of space.

Comments (limited to 255 bytes) - Like the file comments, additional information for this chunk.
I agree with cookie on his format, however i'd improve what is in the quote: Name and comments are human-readable strings, I mean: they have no weird symbols, only characters; so you can use null-terminated strings: The size required would be the same (because you add a null character at the end, but you no longer need the byte identifying the lenght) and this way you can have strings of any size. You may want to have a 500-byte comments on the chunk. In any case, the recommendation of making the name and comments as short as possible still applies.

A 2nd improvment, would be the (optional) use of unicode strings in name and comments. As you know, unicode strings have (may have) null characters very often and you'd need something to distinguish between the null bytes of the characters in an unicode string and the null byte in an ansi string that means end-of-string.

That can be done, puting the bytes 255 and 254 at the begining of the string. If those bytes are there, the string is unicode and ends when you find 2 null bytes (ie: 1 null unicode character). If the string begins with the characters 254 and 255 (note the order), that means the string is unicode but big endian. In any other case, the string is ansi.
quote:
Originally posted by CookieRevised
Checksum (x bytes; depend on what kind of checksum you use) - You could add a checksum to the chunck to make it possible to verify the integrity of your data.
[i]Comments: But that would imply reading/saving/checking the data, which could mean slow-processing. On the other hand, you can create your own type of checksum (only take the hash of byte 10 thru byte 100 or something). This has some advantages: since it only checks some bytes and not all, the speed wouldn't be as slow as if you would check the whole chunck-data. And people who wanna "hack" your fileformat will have a hard time doing it, because they don't know how the checksum is calculated.
this should be optional and this should be said in (for example) one bit in the version byte.

in case of using the checksum, it should be applied to all the chuck. today there are quite fast algorithms to compute a CRC32 or a MD5 very fast (for example, once i mde a program in vb that takes a file and calculates its CRC32. It worked at about 600 Kb/s, and that's a good speed, because VB doesn't have the needed support for some kind of operations and this slows the speed. I'm sure that the same program (well) done in C would go at 2 Mb/s or faster)
RE: File format suggestions by CookieRevised on 06-12-2004 at 01:45 AM

quote:
Originally posted by Choli
agree with cookie on his format, however i'd improve what is in the quote: Name and comments are human-readable strings, I mean: they have no weird symbols, only characters; so you can use null-terminated strings: The size required would be the same (because you add a null character at the end, but you no longer need the byte identifying the lenght) and this way you can have strings of any size. You may want to have a 500-byte comments on the chunk. In any case, the recommendation of making the name and comments as short as possible still applies.
True, that's also a possebility. But personaly, I don't like to use that because you need to read and check an unknown amount of bytes until you encounter a null-byte.

The advanatge is indeed that you can have a string longer then 255 bytes, but the disadvantage is that you need to read an X-amount of bytes and check each byte if it is a null-byte.

This implies also that you can't jump easly to a certain part of the file/chunck without checking these null-terminated strings and this is, IMO, a bit messy (but it will work though). For example, If I wanna read a certain chunk, I must process every previous chunk to check if there are comments and how long they are, before I now where the offset is from the chunk I need. If you use a length-identifier, I can jump very easly and quickly thru the chunks...

RE: File format suggestions by Choli on 06-12-2004 at 10:09 AM

quote:
Originally posted by CookieRevised
This implies also that you can't jump easly to a certain part of the file/chunck without checking these null-terminated strings and this is, IMO, a bit messy
:O yes yes yes, you're completly right.... I didn't think about that:$. I suggest puttung the member Data length at the begining of the chunk and it should be the whole chunk length. Imo, I still think that null-terminated strings are better than pascal-style strings. Most programming languages and OSes use those kind of strings and it's not very complicated to skip the name and comments fields by searching null chars (even if you need to read all the bytes). Note that putting the data lenght at the begining if you're reading the comments that's because you need the data of the chunk and so you may need (very probably) the whole chunk (with its name and comments). So at the same time you're reading the comments too.
RE: File format suggestions by Millenium_edition on 06-12-2004 at 12:08 PM

I'm doing the way you suggested, since the beginning.

(length of string in one byte)<string>

For the extra version info: just use MGIF1, MGIF2, MGIF3, etc.

I've tried using 4 bytes instead of three, but converting from hex to decimal returns -1, so I didn't bother making it like that.

About a required chunk: forget it, it wouldn't be "global" anymore, since this is for ALL kinds of documents: images, text documents, configuration, settings, etc.
+ you can add it if you want your program to support it.

version and checksum is not necessary, that can be done in the comments. They're not long after all.

quote:
Originally posted by Choli
I agree with cookie on his format, however i'd improve what is in the quote: Name and comments are human-readable strings, I mean: they have no weird symbols, only characters; so you can use null-terminated strings: The size required would be the same (because you add a null character at the end, but you no longer need the byte identifying the lenght) and this way you can have strings of any size. You may want to have a 500-byte comments on the chunk. In any case, the recommendation of making the name and comments as short as possible still applies.
no, sorry. it's a DLL, the developer can do what he wants with that, it's a choice to make.


I like your ideas, but you're just customizing it, it should be and remain global =/
RE: File format suggestions by Choli on 06-12-2004 at 01:30 PM

quote:
Originally posted by Millenium_edition
I've tried using 4 bytes instead of three, but converting from hex to decimal returns -1, so I didn't bother making it like that.
eh? you just have to store in those 4 bytes the integer not the decimal representation of it :-/
quote:
Originally posted by Millenium_edition
no, sorry. it's a DLL, the developer can do what he wants with that, it's a choice to make.
I think I get you better now. So let's redefine the chuck format:
Name: this is an ID for the chunk. Fixed size or a string (null-terminated or with a byte-size identifier.). It's pourpose is identify the contents of the chunk.
Length: Length of the date field of the chuck. 4 bytes (from 0 to 2^32 - 1 bytes)
Data: The actual data of the chunk. Internal structure defined by the program/developer that use the DLL. From 0 to 2^32 - 1 bytes. There's on need to be other fields into the chunk. Comments, checksums, etc... may (have to) be included inside this data field and it's up to the developer use/check/manage them.