471,086 Members | 1,451 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,086 software developers and data experts.

Storing variable length data in file -- Paging

How i can store a variable length data in file ?
I want to do it using pure C, without existing databases.

I'm thinking to use pages to store data.

Anyone has idea for the file format ?

I want to store data like a database:
----------------------------------
Custumer:
id - integer
name - varchar(50)
email - varchar(45)
----------------------------------

__________________________________________________ _________
Luiz Antonio Gomes Picanço
Software Developer
www.luizantonio.com
lu**@luizantonio.com
__________________________________________________ _________
Nov 14 '05 #1
14 5518
In article <dd**************************@posting.google.com >,
Luiz Antonio Gomes Pican?o <lu**@luizantonio.com> wrote:
:How i can store a variable length data in file ?
:I want to do it using pure C, without existing databases.

There must be dozens of ways.

:I'm thinking to use pages to store data.

You could do that, but what happens if the variable length data
exceeds the page size?

:Anyone has idea for the file format ?

:I want to store data like a database:
:----------------------------------
:Custumer:
:id - integer
:name - varchar(50)
:email - varchar(45)

How would you like the program to operate in each of these cases?

1) froup - varchar(-42)
2) froup - varchar(0)
3) froup - varchar(4294967295)
4) froup - varchar(6.4)
5) froup - varchar(id)
6) ThisIsAVeryLongIdentifierThatIsLongerThanThe5Chara cterPositionsYouImplyInYourExample - varchar(100)
7) - varchar(10)
8) varchar(20) - froup
9) .dot - varchar(17)
10) _foo - varchar(8)
11) - - varchar(21)
12) ------- varchar(83)
13) froup -
14) froup - varchar[71]
15) biff! - varchar(4)
16)
17) semi - varchar(23);
18) froup varchar(102)
19) froup - varchar(5:10)
20) froup - varchar(6,11)
21) froup - varchar(6)[11]

--
"Mathematics? I speak it like a native." -- Spike Milligan
Nov 14 '05 #2
Luiz Antonio Gomes Pican?o wrote:
How i can store a variable length data in file ?
I want to do it using pure C, without existing databases.

I'm thinking to use pages to store data.

Anyone has idea for the file format ?

I want to store data like a database:
----------------------------------
Custumer:
id - integer
name - varchar(50)
email - varchar(45)
----------------------------------

__________________________________________________ _________
Luiz Antonio Gomes Picanço
Software Developer
www.luizantonio.com
lu**@luizantonio.com
__________________________________________________ _________


To store variable of any length in a file you do this:

You store the length of the data, then the data itself.
This supposes you have a maximum length, say what an
integer can hold (2^31 in most 32 bit systems, 2GB)
or 4GB if you go for unsigned int to store the length.

To read such a file you read the length, by reading an integer
from the file, then skip that number of bytes to access the
next record.

Many popular files formats are done that way, for instance
the object file format under windows (.obj).

For instance in your case suppose you have the record:
id 69988 (1 integer)
name Joe Smith (9 chars)
email jo*@joe.com (11 chars)

You would store, supposing sizeof(int)=4
4 // The length of the record
4 // id of record, always fixed
4 // Length of name
9 // chars of name
4 // length of email
11 // chars of email.

11+4+9+4+4 == 32

Then you store:
fwrite(&len,1,4,file);
fwrite(&id,1,4,file);
fwrite(&namelen,1,4,file);
fwrite(name,1,namelen,file);
fwrite(&email_len,1,4,file);
fwrite(email,1,email_len,file);

You read it using the opposite function, fread().

This is a very compact way of storing the information.

jacob

Nov 14 '05 #3
In article <42**********************@news.wanadoo.fr>,
jacob navia <ja***@jacob.remcomp.fr> wrote:
:Luiz Antonio Gomes Pican?o wrote:

:> How i can store a variable length data in file ?
:> I want to do it using pure C, without existing databases.

:To store variable of any length in a file you do this:

:You store the length of the data, then the data itself.
:This supposes you have a maximum length, say what an
:integer can hold (2^31 in most 32 bit systems, 2GB)
:or 4GB if you go for unsigned int to store the length.

The OP asked for "pure C". Assuming 4 byte ints is not completely
portable, and hence not entirely "pure". ;-)

:To read such a file you read the length, by reading an integer
:from the file, then skip that number of bytes to access the
:next record.

Well, you wouldn't read an integer from the file, you would read
4 bytes from the file and convert it to sufficiently large integer type,
taking into account whatever ended-ness you had decreed for the file
type.
:For instance in your case suppose you have the record:
:id 69988 (1 integer)
:name Joe Smith (9 chars)
:email jo*@joe.com (11 chars)

:Then you store:
: fwrite(&len,1,4,file);
: fwrite(&id,1,4,file);
: fwrite(&namelen,1,4,file);
: fwrite(name,1,namelen,file);
: fwrite(&email_len,1,4,file);
: fwrite(email,1,email_len,file);

:You read it using the opposite function, fread().

Though if you naively just do the opposite, then you run into the
small matter of null termination. Usually a 'varchar' size excludes
any required string terminator. There is no need to store the null
terminator when you store the length, and the code above would not
do so -- but when you read the string values back in, chances are that
you would want to null terminate them.
Of course a more generalized program would not assume a fixed-length
representation for the -size- of the strings. For example, using a
4 byte unsigned number as the size field artificially limits the
maximum field length to only (4 gigabytes minus 1), which is entirely
unsatisfactory if one wishes to store even moderately sized strings
such as an entire day's comp.lang.c flames and off-topic posts.
--
Positrons can be described as electrons traveling backwards in time.
Certainly many Usenet arguments about the past become clearer when they
are re-interpreted as uncertainty about the future.
-- Walter Roberson
Nov 14 '05 #4
Luiz Antonio Gomes Pican?o wrote on 25/02/05 :
How i can store a variable length data in file ?
I want to do it using pure C, without existing databases.

I'm thinking to use pages to store data.

Anyone has idea for the file format ?

I want to store data like a database:
----------------------------------
Custumer:
id - integer
name - varchar(50)
email - varchar(45)
----------------------------------


This is not a C question. It's a design issue.

<OT>
Use the Comma Separated Values (CVS) format, for example... Easy to
implement and to parse...
</OT>

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html
The C-library: http://www.dinkumware.com/refxc.html

"Mal nommer les choses c'est ajouter du malheur au
monde." -- Albert Camus.

Nov 14 '05 #5
Walter Roberson wrote:
In article <42**********************@news.wanadoo.fr>,
jacob navia <ja***@jacob.remcomp.fr> wrote:
:Luiz Antonio Gomes Pican?o wrote:

:> How i can store a variable length data in file ?
:> I want to do it using pure C, without existing databases.

:To store variable of any length in a file you do this:

:You store the length of the data, then the data itself.
:This supposes you have a maximum length, say what an
:integer can hold (2^31 in most 32 bit systems, 2GB)
:or 4GB if you go for unsigned int to store the length.

The OP asked for "pure C". Assuming 4 byte ints is not completely
portable, and hence not entirely "pure". ;-)

I said: "WHAT AN INTEGER CAN HOLD", and *then* I explained as an
example 4 byte ints, the most common case. Anyway you *have* to
assume some length. A 4GB limit for names, email-addresses looks
reasonable to me.

More interesting would be a limitation to 255 for both. This would
save 6 bytes/record.

More clever schemas can be used:
The length is 1-254 in one byte. If the length is more than 254
then a 255 is written, and the next two bytes are read. If they
are between 1-65534 that is the length. If the value read is
65535 then you read the next 4 bytes, and so on. This is a
virtually ilimited format.

This way you can store very efficiently the data since most names
and emails will be under 254.

This schema is used for instance in the encoding of the numeric
constants in Microsoft's debug info.

It supposes a slightly more complex decoder.
:To read such a file you read the length, by reading an integer
:from the file, then skip that number of bytes to access the
:next record.

Well, you wouldn't read an integer from the file, you would read
4 bytes from the file and convert it to sufficiently large integer type,
taking into account whatever ended-ness you had decreed for the file
type.

Correct.

:For instance in your case suppose you have the record:
:id 69988 (1 integer)
:name Joe Smith (9 chars)
:email jo*@joe.com (11 chars)

:Then you store:
: fwrite(&len,1,4,file);
: fwrite(&id,1,4,file);
: fwrite(&namelen,1,4,file);
: fwrite(name,1,namelen,file);
: fwrite(&email_len,1,4,file);
: fwrite(email,1,email_len,file);

:You read it using the opposite function, fread().

Though if you naively just do the opposite, then you run into the
small matter of null termination. Usually a 'varchar' size excludes
any required string terminator. There is no need to store the null
terminator when you store the length, and the code above would not
do so -- but when you read the string values back in, chances are that
you would want to null terminate them.

Correct. You should allocate 1 byte more than you will read
and then zero terminate the string when reading.

Of course a more generalized program would not assume a fixed-length
representation for the -size- of the strings. For example, using a
4 byte unsigned number as the size field artificially limits the
maximum field length to only (4 gigabytes minus 1),
Well, this is a limitation of the format as described.
An easy way to overcome this and make it more efficient is the
algorithm outlined above, or you can go to long long
for the length, allowing to store strings up to
18 446 744 073 709 551 616 bytes.

which is entirely
unsatisfactory if one wishes to store even moderately sized strings
such as an entire day's comp.lang.c flames and off-topic posts.


Well, my flames folder *is* overflowing...

flames >/dev/null
Nov 14 '05 #6
jacob navia wrote:
.... snip ...
I said: "WHAT AN INTEGER CAN HOLD", and *then* I explained as an
example 4 byte ints, the most common case. Anyway you *have* to
assume some length. A 4GB limit for names, email-addresses looks
reasonable to me.

More interesting would be a limitation to 255 for both. This would
save 6 bytes/record.

More clever schemas can be used:
The length is 1-254 in one byte. If the length is more than 254
then a 255 is written, and the next two bytes are read. If they
are between 1-65534 that is the length. If the value read is
65535 then you read the next 4 bytes, and so on. This is a
virtually ilimited format.

This way you can store very efficiently the data since most names
and emails will be under 254.

This schema is used for instance in the encoding of the numeric
constants in Microsoft's debug info.


A simpler scheme is to assign the high bit of each byte to mean
'more', and extract 7 bits per byte. This would be something like:

#define HIBIT 128

ch = getc(f); value = 0;
while (ch & HIBIT) {
value = HIBIT * value + (ch & (HIBIT - 1));
ch = getc(f);
}
value = HIBIT * value + (ch & (HIBIT - 1));

which is independant of CHAR_BIT and works nicely with streams.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
Nov 14 '05 #7
In article <42***************@yahoo.com>,
CBFalconer <cb********@worldnet.att.net> wrote:
:A simpler scheme is to assign the high bit of each byte to mean
:'more', and extract 7 bits per byte. This would be something like:

: #define HIBIT 128

Minor phrasing nit. For this scheme, you do not want to assign
the "high" bit of each byte to mean 'more', you want to assign
a -fixed- platform-independant bit for that purpose. Which is
what your code does.
: ch = getc(f); value = 0;

This discussion leads me to ponder platform independance of the
database a bit more.

I do not have my copy of the C89 standard with me and cannot get
it today (building construction work), so perhaps someone could
answer to these points?

a) getc() is defined in terms of 'characters'. For this purpose,
is 'character' equivilent to 'byte' (that is, basic fetchable
units), or is 'character' potentially a multi-byte entity
(e.g., unicode). [I've probably been reading too much about
perl streams...]

b) Are fread() and fwrite() defined in terms of specific numbers
of bits per byte? If not, then if a file is written using
fwrite() on one platform, and is then taken to another platform,
the data that is read in may not have the same byte or character
boundaries. The result would then depend upon the nature of the
process of "taking it to another platform" (e.g., if I recall
correctly, ftp would normally do bytewise renormalization). This,
of course, in addition to the usual problem that
fwrite( (char *)&value, sizeof(value), 1, STREAM) is going to
write in a platform-dependant byte order.

I guess all this was why xdr was invented...
--
Admit it -- you peeked ahead to find out how this message ends!
Nov 14 '05 #8
In article <42**********************@news.wanadoo.fr>,
jacob navia <ja***@jacob.remcomp.fr> wrote:
|Walter Roberson wrote:
|> In article <42**********************@news.wanadoo.fr>,
|> jacob navia <ja***@jacob.remcomp.fr> wrote:

|> :You store the length of the data, then the data itself.
|> :This supposes you have a maximum length, say what an
|> :integer can hold (2^31 in most 32 bit systems, 2GB)
|> :or 4GB if you go for unsigned int to store the length.

|I said: "WHAT AN INTEGER CAN HOLD", and *then* I explained as an
|example 4 byte ints, the most common case.

You closed the () a bit early. If you remove the part in () from
your sentance, your sentance implies that if you use an unsigned
int to hold the length then the limit will be 4 GB no matter what
the platform.

In any case, the point is that you should be chosing the limit
first and -then- choosing the representation. Choosing a platform-
dependant limit is not the best of coding practices. There is
more excuse for chosing the limit of 2^32-1, as one can be
sure that one's unsigned long will hold at least that much. That is
a different choice than chosing "what an integer can hold".
:An easy way to overcome this and make it more efficient is the
:algorithm outlined above, or you can go to long long
:for the length,

long long is not part of C89.
--
So you found your solution
What will be your last contribution?
-- Supertramp (Fool's Overture)
Nov 14 '05 #9
In article <42**********************@news.wanadoo.fr>,
jacob navia <ja***@jacob.remcomp.fr> wrote:
:Then you store:
: fwrite(&len,1,4,file);

The byte order that results is not fixed in the standard.
You would not use this method of writing the file if you
are trying to write portable files.
--
I predict that you will not trust this prediction.
Nov 14 '05 #10
CBFalconer wrote:
jacob navia wrote:

... snip ...
I said: "WHAT AN INTEGER CAN HOLD", and *then* I explained as an
example 4 byte ints, the most common case. Anyway you *have* to
assume some length. A 4GB limit for names, email-addresses looks
reasonable to me.

More interesting would be a limitation to 255 for both. This would
save 6 bytes/record.

More clever schemas can be used:
The length is 1-254 in one byte. If the length is more than 254
then a 255 is written, and the next two bytes are read. If they
are between 1-65534 that is the length. If the value read is
65535 then you read the next 4 bytes, and so on. This is a
virtually ilimited format.

This way you can store very efficiently the data since most names
and emails will be under 254.

This schema is used for instance in the encoding of the numeric
constants in Microsoft's debug info.

A simpler scheme is to assign the high bit of each byte to mean
'more', and extract 7 bits per byte. This would be something like:

#define HIBIT 128

ch = getc(f); value = 0;
while (ch & HIBIT) {
value = HIBIT * value + (ch & (HIBIT - 1));
ch = getc(f);
}
value = HIBIT * value + (ch & (HIBIT - 1));

which is independant of CHAR_BIT and works nicely with streams.


I don't think the 7-bit thing works today. Everybody 'extends' the set
to eight bits. Many years ago I learned that Pascal 'strings' were
arrays of characters, the first of which was the length of the string.
As above, this allowed short strings of 0..255 characters. Then someone
(I don't remember who) 'extended' the system. The short string was
limited to 254 characters. If the 'length' byte (s[0]) was 255 it
indicated that the string continued and that the length of the second
string could be found at s[255] and the first character of the second
string is at s[256]. If s[255] == 255 we just keep going (and going).

--
Joe Wright mailto:jo********@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---
Nov 14 '05 #11
jacob navia <ja***@jacob.remcomp.fr> wrote in message news:<42**********************@news.wanadoo.fr>...
Luiz Antonio Gomes Pican?o wrote:
How i can store a variable length data in file ?
I want to do it using pure C, without existing databases.

I'm thinking to use pages to store data.

Anyone has idea for the file format ?

I want to store data like a database:
----------------------------------
Custumer:
id - integer
name - varchar(50)
email - varchar(45)
----------------------------------

__________________________________________________ _________
Luiz Antonio Gomes Picanço
Software Developer
www.luizantonio.com
lu**@luizantonio.com
__________________________________________________ _________


To store variable of any length in a file you do this:

You store the length of the data, then the data itself.
This supposes you have a maximum length, say what an
integer can hold (2^31 in most 32 bit systems, 2GB)
or 4GB if you go for unsigned int to store the length.

To read such a file you read the length, by reading an integer
from the file, then skip that number of bytes to access the
next record.

Many popular files formats are done that way, for instance
the object file format under windows (.obj).

For instance in your case suppose you have the record:
id 69988 (1 integer)
name Joe Smith (9 chars)
email jo*@joe.com (11 chars)

You would store, supposing sizeof(int)=4
4 // The length of the record
4 // id of record, always fixed
4 // Length of name
9 // chars of name
4 // length of email
11 // chars of email.

11+4+9+4+4 == 32

Then you store:
fwrite(&len,1,4,file);
fwrite(&id,1,4,file);
fwrite(&namelen,1,4,file);
fwrite(name,1,namelen,file);
fwrite(&email_len,1,4,file);
fwrite(email,1,email_len,file);

You read it using the opposite function, fread().

This is a very compact way of storing the information.

jacob

But, when the record grow, i should rewrite the all data to file. I'm
thinking in do this:

struct page
{
unsigned int nextpage;
unsigned int priorpage;
unsigned int length;
char pageoverflow;
}

If the record becomes larger then page, a overflow page will be used.
Nov 14 '05 #12
Luiz Antonio Gomes Pican?o wrote:
But, when the record grow, i should rewrite the all data to file.


True, but I thought that people's names/emails address do not change.
For growing a record the best strategy within my schema is to leave
a record marked as empty and then add the bigger record at the
end of the file.

This pressumes periodic "garbage collection" where all the empty
records are erased in a copy operation. Some years ago, when
I used Outlook, I had to periodcally tell it to "compress folder"
because probably a similar schema was used.

Your page schema is better when records grow by moderate
amounts but wastes a lot of place in the not 100% filled
pages. Besides, your schema doesn't organize the records within a
page.

Each possible schema has drawbacks. It depends on the data and
whether it changes often or not, whether space is at a premium
or not, whether reading speed is important, whether software
complexity is an issue, etc etc.

jacob

Nov 14 '05 #13
On Sat, 26 Feb 2005 17:13:40 +0000, Walter Roberson wrote:

....
: ch = getc(f); value = 0;

This discussion leads me to ponder platform independance of the
database a bit more.

I do not have my copy of the C89 standard with me and cannot get
it today (building construction work), so perhaps someone could
answer to these points?

a) getc() is defined in terms of 'characters'. For this purpose,
is 'character' equivilent to 'byte' (that is, basic fetchable
units), or is 'character' potentially a multi-byte entity
(e.g., unicode). [I've probably been reading too much about
perl streams...]
Yes, it reads the next byte from the file. It reads a value as an unsigned
char and converts that to int for the return value. It is the wide
character functions that interpret multibyte sequences.
b) Are fread() and fwrite() defined in terms of specific numbers
of bits per byte?
No.
If not, then if a file is written using
fwrite() on one platform, and is then taken to another platform,
the data that is read in may not have the same byte or character
boundaries.
That is a possibility irrespective of what the standard C library does.
The result would then depend upon the nature of the
process of "taking it to another platform" (e.g., if I recall
correctly, ftp would normally do bytewise renormalization).
E.g. should a 16 bit platform pack 2 octets per byte or just one?
This,
of course, in addition to the usual problem that
fwrite( (char *)&value, sizeof(value), 1, STREAM) is going to
write in a platform-dependant byte order.


But that problem is easily avoided.

Lawrence
Nov 14 '05 #14
On 26 Feb 2005 17:13:40 GMT, ro******@ibd.nrc-cnrc.gc.ca (Walter
Roberson) wrote:
<snip>
a) getc() is defined in terms of 'characters'. For this purpose,
is 'character' equivilent to 'byte' (that is, basic fetchable
units), or is 'character' potentially a multi-byte entity
(e.g., unicode). [I've probably been reading too much about
perl streams...]
getc, fgetc, getchar do a byte, as do putc, fputc, putchar.
getwc, fgetwc, getwchar and putwc, fputwc, getwchar do a wide
character to/from multibyte in a 'wide-oriented' file (if supported).
b) Are fread() and fwrite() defined in terms of specific numbers
of bits per byte? If not, then if a file is written using
fwrite() on one platform, and is then taken to another platform,
the data that is read in may not have the same byte or character
boundaries. The result would then depend upon the nature of the
process of "taking it to another platform" (e.g., if I recall
correctly, ftp would normally do bytewise renormalization). This,
Exactly. Actually, FTP has several options that mattered in the days
when there were significant 36-bitters on the 'net like PDP-10's.
Nowadays people forget to do even ASCII/BINARY, or with (typical?)
browsers don't even have the choice.
of course, in addition to the usual problem that
fwrite( (char *)&value, sizeof(value), 1, STREAM) is going to
write in a platform-dependant byte order.
The cast isn't needed; fwrite is declared to take const void*.
(Well, unless you need to cast away volatile from your buffer, and
even then it's not really the right way.)
I guess all this was why xdr was invented...


And ASN.1. And for simple cases network byte order.
And one reason many Internet protocols use text.

- David.Thompson1 at worldnet.att.net
Nov 14 '05 #15

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by dave | last post: by
1 post views Thread by RelaxoRy | last post: by
5 posts views Thread by Donald Adams | last post: by
9 posts views Thread by =?Utf-8?B?SmFzb24gSGFydHNvZQ==?= | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.