By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,159 Members | 888 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,159 IT Pros & Developers. It's quick & easy.

Decoding a 30 MB formatted text file

P: n/a
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

AuctioneerSnapshotDB = {
["nordrassil-neutral"] = {
["nextAuctionId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1;1650000;1650000;Boneglay;0;0;3;1159 391569;1159420369",
[2] =
"6661;0;0;0;1;399900;599900;Krius;0;0;2;1159391569 ;1159398769",
[3] =
"6657;0;0;1289192110;1;7300;7900;Bootyboy;0;0;4;11 59391569;1159477969",
[19] =
"9865;1191;0;680935487;1;5013;8000;Warmist;0;0;1;1 159391569;1159393369",
},
["ahKey"] = "nordrassil-neutral",

I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.

A pseudocode version might read:
Copy each line into a temporary string
If we have found the keyword "nordrassil-neutral" and have found the
keyword "auctions"
if the line contains 10 ";"
populate the struct
} while we have not found the keyword "ahKey"

I can tell that there are 3 contigous "\t" before each numbered line.
But my question is if this is the right approach to a structured
document or is there a better way? I can see that there is a rational
structure but can't see how to use the formatted text better than my
brute force of counting approach.

BTW, I have asked for guidance on the format - decoding it myself is
the only option.

Sep 29 '06 #1
Share this Question
Share on Google+
7 Replies


P: n/a
pkirk25 wrote:
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

AuctioneerSnapshotDB = {
["nordrassil-neutral"] = {
["nextAuctionId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1;1650000;1650000;Boneglay;0;0;3;1159 391569;1159420369",
[2] =
"6661;0;0;0;1;399900;599900;Krius;0;0;2;1159391569 ;1159398769",
[3] =
"6657;0;0;1289192110;1;7300;7900;Bootyboy;0;0;4;11 59391569;1159477969",
[19] =
"9865;1191;0;680935487;1;5013;8000;Warmist;0;0;1;1 159391569;1159393369",
},
["ahKey"] = "nordrassil-neutral",

I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.
If you want the entire document contents, I would parse this based on
the structure, not the content. Strip the whitespace and build a tree
based on the objects (delimited by {}) and elements (delimited by[]).

A first pass to build the tree can also be used to validate the
structure (matching [] and {}). Once you have the tree you can scan for
the required elements, ignoring the layout.

--
Ian Collins.
Sep 29 '06 #2

P: n/a

"pkirk25" <pa*****@kirks.netwrote in message
news:11*********************@i3g2000cwc.googlegrou ps.com...
But my question is if this is the right approach to a structured
document or is there a better way?
"better way?" Maybe, maybe not. Although C is my preferred method, there
are methods other than C. I've used AWK or VI with great success in the
past solving problems similar to yours. AWK is well suited to processing
formatted text data and because it can accept input files it can be
automated. Also, you can use it to readily reformat the desired output so
it can be read in easily by C (i.e, think text preprocessor...). If it's a
one or two time deal, I'd use VI and brute force editing, _if_ the largest
unique piece of text that needs to be processed is a single line. Again,
once it's in an easier to use format, then read it in via C.
Rod Pemberton
Sep 29 '06 #3

P: n/a
pkirk25 wrote:
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

AuctioneerSnapshotDB = {
["nordrassil-neutral"] = {
["nextAuctionId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1;1650000;1650000;Boneglay;0;0;3;1159 391569;1159420369",
[2] =
"6661;0;0;0;1;399900;599900;Krius;0;0;2;1159391569 ;1159398769",
[3] =
"6657;0;0;1289192110;1;7300;7900;Bootyboy;0;0;4;11 59391569;1159477969",
[19] =
"9865;1191;0;680935487;1;5013;8000;Warmist;0;0;1;1 159391569;1159393369",
},
["ahKey"] = "nordrassil-neutral",

I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.

A pseudocode version might read:
Copy each line into a temporary string
If we have found the keyword "nordrassil-neutral" and have found the
keyword "auctions"
if the line contains 10 ";"
populate the struct
} while we have not found the keyword "ahKey"

I can tell that there are 3 contigous "\t" before each numbered line.
But my question is if this is the right approach to a structured
document or is there a better way? I can see that there is a rational
structure but can't see how to use the formatted text better than my
brute force of counting approach.
It looks like the file is a collection of key-value pairs. The
keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.

If you rely on counting tabs and semi-colons or the order of the
keys, you can be sure that the people maintaining the program
that generates the file will change those details without
bothering to tell you. The basic structure could change, too, of
course, but that is less likely.

--
Thomas M. Sommers -- tm*@nj.net -- AB2SB

Sep 30 '06 #4

P: n/a
pkirk25 wrote:
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:
[...]
BTW, I have asked for guidance on the format - decoding it myself is
the only option.
This might be a time to learn parser generators such as yacc, they
can make writing parser much easier than doing it by hand.
Sep 30 '06 #5

P: n/a
It looks like the file is a collection of key-value pairs. The
keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.
Something along the lines of this seem correct approach?

key0 = "AuctioneerSnapshotDB" is start of region
key1 = random_input_string for example "draenor-neutral" -
key2 = "auctions" is start of useful data
key3 = "ahKey" means you are now out of the region

bool to put data int structs

read file line by line
when found key0, key1 and key2 bool to put data int structs = true
work with data until key3 found at which point we exit the function

Or is were you thinkign of searching for curly braces?

Sep 30 '06 #6

P: n/a
pkirk25 wrote:
>>It looks like the file is a collection of key-value pairs. The
keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.

Something along the lines of this seem correct approach?

key0 = "AuctioneerSnapshotDB" is start of region
key1 = random_input_string for example "draenor-neutral" -
key2 = "auctions" is start of useful data
key3 = "ahKey" means you are now out of the region

bool to put data int structs

read file line by line
when found key0, key1 and key2 bool to put data int structs = true
work with data until key3 found at which point we exit the function

Or is were you thinkign of searching for curly braces?
Not just curly braces, but all of the syntax of the file. Read
the file token by token, not line by line, since for all you know
there may be some line breaks in the middle of some element (I am
assuming from what you said in the initial post that the format
is undocumented, which means that you should make as few
assumptions as possible about it).

You might want to use data structures along these lines:

struct key {
int k_type;
union {
int k_int;
char *k_string;
} u;
};

struct key_value;

struct value {
int v_type;
union {
int k_int;
char *k_string;
struct key_value *k_keyvalue;
} u;
};

struct key_value {
struct key *kv_key;
struct value *kv_value;
};

If there is only one of them, you might be able to treat the top
level (AuctioneerSnapshotDB) specially. If so, write a function
to search for a particular key and return its associated value.
This function will call a recursive function to read in a
key-value pair (it has to be recursive because the value may be
another key-value pair). Once you find the right key and get its
value, then worry about getting they subkeys you want.

This is all off the top of my head, so there are probably things
I haven't thought of or otherwise got wrong, but you should get
the idea.

--
Thomas M. Sommers -- tm*@nj.net -- AB2SB

Sep 30 '06 #7

P: n/a
"pkirk25" <pa*****@kirks.netwrote in message
news:11*********************@i3g2000cwc.googlegrou ps.com...
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:
<irony>If only there were a Practical Extraction and Reporting Language
(Perl) you could use...</irony>

--
Mabden
Oct 1 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.