473,382 Members | 1,353 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

Decoding a 30 MB formatted text file

My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

AuctioneerSnapshotDB = {
["nordrassil-neutral"] = {
["nextAuctionId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1;1650000;1650000;Boneglay;0;0;3;1159 391569;1159420369",
[2] =
"6661;0;0;0;1;399900;599900;Krius;0;0;2;1159391569 ;1159398769",
[3] =
"6657;0;0;1289192110;1;7300;7900;Bootyboy;0;0;4;11 59391569;1159477969",
[19] =
"9865;1191;0;680935487;1;5013;8000;Warmist;0;0;1;1 159391569;1159393369",
},
["ahKey"] = "nordrassil-neutral",

I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.

A pseudocode version might read:
Copy each line into a temporary string
If we have found the keyword "nordrassil-neutral" and have found the
keyword "auctions"
if the line contains 10 ";"
populate the struct
} while we have not found the keyword "ahKey"

I can tell that there are 3 contigous "\t" before each numbered line.
But my question is if this is the right approach to a structured
document or is there a better way? I can see that there is a rational
structure but can't see how to use the formatted text better than my
brute force of counting approach.

BTW, I have asked for guidance on the format - decoding it myself is
the only option.

Sep 29 '06 #1
7 2433
pkirk25 wrote:
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

AuctioneerSnapshotDB = {
["nordrassil-neutral"] = {
["nextAuctionId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1;1650000;1650000;Boneglay;0;0;3;1159 391569;1159420369",
[2] =
"6661;0;0;0;1;399900;599900;Krius;0;0;2;1159391569 ;1159398769",
[3] =
"6657;0;0;1289192110;1;7300;7900;Bootyboy;0;0;4;11 59391569;1159477969",
[19] =
"9865;1191;0;680935487;1;5013;8000;Warmist;0;0;1;1 159391569;1159393369",
},
["ahKey"] = "nordrassil-neutral",

I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.
If you want the entire document contents, I would parse this based on
the structure, not the content. Strip the whitespace and build a tree
based on the objects (delimited by {}) and elements (delimited by[]).

A first pass to build the tree can also be used to validate the
structure (matching [] and {}). Once you have the tree you can scan for
the required elements, ignoring the layout.

--
Ian Collins.
Sep 29 '06 #2

"pkirk25" <pa*****@kirks.netwrote in message
news:11*********************@i3g2000cwc.googlegrou ps.com...
But my question is if this is the right approach to a structured
document or is there a better way?
"better way?" Maybe, maybe not. Although C is my preferred method, there
are methods other than C. I've used AWK or VI with great success in the
past solving problems similar to yours. AWK is well suited to processing
formatted text data and because it can accept input files it can be
automated. Also, you can use it to readily reformat the desired output so
it can be read in easily by C (i.e, think text preprocessor...). If it's a
one or two time deal, I'd use VI and brute force editing, _if_ the largest
unique piece of text that needs to be processed is a single line. Again,
once it's in an easier to use format, then read it in via C.
Rod Pemberton
Sep 29 '06 #3
pkirk25 wrote:
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

AuctioneerSnapshotDB = {
["nordrassil-neutral"] = {
["nextAuctionId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1;1650000;1650000;Boneglay;0;0;3;1159 391569;1159420369",
[2] =
"6661;0;0;0;1;399900;599900;Krius;0;0;2;1159391569 ;1159398769",
[3] =
"6657;0;0;1289192110;1;7300;7900;Bootyboy;0;0;4;11 59391569;1159477969",
[19] =
"9865;1191;0;680935487;1;5013;8000;Warmist;0;0;1;1 159391569;1159393369",
},
["ahKey"] = "nordrassil-neutral",

I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.

A pseudocode version might read:
Copy each line into a temporary string
If we have found the keyword "nordrassil-neutral" and have found the
keyword "auctions"
if the line contains 10 ";"
populate the struct
} while we have not found the keyword "ahKey"

I can tell that there are 3 contigous "\t" before each numbered line.
But my question is if this is the right approach to a structured
document or is there a better way? I can see that there is a rational
structure but can't see how to use the formatted text better than my
brute force of counting approach.
It looks like the file is a collection of key-value pairs. The
keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.

If you rely on counting tabs and semi-colons or the order of the
keys, you can be sure that the people maintaining the program
that generates the file will change those details without
bothering to tell you. The basic structure could change, too, of
course, but that is less likely.

--
Thomas M. Sommers -- tm*@nj.net -- AB2SB

Sep 30 '06 #4
pkirk25 wrote:
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:
[...]
BTW, I have asked for guidance on the format - decoding it myself is
the only option.
This might be a time to learn parser generators such as yacc, they
can make writing parser much easier than doing it by hand.
Sep 30 '06 #5
It looks like the file is a collection of key-value pairs. The
keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.
Something along the lines of this seem correct approach?

key0 = "AuctioneerSnapshotDB" is start of region
key1 = random_input_string for example "draenor-neutral" -
key2 = "auctions" is start of useful data
key3 = "ahKey" means you are now out of the region

bool to put data int structs

read file line by line
when found key0, key1 and key2 bool to put data int structs = true
work with data until key3 found at which point we exit the function

Or is were you thinkign of searching for curly braces?

Sep 30 '06 #6
pkirk25 wrote:
>>It looks like the file is a collection of key-value pairs. The
keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.

Something along the lines of this seem correct approach?

key0 = "AuctioneerSnapshotDB" is start of region
key1 = random_input_string for example "draenor-neutral" -
key2 = "auctions" is start of useful data
key3 = "ahKey" means you are now out of the region

bool to put data int structs

read file line by line
when found key0, key1 and key2 bool to put data int structs = true
work with data until key3 found at which point we exit the function

Or is were you thinkign of searching for curly braces?
Not just curly braces, but all of the syntax of the file. Read
the file token by token, not line by line, since for all you know
there may be some line breaks in the middle of some element (I am
assuming from what you said in the initial post that the format
is undocumented, which means that you should make as few
assumptions as possible about it).

You might want to use data structures along these lines:

struct key {
int k_type;
union {
int k_int;
char *k_string;
} u;
};

struct key_value;

struct value {
int v_type;
union {
int k_int;
char *k_string;
struct key_value *k_keyvalue;
} u;
};

struct key_value {
struct key *kv_key;
struct value *kv_value;
};

If there is only one of them, you might be able to treat the top
level (AuctioneerSnapshotDB) specially. If so, write a function
to search for a particular key and return its associated value.
This function will call a recursive function to read in a
key-value pair (it has to be recursive because the value may be
another key-value pair). Once you find the right key and get its
value, then worry about getting they subkeys you want.

This is all off the top of my head, so there are probably things
I haven't thought of or otherwise got wrong, but you should get
the idea.

--
Thomas M. Sommers -- tm*@nj.net -- AB2SB

Sep 30 '06 #7
"pkirk25" <pa*****@kirks.netwrote in message
news:11*********************@i3g2000cwc.googlegrou ps.com...
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:
<irony>If only there were a Practical Extraction and Reporting Language
(Perl) you could use...</irony>

--
Mabden
Oct 1 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Chris Curvey | last post by:
I'm writing an XMLRPC server, which is receiving a request (from a non-Python client) that looks like this (formatted for legibility): <?xml version="1.0"?> <methodCall>...
2
by: Steven T. Hatton | last post by:
I'm still not completely sure what's going on with C++ I/O regarding the extractors and inserters. The following document seems a bit inconsistent:...
0
by: Johann Blake | last post by:
In my need to decode a JPEG 2000 file, I discovered like many that there was no functionality for this in the .NET Framework. Instead of forking out a pile of cash to do this, I came up with the...
5
by: Peter Jansson | last post by:
Hello group, The following code is an attempt to perform URL-decoding of URL-encoded string. Note that std::istringstream is used within the switch, within the loop. Three main issues have been...
25
by: marcin.rzeznicki | last post by:
Hello everyone I've got a little problem with choosing the best decoding strategy for some nasty problem. I have to deal with very large files wich contain text encoded with various encodings....
2
by: Uriah Piddle | last post by:
Hi Gang, I have text that I am putting into a label that has been custom formatted by a text editor and looks like this: Some outside text <crl>Some inside text</clr Some outside text How...
6
by: Peter K | last post by:
Hi in the processing of some text files, I have found I have strings like: f=E5t pr=F8ve where the strings "=E5" and "=F8" are danish characters "å" and "ø". I can work this out myself,...
3
by: Bjarne Nielsen | last post by:
Hi all In my C# program I need to read a file, which is exported from a DOS program. So words such as "Afløser" (with special danish characters) is read like "Afl�ser". How do I...
42
by: Santander | last post by:
how to decode HTML pages encoded like this: http://www.long2consulting.com/seeinaction2008/Simplicity_Beach_table/index.htm Is there script that will do this automatically and generate normal fully...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.