Decoding a 30 MB formatted text file

pkirk25

My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

AuctioneerSnapshotDB = {
["nordrassil-neutral"] = {
["nextAuctionId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1;1650000;1650000;Boneglay;0;0;3;1159 391569;1159420369",
[2] =
"6661;0;0;0;1;399900;599900;Krius;0;0;2;1159391569 ;1159398769",
[3] =
"6657;0;0;1289192110;1;7300;7900;Bootyboy;0;0;4;11 59391569;1159477969",
[19] =
"9865;1191;0;680935487;1;5013;8000;Warmist;0;0;1;1 159391569;1159393369",
},
["ahKey"] = "nordrassil-neutral",

I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.

A pseudocode version might read:
Copy each line into a temporary string
If we have found the keyword "nordrassil-neutral" and have found the
keyword "auctions"
if the line contains 10 ";"
populate the struct
} while we have not found the keyword "ahKey"

I can tell that there are 3 contigous "\t" before each numbered line.
But my question is if this is the right approach to a structured
document or is there a better way? I can see that there is a rational
structure but can't see how to use the formatted text better than my
brute force of counting approach.

BTW, I have asked for guidance on the format - decoding it myself is
the only option.

Sep 29 '06 #1

Subscribe Post Reply

2433

Ian Collins

pkirk25 wrote:

My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

AuctioneerSnapshotDB = {
["nordrassil-neutral"] = {
["nextAuctionId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1;1650000;1650000;Boneglay;0;0;3;1159 391569;1159420369",
[2] =
"6661;0;0;0;1;399900;599900;Krius;0;0;2;1159391569 ;1159398769",
[3] =
"6657;0;0;1289192110;1;7300;7900;Bootyboy;0;0;4;11 59391569;1159477969",
[19] =
"9865;1191;0;680935487;1;5013;8000;Warmist;0;0;1;1 159391569;1159393369",
},
["ahKey"] = "nordrassil-neutral",

I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.

If you want the entire document contents, I would parse this based on
the structure, not the content. Strip the whitespace and build a tree
based on the objects (delimited by {}) and elements (delimited by[]).

A first pass to build the tree can also be used to validate the
structure (matching [] and {}). Once you have the tree you can scan for
the required elements, ignoring the layout.

--
Ian Collins.

Sep 29 '06 #2

Rod Pemberton

"pkirk25" <pa*****@kirks.netwrote in message
news:11*********************@i3g2000cwc.googlegrou ps.com...

But my question is if this is the right approach to a structured
document or is there a better way?

"better way?" Maybe, maybe not. Although C is my preferred method, there
are methods other than C. I've used AWK or VI with great success in the
past solving problems similar to yours. AWK is well suited to processing
formatted text data and because it can accept input files it can be
automated. Also, you can use it to readily reformat the desired output so
it can be read in easily by C (i.e, think text preprocessor...). If it's a
one or two time deal, I'd use VI and brute force editing, _if_ the largest
unique piece of text that needs to be processed is a single line. Again,
once it's in an easier to use format, then read it in via C.
Rod Pemberton

Sep 29 '06 #3

T.M. Sommers

pkirk25 wrote:

My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

AuctioneerSnapshotDB = {
["nordrassil-neutral"] = {
["nextAuctionId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1;1650000;1650000;Boneglay;0;0;3;1159 391569;1159420369",
[2] =
"6661;0;0;0;1;399900;599900;Krius;0;0;2;1159391569 ;1159398769",
[3] =
"6657;0;0;1289192110;1;7300;7900;Bootyboy;0;0;4;11 59391569;1159477969",
[19] =
"9865;1191;0;680935487;1;5013;8000;Warmist;0;0;1;1 159391569;1159393369",
},
["ahKey"] = "nordrassil-neutral",

I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.

A pseudocode version might read:
Copy each line into a temporary string
If we have found the keyword "nordrassil-neutral" and have found the
keyword "auctions"
if the line contains 10 ";"
populate the struct
} while we have not found the keyword "ahKey"

I can tell that there are 3 contigous "\t" before each numbered line.
But my question is if this is the right approach to a structured
document or is there a better way? I can see that there is a rational
structure but can't see how to use the formatted text better than my
brute force of counting approach.

It looks like the file is a collection of key-value pairs. The
keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.

If you rely on counting tabs and semi-colons or the order of the
keys, you can be sure that the people maintaining the program
that generates the file will change those details without
bothering to tell you. The basic structure could change, too, of
course, but that is less likely.

--
Thomas M. Sommers -- tm*@nj.net -- AB2SB

Sep 30 '06 #4

Nils O. Selåsdal

pkirk25 wrote:

My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

[...]

BTW, I have asked for guidance on the format - decoding it myself is
the only option.

This might be a time to learn parser generators such as yacc, they
can make writing parser much easier than doing it by hand.

Sep 30 '06 #5

pkirk25

It looks like the file is a collection of key-value pairs. The

keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.

Something along the lines of this seem correct approach?

key0 = "AuctioneerSnapshotDB" is start of region
key1 = random_input_string for example "draenor-neutral" -
key2 = "auctions" is start of useful data
key3 = "ahKey" means you are now out of the region

bool to put data int structs

read file line by line
when found key0, key1 and key2 bool to put data int structs = true
work with data until key3 found at which point we exit the function

Or is were you thinkign of searching for curly braces?

Sep 30 '06 #6

T.M. Sommers

pkirk25 wrote:

>>It looks like the file is a collection of key-value pairs. The
keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.

Something along the lines of this seem correct approach?

key0 = "AuctioneerSnapshotDB" is start of region
key1 = random_input_string for example "draenor-neutral" -
key2 = "auctions" is start of useful data
key3 = "ahKey" means you are now out of the region

bool to put data int structs

read file line by line
when found key0, key1 and key2 bool to put data int structs = true
work with data until key3 found at which point we exit the function

Or is were you thinkign of searching for curly braces?

Not just curly braces, but all of the syntax of the file. Read
the file token by token, not line by line, since for all you know
there may be some line breaks in the middle of some element (I am
assuming from what you said in the initial post that the format
is undocumented, which means that you should make as few
assumptions as possible about it).

You might want to use data structures along these lines:

struct key {
int k_type;
union {
int k_int;
char *k_string;
} u;
};

struct key_value;

struct value {
int v_type;
union {
int k_int;
char *k_string;
struct key_value *k_keyvalue;
} u;
};

struct key_value {
struct key *kv_key;
struct value *kv_value;
};

If there is only one of them, you might be able to treat the top
level (AuctioneerSnapshotDB) specially. If so, write a function
to search for a particular key and return its associated value.
This function will call a recursive function to read in a
key-value pair (it has to be recursive because the value may be
another key-value pair). Once you find the right key and get its
value, then worry about getting they subkeys you want.

This is all off the top of my head, so there are probably things
I haven't thought of or otherwise got wrong, but you should get
the idea.

--
Thomas M. Sommers -- tm*@nj.net -- AB2SB

Sep 30 '06 #7

Mabden

"pkirk25" <pa*****@kirks.netwrote in message
news:11*********************@i3g2000cwc.googlegrou ps.com...

My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

<irony>If only there were a Practical Extraction and Reporting Language
(Perl) you could use...</irony>

--
Mabden

Oct 1 '06 #8

by: Chris Curvey | last post by:

I'm writing an XMLRPC server, which is receiving a request (from a non-Python client) that looks like this (formatted for legibility): <?xml version="1.0"?> <methodCall>...

Python

What does "formatted" I/O really mean?

by: Steven T. Hatton | last post by:

I'm still not completely sure what's going on with C++ I/O regarding the extractors and inserters. The following document seems a bit inconsistent:...

C / C++

How to decode a JPEG 2000 file or use any internal decoding of WebResponse

by: Johann Blake | last post by:

In my need to decode a JPEG 2000 file, I discovered like many that there was no functionality for this in the .NET Framework. Instead of forking out a pile of cash to do this, I came up with the...

C# / C Sharp

Efficient URL-decoding.

by: Peter Jansson | last post by:

Hello group, The following code is an attempt to perform URL-decoding of URL-encoded string. Note that std::istringstream is used within the switch, within the loop. Three main issues have been...

C / C++

Decoding strategy

by: marcin.rzeznicki | last post by:

Hello everyone I've got a little problem with choosing the best decoding strategy for some nasty problem. I have to deal with very large files wich contain text encoded with various encodings....

C# / C Sharp

HTML Decoding

by: Uriah Piddle | last post by:

Hi Gang, I have text that I am putting into a label that has been custom formatted by a text editor and looks like this: Some outside text <crl>Some inside text</clr Some outside text How...

ASP.NET

decoding =E5, =F8

by: Peter K | last post by:

Hi in the processing of some text files, I have found I have strings like: f=E5t pr=F8ve where the strings "=E5" and "=F8" are danish characters "å" and "ø". I can work this out myself,...

C# / C Sharp

Decoding DOS formatted file

by: Bjarne Nielsen | last post by:

Hi all In my C# program I need to read a file, which is exported from a DOS program. So words such as "AflÃ¸ser" (with special danish characters) is read like "Aflï¿½ser". How do I...

C# / C Sharp

Decoding html pages

by: Santander | last post by:

how to decode HTML pages encoded like this: http://www.long2consulting.com/seeinaction2008/Simplicity_Beach_table/index.htm Is there script that will do this automatically and generate normal fully...

Javascript

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Decoding a 30 MB formatted text file

Similar topics