Decoding a 30 MB formatted text file

pkirk25

My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

AuctioneerSnaps hotDB = {
["nordrassil-neutral"] = {
["nextAuctio nId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0 ;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1; 1650000;1650000 ;Boneglay;0;0;3 ;1159391569;115 9420369",
[2] =
"6661;0;0;0;1;3 99900;599900;Kr ius;0;0;2;11593 91569;115939876 9",
[3] =
"6657;0;0;12891 92110;1;7300;79 00;Bootyboy;0;0 ;4;1159391569;1 159477969",
[19] =
"9865;1191;0;68 0935487;1;5013; 8000;Warmist;0; 0;1;1159391569; 1159393369",
},
["ahKey"] = "nordrassil-neutral",

I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.

A pseudocode version might read:
Copy each line into a temporary string
If we have found the keyword "nordrassil-neutral" and have found the
keyword "auctions"
if the line contains 10 ";"
populate the struct
} while we have not found the keyword "ahKey"

I can tell that there are 3 contigous "\t" before each numbered line.
But my question is if this is the right approach to a structured
document or is there a better way? I can see that there is a rational
structure but can't see how to use the formatted text better than my
brute force of counting approach.

BTW, I have asked for guidance on the format - decoding it myself is
the only option.

Sep 29 '06 #1

Subscribe Reply

2488

Ian Collins

pkirk25 wrote:

My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

AuctioneerSnaps hotDB = {
["nordrassil-neutral"] = {
["nextAuctio nId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0 ;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1; 1650000;1650000 ;Boneglay;0;0;3 ;1159391569;115 9420369",
[2] =
"6661;0;0;0;1;3 99900;599900;Kr ius;0;0;2;11593 91569;115939876 9",
[3] =
"6657;0;0;12891 92110;1;7300;79 00;Bootyboy;0;0 ;4;1159391569;1 159477969",
[19] =
"9865;1191;0;68 0935487;1;5013; 8000;Warmist;0; 0;1;1159391569; 1159393369",
},
["ahKey"] = "nordrassil-neutral",

I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.

If you want the entire document contents, I would parse this based on
the structure, not the content. Strip the whitespace and build a tree
based on the objects (delimited by {}) and elements (delimited by[]).

A first pass to build the tree can also be used to validate the
structure (matching [] and {}). Once you have the tree you can scan for
the required elements, ignoring the layout.

--
Ian Collins.

Sep 29 '06 #2

Rod Pemberton

"pkirk25" <pa*****@kirks. netwrote in message
news:11******** *************@i 3g2000cwc.googl egroups.com...

But my question is if this is the right approach to a structured
document or is there a better way?

"better way?" Maybe, maybe not. Although C is my preferred method, there
are methods other than C. I've used AWK or VI with great success in the
past solving problems similar to yours. AWK is well suited to processing
formatted text data and because it can accept input files it can be
automated. Also, you can use it to readily reformat the desired output so
it can be read in easily by C (i.e, think text preprocessor... ). If it's a
one or two time deal, I'd use VI and brute force editing, _if_ the largest
unique piece of text that needs to be processed is a single line. Again,
once it's in an easier to use format, then read it in via C.
Rod Pemberton

Sep 29 '06 #3

T.M. Sommers

pkirk25 wrote:

My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

AuctioneerSnaps hotDB = {
["nordrassil-neutral"] = {
["nextAuctio nId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0 ;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1; 1650000;1650000 ;Boneglay;0;0;3 ;1159391569;115 9420369",
[2] =
"6661;0;0;0;1;3 99900;599900;Kr ius;0;0;2;11593 91569;115939876 9",
[3] =
"6657;0;0;12891 92110;1;7300;79 00;Bootyboy;0;0 ;4;1159391569;1 159477969",
[19] =
"9865;1191;0;68 0935487;1;5013; 8000;Warmist;0; 0;1;1159391569; 1159393369",
},
["ahKey"] = "nordrassil-neutral",

I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.

A pseudocode version might read:
Copy each line into a temporary string
If we have found the keyword "nordrassil-neutral" and have found the
keyword "auctions"
if the line contains 10 ";"
populate the struct
} while we have not found the keyword "ahKey"

I can tell that there are 3 contigous "\t" before each numbered line.
But my question is if this is the right approach to a structured
document or is there a better way? I can see that there is a rational
structure but can't see how to use the formatted text better than my
brute force of counting approach.

It looks like the file is a collection of key-value pairs. The
keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.

If you rely on counting tabs and semi-colons or the order of the
keys, you can be sure that the people maintaining the program
that generates the file will change those details without
bothering to tell you. The basic structure could change, too, of
course, but that is less likely.

--
Thomas M. Sommers -- tm*@nj.net -- AB2SB

Sep 30 '06 #4

Nils O. Selåsdal

pkirk25 wrote:

My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

[...]

BTW, I have asked for guidance on the format - decoding it myself is
the only option.

This might be a time to learn parser generators such as yacc, they
can make writing parser much easier than doing it by hand.

Sep 30 '06 #5

pkirk25

It looks like the file is a collection of key-value pairs. The

keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.

Something along the lines of this seem correct approach?

key0 = "AuctioneerSnap shotDB" is start of region
key1 = random_input_st ring for example "draenor-neutral" -
key2 = "auctions" is start of useful data
key3 = "ahKey" means you are now out of the region

bool to put data int structs

read file line by line
when found key0, key1 and key2 bool to put data int structs = true
work with data until key3 found at which point we exit the function

Or is were you thinkign of searching for curly braces?

Sep 30 '06 #6

T.M. Sommers

pkirk25 wrote:

>>It looks like the file is a collection of key-value pairs. The
keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.

Something along the lines of this seem correct approach?

key0 = "AuctioneerSnap shotDB" is start of region
key1 = random_input_st ring for example "draenor-neutral" -
key2 = "auctions" is start of useful data
key3 = "ahKey" means you are now out of the region

bool to put data int structs

read file line by line
when found key0, key1 and key2 bool to put data int structs = true
work with data until key3 found at which point we exit the function

Or is were you thinkign of searching for curly braces?

Not just curly braces, but all of the syntax of the file. Read
the file token by token, not line by line, since for all you know
there may be some line breaks in the middle of some element (I am
assuming from what you said in the initial post that the format
is undocumented, which means that you should make as few
assumptions as possible about it).

You might want to use data structures along these lines:

struct key {
int k_type;
union {
int k_int;
char *k_string;
} u;
};

struct key_value;

struct value {
int v_type;
union {
int k_int;
char *k_string;
struct key_value *k_keyvalue;
} u;
};

struct key_value {
struct key *kv_key;
struct value *kv_value;
};

If there is only one of them, you might be able to treat the top
level (AuctioneerSnap shotDB) specially. If so, write a function
to search for a particular key and return its associated value.
This function will call a recursive function to read in a
key-value pair (it has to be recursive because the value may be
another key-value pair). Once you find the right key and get its
value, then worry about getting they subkeys you want.

This is all off the top of my head, so there are probably things
I haven't thought of or otherwise got wrong, but you should get
the idea.

--
Thomas M. Sommers -- tm*@nj.net -- AB2SB

Sep 30 '06 #7

Mabden

"pkirk25" <pa*****@kirks. netwrote in message
news:11******** *************@i 3g2000cwc.googl egroups.com...

My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.

It is machine generated and is nicely formatted. Example text follows:

<irony>If only there were a Practical Extraction and Reporting Language
(Perl) you could use...</irony>

--
Mabden

Oct 1 '06 #8

Similar topics

3066

xmlrpclib and decoding entity references

by: Chris Curvey | last post by:

I'm writing an XMLRPC server, which is receiving a request (from a non-Python client) that looks like this (formatted for legibility): <?xml version="1.0"?> <methodCall> <methodName>echo</methodName> <params> <param> <value> <string>Le Martyre de Saint André <BR> avec inscription

Python

3509

What does "formatted" I/O really mean?

by: Steven T. Hatton | last post by:

I'm still not completely sure what's going on with C++ I/O regarding the extractors and inserters. The following document seems a bit inconsistent: http://gcc.gnu.org/onlinedocs/libstdc++/27_io/howto.html#1 Copying a file: WRONG WAY: #include <fstream> std::ifstream IN ("input_file"); std::ofstream OUT ("output_file");

C / C++

2766

How to decode a JPEG 2000 file or use any internal decoding of WebResponse

by: Johann Blake | last post by:

In my need to decode a JPEG 2000 file, I discovered like many that there was no functionality for this in the .NET Framework. Instead of forking out a pile of cash to do this, I came up with the idea that costs nothing and it is inheritently built into the Framework. So here is the solution... When you use the WebRequest and WebResponse classes to obtain graphics from a web site, these classes have built-in decoding for JPEG 2000 files....

C# / C Sharp

6366

Efficient URL-decoding.

by: Peter Jansson | last post by:

Hello group, The following code is an attempt to perform URL-decoding of URL-encoded string. Note that std::istringstream is used within the switch, within the loop. Three main issues have been raised about the code; 1. If characters after '%' do not represent hexademical number, then uninitialized value variable 'hexint' used - this is undefined behavior. 2. This code is very inefficient - to many mallocs/string

C / C++

3415

Decoding strategy

by: marcin.rzeznicki | last post by:

Hello everyone I've got a little problem with choosing the best decoding strategy for some nasty problem. I have to deal with very large files wich contain text encoded with various encodings. Their length makes loading contents of file into memory in single run inappropriate. I solved this problem by implementing memory mapping using P/Invoke and I load contents of file in chunks. Since files' contents are in different encodings what I...

C# / C Sharp

1263

HTML Decoding

by: Uriah Piddle | last post by:

Hi Gang, I have text that I am putting into a label that has been custom formatted by a text editor and looks like this: Some outside text <crl>Some inside text</clr Some outside text How can I map this to an item in a CSS file so that I can use it to format the 'Some inside text' portion of the text?

ASP.NET

3231

decoding =E5, =F8

by: Peter K | last post by:

Hi in the processing of some text files, I have found I have strings like: f=E5t pr=F8ve where the strings "=E5" and "=F8" are danish characters "å" and "ø". I can work this out myself, but how can my program know - or at least what "decoding" do I need to do to get the correct characters in my string?

C# / C Sharp

2500

Decoding DOS formatted file

by: Bjarne Nielsen | last post by:

Hi all In my C# program I need to read a file, which is exported from a DOS program. So words such as "AflÃ¸ser" (with special danish characters) is read like "Aflï¿½ser". How do I decode/encode DOS formatted strings to the right ".net" format. Cheers Bjarne Nielsen

C# / C Sharp

8967

Decoding html pages

by: Santander | last post by:

how to decode HTML pages encoded like this: http://www.long2consulting.com/seeinaction2008/Simplicity_Beach_table/index.htm Is there script that will do this automatically and generate normal fully readable HTML? Santander

Javascript

9647

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9491

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10357

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10163

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10104

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8988

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7510

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

3668

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2894

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General