My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.
It is machine generated and is nicely formatted. Example text follows:
AuctioneerSnaps hotDB = {
["nordrassil-neutral"] = {
["nextAuctio nId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0 ;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1; 1650000;1650000 ;Boneglay;0;0;3 ;1159391569;115 9420369",
[2] =
"6661;0;0;0;1;3 99900;599900;Kr ius;0;0;2;11593 91569;115939876 9",
[3] =
"6657;0;0;12891 92110;1;7300;79 00;Bootyboy;0;0 ;4;1159391569;1 159477969",
[19] =
"9865;1191;0;68 0935487;1;5013; 8000;Warmist;0; 0;1;1159391569; 1159393369",
},
["ahKey"] = "nordrassil-neutral",
I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.
A pseudocode version might read:
Copy each line into a temporary string
If we have found the keyword "nordrassil-neutral" and have found the
keyword "auctions"
if the line contains 10 ";"
populate the struct
} while we have not found the keyword "ahKey"
I can tell that there are 3 contigous "\t" before each numbered line.
But my question is if this is the right approach to a structured
document or is there a better way? I can see that there is a rational
structure but can't see how to use the formatted text better than my
brute force of counting approach.
BTW, I have asked for guidance on the format - decoding it myself is
the only option. 7 2488
pkirk25 wrote:
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.
It is machine generated and is nicely formatted. Example text follows:
AuctioneerSnaps hotDB = {
["nordrassil-neutral"] = {
["nextAuctio nId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0 ;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1; 1650000;1650000 ;Boneglay;0;0;3 ;1159391569;115 9420369",
[2] =
"6661;0;0;0;1;3 99900;599900;Kr ius;0;0;2;11593 91569;115939876 9",
[3] =
"6657;0;0;12891 92110;1;7300;79 00;Bootyboy;0;0 ;4;1159391569;1 159477969",
[19] =
"9865;1191;0;68 0935487;1;5013; 8000;Warmist;0; 0;1;1159391569; 1159393369",
},
["ahKey"] = "nordrassil-neutral",
I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.
If you want the entire document contents, I would parse this based on
the structure, not the content. Strip the whitespace and build a tree
based on the objects (delimited by {}) and elements (delimited by[]).
A first pass to build the tree can also be used to validate the
structure (matching [] and {}). Once you have the tree you can scan for
the required elements, ignoring the layout.
--
Ian Collins.
"pkirk25" <pa*****@kirks. netwrote in message
news:11******** *************@i 3g2000cwc.googl egroups.com...
But my question is if this is the right approach to a structured
document or is there a better way?
"better way?" Maybe, maybe not. Although C is my preferred method, there
are methods other than C. I've used AWK or VI with great success in the
past solving problems similar to yours. AWK is well suited to processing
formatted text data and because it can accept input files it can be
automated. Also, you can use it to readily reformat the desired output so
it can be read in easily by C (i.e, think text preprocessor... ). If it's a
one or two time deal, I'd use VI and brute force editing, _if_ the largest
unique piece of text that needs to be processed is a single line. Again,
once it's in an easier to use format, then read it in via C.
Rod Pemberton
pkirk25 wrote:
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.
It is machine generated and is nicely formatted. Example text follows:
AuctioneerSnaps hotDB = {
["nordrassil-neutral"] = {
["nextAuctio nId"] = 20,
["version"] = 1,
["updates"] = {
[1] = "15416.012;;0;0 ;0;0;0;0",
},
["auctions"] = {
[1] =
"16717;0;0;0;1; 1650000;1650000 ;Boneglay;0;0;3 ;1159391569;115 9420369",
[2] =
"6661;0;0;0;1;3 99900;599900;Kr ius;0;0;2;11593 91569;115939876 9",
[3] =
"6657;0;0;12891 92110;1;7300;79 00;Bootyboy;0;0 ;4;1159391569;1 159477969",
[19] =
"9865;1191;0;68 0935487;1;5013; 8000;Warmist;0; 0;1;1159391569; 1159393369",
},
["ahKey"] = "nordrassil-neutral",
I think I will be able to find what I want and populate my structs by
looking for keywords like "nordrassil-neutral" and "ahKey". The code
is not pretty. In fact, it seem sot have works like "Fragile - Handle
with Care" stamped all over it.
A pseudocode version might read:
Copy each line into a temporary string
If we have found the keyword "nordrassil-neutral" and have found the
keyword "auctions"
if the line contains 10 ";"
populate the struct
} while we have not found the keyword "ahKey"
I can tell that there are 3 contigous "\t" before each numbered line.
But my question is if this is the right approach to a structured
document or is there a better way? I can see that there is a rational
structure but can't see how to use the formatted text better than my
brute force of counting approach.
It looks like the file is a collection of key-value pairs. The
keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.
If you rely on counting tabs and semi-colons or the order of the
keys, you can be sure that the people maintaining the program
that generates the file will change those details without
bothering to tell you. The basic structure could change, too, of
course, but that is less likely.
--
Thomas M. Sommers -- tm*@nj.net -- AB2SB
pkirk25 wrote:
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.
It is machine generated and is nicely formatted. Example text follows:
[...]
BTW, I have asked for guidance on the format - decoding it myself is
the only option.
This might be a time to learn parser generators such as yacc, they
can make writing parser much easier than doing it by hand.
It looks like the file is a collection of key-value pairs. The
keys are either identifiers (the top level only, apparently), or
strings or numbers enclosed in square brackets. The values are
either numbers, strings, or lists of key-value pairs enclosed in
curly braces and separated by commas. Your best bet would be to
parse the file based on that structure.
Something along the lines of this seem correct approach?
key0 = "AuctioneerSnap shotDB" is start of region
key1 = random_input_st ring for example "draenor-neutral" -
key2 = "auctions" is start of useful data
key3 = "ahKey" means you are now out of the region
bool to put data int structs
read file line by line
when found key0, key1 and key2 bool to put data int structs = true
work with data until key3 found at which point we exit the function
Or is were you thinkign of searching for curly braces?
pkirk25 wrote:
>>It looks like the file is a collection of key-value pairs. The keys are either identifiers (the top level only, apparently), or strings or numbers enclosed in square brackets. The values are either numbers, strings, or lists of key-value pairs enclosed in curly braces and separated by commas. Your best bet would be to parse the file based on that structure.
Something along the lines of this seem correct approach?
key0 = "AuctioneerSnap shotDB" is start of region
key1 = random_input_st ring for example "draenor-neutral" -
key2 = "auctions" is start of useful data
key3 = "ahKey" means you are now out of the region
bool to put data int structs
read file line by line
when found key0, key1 and key2 bool to put data int structs = true
work with data until key3 found at which point we exit the function
Or is were you thinkign of searching for curly braces?
Not just curly braces, but all of the syntax of the file. Read
the file token by token, not line by line, since for all you know
there may be some line breaks in the middle of some element (I am
assuming from what you said in the initial post that the format
is undocumented, which means that you should make as few
assumptions as possible about it).
You might want to use data structures along these lines:
struct key {
int k_type;
union {
int k_int;
char *k_string;
} u;
};
struct key_value;
struct value {
int v_type;
union {
int k_int;
char *k_string;
struct key_value *k_keyvalue;
} u;
};
struct key_value {
struct key *kv_key;
struct value *kv_value;
};
If there is only one of them, you might be able to treat the top
level (AuctioneerSnap shotDB) specially. If so, write a function
to search for a particular key and return its associated value.
This function will call a recursive function to read in a
key-value pair (it has to be recursive because the value may be
another key-value pair). Once you find the right key and get its
value, then worry about getting they subkeys you want.
This is all off the top of my head, so there are probably things
I haven't thought of or otherwise got wrong, but you should get
the idea.
--
Thomas M. Sommers -- tm*@nj.net -- AB2SB
"pkirk25" <pa*****@kirks. netwrote in message
news:11******** *************@i 3g2000cwc.googl egroups.com...
My data is in a big file that I have no control over. Sometimes its
over 30 MB and often there are several of them.
It is machine generated and is nicely formatted. Example text follows:
<irony>If only there were a Practical Extraction and Reporting Language
(Perl) you could use...</irony>
--
Mabden This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Chris Curvey |
last post by:
I'm writing an XMLRPC server, which is receiving a request (from a
non-Python client) that looks like this (formatted for legibility):
<?xml version="1.0"?>
<methodCall>
<methodName>echo</methodName>
<params>
<param>
<value>
<string>Le Martyre de Saint André <BR> avec inscription
|
by: Steven T. Hatton |
last post by:
I'm still not completely sure what's going on with C++ I/O regarding the
extractors and inserters. The following document seems a bit inconsistent:
http://gcc.gnu.org/onlinedocs/libstdc++/27_io/howto.html#1
Copying a file:
WRONG WAY:
#include <fstream>
std::ifstream IN ("input_file");
std::ofstream OUT ("output_file");
|
by: Johann Blake |
last post by:
In my need to decode a JPEG 2000 file, I discovered like many that
there was no functionality for this in the .NET Framework. Instead of
forking out a pile of cash to do this, I came up with the idea that
costs nothing and it is inheritently built into the Framework. So here
is the solution...
When you use the WebRequest and WebResponse classes to obtain graphics
from a web site, these classes have built-in decoding for JPEG 2000
files....
|
by: Peter Jansson |
last post by:
Hello group,
The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;
1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.
2. This code is very inefficient - to many mallocs/string
|
by: marcin.rzeznicki |
last post by:
Hello everyone
I've got a little problem with choosing the best decoding strategy for
some nasty problem. I have to deal with very large files wich contain
text encoded with various encodings. Their length makes loading
contents of file into memory in single run inappropriate. I solved this
problem by implementing memory mapping using P/Invoke and I load
contents of file in chunks. Since files' contents are in different
encodings what I...
| |
by: Uriah Piddle |
last post by:
Hi Gang,
I have text that I am putting into a label that has been custom formatted by
a text editor and looks like this:
Some outside text <crl>Some inside text</clr Some outside text
How can I map this to an item in a CSS file so that I can use it to format
the 'Some inside text' portion of the text?
|
by: Peter K |
last post by:
Hi
in the processing of some text files, I have found I have strings like:
f=E5t
pr=F8ve
where the strings "=E5" and "=F8" are danish characters "å" and "ø". I can
work this out myself, but how can my program know - or at least what
"decoding" do I need to do to get the correct characters in my string?
|
by: Bjarne Nielsen |
last post by:
Hi all
In my C# program I need to read a file, which is exported from a DOS
program. So words such as "Afløser" (with special danish characters) is
read like "Afl�ser".
How do I decode/encode DOS formatted strings to the right ".net" format.
Cheers
Bjarne Nielsen
|
by: Santander |
last post by:
how to decode HTML pages encoded like this:
http://www.long2consulting.com/seeinaction2008/Simplicity_Beach_table/index.htm
Is there script that will do this automatically and generate normal fully
readable HTML?
Santander
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |