473,383 Members | 1,748 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,383 software developers and data experts.

Indexing large data

Hi

I am having problem indexing large amount of textual data(in range of
200,000 to 1 million). I tried using map container but the compiler
just hanged upon entering the data. On bit of searching i got reference
to a paper regarding "Signature files" by FALOUTSOS C. 1992. I tried
implementing it but the results were far from satisfactory(maybe i
implemented it wrong). Anyways i would appreciate if anyone could
suggest method to index large data(even using external
storage). Also if anyone has successfully tested signature files could
give their comments.

thanks in advance

Jul 23 '05 #1
3 1626
ch************@gmail.com wrote:
I am having problem indexing large amount of textual data(in range of
200,000 to 1 million).
What kind of 'textual data'? Structured or unstructured? What means
'range of 200,000 to 1 million'?
I tried using map container but the compiler
just hanged upon entering the data. On bit of searching i got reference
to a paper regarding "Signature files" by FALOUTSOS C. 1992. I tried
implementing it but the results were far from satisfactory(maybe i
implemented it wrong). Anyways i would appreciate if anyone could
suggest method to index large data(even using external
storage). Also if anyone has successfully tested signature files could
give their comments.


For structured data SQLite is a widely used library
(http://www.sqlite.org/), for unstructured data probably other free
libraries exist.

Jul 23 '05 #2
ch************@gmail.com wrote:
Hi

I am having problem indexing large amount of textual data(in range of
200,000 to 1 million). I tried using map container but the compiler
just hanged upon entering the data. On bit of searching i got reference
to a paper regarding "Signature files" by FALOUTSOS C. 1992. I tried
implementing it but the results were far from satisfactory(maybe i
implemented it wrong). Anyways i would appreciate if anyone could
suggest method to index large data(even using external
storage). Also if anyone has successfully tested signature files could
give their comments.

thanks in advance


I assume you mean create a container

word1 -> (doc1, pos1) -> (doc2, pos2) -> (doc3, pos3)
word2 -> (doc4, pos4)
etc

What size? A million words or a million documents.

Why do you believe the "hang" is in the container (or even the compiler
as you say)? Why not the parser? Do you mean "hang" or just "too slow"?

How long does it take to parse the documents? Why don't you try with 1
document, or 10 documents to start with???

The STL should be able to cope with that, it will be faster than
external storage. Unfortunately, you'd need to load/save your map all
the time. Check http://tinyurl.com/77xax

As for a data structure, I would suggest

std::map<std::string, std::list<int> >

where "int" is your document id. To add a document

index[word].push_back(doc);

Calum

Jul 23 '05 #3
On 4 Jun 2005 00:23:06 -0700, ch************@gmail.com wrote:
Anyways i would appreciate if anyone could
suggest method to index large data(even using external
storage).


Most commercial databases seem to use some kind of B-tree indexing.

--
Bob Hairgrove
No**********@Home.com
Jul 23 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Craig Stadler | last post by:
I have a fairly large database on my (2 million records) WIN32 mySQL 4.0.20 The question is: I am adding large amounts of data to it (300,000 to 500,000 at a time) with the standard (INSERT...
1
by: Thomas Bartkus | last post by:
If we have a date/time field and are doing frequent queries WHERE {date/time field} BETWEEN TimeA AND TimeB Does it make sense, query speed wise, to create an index on the date/time field? ...
2
by: kaelin358 | last post by:
I need some advice on the best way to load data to a table while maintaining an index. I have a table that is very small at the moment but will have more than 70 million rows by the end of the...
2
by: Kent.Brooke | last post by:
Can someone set me straight I know indexing is a "try & see" art. However I am at a loss if it's better to use the INCLUDE switch on a unique index and tag on the columns be used to avoid a lookup...
7
by: Ryan | last post by:
I have a bit of a problem with regards an indexing strategy. Well, basically there is no indexing strategy on a set of data I have at work. Now, I didn't create the design as I would have allowed...
3
by: Chung Leong | last post by:
Here's the rest of the tutorial I started earlier: Aside from text within a document, Indexing Service let you search on meta information stored in the files. For example, MusicArtist and...
26
by: jacob navia | last post by:
Suppose an implementation where sizeof int == 4 sizeof void * == 8 sizeof long long == 8 When indexing an array array this would mean that arrays are limited to 2GB. To overcome this,
4
by: Amar | last post by:
Hi All, I need to select data from a database table containing huge amount of data. Now I am storing data using one primary key and I am just using simple select statement, and this process...
2
by: =?Utf-8?B?SmVycnkgQw==?= | last post by:
I have a server 2008 IIS 7.0 with indexing service installed. I have created the catalog and have a test page using these posts:...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.