By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,034 Members | 1,012 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,034 IT Pros & Developers. It's quick & easy.

searching a value of a dict (each value is a list)

P: n/a
Hi,

I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.

dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.

How can I do this with Python? Ideas?
Dec 9 '07 #1
Share this Question
Share on Google+
20 Replies


P: n/a
On 12월10일, 오전1시23분, Seongsu Lee <se...@senux.comwrote:
Hi,

I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.

dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.

How can I do this with Python? Ideas?
Hi,

I just let the dict work in bidirectional fashion so that
I can find out what I want by both key and value. A mark
or prefix was needed to distinguish between keys originated
from keys and keys originated from values. (value * (-1))

from pprint import pprint
dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}
for k, v in dict.items():
for x in v:
dict[x * -1] = k
pprint(dict)

{-105: 900000,
-104: 900000,
-103: 900000,
-102: 900000,
-101: 900000,
-100: 900000,
-22: 900001,
-21: 900001,
-20: 900001,
-19: 999999,
-18: 999999,
-17: 999999,
-16: 999999,
-15: 999999,
-12: 2,
-11: 2,
-10: 2,
-5: 1,
-4: 1,
-3: 1,
-2: 1,
-1: 1,
1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

What do you think of this? Ideas with less space complexity?
Dec 9 '07 #2

P: n/a
Seongsu Lee escribi:
Hi,

I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
(...)

I want to find out the key value which has a specific
integer in the list of its value.
Sorry if this is unhelpful, but have you considered moving your data
model a proper database?
I ask because unless someone knows of a specific module, I think we are
in DB's authentic realm. Is the fastest solution, probably not just for
this particular operation you are trying to do.

Regards,
Pablo
Dec 9 '07 #3

P: n/a
On 12썡10씪, 삤*1떆53遺, Pablo Ziliani <pa...@decode.com.arwrote:
Seongsu Lee escribi처:
Hi,
I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
(...)
I want to find out the key value which has a specific
integer in the list of its value.

Sorry if this is unhelpful, but have you considered moving your data
model a proper database?
I ask because unless someone knows of a specific module, I think we are
in DB's authentic realm. Is the fastest solution, probably not just for
this particular operation you are trying to do.

Regards,
Pablo
Hi Pablo,

Thank you for your posting! I wanted to solve the problem within
a given environment, python, and I think it is solved by
a dictionary with bidirectional key. I have posted it and want to
know if other knows more elegant way to do it.
Dec 9 '07 #4

P: n/a
Seongsu Lee:
What do you think of this? Ideas with less space complexity?
You can put the second group of keys in a second dictionary, so you
don't have to mangle them, and it may be a bit faster.

Regarding the space complexity, I don't know how you can reduce it
with Python. Probably you can create a list of long, sort it and use
bisect on it to find the keys. Such longs can be a combination of
shifted integer OR the other integer that is the key of the original
dict. But I don't how much you can gain like this.

Another solution to reduce space is to use a tiny external module
written in C, Pyrex or D. Here follows some simple D code you can
modify a bit to make it work with Pyd (http://pyd.dsource.org/):
import std.traits: ReturnType;
import std.stdio: writefln;

struct TyInt_int {
int el, n;
int opCmp(TyInt_int other) {
if (el == other.el)
return 0;
return (el < other.el) ? -1 : 1;
}
}

int bisect(TyElem, TyData, TyFun)(TyElem[] a, TyData x, TyFun key) {
int lo = 0;
int hi = a.length;
while (lo < hi) {
int mid = (lo + hi) / 2;
if (x < key(a[mid]))
hi = mid;
else
lo = mid + 1;
}
return lo;
}

void main() {
int[][int] int_arr;
int_arr[1] = [1, 2, 3, 4, 5];
int_arr[1] = [10, 11, 12],
int_arr[900000] = [100, 101, 102, 103, 104, 105],
int_arr[900001] = [20, 21, 22],
int_arr[999999] = [15, 16, 17, 18, 19];

int tot_len = 0;
foreach(arr; int_arr)
tot_len += arr.length;

auto data_arr = new TyInt_int[](tot_len);
int i = 0;
foreach(n, arr; int_arr)
foreach(el; arr)
data_arr[i++] = TyInt_int(el, n);

data_arr.sort;
writefln(bisect(data_arr, 100, (TyInt_int ii){return ii.el;}));
}

Bye,
bearophile
Dec 9 '07 #5

P: n/a
Seongsu Lee <se***@senux.comwrote:
Hi,

I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.

dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.
Are the integers in the lists unique? I mean, could 104 occur in more than
one list? If it did, would it matter which key was returned?
How can I do this with Python? Ideas?
When I see something like this my natural response is to think that the data
structure is inappropriate for the use it's being put to.

The code someone else posted to reverse the keys is all very well, but
surely hugely wasteful on cpu, maybe storage, and elapsed time.

Even if the dict in this form is needed for some other reason, couldn't the
code that created it also create a reverse index at the same time?

--
Jeremy C B Nicoll - my opinions are my own.
Dec 9 '07 #6

P: n/a
Jeremy C B Nicoll:
The code someone else posted to reverse the keys is all very well, but
surely hugely wasteful on cpu, maybe storage, and elapsed time.
If you are talking about my D code then I know it, the creation of the
first dict has to be skipped, if possible... The code I have posted
must be adapted.

Bye,
bearophile
Dec 9 '07 #7

P: n/a
be************@lycos.com wrote:
Jeremy C B Nicoll:
The code someone else posted ...

If you are talking about my D code then I know it...
No I meant the code that used python to iterate over the dict and create
zillions of extra keys. I've deleted earlier posts in the thread and wasn't
sure who suggested that.

--
Jeremy C B Nicoll - my opinions are my own.
Dec 10 '07 #8

P: n/a
I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.

dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.

How can I do this with Python? Ideas?
def find_key(dict, num):
for k in dict:
if num in dict[k]:
return k
Dec 10 '07 #9

P: n/a
On 12월10일, 오전6시49분, Jeremy C B Nicoll <jer...@omba.demon.co.ukwrote:
Seongsu Lee <se...@senux.comwrote:
Hi,
I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.
dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}
I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.

Are the integers in the lists unique? I mean, could 104 occur in more than
one list? If it did, would it matter which key was returned?
Yes, the intergers in the lists are unique.
How can I do this with Python? Ideas?

When I see something like this my natural response is to think that the data
structure is inappropriate for the use it's being put to.

The code someone else posted to reverse the keys is all very well, but
surely hugely wasteful on cpu, maybe storage, and elapsed time.

Even if the dict in this form is needed for some other reason, couldn't the
code that created it also create a reverse index at the same time?
The reason I use the dict for my data is to speed up the search by
key.

The code could create also a reverse index (a reverse dict) at the
time. But as you said, it waste the space and I wanted to ask someone
who may know some way to reduce the waste of space while searching
fast.
Dec 10 '07 #10

P: n/a
Seongsu Lee <se***@senux.comwrote:
The reason I use the dict for my data is to speed up the search by key.
Ok, I understand that once the overhead of creating the dict has been done,
getting access to values within it is quick. And taking the time to create
a set of reverse keys speeds up the reverse access.

Rather than scanning the whole dict and creating reverse keys for everything
in it first, there might be an advantage in making search logic test if
there is a reverse key for the negative integer to be searched for, and if
so use it, otherwise scan the dict creating and examining reverse keys until
the integer is found. That way if the integer you're looking for is early
in the dict you'd only create reverse keys for the integers from the start
of the dict until the required one.
We don't know what external(?) process created the data that you've stored
in the dict, nor what you use it for - or more to the point - how often. If
you're going to make just one search of that data then there's little point
in having a fast search after a slow dict creation. On the other hand if
you have many many searches to do the initial overhead might be acceptable.
(I don't know how slow creating the dict would be for a typical example of a
million keys each keying lists of 1-1000 integers.)
The code could create also a reverse index (a reverse dict) at the
time. But as you said, it waste the space and I wanted to ask someone
who may know some way to reduce the waste of space while searching
fast.
Is the dict used by anything else? If the data in it was held in some other
form would that cause your program (or other programs) lots of problems? If
the range of values of the integers being stored is suitable, you might
sensibly use several or many smaller dicts to store all the data (and thus
save time reverse-keying much less of it).

--
Jeremy C B Nicoll - my opinions are my own.
Dec 10 '07 #11

P: n/a
Seongsu Lee wrote:
Hi,

I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.

dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.

How can I do this with Python? Ideas?
You can try this:

items = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

def findItem(item, dictionary):
for key, value in dictionary.iteritems():
if item in value:
print key, value

findItem(104, items)

This will allow you to work with the existing dataset without needing to
duplicate it. It will print all occurrances.

Also, you should never use reserved words like 'dict' this creates
confusion and can cause Python to misbehave since you are rebinding the
name.

Hope this helps.

Adonis Vargas

Dec 10 '07 #12

P: n/a
Adonis Vargas:
Also, you should never use reserved words like 'dict' this creates
confusion and can cause Python to misbehave since you are rebinding the
name.
Adonis Vargas
After hearing this suggestion for the 300th time, I think it may be
the moment to fix this problem in Python3, and make the Python
compiler issue a syntax error if someone tries to reassign such kind
of words, like dict, set, etc.

Bye,
bearophile
Dec 10 '07 #13

P: n/a
On 12월10일, 오후12시18분, Adonis Vargas <adon...@REMOVETHISearthlink.net>
wrote:
Seongsu Lee wrote:
Hi,
I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.
dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}
I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.
How can I do this with Python? Ideas?

You can try this:

items = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

def findItem(item, dictionary):
for key, value in dictionary.iteritems():
if item in value:
print key, value

findItem(104, items)

This will allow you to work with the existing dataset without needing to
duplicate it. It will print all occurrances.
Hi,

Yes, it works. But I think it works in O(n * m), doesn't it?
(n is # of keys in the dictionary and m is # of items in the list.)
So, we need to create a reverse index. (a reverse dictionary) or
need something better at least, I think.
Also, you should never use reserved words like 'dict' this creates
confusion and can cause Python to misbehave since you are rebinding the
name.
Yep. :)
Hope this helps.

Adonis Vargas- 따온 텍스트 숨기기 -

- 따온 텍스트 보기 -
Dec 10 '07 #14

P: n/a
On Dec 10, 3:50 am, Seongsu Lee <se...@senux.comwrote:
On 12월10일, 오후12시18분, Adonis Vargas <adon...@REMOVETHISearthlink.net>
wrote:
Seongsu Lee wrote:
Hi,
I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.
dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}
I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.
How can I do this with Python? Ideas?
You can try this:
items = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}
def findItem(item, dictionary):
for key, value in dictionary.iteritems():
if item in value:
print key, value
findItem(104, items)
This will allow you to work with the existing dataset without needing to
duplicate it. It will print all occurrances.

Hi,

Yes, it works. But I think it works in O(n * m), doesn't it?
(n is # of keys in the dictionary and m is # of items in the list.)
So, we need to create a reverse index. (a reverse dictionary) or
need something better at least, I think.
Also, you should never use reserved words like 'dict' this creates
confusion and can cause Python to misbehave since you are rebinding the
name.

Yep. :)
Hope this helps.
Adonis Vargas- 따온 텍스트 숨기기 -
- 따온 텍스트 보기 -
If I'm not mistaken, building a reverse dictionary like that will be
O(n*m) because dict/list access is O(n) (ammortized). Somebody correct
me if I'm wrong. In that case, it really depends on how you will use
the dict to see whether you get any benefit from building the reversed
dict. If you want to do several lookups, then the initial overhead
(speed/memory) of building the reversed dict might be worth it so that
you can just run lookups at O(n). But if you only need it once, it is
a waste of time and space to create a reverse dict when your access
time is the same for the lookup as for building the reversed dict.

If you do need more than one lookup, it would also be a good
optimization strategy to build the reverse dict in parallel, as you
execute the first search; that way you can combine the time spent on
building the reverse dict and the lookup, to get a total of O(n*m)
rather than O(n^2*m). The first search is "free" since you need the
reverse dict anyway.

Regards,
Jordan
Dec 10 '07 #15

P: n/a
On 2007-12-10, MonkeeSage <Mo********@gmail.comwrote:
If I'm not mistaken, building a reverse dictionary like that will be
O(n*m) because dict/list access is O(n) (ammortized). Somebody correct
me if I'm wrong. In that case, it really depends on how you will use
the dict to see whether you get any benefit from building the reversed
dict. If you want to do several lookups, then the initial overhead
(speed/memory) of building the reversed dict might be worth it so that
you can just run lookups at O(n).
It also depends on if the dictionary shall be mutated between
reverse lookups.
But if you only need it once, it is a waste of time and space
to create a reverse dict when your access time is the same for
the lookup as for building the reversed dict.

If you do need more than one lookup, it would also be a good
optimization strategy to build the reverse dict in parallel, as
you execute the first search; that way you can combine the time
spent on building the reverse dict and the lookup, to get a
total of O(n*m) rather than O(n^2*m). The first search is
"free" since you need the reverse dict anyway.
It wouldn't be merely an optimization if reverse lookups and
mutations were interleaved.

--
Neil Cerutti
You only get a once-in-a-lifetime opportunity so many times. --Ike Taylor
Dec 10 '07 #16

P: n/a
On Dec 10, 8:31 am, Neil Cerutti <horp...@yahoo.comwrote:
On 2007-12-10, MonkeeSage <MonkeeS...@gmail.comwrote:
If I'm not mistaken, building a reverse dictionary like that will be
O(n*m) because dict/list access is O(n) (ammortized). Somebody correct
me if I'm wrong. In that case, it really depends on how you will use
the dict to see whether you get any benefit from building the reversed
dict. If you want to do several lookups, then the initial overhead
(speed/memory) of building the reversed dict might be worth it so that
you can just run lookups at O(n).

It also depends on if the dictionary shall be mutated between
reverse lookups.
But if you only need it once, it is a waste of time and space
to create a reverse dict when your access time is the same for
the lookup as for building the reversed dict.
If you do need more than one lookup, it would also be a good
optimization strategy to build the reverse dict in parallel, as
you execute the first search; that way you can combine the time
spent on building the reverse dict and the lookup, to get a
total of O(n*m) rather than O(n^2*m). The first search is
"free" since you need the reverse dict anyway.

It wouldn't be merely an optimization if reverse lookups and
mutations were interleaved.

--
Neil Cerutti
You only get a once-in-a-lifetime opportunity so many times. --Ike Taylor
Well true, but you enter a whole other level of complexity in that
case...something like Theta(log(n*(m-n))). I might have calculated
that incorrectly, but that just goes to show how complex a lookup
is(!) in such a case.

Regards,
Jordan
Dec 10 '07 #17

P: n/a
On Dec 10, 8:31 am, Neil Cerutti <horp...@yahoo.comwrote:
On 2007-12-10, MonkeeSage <MonkeeS...@gmail.comwrote:
If I'm not mistaken, building a reverse dictionary like that will be
O(n*m) because dict/list access is O(n) (ammortized). Somebody correct
me if I'm wrong. In that case, it really depends on how you will use
the dict to see whether you get any benefit from building the reversed
dict. If you want to do several lookups, then the initial overhead
(speed/memory) of building the reversed dict might be worth it so that
you can just run lookups at O(n).

It also depends on if the dictionary shall be mutated between
reverse lookups.
But if you only need it once, it is a waste of time and space
to create a reverse dict when your access time is the same for
the lookup as for building the reversed dict.
If you do need more than one lookup, it would also be a good
optimization strategy to build the reverse dict in parallel, as
you execute the first search; that way you can combine the time
spent on building the reverse dict and the lookup, to get a
total of O(n*m) rather than O(n^2*m). The first search is
"free" since you need the reverse dict anyway.

It wouldn't be merely an optimization if reverse lookups and
mutations were interleaved.

--
Neil Cerutti
You only get a once-in-a-lifetime opportunity so many times. --Ike Taylor
Well true, but you enter a whole other level of complexity in that
case...something like Theta(log(n*(m-n))). I might have calculated
that incorrectly, but that just goes to show how complex a lookup
is(!) in such a case.

Regards,
Jordan
Dec 10 '07 #18

P: n/a
On Dec 10, 9:45 am, MonkeeSage <MonkeeS...@gmail.comwrote:
On Dec 10, 8:31 am, Neil Cerutti <horp...@yahoo.comwrote:
On 2007-12-10, MonkeeSage <MonkeeS...@gmail.comwrote:
If I'm not mistaken, building a reverse dictionary like that will be
O(n*m) because dict/list access is O(n) (ammortized). Somebody correct
me if I'm wrong. In that case, it really depends on how you will use
the dict to see whether you get any benefit from building the reversed
dict. If you want to do several lookups, then the initial overhead
(speed/memory) of building the reversed dict might be worth it so that
you can just run lookups at O(n).
It also depends on if the dictionary shall be mutated between
reverse lookups.
But if you only need it once, it is a waste of time and space
to create a reverse dict when your access time is the same for
the lookup as for building the reversed dict.
If you do need more than one lookup, it would also be a good
optimization strategy to build the reverse dict in parallel, as
you execute the first search; that way you can combine the time
spent on building the reverse dict and the lookup, to get a
total of O(n*m) rather than O(n^2*m). The first search is
"free" since you need the reverse dict anyway.
It wouldn't be merely an optimization if reverse lookups and
mutations were interleaved.
--
Neil Cerutti
You only get a once-in-a-lifetime opportunity so many times. --Ike Taylor

Well true, but you enter a whole other level of complexity in that
case...something like Theta(log(n*(m-n))). I might have calculated
that incorrectly, but that just goes to show how complex a lookup
is(!) in such a case.

Regards,
Jordan
Sorry for the double-post...google is being beligerant right now.
Dec 10 '07 #19

P: n/a
Seongsu Lee:
>I have a dictionary with million keys. Each value in the dictionary has a list with up to thousand integers.<
Let's say each integer can be represented with 32 bits (if there are
less numbers then a 3-byte representation may suffice, but this makes
things more complex), that is 2^2 bytes. Let's say there are 2^20 keys
each one associated to 2^10 values. So to represent the values you
need 2^32 bytes. It means 4 GB, so I don't think Python suffices to
store them in RAM, because a Python int object requires quite more
than 4 bytes (only represented inside an array.array it may need just
4 bytes).

So if you can use 128 MB RAM to store such data structure you need to
store data on HD too. You probably can use a lower-level language. On
disk you can keep the reverse index, represented as an array of
records/structs, each of such structures keep two 32-bit numbers (so
such array is 8 GB). Such index is sorted according to the first
element of the struct. The first number is the value of the original
dictionary and the second nuber is its key. Inside the RAM you can
keep another sorted array that "summarizes" your whole data. When you
need a number you can do a binary search on the array in RAM, such
array gives you the position where you can read (with a seek) a little
part of the file (512 bytes may suffice), to perform a little binary
search (if the block is very little a linear scan suffices) on it too
to find the number you need. Note that the summarizing data structure
in RAM may be represented with just a Python dict too, so in the end
you can use Python to solve this problem. You may need a lower-level
language to create the 8 GB file on disk (or create it with Python,
but it may take lot of time. You may sort it with the sort unix
command).

This isn't a complete solution, but I think it may work.

Bye,
bearophile
Dec 10 '07 #20

P: n/a
On Dec 10, 1:28 pm, bearophileH...@lycos.com wrote:
Seongsu Lee:
I have a dictionary with million keys. Each value in the dictionary has a list with up to thousand integers.<

Let's say each integer can be represented with 32 bits (if there are
less numbers then a 3-byte representation may suffice, but this makes
things more complex), that is 2^2 bytes. Let's say there are 2^20 keys
each one associated to 2^10 values. So to represent the values you
need 2^32 bytes. It means 4 GB, so I don't think Python suffices to
store them in RAM, because a Python int object requires quite more
than 4 bytes (only represented inside an array.array it may need just
4 bytes).

So if you can use 128 MB RAM to store such data structure you need to
store data on HD too. You probably can use a lower-level language. On
disk you can keep the reverse index, represented as an array of
records/structs, each of such structures keep two 32-bit numbers (so
such array is 8 GB). Such index is sorted according to the first
element of the struct. The first number is the value of the original
dictionary and the second nuber is its key. Inside the RAM you can
keep another sorted array that "summarizes" your whole data. When you
need a number you can do a binary search on the array in RAM, such
array gives you the position where you can read (with a seek) a little
part of the file (512 bytes may suffice), to perform a little binary
search (if the block is very little a linear scan suffices) on it too
to find the number you need. Note that the summarizing data structure
in RAM may be represented with just a Python dict too, so in the end
you can use Python to solve this problem. You may need a lower-level
language to create the 8 GB file on disk (or create it with Python,
but it may take lot of time. You may sort it with the sort unix
command).

This isn't a complete solution, but I think it may work.

Bye,
bearophile
Nice. :)

Regards,
Jordan
Dec 14 '07 #21

This discussion thread is closed

Replies have been disabled for this discussion.