473,320 Members | 1,977 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Fastest way to subtract elements of datasets of HDF5 file?

10 Byte
Hey Everyone:

Here is one interesting problem.

Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2 in HDF5 file (input.h5). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file).

Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.

Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).

Expand|Select|Wrap|Line Numbers
  1. f_r = h5py.File('input.h5', 'r+')
  2. dset1 = f_r.get('dataset_1')
  3. dset2 = f_r.get('dataset_2')
  4. r1,c1 = dset1.shape
  5. r2,c2 = dset2.shape
  6. left, right, count = 0,0,0; W = 4000  # Window half-width ;n = 1
  7. f_w = h5py.File('data.h5', 'w')
  8. d1 = np.zeros(shape=(0, 4))
  9. dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)
  10.  
  11. for j in range(r1):
  12.     e1 = dset1[j,1]
  13.  
  14.     # move left pointer so that is within -delta of e
  15.     while left < r2 and dset2[left,1] - e1 <= -W:
  16.         left += 1
  17.     # move right pointer so that is outside of +delta
  18.     while right < r2 and dset2[right,1] - e1 <= W:
  19.         right += 1
  20.  
  21.     for i in range(left, right):
  22.         delta = e1 - dset2[i,1]
  23.         dset.resize(dset.shape[0] + n, axis=0)
  24.         dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta]
  25.         count += 1
  26.  
  27. print("\nFinal shape of dataset created: " + str(dset.shape))
  28.  
  29. f_w.close()
Jul 30 '20 #1
10 4919
SioSio
272 256MB
f_r is not closed.
If it don't close the file, the code may slow down, it won't free up space in RAM and will affect performance.
Jul 31 '20 #2
RockRoll
10 Byte
I closed it using "f_r.close()" at the end and it didn't change anything. Any other suggestion?
Jul 31 '20 #3
SioSio
272 256MB
How about closing f_r immediately after using it?
The use of f_r ends with the first 3rd lines.
Jul 31 '20 #4
RockRoll
10 Byte
I tried that too. It gives an error "ValueError: Not a dataset (not a dataset)" at Line 12 where e1 is asking for dset1.

I can't transfer dataset_1 and dataset_2 directly to a list/numpy array as dataset_1 or 2 are really large.

Any other thought?
Jul 31 '20 #5
SioSio
272 256MB
The 9th line is written in f_w. Is this all right?
It need f_w.flush() before f_w.close().
Jul 31 '20 #6
RockRoll
10 Byte
Yeah, 9th seems to be fine. f_w is providing a file object. Basically I am creating a new "data.h5" that save in its dset at line 24.

I can also change line 9 to: dset = f_r.create_dataset('dataset_3', data=d1, maxshape=(None, None), chunks=True)

This (instead of creating a new hdf5 file), creates a new dataset3 in input.h5 ; but the computation time is unimpacted.

My suspicion is something can be improved the way line 24/loop is saving the data, but not sure as I am not an expert in programming.
Jul 31 '20 #7
SioSio
272 256MB
Where is the value of n set?
It looks like to comment "n = 1" on line 6.
Expand|Select|Wrap|Line Numbers
  1. left, right, count = 0,0,0; W = 4000  # Window half-width ;n = 1
Please tell me about the change in the number of elements in the dset row you are trying to execute.
In some cases, it may be able to move line 23 outside the outer for loop.
Jul 31 '20 #8
RockRoll
10 Byte
So here is what is happening:
1. I choose a flexible shape dset on line 9; flexible as I am dealing with large arrays and that can vary with the input file size
2. I fill in some values of interest at line 24
3. At line 23, I am basically expanding the current size of dset by n (=1). The added row is filled in with values I create at line 24.

Simply put, I am generating some numbers (line 22) and filling in dset by appending its row by 1 each time.

Can you please elaborate more when you say "line 23 outside the outer for loop."?

One quick thing I checked is even when line 23 and 24 are commented out (meaning I am just creating values in line 22, not storing in dset), still the computation time is huge (slow). So moving out line 23 may not be changing the execution speed.
Jul 31 '20 #9
SioSio
272 256MB
I write once again,
In line 6, n = 1 is disabled.
Only the following parts are valid
Left, right, count = 0, 0, 0; W = 4000
It is a comment from here.
# Half the width of the window. n = 1
Therefore, line 23 is not resized.
On the contrary, by adding indeterminate variables, execution may be unstable.
If the showed code is a part of the whole, and n = 1 is set in the non-show part other than the 6th line, the 23rd line has a maximum row dimension is (right-left+1).
If the row dimension of the dset you write to the file is one larger than the initial size, you need to resize it only once outside the outer for loop.
Aug 1 '20 #10
RockRoll
10 Byte
Hey SioSio,

Thanks for your assistance. So, that "n" was just a typo here. I however solved the problem. It turns out that PyTables' "append" method was much faster than resizing the HDF5 file. Just wanted to mention here, if anyone stops by here in the future!

Thanks again for your time!
Aug 14 '20 #11

Sign in to post your reply or Sign up for a free account.

Similar topics

60
by: Julie | last post by:
What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string. The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't...
11
by: hoopsho | last post by:
Hi Everyone, I am trying to write a program that does a few things very fast and with efficient use of memory... a) I need to parse a space-delimited file that is really large, upwards fo a...
14
by: Mark Broadbent | last post by:
Does anybody know what is (factual please -not just guess) the quickest method to read data from a file? I am not interested in the format of the data (i.e. blocks, bytes, string etc) just that the...
5
by: Gaurav - http://www.gauravcreations.com | last post by:
what is the fastest method to sort and load 1 lakh + strings in a list box from a text file. each string is in a new line -- Gaurav Creations
11
by: subodheee | last post by:
i have a problem in implementing script,i have to extract from below data op,offset,and i need to compare the if same offset,same op repeats in other process i need to make it as same,else no .please...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.