473,416 Members | 1,557 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,416 developers and data experts.

How to Downsample Your Data Efficiently

Tired of spending countless mintues downsampling your data? Look no further!

In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million minute-level records in only 41 seconds in DolphinDB.

The basic configuration of the DolphinDB server is:

16 CPU cores
256 GB memory
4 SSDs
A DolphinDB cluster with 4 data nodes is deployed, and each node uses a SSD.

[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*m7oACYfcwPFlhzMcTe8Tnw.png[/IMG]

The data we use is:

the level 1 quotes on August, 2007 from New York Stock Exchange
around 272 GB, with 6.48 billion records
Downsampling can be performed with a SQL statement in DolphinDB.

[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*TnmOqOphYWrjO_tVwSXz4w.png[/IMG]

As the SQL query may involve multiple partitions, DolphinDB breaks down the job into several tasks and assigns the tasks to the corresponding data nodes for parallel execution. When all the tasks are completed, the system merges the intermediate results from the nodes to return the final result.

The script is as follows:
Expand|Select|Wrap|Line Numbers
  1. db = database("dfs://TAQ")
  2. quotes = db.loadTable("quotes")
  3. select count(*)  from quotes where date between 2007.08.01 : 2007.08.31
  4.  
  5. model=select  top 1 symbol,date, minute(time) as minute, bid, ofr from quotes where date=2007.08.01,symbol=`EBAY
  6. if(existsTable("dfs://TAQ", "quotes_minute_sql"))
  7.  db.dropTable("quotes_minute_sql")
  8. db.createPartitionedTable(model, "quotes_minute_sql", `date`symbol)
  9.  
  10. timer{
  11.  minuteQuotes=select avg(bid) as bid, avg(ofr) as ofr from quotes where data between 2007.08.01 : 2007.08.31 group by symbol,date,bar(time, 60) as minute
  12.  loadTable("dfs://TAQ", "quotes_minute_sql").append!(minuteQuotes)
  13. }
  14.  
  15. select count(*)  from loadTable("dfs://TAQ", "quotes_minute")
The frequency can be adjusted as needed just by modifying bar(time, 60). Here 60 means the data is downsampled to 1-minute interval as the timestamp values have seconds precision.

[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*iJDnTB5cZ0prCUIzL0Ew4A.png[/IMG]

The table “quotes_minute_sql“ is created with createPartitionedTable and the downsampled result can be appended to this table.

[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*2kM7NKookUH_LKWVYZ7k6Q.png[/IMG]

We can execute the script and visit the web-based user interface to check the resource usage. It’s shown that all CPU cores have participated in the downsampling. On each data node, 15 tasks are running concurrently as data is being read from disk.

[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*gEzm3dCur7RbFcDe19f4eg.png[/IMG]

When we come back to VScode and check the execution status, we find that it only takes 41 seconds to complete the data downsampling, which generates 61 million minute-level records.

[IMG]https://miro.medium.com/v2/resize:fit:720/format:webp/1*38Cr8hLn5BiKLT898pQyWQ.png[/IMG]

DolphinDB exhibits outstanding performance in data downsampling due to the following reasons:
Jobs are executed distributedly and resources of different nodes can be utilized at the same time;
Compression reduces the disk I/O;
Columnar storage and vectorized computation improve the efficiency of aggregation.

To learn detailed operations of data downsampling, take a look at this demo!
https://youtu.be/0vRuiz1Lf6Y

Thanks for your reading! To keep up with our latest news, please follow our Twitter and Linkedin. You can also join our Slack to chat with the author!

Feel free to check our website for more information!
Feb 22 '24 #1
0 14340

Sign in to post your reply or Sign up for a free account.

Similar topics

2
by: Steve_CA | last post by:
Hello all, I just started a new job this week and they complain about the length of time it takes to load data into their data warehouse, which they do once a month. From what I can gather,...
11
by: Ignacio X. Domínguez | last post by:
Hi. I'm developing a desktop application that needs to store some data in a local file. Let's say for example that I want to have an address book with names and phone numbers in a file. I would...
5
by: andrew007 | last post by:
I have created myDataset which has a dataTable. This data table has a column called "xmlDate". I have saved the following data to this field. <zones><zone>india</zone><zone>france</zone></zones>...
0
by: Fei Liu | last post by:
Yet another problem to deal with dynamic data type that can only be determined at run time. For a netCDF file (a scientific data format), a variable is defined with its associating dimensions, i.e....
9
by: Daz | last post by:
Hello people! (This post is best viewed using a monospace font). I need to create a class, which holds 4 elements: std::string ItemName int Calories int Weight int Density
5
by: Donald Adams | last post by:
Hi, I will have both web and win clients and would like to page my data. I could not find out how the datagrid control does it's paging though I did find some sample code that says they do it...
3
by: Shark | last post by:
Hi, I need a help. My application reads data from COM port, this data is then parsed and displyed on: 1. two plotters 2. text box. I'm using Invoke method to update UI when new data is...
9
by: igor.tatarinov | last post by:
Hi, I am pretty new to Python and trying to use it for a relatively simple problem of loading a 5 million line text file and converting it into a few binary files. The text file has a fixed format...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.