By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
428,829 Members | 1,798 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 428,829 IT Pros & Developers. It's quick & easy.

Cron job to remove logically redundant entries in Postgres SQL

P: 1
I have a requirement to delete records from a Postgres SQL table which has more than 200 million records. The table is not having any primary key.

The sample table (Bookmark is the name of table) content is as below:

Expand|Select|Wrap|Line Numbers
  1.     systemId     filename            mindatetime                    maxdatetime
  2.       70277        monitor_1.dat    2019-04-21 08:00:00 AM        2019-04-21 03:10:00 PM
  3.       10006        monitor_2.dat    2019-04-25 10:00:00 AM        2019-04-25 11:30:00 AM
  4.       10006        monitor_3.dat    2019-04-28 08:00:00 AM        2019-04-28 10:00:00 AM
  5.       10006        monitor_3.dat    2019-04-28 09:00:00 AM        2019-04-28 11:00:00 AM
  6.       10006        monitor_3.dat    2019-04-28 07:00:00 AM        2019-04-28 04:00:00 PM
  7.       8368        monitor_1.dat    2019-05-21 11:00:00 AM        2019-05-21 11:30:00 AM
  8.       8368        monitor_7.dat    2019-05-21 06:00:00 AM        2019-05-21 11:00:00 AM
  9.       8368        monitor_5.dat    2019-05-23 08:00:00 AM        2019-05-23 10:00:00 AM

The cron job should run on a given schedule to delete the records which are logically redundant.

To explain this let's take the case of systemId '10006' where filename is 'monitor_3.dat' having 3 entries with min and max date timestamp of the same day.

Logically we can delete the entries having mindatetime 08:00:00 AM and 09:00:00 AM, maxdatetime 10:00:00 AM, 11:00:00 AM as that interval is being covered by the other entry which has mindatetime as 7 AM and maxdatetime as 4 PM.

So those entries would fall under this interval and the job should identify such entries in the entire table and delete them.

My resultant output table content in this case should be:

Expand|Select|Wrap|Line Numbers
  1.     systemId     filename            mindatetime                    maxdatetime
  2.       70277        monitor_1.dat    2019-04-21 08:00:00 AM        2019-04-21 03:10:00 PM
  3.       10006        monitor_2.dat    2019-04-25 10:00:00 AM        2019-04-25 11:30:00 AM
  4.       10006        monitor_3.dat    2019-04-28 07:00:00 AM        2019-04-28 04:00:00 PM
  5.       8368        monitor_1.dat    2019-05-21 11:00:00 AM        2019-05-21 11:30:00 AM
  6.       8368        monitor_7.dat    2019-05-21 06:00:00 AM        2019-05-21 11:00:00 AM
  7.       8368        monitor_5.dat    2019-05-23 08:00:00 AM        2019-05-23 10:00:00 AM
The table size is more than 20Gb on disk so I was exploring writing a sql procedure or job to achieve this but not able to make much progress. Any ideas or suggestions for overcoming this complex scenario?
Jun 14 '19 #1
Share this Question
Share on Google+
1 Reply


Rabbit
Expert Mod 10K+
P: 12,359
What happens if they only partially overlap? What happens if they fully overlap but over two different entries? What happens if they overlap but the start and end are the same? You need to clearly define the requirements otherwise you're going to run into trouble down the road.

Whatever the case may be, the answer will probably be to join the table to itself to find overlapping entries. How you formulate that join will depend on what you need to happen in the scenarios above.
Jun 14 '19 #2

Post your reply

Sign in to post your reply or Sign up for a free account.