Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 8 of 8 total
Thread Need help with SQL to eliminate rows with overlapping date ranges
Thu, Mar 19 2015 12:19 PMPermanent Link

Barry

I am trying to write an SQL statement that will return only those rows whose date range encompasses all of the date ranges of the other rows. This is harder than it first looks.

I have a table called Files with a structure:

File_Id: GUID;
Location: VarChar(4);
File_Contents: BLOB;
Start_DateTime: TimeStamp;
End_DateTime: TimeStamp;

The File_Id is a GUID that uniquely identifies the row.
The Location is like a room number and the date ranges are for that specific location.
The columns Start_DateTime and End_DateTime covers a date range (see example). By definition, the Start_DateTime value is always less that the End_DateTime value for that row.

The table can have rows with a date range that overlaps the date range of another row for that location.
An overlapping date range is a row whose range either fully or partially falls within a date range of another row of the same location.  I need to ignore all of the rows whose date range fully falls WITHIN a date range of another row for the same LOCATION.

Example.
The table has rows like:
          File_Id, Location, File_Contents, Start_DateTime, End_DateTime
Row 1: "{...}"  "1"   "<text>"   2010-01-01 00:00:00     2010-01-02 23:59:59
Row 2: "{...}"  "1"   "<text>"   2010-01-02 00:00:00     2010-01-02 23:59:59
Row 3: "{...}"  "1"   "<text>"   2010-01-03 00:00:00     2010-01-03 23:59:59
Row 4: "{...}"  "1"   "<text>"   2010-01-04 00:00:00     2010-01-04 23:59:59
Row 5: "{...}"  "1"   "<text>"   2010-01-01 00:00:00     2010-01-19 23:59:59
Row 6: "{...}"  "1"   "<text>"   2010-01-01 00:00:00     2010-01-19 23:59:59
Row 7: "{...}"  "1"   "<text>"   2010-01-03 00:00:00     2010-01-21 23:59:59
Row 8: "{...}"  "2"   "<text>"   2010-01-01 00:00:00     2010-01-02 23:59:59

I need to run an SQL statement that returns rows 5, 7, and 8. Row 5 covers all of the date ranges for rows 1 to 4 of location "1", and row 6 is a duplicate of row 5, and row 7 is needed because although it starts in the range of row 5, it ends outside the range of row 5. Row 8 is for Location "2". (Note: It is also valid to return row 6 instead of row 5 because the date ranges are identical for Location "1").

I don't want to post my SQL statement just yet because it may influence how you approach the problem. It doesn't matter if the result is sensitive or not.  My SQL statement does eliminate rows with date ranges that are smaller than other rows, but will still return rows with the same date range. It takes 19sec-25sec to complete on a Files table with 4,000 rows so it is still a bit slow.

I'm wondering if there a simple (fast) SQL solution to this puzzle? Feel free to create any indexes needed to speed this up.

TIA
Barry
Thu, Mar 19 2015 3:44 PMPermanent Link

Barry

Never mind.
I solved the problem and the SQL statement runs in under 1/4 second.

Barry
Thu, Mar 19 2015 5:12 PMPermanent Link

Richard Harding

Wise Nutrition Coaching

Hi Barry,

What did you do to solve the problem?

It might be useful for someone and I think it is good to complete the puzzle.

Richard Harding
Mon, Mar 23 2015 4:21 PMPermanent Link

Barry

Richard,

Since you're looking for an answer, here goes. I hope it helps someone else out too.

The obvious approach would be to use a table join or a subselect to count the number of other rows that have a date range outside of the date range of the current row so the current row could be eliminated.

Although table joins and subselects are usually quite fast with the "=" operator, when the <= or >= operators are used it will slow down quite a bit and will take several minutes to complete even with the proper indexes.

After examining my data a little more, I discovered the overlapping date ranges always starts on the same date, but will end on different dates. In other words, the date range grows over time.

This makes it possible to use a much faster solution like:

select Location, Start_DateTime, max(End_DateTime) as End_DateTime, count(*) as Num from files
group by Location, Start_DateTime;

This will provide a list of all data rows with a unique Start_DateTime and the largest date range. This takes just a fraction of a second to execute compared to several minutes with a table join or SubSelect.

Now my solution does not eliminate the date ranges that start and end inside of another date range, but since this does not occur with my data, I decided not to spend any more time on it.

Barry
Tue, Mar 24 2015 4:47 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Barry


I have to say what a brilliant cheat - now watch a user come along and spoil it <VBG>

Roy Lambert
Tue, Mar 24 2015 11:38 AMPermanent Link

Barry

>I have to say what a brilliant cheat - now watch a user come along and spoil it <VBG><

Yeah I know what you mean. When I hit the Post button, my thoughts turned to you. <LOL>
I thought it would be 30 seconds before I'd see a your reply with even a simpler solution.

Barry
Tue, Mar 24 2015 6:12 PMPermanent Link

Richard Harding

Wise Nutrition Coaching

Hi Barry (& Roy)

Great - thank you.  Very neat.  I will add it to my list of useful things to know.

I had a similar problem years ago and it it took longer than expected to work out a solution.  My solution is below which appears to be similar to Barry's original solution.  As Barry says, the self-join makes it much slower than his alternative.

Richard

------------------------------------------------------------------------------------------------------------------------------------------------------------------

CREATE TABLE "RangeTable"
(
"ID" INTEGER GENERATED BY DEFAULT AS IDENTITY (START WITH 0, INCREMENT BY 1) NOT NULL,
"StartDate" DATE NOT NULL,
"EndDate" DATE NOT NULL,
CONSTRAINT "rtStartBeforeEnd" CHECK (StartDate < EndDate)
)

-- IF COUNT(*) > 0 then overlapping ranges exists
SELECT COUNT(*)
  FROM RangeTable AS LR1, RangeTable AS LR2
  WHERE
     (LR1.ID < LR2.ID) AND
     (LR1.StartDate <= LR2.EndDate) AND
     (LR1.EndDate >= LR2.StartDate)
Wed, Mar 25 2015 4:10 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Barry


>Yeah I know what you mean. When I hit the Post button, my thoughts turned to you. <LOL>
>I thought it would be 30 seconds before I'd see a your reply with even a simpler solution.

I wish. For that you'd need John Hay or Ole

Roy
Image