Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 10 of 15 total
Thread Looking for a clever sql
Fri, Apr 2 2010 12:23 PMPermanent Link

Hershcu Sorin

Hello

It's not a specific ebd question but since I have to do it with edb I try
here for a clever solution.

I have a table with Field1, Field2
Some records have a equal values on the first 2 fields.
Rec1.Field1 = Rec2.Field1;
Rec1.Field2 = Rec2.Field2;

I want to remain only with the first record and delete the others.
I know that I can iterate on the records, one by one, looking for equals
record and delete them
but since it's a very big table 500,000 records it take hours.

Any faster solution will be appreciated
Thanks Sorin

Sat, Apr 3 2010 4:44 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Sorin

>I have a table with Field1, Field2
>Some records have a equal values on the first 2 fields.
>Rec1.Field1 = Rec2.Field1;
>Rec1.Field2 = Rec2.Field2;
>
>I want to remain only with the first record and delete the others.
>I know that I can iterate on the records, one by one, looking for equals
>record and delete them
>but since it's a very big table 500,000 records it take hours.

It should be doable, the big problem is making sure you keep the first record and delete the second. The core SQL is something like

SELECT id FROM table
GROUP BY Field1, Field2
HAVING COUNT(*) > 1

That will give you a list if ids where there are duplicates (or triplicates etc) and it will probably be the last one. However, subject to correction by Tim, you can't guarantee this.

The SQL can then be extended to

DELETE FROM table WHERE id IN
(
SELECT id FROM table
GROUP BY Field1, Field2
HAVING COUNT(*) > 1
)

which will probably do what you want for duplicates, but if there are any triplicates or better you would need to run the code several times.

Roy Lambert [Team Elevate]
Sat, Apr 3 2010 5:14 AMPermanent Link

Malcolm Taylor

Roy Lambert wrote:

>
> DELETE FROM table WHERE id IN
> (
> SELECT id FROM table
> GROUP BY Field1, Field2
> HAVING COUNT(*) > 1
> )
>
> which will probably do what you want for duplicates, but if there
> are any triplicates or better you would need to run the code
> several times.
>
> Roy Lambert [Team Elevate]

Roy, are you sure that will not delete *all* the records with
matching f1 and f2 .. the original and any number of 'matching'
records?


--
Sat, Apr 3 2010 7:47 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Malcolm

>Roy, are you sure that will not delete *all* the records with
>matching f1 and f2 .. the original and any number of 'matching'
>records?

Pretty certain. The GROUP BY will mean that only one row is returned for each case where there are two or more rows that have the same f1 & f2.

The way to check is to see what the sub query returns and cross check with the actual data.

Roy Lambert [Team Elevate]
Sat, Apr 3 2010 8:55 AMPermanent Link

Michael Baytalsky


Just a quick note: the SQL below will not work on 99% of databases.
I know it will probably work on DBISAM, not tested with EDB.
This will return ALL pairs with one of their id - probably first one.

On 4/3/2010 11:44 AM, Roy Lambert wrote:
> SELECT id FROM table
> GROUP BY Field1, Field2
> HAVING COUNT(*)>  1

Instead you should write, which will work on most databases:

SELECT Min(id) FROM table
GROUP BY Field1, Field2
HAVING COUNT(*) >  1


The fastest way still is to sort this table by Field1, Field2 and iterate
it deleting duplicate ocurrences. You will not need to search for each one,
because you just remember your current pair and compare to next until
it does not match. If you do this in stored proc that will be the fastest
solution one could imagine for any database.


--
Regards,
Michael Baytalsky
Context Software LLC.
http://www.contextsoft.com
the makers of Context Database Designer
Sat, Apr 3 2010 11:25 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Michael


Or in this case since xx wants to delete the second of the matching pairs

DELETE FROM table WHERE id IN
(
SELECT Max(id) FROM table
GROUP BY GROUP BY Field1, Field2
HAVING COUNT(*)>1
)

Thanks for that little tip - very useful.

>The fastest way still is to sort this table by Field1, Field2 and iterate
>it deleting duplicate ocurrences. You will not need to search for each one,
>because you just remember your current pair and compare to next until
>it does not match. If you do this in stored proc that will be the fastest
>solution one could imagine for any database.

Without timing it I'll have to take your word about speed. However, if you're right, I wouldn't expect much difference just using Delphi with no controls attached to the table.

Roy Lambert
Sun, Apr 4 2010 3:26 AMPermanent Link

Hershcu Sorin

Thanks Roy

You are right.
It delete the last occurrence and and needs to run the procedure again for
cases with more than double occurrences.

Thanks Sorin

"Roy Lambert" <roy.lambert@skynet.co.uk> wrote in message

>>Roy, are you sure that will not delete *all* the records with
>>matching f1 and f2 .. the original and any number of 'matching'
>>records?
>
> Pretty certain. The GROUP BY will mean that only one row is returned for
> each case where there are two or more rows that have the same f1 & f2.
>
> The way to check is to see what the sub query returns and cross check with
> the actual data.

Sun, Apr 4 2010 4:33 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Sorin


Two points:

1. don't forget that useful tip from Michael Baytalsky using max(id)
2. You can determine the number of iterations you need by using another query

SELECT MAX(DupNos)-1 AS Iterations FROM
(SELECT COUNT(*) AS DupNosFROM table
GROUP BY GROUP BY Field1, Field2
HAVING COUNT(*)>1) AS T1

If you have many triplicates and quadruplicates etc it may be faster as Michael suggests just to run through the table in Delphi. Just make sure you have no controls apart from the table component connected.

Roy Lambert [Team Elevate]
Sun, Apr 4 2010 4:47 AMPermanent Link

Hershcu Sorin


"Roy Lambert" <roy.lambert@skynet.co.uk> wrote in message
> Two points:
>
> 1. don't forget that useful tip from Michael Baytalsky using max(id)
> 2. You can determine the number of iterations you need by using another
> query
>
> SELECT MAX(DupNos)-1 AS Iterations FROM
> (SELECT COUNT(*) AS DupNosFROM table
> GROUP BY GROUP BY Field1, Field2
> HAVING COUNT(*)>1) AS T1
>
> If you have many triplicates and quadruplicates etc it may be faster as
> Michael suggests just to run through the table in Delphi. Just make sure
> you have no controls apart from the table component connected.

Yes the iteration on Delphi table is faster.

Thanks All
Sorin

Mon, Apr 5 2010 3:25 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< Without timing it I'll have to take your word about speed. However, if
you're right, I wouldn't expect much difference just using Delphi with no
controls attached to the table. >>

There's still a lot more going on in a TDataSet than there is in a raw EDB
SQL cursor.  The EDB SQL cursors are bare-bones and are very, very fast.

--
Tim Young
Elevate Software
www.elevatesoft.com

Page 1 of 2Next Page »
Jump to Page:  1 2
Image