Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 10 of 12 total
Thread Spotting duplicates
Tue, Mar 9 2010 11:32 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

I do hope John Hay reads the ElevateDB ngs as well as the DBISAM ones.

I have four tables

Contacts, Career, Companies, Sites

Contacts is a one to many with Career
Companies is a one to many with Sites
Career is one to one with Sites and one to many with Companies

With some data loading from external DBs I now have multiple occurrences of the same person (sites as well but I've sussed that and companies can wait).

I can almost get the SQL to identify duplicate contacts BUT I can only get one of them. My logic is if two people with the same forename and same surname exist in the same site of a company its a good guess they are the same person so

SELECT
_fkContacts,
_fkCompanies,
_fkSites,
_Forename,
_Surname
FROM
Career
JOIN Contacts C ON _fkContacts = C._ID
GROUP BY _Surname,_Forename,_fkCompanies,_fkSites
HAVING COUNT(_Surname + _Forename + CAST(_fkCompanies AS VARCHAR(16)) + CAST(_fkSites AS VARCHAR(16))) > 1

The top one on the list is Anthony Allison - ID 1008199. What I need is the other ID (1000095) as well.

Any ideas?

Roy Lambert
Tue, Mar 9 2010 1:19 PMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

I do enjoy feeling like a moron

SELECT _ID FROM Contacts WHERE _ID IN (SELECT.....


Roy Lambert
Tue, Mar 9 2010 1:24 PMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

I now feel an even bigger buffoon. If the original query doesn't produce the full list adding the extra bit won't help.

Roy Lambert
Tue, Mar 9 2010 10:00 PMPermanent Link

Raul

Team Elevate Team Elevate

Roy,

I think it's the group by that's limiting your output.

How about starting with getting duplicate entries from one table first ?

Something like this :

select C1._fkContacts,C1._Surname,C1._Forename,C2._fkContacts as _fkContacts2, C2._Surname,C2._Forename

from Career C1,Career C2

where c1._fkContacts<c2._fkContacts and c1._Surname=c2._Surname and
c1._Forename=c2._Forename and C1._fkCompanies = C2._fkCompanies and
C1._fkSites=C2._fkSites

Then you should be able to use one of the _fk fields to join to other table.

Raul
Wed, Mar 10 2010 3:01 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Raul

>I think it's the group by that's limiting your output.

Me to.

>How about starting with getting duplicate entries from one table first ?

I'd love to, unfortunately 1) _Forename, _Surname are in Contacts, _fkCompanies, _fkSites are in Career so I have to use two tables and 2) Career contains c20k records (Contacts a bit less) and a cartesian join on that takes forever

Thanks for the suggestion though, The C1._fkContacts < C2._fkContacts is something I never thought of but feels as if it should help.

I did come up with a solution but it doesn't work (I suspect a bug)

Roy Lambert
Wed, Mar 10 2010 3:16 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Raul


Altering the SQL to

select _Surname, _Forename FROM Contacts C1, Contacts C2
where c1._ID<c2._ID and c1._Surname=c2._Surname and c1._Forename=c2._Forename

just to try and Forever = 1274.887 secs

Roy Lambert
Wed, Mar 10 2010 9:07 AMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< I did come up with a solution but it doesn't work (I suspect a bug) >>

What is it that you suspect is a bug ?

--
Tim Young
Elevate Software
www.elevatesoft.com

Wed, Mar 10 2010 9:08 AMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< just to try and Forever = 1274.887 secs >>

http://www.elevatesoft.com/supportfaq?action=view&category=edb&question=joins_where_clause

--
Tim Young
Elevate Software
www.elevatesoft.com

Wed, Mar 10 2010 10:12 AMPermanent Link

Raul

Team Elevate Team Elevate

Roy,

Hopefully you get your solution working (bug and all) but question : is this a one time operation or do you need to do this regularly.

If one time then creating a temp intermediary tables (do all joins so you're down to single table and then eliminate non-duplicates using your group by logic) might be faster as you'd have all the relevant IDs at the end that just need to be merged.

Raul
Wed, Mar 10 2010 11:11 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim


I started a new thread for it: Is this In comparison a bug

Roy Lambert
Page 1 of 2Next Page »
Jump to Page:  1 2
Image