Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 5 of 5 total
Thread Horrific SQL Merge prospect
Mon, Mar 19 2012 1:43 PMPermanent Link

Adam Brett

Orixa Systems

I have a client with 2 offices which have previously been separate. Each has a DBISAM DB, which will become an Elevate DB in the up-coming update process.

They would like to merge their databases, keeping each local version but sharing some data.

At present each office has a Customers table ... but the data in them does not match (each serves different customers, but there is _some_ overlap)

There are about 11,000 records in each Customers table, mostly UK retail businesses, the first 4,000 or so are shared, from earlier times. Each Customer has multiple attached People, Phones, Addresses & other data which will need to be reconciled.

UGH.

Step 1:

I want to try to get a measure of how many of the records overlap.

I can do

"WHERE Office1.Name = Office2.Name"
"WHERE UPPER(Office1.Name)=UPPER(Office2.Name)"
"WHERE UPPER(Office1.PostCode)=UPPER(Office2.PostCode)"

etc., etc., to see direct matches between names & other fields, I can also use LIKE etc., but this will of course miss out records which have different spellings.

--

I don't have much experience of this.

Does Elevate have functions to compare somewhat-similar inputs like the "SOUNDEX" functionality I remember using years ago?

If not are there things I can do in Delphi & what would people with more experience recommend?
Mon, Mar 19 2012 2:49 PMPermanent Link

Raul

Team Elevate Team Elevate

Adam,

Matching naems is always tricky - can you check things like phone number, postal codes or tax codes instead? Those are usually in known format so you can more easily clean them up and compare.

If do have to do name comparisons then DBISAM does not have a soundex and AFAIK neither does EDB.

You could create a custom function to add the soundex to either or just use soundex function in delphi code.

Delphi has a built in Soundex function which is somewhat limited as well as soundex based ResemblesText/AnsiResemblesText in StrUtils.

Google search also turned out some other ones (i have no idea how good they are) like this http://stevepeacocke.blogspot.ca/2007/08/whats-this-soundex-in-delphi.html so i'm sure you can find some reasonable implementations.

Good luck

Raul

<<
Adam Brett wrote:

Does Elevate have functions to compare somewhat-similar inputs like the "SOUNDEX" functionality I remember using years ago?

If not are there things I can do in Delphi & what would people with more experience recommend?


>>
Tue, Mar 20 2012 4:51 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Adam


I was part of the team at Touche Ross deduplicating applications for British Gas shares. I've also imported other databases into my own. You have my sympathy.

A lot of what you can do will be dictated by what data you have, and how good it is, and how good the original data entry/validation was (I know I'm teaching granny to suck eggs). As Raul indicates names are, generally, the worst data to try and dedup on.

You may consider this nosey or impertinent. How are you going to be paid for this? I ask because its incredibly easy to underestimate the amount of effort needed. If you're being paid a day rate - fine - if its a fixed fee - double it and you may get near the mark Smiley

Another question is are invoices / orders involved and do you have to merge those as well? What are the consequences of getting it wrong? With the British Gas floatation the consequences were some people would be upset because they didn't get any and some delighted because they were allowed more than the rules said. Not a big deal either way.

As you say get a rough cut idea of what and how many first. You've spotted that you can't trust the case - good. Unfortunately you've missed that you can't trust spacing. eg

WHERE Office1.Name = Office2.Name - forget this

WHERE UPPER(Office1.PostCode)=UPPER(Office2.PostCode)

should be

WHERE UPPER(REPLACE(' ','',Office1.PostCode))=UPPER(REPLACE(' ','',Office2.PostCode))


WHERE UPPER(REPLACE(' ','',Office1.Name))=UPPER(REPLACE(' ','',Office2.Name))

is better than

WHERE UPPER(Office1.Name)=UPPER(Office2.Name)

but ignores the fact that Office1 may have entered "Adam Limited" and Office2 "Adam Ltd" so you need to process them to remove any unwanted words.

Phone number (with spaces stripped) is usually pretty good, but Office1 may always enter with std code and Office2 without if its local, or one may enter a switchboard number and the other a direct dial.

If I was doing it I would probably use SQL to get an idea and then Delphi to do the job. One thing that crosses my mind is that if invoices are involved the address on those is generally of good quality so backflush that to the contacts tables and then use that address. Concentrate firstly on deduping / merging companies then look at contacts within companies. Its a lot less likely there'll be two John Smiths in a company than in the contact list as a whole.

If you want to provide a bit more info - tables involved and their structure I'll give it some more thought.


Roy Lambert [Team Elevate]
Tue, Mar 20 2012 11:26 AMPermanent Link

Adam Brett

Orixa Systems

>>How are you going to be paid for this? I ask because its incredibly easy to underestimate the amount of effort >>needed. If you're being paid a day rate - fine - if its a fixed fee - double it and you may get near the mark Smiley

Luckily I am on a day-rate Smile... but first I have to suggest a strategy for the process (it doesn't all have to be done with SQL, some of it could involve getting staff to clean up data) ... so the strategy has to be as cost effective as possible!

>>Another question is are invoices / orders involved and do you have to merge those as well?

We have master records for "People" (who can be customers, i.e. individuals) and "Organisations" which can be customers as well i.e. retailers. All have linked Phone & Address data etc.

I fully agree with Your and Raul's post, that I should search on items such as PostCode and Phone Number.

The tip of excluding space, hyphen & other non-number characters from Phone Numbers is a really good one. That should generate a good duplicates list.

... A really lovely fact, I ran a test SQL on the 2 DBs yesterday & there are many customers with 10 or more duplicate entries (!!!) ... partly because the same person has been re-entered, but also sometimes because 1 retailer has 2 shops with the same name in different locations Frown

It will take a lot of work to set straight.

>>If I was doing it I would probably use SQL to get an idea
>> and then Delphi to do the job.

Really good idea ... kind of where I can see I was going, but good to actually consider structuring it in that way!

>>One thing that crosses my mind is that if invoices are
>>involved the address on those is generally of good quality
>> so backflush that to the contacts tables and then use that address.

Also a good idea.

... Thanks for these pointers ... I'll try to work with this.

--

I am still curious about the existence of a "LIKE" method somewhere which could match data:

+44 207 458 1221 with 0207 4581221

i.e. ignore the spaces, ignore the +, ignore the additional "0" & "44" ...

I guess for the special case of the phone number (which might be a good proxy Primary Key!) I might be able to make this work.

For example if I reverse the order of the phone numbers, strip out the spaces and non number char, and take only the first 10 char I get an exact match ...

That might be a good way to start!
Tue, Mar 20 2012 11:58 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Adam

>Luckily I am on a day-rate Smile... but first I have to suggest a strategy for the process (it doesn't all have to be done with SQL, some of it could involve getting staff to clean up data) ... so the strategy has to be as cost effective as possible!

Strategies are always easy - its the tactics that generally kick you in the soft dangly bits Smiley

1. Flag potential duplicated organisations
2. Merge agreed duplicates (invoices, orders, people etc)
3. Flag potential duplicate people
4. Eyeball and delete

>>>Another question is are invoices / orders involved and do you have to merge those as well?
>
>We have master records for "People" (who can be customers, i.e. individuals) and "Organisations" which can be customers as well i.e. retailers. All have linked Phone & Address data etc.

I am reading this as one table - bad news - see "strategy" above

Another thought - since you have order / invoice history - ignore anyone individual or organisation where there has been no movement for a period - say a year. That means you're focussing on useful relevant information to get things underway quickly. The remainder can be dealt with at leisure, and for individuals can possibly be ignored reloading the data when it becomes necessary.

>I fully agree with Your and Raul's post, that I should search on items such as PostCode and Phone Number.

It wouldn't hurt to process the phone numbers to ensure validity and format:

1. check first digit - if not zero or + pass to the eyeball team
2. if +44 alter to 0
3. if 00 alter to +
4. Format to a standard (you'll strip spaces to test later but this makes it easier for reading)
5. Check the std code - if not valid pass to eyeball team
6. Check if number is valid format eg 020 xxxx xxxx or 01933 xxx xxx or 07 whatever (can't remember) if not pass to eyeball team

My guess is this will cause 10% - 15% to need eyeballing

I can let you have a reasonable std code table and, if you want it, phone number formatting. I can also give you a function which will be a good starter in stripping company names back to a comparable level.

>I am still curious about the existence of a "LIKE" method somewhere which could match data:
>
> +44 207 458 1221 with 0207 4581221
>
>i.e. ignore the spaces, ignore the +, ignore the additional "0" & "44" ...
>
>I guess for the special case of the phone number (which might be a good proxy Primary Key!) I might be able to make this work.
>
>For example if I reverse the order of the phone numbers, strip out the spaces and non number char, and take only the first 10 char I get an exact match ...

That won't work for all phone numbers

Roy Lambert [Team Elevate]
Image