Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 3 of 3 total
Thread Extracting information from cvs
Tue, Jul 8 2008 7:51 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

I was reminded yet again about this since some Russian's are apparently extracting information from cvs held on Monster.xx and selling it. Its something I keep thinking I need to build into my app and keep failing to get my head round.

The problem I have is that cvs are essentially unstructured documents and the text I want (name, phone numbers, email address, employer, dates employed and skills) in the main aren't susceptible to something like pattern analysis so where do I start.

Anyone have any experience with it and willing to give a poor old codger a guide?


Roy Lambert

Tue, Jul 8 2008 1:06 PMPermanent Link

"Raul"
Hi Roy,

I'm sure there are better techniques but last time i had to do something
similar it involved text parsing and lot of internal rules.

In our case it was US/Canada info so we were able to make some assumptions
when parsing:
- basically anything with @ in it was a candidate for email address (we did
do a simple sanity check to make sure domain for example looked ok e.g.
<anything>.<valid 3-4 letter domain>
- phone numbers is North America are usually [1+]3+7 numbers in in number of
formats (e.g. (123)123-4567, 11231234567, etc)
- address usually contains a province or state (which we all of) so we
assumed anything before that was address

There were number of other rules and code was getting pretty complex. We did
use some regex stuff for couple of known areas but doing own parsing wored
lot better.

Raul


"Roy Lambert" <roy.lambert@skynet.co.uk> wrote in message
news:3852DBCB-2702-4BD9-97BF-5D68BBE94EBC@news.elevatesoft.com...
>I was reminded yet again about this since some Russian's are apparently
>extracting information from cvs held on Monster.xx and selling it. Its
>something I keep thinking I need to build into my app and keep failing to
>get my head round.
>
> The problem I have is that cvs are essentially unstructured documents and
> the text I want (name, phone numbers, email address, employer, dates
> employed and skills) in the main aren't susceptible to something like
> pattern analysis so where do I start.
>
> Anyone have any experience with it and willing to give a poor old codger a
> guide?
>
>
> Roy Lambert
>
>
Tue, Jul 8 2008 2:19 PMPermanent Link

"Raul"
Re-posting. I have a new laptop and typo count is way up - Sorry.

Hi Roy,

I'm sure there are better techniques but last time i had to do something
similar it involved text parsing and lot of internal rules.

In our case it was US/Canada info so we were able to make some assumptions
when parsing:
- basically anything with @ in it was a candidate for email address (we did
do a simple sanity check to make sure domain for example looked ok e.g.
<anything>.<valid 3-4 letter domain>
- phone numbers in North America are usually [1+]3+7 numbers  and can be in
number of (known) formats (e.g. (123)123-4567, 123.123.1234 , 11231234567,
etc)
- address usually contains a province or state (which we know all of) so we
assumed anything before that was address

There were number of other rules and code was getting pretty complex. We
did use some regex stuff for couple of known areas but doing own parsing
worked a lot better.

Raul


> "Roy Lambert" <roy.lambert@skynet.co.uk> wrote in message
> news:3852DBCB-2702-4BD9-97BF-5D68BBE94EBC@news.elevatesoft.com...
>>I was reminded yet again about this since some Russian's are apparently
>>extracting information from cvs held on Monster.xx and selling it. Its
>>something I keep thinking I need to build into my app and keep failing to
>>get my head round.
>>
>> The problem I have is that cvs are essentially unstructured documents and
>> the text I want (name, phone numbers, email address, employer, dates
>> employed and skills) in the main aren't susceptible to something like
>> pattern analysis so where do I start.
>>
>> Anyone have any experience with it and willing to give a poor old codger
>> a guide?
>>
>>
>> Roy Lambert
>>
>>
>
Image