Login ProductsSalesSupportDownloadsAbout |
Home » Technical Support » DBISAM Technical Support » Support Forums » DBISAM General Discussion » View Thread |
Messages 1 to 3 of 3 total |
Extracting information from cvs |
Tue, Jul 8 2008 7:51 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | I was reminded yet again about this since some Russian's are apparently extracting information from cvs held on Monster.xx and selling it. Its something I keep thinking I need to build into my app and keep failing to get my head round.
The problem I have is that cvs are essentially unstructured documents and the text I want (name, phone numbers, email address, employer, dates employed and skills) in the main aren't susceptible to something like pattern analysis so where do I start. Anyone have any experience with it and willing to give a poor old codger a guide? Roy Lambert |
Tue, Jul 8 2008 1:06 PM | Permanent Link |
"Raul" | Hi Roy,
I'm sure there are better techniques but last time i had to do something similar it involved text parsing and lot of internal rules. In our case it was US/Canada info so we were able to make some assumptions when parsing: - basically anything with @ in it was a candidate for email address (we did do a simple sanity check to make sure domain for example looked ok e.g. <anything>.<valid 3-4 letter domain> - phone numbers is North America are usually [1+]3+7 numbers in in number of formats (e.g. (123)123-4567, 11231234567, etc) - address usually contains a province or state (which we all of) so we assumed anything before that was address There were number of other rules and code was getting pretty complex. We did use some regex stuff for couple of known areas but doing own parsing wored lot better. Raul "Roy Lambert" <roy.lambert@skynet.co.uk> wrote in message news:3852DBCB-2702-4BD9-97BF-5D68BBE94EBC@news.elevatesoft.com... >I was reminded yet again about this since some Russian's are apparently >extracting information from cvs held on Monster.xx and selling it. Its >something I keep thinking I need to build into my app and keep failing to >get my head round. > > The problem I have is that cvs are essentially unstructured documents and > the text I want (name, phone numbers, email address, employer, dates > employed and skills) in the main aren't susceptible to something like > pattern analysis so where do I start. > > Anyone have any experience with it and willing to give a poor old codger a > guide? > > > Roy Lambert > > |
Tue, Jul 8 2008 2:19 PM | Permanent Link |
"Raul" | Re-posting. I have a new laptop and typo count is way up - Sorry.
Hi Roy, I'm sure there are better techniques but last time i had to do something similar it involved text parsing and lot of internal rules. In our case it was US/Canada info so we were able to make some assumptions when parsing: - basically anything with @ in it was a candidate for email address (we did do a simple sanity check to make sure domain for example looked ok e.g. <anything>.<valid 3-4 letter domain> - phone numbers in North America are usually [1+]3+7 numbers and can be in number of (known) formats (e.g. (123)123-4567, 123.123.1234 , 11231234567, etc) - address usually contains a province or state (which we know all of) so we assumed anything before that was address There were number of other rules and code was getting pretty complex. We did use some regex stuff for couple of known areas but doing own parsing worked a lot better. Raul > "Roy Lambert" <roy.lambert@skynet.co.uk> wrote in message > news:3852DBCB-2702-4BD9-97BF-5D68BBE94EBC@news.elevatesoft.com... >>I was reminded yet again about this since some Russian's are apparently >>extracting information from cvs held on Monster.xx and selling it. Its >>something I keep thinking I need to build into my app and keep failing to >>get my head round. >> >> The problem I have is that cvs are essentially unstructured documents and >> the text I want (name, phone numbers, email address, employer, dates >> employed and skills) in the main aren't susceptible to something like >> pattern analysis so where do I start. >> >> Anyone have any experience with it and willing to give a poor old codger >> a guide? >> >> >> Roy Lambert >> >> > |
This web page was last updated on Wednesday, April 24, 2024 at 11:07 AM | Privacy PolicySite Map © 2024 Elevate Software, Inc. All Rights Reserved Questions or comments ? E-mail us at info@elevatesoft.com |