|Home » Technical Support » DBISAM Technical Support » Support Forums » DBISAM General Discussion » View Thread|
|Messages 1 to 5 of 5 total|
|Extracting information from a text document|
|Tue, Sep 25 2007 7:29 AM||Permanent Link|
I want to start extracting information from people's cv's. Initially I'll be happy (ecstatic) if I can get the employment history. The problem is that people being people the cvs are not formatted in the same way.
I've posted a .doc to binaries with a slug of samples in from a random selection of cv's. To my eyes there are too many different formats to easily use pattern recognition eg date separator date company newline job title. Anyone able to give some guidance, suggest some reading (I've been googling till my eyes hurt).
|Wed, Sep 26 2007 6:08 AM||Permanent Link|
Based on what I read of agencies, you simply search for keywords, then
apply the candidate. Thus when the CV says "I hate C++, I'm really bad at
Delphi, and Java is the devil", they get offered jobs for all of them.
|Wed, Sep 26 2007 9:01 AM||Permanent Link|
Getting the "skills" out is a doddle - tokenise the cv and extract ANY words that match skills (taking C# as a word), and without a pair of eyeballs in the process I think you have indicated its usefulness.
What I want is to get employment history which may come in several forms in the cv eg
January 2007 ~ Axis Electronics
Present (Contract Electronics Manufacturer)
Responsibilities: Materials Manager – Purchasing, Material Control & Logistics.
Feb 03’ to present date – Project Manager (Sensors)
GE Druck Ltd. Fir Tree Lane Leicester.
Managing Director Inditex UK & Ireland 2003- Date
(Zara UK Ltd, Massimo Dutti UK Ltd, ZA Clothing Ireland Ltd, Zara Home Ltd, Bershka UK Ltd, Bershka Ireland Ltd, Massimo Dutti Ireland Ltd, Pull & Bear Ireland, Stradivarius Ireland)
|Wed, Sep 26 2007 12:43 PM||Permanent Link|
Hmm, I doubt that could be well automated, but I'm reminded of my code
that takes lists and tries to work out the titles etc. What I do is look
for blank lines, and assume that a blank line indicates a separator, and
therefore the next line must be the title to be followed by options (in
your case the details of the employment). A lot of heuristics come into
play, but you might assume that if the first or second line has a year in,
then that must contain the date. Otherwise check from the bottom - perhaps
that is there. But the key here is that having worked out what might make
sense, the details are shown in a list (Raize as it happens) which allows
you to colour the lines to indicate the evaluation. The end user can then
verify the analysis is correct, or modify the results by selecting a line
and clicking the "this is the date" button.
Essentially, I do the best the code can do, and then let the user make it
proper. After that, I code it up and use it in my internal format.
|Wed, Sep 26 2007 1:16 PM||Permanent Link|
I've been going through a few samples and defining the patterns. So far I get
Really "simple" to code for
X=punctuation or other non-space separator
Uppercase means only alphamerics, lowercase means non-terminating separator other than space embeded.