Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 5 of 5 total
Thread Extracting information from a text document
Tue, Sep 25 2007 7:29 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

I want to start extracting information from people's cv's. Initially I'll be happy (ecstatic) if I can get the employment history. The problem is that people being people the cvs are not formatted in the same way.

I've posted a .doc to binaries with a slug of samples in from a random selection of cv's. To my eyes there are too many different formats to easily use pattern recognition eg date separator date company newline job title. Anyone able to give some guidance, suggest some reading (I've been googling till my eyes hurt).

Roy Lambert
Wed, Sep 26 2007 6:08 AMPermanent Link

Based on what I read of agencies, you simply search for keywords, then
apply the candidate. Thus when the CV says "I hate C++, I'm really bad at
Delphi, and Java is the devil", they get offered jobs for all of them.

/Matthew Jones/
Wed, Sep 26 2007 9:01 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Matthew


Getting the "skills" out is a doddle - tokenise the cv and extract ANY words that match skills (taking C# as a word), and without a pair of eyeballs in the process I think you have indicated its usefulness.

What I want is to get employment history which may come in several forms in the cv eg



January 2007 ~ Axis Electronics
Present (Contract Electronics Manufacturer)

Responsibilities: Materials Manager Purchasing, Material Control & Logistics.

----

Feb 03 to present date Project Manager (Sensors)
GE Druck Ltd. Fir Tree Lane Leicester.

----

Managing Director Inditex UK & Ireland 2003- Date
(Zara UK Ltd, Massimo Dutti UK Ltd, ZA Clothing Ireland Ltd, Zara Home Ltd, Bershka UK Ltd, Bershka Ireland Ltd, Massimo Dutti Ireland Ltd, Pull & Bear Ireland, Stradivarius Ireland)


Roy Lambert
Wed, Sep 26 2007 12:43 PMPermanent Link

Hmm, I doubt that could be well automated, but I'm reminded of my code
that takes lists and tries to work out the titles etc. What I do is look
for blank lines, and assume that a blank line indicates a separator, and
therefore the next line must be the title to be followed by options (in
your case the details of the employment). A lot of heuristics come into
play, but you might assume that if the first or second line has a year in,
then that must contain the date. Otherwise check from the bottom - perhaps
that is there. But the key here is that having worked out what might make
sense, the details are shown in a list (Raize as it happens) which allows
you to colour the lines to indicate the evaluation. The end user can then
verify the analysis is correct, or modify the results by selecting a line
and clicking the "this is the date" button.

Essentially, I do the best the code can do, and then let the user make it
proper. After that, I code it up and use it in my internal format.

/Matthew Jones/
Wed, Sep 26 2007 1:16 PMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Matthew


I've been going through a few samples and defining the patterns. So far I get

CNJNrNDXD
CXDNRSCXDXD
CXDXDNJNRXJD
DSDNCXr
DXCNDXJNJXRXJ
DXCXJNDXJNRX
DXDNCXJNR
DXDNRJC
DXDNRXC
DXDNrXCXJ
DXDNRXSCSJ
DXDXCNDXDXR
DXDXCNDXRXJ
DXDXcNR
DXDXCXJ
DXDXCXJNDXJRNDXJRNDXJR
DXDXCXJNXR
DXDXCXr
DXDXRXC
DXDXSRNCXJ
DXSRXSCXJNDJ
DXSSRXSC
RNDXDXSCCXJXDXDNJNR
RXCXDX
RXCXDXD

Really "simple" to code for Frown


Roy Lambert

D=date
S=space
X=punctuation or other non-space separator
C=company
R=role
J=junk
N=newline
Uppercase means only alphamerics, lowercase means non-terminating separator other than space embeded.

Image