Support Forums - View Thread

View Thread

The following is the text of the current message along with any replies.

Messages 1 to 5 of 5 total

Extracting information from a text document

Tue, Sep 25 2007 7:29 AM	Permanent Link
Roy Lambert NLH Associates Team Elevate	I want to start extracting information from people's cv's. Initially I'll be happy (ecstatic) if I can get the employment history. The problem is that people being people the cvs are not formatted in the same way. I've posted a .doc to binaries with a slug of samples in from a random selection of cv's. To my eyes there are too many different formats to easily use pattern recognition eg date separator date company newline job title. Anyone able to give some guidance, suggest some reading (I've been googling till my eyes hurt). Roy Lambert
Wed, Sep 26 2007 6:08 AM	Permanent Link
	Based on what I read of agencies, you simply search for keywords, then apply the candidate. Thus when the CV says "I hate C++, I'm really bad at Delphi, and Java is the devil", they get offered jobs for all of them. /Matthew Jones/
Wed, Sep 26 2007 9:01 AM	Permanent Link
Roy Lambert NLH Associates Team Elevate	Matthew Getting the "skills" out is a doddle - tokenise the cv and extract ANY words that match skills (taking C# as a word), and without a pair of eyeballs in the process I think you have indicated its usefulness. What I want is to get employment history which may come in several forms in the cv eg January 2007 ~ Axis Electronics Present (Contract Electronics Manufacturer) Responsibilities: Materials Manager – Purchasing, Material Control & Logistics. ---- Feb 03’ to present date – Project Manager (Sensors) GE Druck Ltd. Fir Tree Lane Leicester. ---- Managing Director Inditex UK & Ireland 2003- Date (Zara UK Ltd, Massimo Dutti UK Ltd, ZA Clothing Ireland Ltd, Zara Home Ltd, Bershka UK Ltd, Bershka Ireland Ltd, Massimo Dutti Ireland Ltd, Pull & Bear Ireland, Stradivarius Ireland) Roy Lambert
Wed, Sep 26 2007 12:43 PM	Permanent Link
	Hmm, I doubt that could be well automated, but I'm reminded of my code that takes lists and tries to work out the titles etc. What I do is look for blank lines, and assume that a blank line indicates a separator, and therefore the next line must be the title to be followed by options (in your case the details of the employment). A lot of heuristics come into play, but you might assume that if the first or second line has a year in, then that must contain the date. Otherwise check from the bottom - perhaps that is there. But the key here is that having worked out what might make sense, the details are shown in a list (Raize as it happens) which allows you to colour the lines to indicate the evaluation. The end user can then verify the analysis is correct, or modify the results by selecting a line and clicking the "this is the date" button. Essentially, I do the best the code can do, and then let the user make it proper. After that, I code it up and use it in my internal format. /Matthew Jones/
Wed, Sep 26 2007 1:16 PM	Permanent Link
Roy Lambert NLH Associates Team Elevate	Matthew I've been going through a few samples and defining the patterns. So far I get CNJNrNDXD CXDNRSCXDXD CXDXDNJNRXJD DSDNCXr DXCNDXJNJXRXJ DXCXJNDXJNRX DXDNCXJNR DXDNRJC DXDNRXC DXDNrXCXJ DXDNRXSCSJ DXDXCNDXDXR DXDXCNDXRXJ DXDXcNR DXDXCXJ DXDXCXJNDXJRNDXJRNDXJR DXDXCXJNXR DXDXCXr DXDXRXC DXDXSRNCXJ DXSRXSCXJNDJ DXSSRXSC RNDXDXSCCXJXDXDNJNR RXCXDX RXCXDXD Really "simple" to code for Roy Lambert D=date S=space X=punctuation or other non-space separator C=company R=role J=junk N=newline Uppercase means only alphamerics, lowercase means non-terminating separator other than space embeded.