Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 5 of 5 total
Thread Full text indexing strategies
Tue, Mar 4 2014 11:37 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

My full text indexing is, generally, working fine. However, one cv service I use has started embedding plain text versions of the cv in the body of the email. That would present no problem BUT then have multiple (I stopped counting at 45) copies in there. The result is that an email which should be (guess) 30Kb is now over 2Mb and has over 150,000 words to be indexed. Strangely enough this takes a while (over 10 minutes in the worst case) and I need to think of how to handle it. I have complained to the site but don't have much hope of action, and even if they rectify this one I'm going to prepare for the next morons.

My current flow is:
1. download email, post and leave in table until someone looks at it
2. on display decode, check for images, store the message part (either html or plain text) in a column and post.

Its part 2 that causes the grief.When the code hits the .Post it wanders off to do the full text indexing. The user now has time to go off and grow his/her own coffee bush before making a cup of coffee!

Until such time as someone buys Tim a magic wand to wave and he lets us do full text indexing in the background does anyone have a strategy to suggest?

A major criteria is that this is file server.

Roy Lambert
Tue, Mar 4 2014 3:07 PMPermanent Link

Barry

Roy,

Can I assume "CV" means resume?

Two solutions come to mind.

1) If the email has the CV embedded as plain text, why can't you have a text filter in a module that strips it out prior to indexing?

2) You need to remove full text indexing from real-time. So why not have two tables, a PreEdit table and a Search table? Step 1 & 2 works on the PreEdit table that does not have a fulltext index. When the PreEdit row gets saved in #2 by the user, it sets the column Status from "PE" (Pre-edit) to "Q" (Queued).  A background task running on another computer takes the queued PreEdit record and adds it to the Search table which has the full text index and after the record has been saved to the Search table, it removes it from the Pre-Edit table. If this background task is the only one writing to the Search table, I assume it won't block readers.

3?) Use a different fulltext search engine with EDB. There is Lucene (requires .net)  http://sourceforge.net/projects/mutis/ or Sphinx http://sourceforge.net/projects/delphisphinxcli/?source=recommended.  Sphinx works independently of the database and returns the record numbers (like AutoInc column value) that match the search criteria. They are both extremely fast but requires a bit of work to get it installed and running with EDB. (Like buying a house that is a "fixer-upper").  And of course you'll need to check the license agreements when distributing it with commercial products.

Barry
Wed, Mar 5 2014 2:12 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Barry

>Can I assume "CV" means resume?

A bit more than the US resume generally but that's the right idea.

>Two solutions come to mind.
>
>1) If the email has the CV embedded as plain text, why can't you have a text filter in a module that strips it out prior to indexing?

Because I may search the email but not the attachments for keywords. If I could figure out a sensible way of stripping it down to one copy of the cv embedded that would do the job but visually inspecting a few examples there's no way of distinguishing start / end points.

>2) You need to remove full text indexing from real-time. So why not have two tables, a PreEdit table and a Search table? Step 1 & 2 works on the PreEdit table that does not have a fulltext index. When the PreEdit row gets saved in #2 by the user, it sets the column Status from "PE" (Pre-edit) to "Q" (Queued). A background task running on another computer takes the queued PreEdit record and adds it to the Search table which has the full text index and after the record has been saved to the Search table, it removes it from the Pre-Edit table. If this background task is the only one writing to the Search table, I assume it won't block readers.

This is the sort of solution that's been going through my mind. I'll add your ideas to mine and see if I can come up with something feasible.

>3?) Use a different fulltext search engine with EDB. There is Lucene (requires .net) http://sourceforge.net/projects/mutis/ or Sphinx http://sourceforge.net/projects/delphisphinxcli/?source=recommended. Sphinx works independently of the database and returns the record numbers (like AutoInc column value) that match the search criteria. They are both extremely fast but requires a bit of work to get it installed and running with EDB. (Like buying a house that is a "fixer-upper"). And of course you'll need to check the license agreements when distributing it with commercial products.

Interesting.  Although I've known there are third party tools out there I've never really thought of using them as well as. I'll work on option 2 for a while and keep this in mind in case I fail.

Thanks

Roy
Tue, Mar 11 2014 8:12 AMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< My full text indexing is, generally, working fine. However, one cv
service I use has started embedding plain text versions of the cv in the
body of the email. That would present no problem BUT then have multiple (I
stopped counting at 45) copies in there. The result is that an email which
should be (guess) 30Kb is now over 2Mb and has over 150,000 words to be
indexed. Strangely enough this takes a while (over 10 minutes in the worst
case) and I need to think of how to handle it. I have complained to the site
but don't have much hope of action, and even if they rectify this one I'm
going to prepare for the next morons. >>

So, *each* email has multiple copies of the CV in it ?  Or that you've got
multiple emails with the same CV in each ?  Also, there are 150,000 words in
*one* email ?  I just looked this up, and the median number of words for
books is only 65k words.

Tim Young
Elevate Software
www.elevatesoft.com
Tue, Mar 11 2014 8:57 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim


>So, *each* email has multiple copies of the CV in it ?

Well, they seem to have solved that problem, the next iteration was 195 repeats of <<CV Text: CV:>> then this afternoon its followed by

195 repeats of

<<CV Text: - Exceptional business development / programme management executive with a  contagiously positive driving force and highly effective team leadership skills
- Proven track record in complex, dynamic and highly challenging environments
- Extensive international experience across defence, civil aviation and aerospace industries
- Strong leadership skills in both business development and project execution domains
- Inventive, market savvy and creative whilst delivering return on investment
- Excellent collaborative skills and experienced in working with diverse teams, customers, suppliers and partners to deliver mutually successful outcomes.

See CV for further details>>

>Also, there are 150,000 words in
>*one* email ? I just looked this up, and the median number of words for
>books is only 65k words.

Another measure is most of my ebooks are well under 1/2Mb and these emails were over 2Mb. At least the changes show they're trying to do something - cocking it up but trying.

Roy
Image