Login ProductsSalesSupportDownloadsAbout |
Home » Technical Support » ElevateDB Technical Support » Support Forums » ElevateDB General » View Thread |
Messages 1 to 5 of 5 total |
Full text indexing strategies |
Tue, Mar 4 2014 11:37 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | My full text indexing is, generally, working fine. However, one cv service I use has started embedding plain text versions of the cv in the body of the email. That would present no problem BUT then have multiple (I stopped counting at 45) copies in there. The result is that an email which should be (guess) 30Kb is now over 2Mb and has over 150,000 words to be indexed. Strangely enough this takes a while (over 10 minutes in the worst case) and I need to think of how to handle it. I have complained to the site but don't have much hope of action, and even if they rectify this one I'm going to prepare for the next morons.
My current flow is: 1. download email, post and leave in table until someone looks at it 2. on display decode, check for images, store the message part (either html or plain text) in a column and post. Its part 2 that causes the grief.When the code hits the .Post it wanders off to do the full text indexing. The user now has time to go off and grow his/her own coffee bush before making a cup of coffee! Until such time as someone buys Tim a magic wand to wave and he lets us do full text indexing in the background does anyone have a strategy to suggest? A major criteria is that this is file server. Roy Lambert |
Tue, Mar 4 2014 3:07 PM | Permanent Link |
Barry | Roy,
Can I assume "CV" means resume? Two solutions come to mind. 1) If the email has the CV embedded as plain text, why can't you have a text filter in a module that strips it out prior to indexing? 2) You need to remove full text indexing from real-time. So why not have two tables, a PreEdit table and a Search table? Step 1 & 2 works on the PreEdit table that does not have a fulltext index. When the PreEdit row gets saved in #2 by the user, it sets the column Status from "PE" (Pre-edit) to "Q" (Queued). A background task running on another computer takes the queued PreEdit record and adds it to the Search table which has the full text index and after the record has been saved to the Search table, it removes it from the Pre-Edit table. If this background task is the only one writing to the Search table, I assume it won't block readers. 3?) Use a different fulltext search engine with EDB. There is Lucene (requires .net) http://sourceforge.net/projects/mutis/ or Sphinx http://sourceforge.net/projects/delphisphinxcli/?source=recommended. Sphinx works independently of the database and returns the record numbers (like AutoInc column value) that match the search criteria. They are both extremely fast but requires a bit of work to get it installed and running with EDB. (Like buying a house that is a "fixer-upper"). And of course you'll need to check the license agreements when distributing it with commercial products. Barry |
Wed, Mar 5 2014 2:12 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | Barry
>Can I assume "CV" means resume? A bit more than the US resume generally but that's the right idea. >Two solutions come to mind. > >1) If the email has the CV embedded as plain text, why can't you have a text filter in a module that strips it out prior to indexing? Because I may search the email but not the attachments for keywords. If I could figure out a sensible way of stripping it down to one copy of the cv embedded that would do the job but visually inspecting a few examples there's no way of distinguishing start / end points. >2) You need to remove full text indexing from real-time. So why not have two tables, a PreEdit table and a Search table? Step 1 & 2 works on the PreEdit table that does not have a fulltext index. When the PreEdit row gets saved in #2 by the user, it sets the column Status from "PE" (Pre-edit) to "Q" (Queued). A background task running on another computer takes the queued PreEdit record and adds it to the Search table which has the full text index and after the record has been saved to the Search table, it removes it from the Pre-Edit table. If this background task is the only one writing to the Search table, I assume it won't block readers. This is the sort of solution that's been going through my mind. I'll add your ideas to mine and see if I can come up with something feasible. >3?) Use a different fulltext search engine with EDB. There is Lucene (requires .net) http://sourceforge.net/projects/mutis/ or Sphinx http://sourceforge.net/projects/delphisphinxcli/?source=recommended. Sphinx works independently of the database and returns the record numbers (like AutoInc column value) that match the search criteria. They are both extremely fast but requires a bit of work to get it installed and running with EDB. (Like buying a house that is a "fixer-upper"). And of course you'll need to check the license agreements when distributing it with commercial products. Interesting. Although I've known there are third party tools out there I've never really thought of using them as well as. I'll work on option 2 for a while and keep this in mind in case I fail. Thanks Roy |
Tue, Mar 11 2014 8:12 AM | Permanent Link |
Tim Young [Elevate Software] Elevate Software, Inc. timyoung@elevatesoft.com | Roy,
<< My full text indexing is, generally, working fine. However, one cv service I use has started embedding plain text versions of the cv in the body of the email. That would present no problem BUT then have multiple (I stopped counting at 45) copies in there. The result is that an email which should be (guess) 30Kb is now over 2Mb and has over 150,000 words to be indexed. Strangely enough this takes a while (over 10 minutes in the worst case) and I need to think of how to handle it. I have complained to the site but don't have much hope of action, and even if they rectify this one I'm going to prepare for the next morons. >> So, *each* email has multiple copies of the CV in it ? Or that you've got multiple emails with the same CV in each ? Also, there are 150,000 words in *one* email ? I just looked this up, and the median number of words for books is only 65k words. Tim Young Elevate Software www.elevatesoft.com |
Tue, Mar 11 2014 8:57 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | Tim
>So, *each* email has multiple copies of the CV in it ? Well, they seem to have solved that problem, the next iteration was 195 repeats of <<CV Text: CV:>> then this afternoon its followed by 195 repeats of <<CV Text: - Exceptional business development / programme management executive with a contagiously positive driving force and highly effective team leadership skills - Proven track record in complex, dynamic and highly challenging environments - Extensive international experience across defence, civil aviation and aerospace industries - Strong leadership skills in both business development and project execution domains - Inventive, market savvy and creative whilst delivering return on investment - Excellent collaborative skills and experienced in working with diverse teams, customers, suppliers and partners to deliver mutually successful outcomes. See CV for further details>> >Also, there are 150,000 words in >*one* email ? I just looked this up, and the median number of words for >books is only 65k words. Another measure is most of my ebooks are well under 1/2Mb and these emails were over 2Mb. At least the changes show they're trying to do something - cocking it up but trying. Roy |
This web page was last updated on Thursday, May 23, 2024 at 07:54 PM | Privacy PolicySite Map © 2024 Elevate Software, Inc. All Rights Reserved Questions or comments ? E-mail us at info@elevatesoft.com |