Login ProductsSalesSupportDownloadsAbout |
Home » Technical Support » ElevateDB Technical Support » Support Forums » ElevateDB General » View Thread |
Messages 1 to 8 of 8 total |
WordGenerator |
Thu, Nov 29 2007 8:41 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | I think I'm beginning to get my head round what I'll need for my full text indexing. I'll want some specific text filters just to strip the unwanted stuff and then some word generators to make sure things are handled as I want (word length, stop chars etc). Before I can move on though I need more info about the word generator template
procedure TEDBWordGeneratorModule1.EDBWordGeneratorModuleGenerateWord( Collation: Integer; <<<<<<<<<< what is this? I thought collations we ANSI, ENG etc not an integer const Text: String; <<<<<<<<<< this seems to be the full output from the filter yes/no var Position: Integer;} output of some sort var Word: String; } don't quite know what to do with them SearchWords: Boolean=False); <<<<<<<<<<<<< no idea In my test I have had position upto 1500 on a 30 char string, and it just keeps going. I know its probably to late, but if this is being called once for each word in the string, has to extract the next word each time then it doesn't seem overly efficient. That to one side can we have a bit more info about how its supposed to word and what the parameters are please? Another concern. To test all I've done is add this to the word generator code in your template showmessage(text+#13+word+#13+inttostr(position)); position := position+1; Word := 'XXX'; When I alter the text in the clob t add multiple lines I only see line 1. This may be because it never passes the first stage of removing the old words. I did this to see if the crlf pair is passed into the word generator. Roy Lambert |
Thu, Nov 29 2007 5:09 PM | Permanent Link |
Tim Young [Elevate Software] Elevate Software, Inc. timyoung@elevatesoft.com | Roy,
<< I think I'm beginning to get my head round what I'll need for my full text indexing. I'll want some specific text filters just to strip the unwanted stuff and then some word generators to make sure things are handled as I want (word length, stop chars etc). Before I can move on though I need more info about the word generator template procedure TEDBWordGeneratorModule1.EDBWordGeneratorModuleGenerateWord( Collation: Integer; <<<<<<<<<< what is this? I thought collations we ANSI, ENG etc not an integer >> The value corresponds to the Windows locale, or 0 if it is the default ANSI/UNI collation. For example, English US is under Windows is 1033. << const Text: String; <<<<<<<<<< this seems to be the full output from the filter yes/no >> If the text was filtered, yes. << var Position: Integer;} output of some sort >> The current position. << var Word: String; } don't quite know what to do with them SearchWords: Boolean=False); <<<<<<<<<<<<< no idea >> Read the template text for the event handler: "Be sure to increment the Position variable parameter as appropriate while generating each word. The Position parameter (1-based) is used to indicate to ElevateDB when the word generation is complete. If the Position parameter is greater than the length of Text, then the word generation is considered complete. The SearchWords parameter is used to indicate when the word generation is occurring on search words. In such a case, you need to retain all asterisks (*) used in words in order to allow the partial-search functionality to work correctly." << In my test I have had position up to 1500 on a 30 char string, and it just keeps going. >> Are you returning a word still ? It will keep going as long as you keep returning words. << I know its probably to late, but if this is being called once for each word in the string, has to extract the next word each time then it doesn't seem overly efficient. >> The overhead isn't bad, just slightly worse than the call overhead for the built-in functionality. << When I alter the text in the clob t add multiple lines I only see line 1. This may be because it never passes the first stage of removing the old words. I did this to see if the crlf pair is passed into the word generator. >> Anything that is in the CLOB column will get passed on to the word generator provided that any applicable text filter doesn't remove it first. -- Tim Young Elevate Software www.elevatesoft.com |
Fri, Nov 30 2007 4:31 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | Tim
>The value corresponds to the Windows locale, or 0 if it is the default >ANSI/UNI collation. For example, English US is under Windows is 1033. I don't think I'll ever need it but is there a table somewhere? ><< const Text: String; <<<<<<<<<< this seems to be the full output from the >filter yes/no >> > >If the text was filtered, yes. > ><< var Position: Integer;} output of some sort >> > >The current position. > ><< var Word: String; } don't quite know what to do with them > SearchWords: Boolean=False); <<<<<<<<<<<<< no idea >> > >Read the template text for the event handler: I did - I've just re-read it and it still means nothing. But a light bulb is dimly glowing. The only way I can see this meaning anything is for this << Field CONTAINS 'elevatesoft produces brilliant software' Does it a) just leave it in and hence the query is GUARANTEED to return false since elevatesoft isn't in the index? >> Correct. >> To be incorrect and the word generator is pre-processing the search terms for a query. Yes, no, stick my head down the loo and flush? >Are you returning a word still ? It will keep going as long as you keep >returning words. So "If the Position parameter is greater than the length of Text, then the word generation is considered complete." isn't quite sufficient, or is only for internal consumption for the word generator? Roy Lambert ps is it a good idea to use Word as the variable name for the words when its also a watsit - you know LargeNumber: word type of thingy? |
Mon, Dec 3 2007 8:11 PM | Permanent Link |
Tim Young [Elevate Software] Elevate Software, Inc. timyoung@elevatesoft.com | Roy,
<< I don't think I'll ever need it but is there a table somewhere? >> http://www.microsoft.com/globaldev/reference/winxp/xp-lcid.mspx << To be incorrect and the word generator is pre-processing the search terms for a query. Yes, no, stick my head down the loo and flush? >> Yes, the word generator does pre-process the search terms, and when it does, the SearchWords parameter will be set to True. << So "If the Position parameter is greater than the length of Text, then the word generation is considered complete." isn't quite sufficient, or is only for internal consumption for the word generator? >> It considers it complete at that point, but will keep calling it until you stop returning words. The reason for this is to allow for stemming whereby one word at a given position can return multiple words without actually moving the position any further. << ps is it a good idea to use Word as the variable name for the words when its also a watsit - you know LargeNumber: word type of thingy? >> I don't think Delphi has a problem with it since it can use the context in which it is used to determine which "Word" is being referred to. -- Tim Young Elevate Software www.elevatesoft.com |
Tue, Dec 4 2007 3:29 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | Tim
>Yes, the word generator does pre-process the search terms, and when it does, >the SearchWords parameter will be set to True. Worth adding into the template? ><< So "If the Position parameter is greater than the length of Text, then >the word generation is considered complete." isn't quite sufficient, or is >only for internal consumption for the word generator? >> > >It considers it complete at that point, but will keep calling it until you >stop returning words. The reason for this is to allow for stemming whereby >one word at a given position can return multiple words without actually >moving the position any further. What? This I really don't understand, can you give me an example. Also if my guess as to how to drive the word generator is correct (ie start at Position in the string and move to the next delimiter which becomes Position for the next entry) is right what am I meant to do when Position is 1500 and the string length is 30? Roy Lambert |
Tue, Dec 4 2007 4:59 PM | Permanent Link |
Tim Young [Elevate Software] Elevate Software, Inc. timyoung@elevatesoft.com | Roy,
<< Worth adding into the template? >> Well, it already says: "The SearchWords parameter is used to indicate when the word generation is occurring on search words." I'm not sure how else to describe it. I did change the text to make a note about the following, however. << What? This I really don't understand, can you give me an example. >> Let's say that you have the string "golfer is very bad" being parsed by EDB via the word generator. You can basically have EDB generate all variations of the word "golfer" by using an internal flag in the word generator that states that you are generating stemmed words. For example, 1) When you first hit the word, create a list of all stemmed words such as "golf", "golfs", "golfing", "golfed", etc. in a global list local to the word generation module instance. 2) Set a flag in the word generation module instance to indicate that you are generating from the stemmed words list and what the current position in the list is, and return the original word "golfer" while incrementing the position accordingly in the string. 3) On each subsequent call, check the stemmed words list flag and, if set, grab the next word off the stem list, increment the stemmed words list position, and return the word. 4) Once the stemmed words list position has reached the end of the stemmed words, set the stemmed words flag to False and continue with the word generation, if the position is still not past the length of the string. Rinse and repeat as necessary. This will give you the ability to allow matches on variations of words without forcing the user to use wildcards when specifying the search words. Of course, this relies on having the list of stemmed words available somewhere to use for lookups when generating the list of stemmed words. << Also if my guess as to how to drive the word generator is correct (ie start at Position in the string and move to the next delimiter which becomes Position for the next entry) is right what am I meant to do when Position is 1500 and the string length is 30? >> If you do things right, Position will never be that far past the string length. -- Tim Young Elevate Software www.elevatesoft.com |
Wed, Dec 5 2007 4:30 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | Tim
>"The SearchWords parameter is used to indicate when the word generation is >occurring on search words." A bit of extra clarification like: Prior to executing CONTAINS the list of words to be searched for is parsed by the word generator. SearchWords is set to True when this occurs and to False when the list of words is being prepared for insertion into the index. >Let's say that you have the string "golfer is very bad" being parsed by EDB >via the word generator. You can basically have EDB generate all variations >of the word "golfer" by using an internal flag in the word generator that >states that you are generating stemmed words. Good example and I'm sure some people will make use of the feature Roy Lambert |
Wed, Dec 5 2007 4:57 PM | Permanent Link |
Tim Young [Elevate Software] Elevate Software, Inc. timyoung@elevatesoft.com | Roy,
<< Good example and I'm sure some people will make use of the feature >> Well, I'm just glad that I thought of it while I was designing the text filtering. This type of stuff is hard to graft on after the fact. -- Tim Young Elevate Software www.elevatesoft.com |
This web page was last updated on Tuesday, April 30, 2024 at 03:55 PM | Privacy PolicySite Map © 2024 Elevate Software, Inc. All Rights Reserved Questions or comments ? E-mail us at info@elevatesoft.com |