Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 8 of 8 total
Thread WordGenerator
Thu, Nov 29 2007 8:41 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

I think I'm beginning to get my head round what I'll need for my full text indexing. I'll want some specific text filters just to strip the unwanted stuff and then some word generators to make sure things are handled as I want (word length, stop chars etc). Before I can move on though I need more info about the word generator template

procedure TEDBWordGeneratorModule1.EDBWordGeneratorModuleGenerateWord(
 Collation: Integer; <<<<<<<<<< what is this? I thought collations we ANSI, ENG etc not an integer
const Text: String; <<<<<<<<<< this seems to be the full output from the filter yes/no
var Position: Integer;}       output of some sort
 var Word: String;   }       don't quite know what to do with them
SearchWords: Boolean=False); <<<<<<<<<<<<< no idea

In my  test I have had position upto 1500 on a 30 char string, and it just keeps going.

I know its probably to late, but if this is being called once for each word in the string, has to extract the next word each time then it doesn't seem overly efficient.

That to one side can we have a bit more info about how its supposed to word and what the parameters are please?

Another concern. To test all I've done is add this to the word generator code in your template

showmessage(text+#13+word+#13+inttostr(position));
position := position+1;
Word := 'XXX';

When I alter the text in the clob t add multiple lines I only see line 1. This may be because it never passes the first stage of removing the old words. I did this to see if the crlf pair is passed into the word generator.

Roy Lambert
Thu, Nov 29 2007 5:09 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< I think I'm beginning to get my head round what I'll need for my full
text indexing. I'll want some specific text filters just to strip the
unwanted stuff and then some word generators to make sure things are handled
as I want (word length, stop chars etc). Before I can move on though I need
more info about the word generator template

procedure TEDBWordGeneratorModule1.EDBWordGeneratorModuleGenerateWord(
Collation: Integer; <<<<<<<<<< what is this? I thought collations we ANSI,
ENG etc not an integer >>

The value corresponds to the Windows locale, or 0 if it is the default
ANSI/UNI collation.  For example, English US is under Windows is 1033.

<< const Text: String; <<<<<<<<<< this seems to be the full output from the
filter yes/no >>

If the text was filtered, yes.

<< var Position: Integer;} output of some sort >>

The current position.

<< var Word: String; } don't quite know what to do with them
SearchWords: Boolean=False); <<<<<<<<<<<<< no idea >>

Read the template text for the event handler:

"Be sure to increment the Position variable parameter as appropriate while
generating each word.  The Position parameter (1-based) is used to indicate
to ElevateDB when the word generation is complete. If the Position parameter
is greater than the length of Text, then the word generation is considered
complete.  The SearchWords parameter is
used to indicate when the word generation is occurring on search words. In
such a case, you need to retain all asterisks (*) used in words in order to
allow the partial-search functionality to work correctly."

<< In my test I have had position up to 1500 on a 30 char string, and it
just keeps going. >>

Are you returning a word still ?  It will keep going as long as you keep
returning words.

<< I know its probably to late, but if this is being called once for each
word in the string, has to extract the next word each time then it doesn't
seem overly efficient. >>

The overhead isn't bad, just slightly worse than the call overhead for the
built-in functionality.

<< When I alter the text in the clob t add multiple lines I only see line 1.
This may be because it never passes the first stage of removing the old
words. I did this to see if the crlf pair is passed into the word generator.
>>

Anything that is in the CLOB column will get passed on to the word generator
provided that any applicable text filter doesn't remove it first.

--
Tim Young
Elevate Software
www.elevatesoft.com

Fri, Nov 30 2007 4:31 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim

>The value corresponds to the Windows locale, or 0 if it is the default
>ANSI/UNI collation. For example, English US is under Windows is 1033.

I don't think I'll ever need it but is there a table somewhere?

><< const Text: String; <<<<<<<<<< this seems to be the full output from the
>filter yes/no >>
>
>If the text was filtered, yes.
>
><< var Position: Integer;} output of some sort >>
>
>The current position.
>
><< var Word: String; } don't quite know what to do with them
> SearchWords: Boolean=False); <<<<<<<<<<<<< no idea >>
>
>Read the template text for the event handler:

I did - I've just re-read it and it still means nothing. But a light bulb is dimly glowing. The only way I can see this meaning anything is for this

<<
Field CONTAINS 'elevatesoft produces brilliant software'

Does it
a) just leave it in and hence the query is GUARANTEED to return false since
elevatesoft isn't in the index? >>

Correct.
>>

To be incorrect and the word generator is pre-processing the search terms for a query. Yes, no, stick my head down the loo and flush?

>Are you returning a word still ? It will keep going as long as you keep
>returning words.

So "If the Position parameter is greater than the length of Text, then the word generation is considered complete."  isn't quite sufficient, or is only for internal consumption for the word generator?

Roy Lambert

ps is it a good idea to use Word as the variable name for the words when its also a watsit - you know LargeNumber: word type of thingy?
Mon, Dec 3 2007 8:11 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< I don't think I'll ever need it but is there a table somewhere? >>

http://www.microsoft.com/globaldev/reference/winxp/xp-lcid.mspx

<< To be incorrect and the word generator is pre-processing the search terms
for a query. Yes, no, stick my head down the loo and flush? >>

Yes, the word generator does pre-process the search terms, and when it does,
the SearchWords parameter will be set to True.

<< So "If the Position parameter is greater than the length of Text, then
the word generation is considered complete." isn't quite sufficient, or is
only for internal consumption for the word generator? >>

It considers it complete at that point, but will keep calling it until you
stop returning words.   The reason for this is to allow for stemming whereby
one word at a given position can return multiple words without actually
moving the position any further.

<< ps is it a good idea to use Word as the variable name for the words when
its also a watsit - you know LargeNumber: word type of thingy? >>

I don't think Delphi has a problem with it since it can use the context in
which it is used to determine which "Word" is being referred to.

--
Tim Young
Elevate Software
www.elevatesoft.com

Tue, Dec 4 2007 3:29 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim


>Yes, the word generator does pre-process the search terms, and when it does,
>the SearchWords parameter will be set to True.

Worth adding into the template?

><< So "If the Position parameter is greater than the length of Text, then
>the word generation is considered complete." isn't quite sufficient, or is
>only for internal consumption for the word generator? >>
>
>It considers it complete at that point, but will keep calling it until you
>stop returning words. The reason for this is to allow for stemming whereby
>one word at a given position can return multiple words without actually
>moving the position any further.

What? This I really don't understand, can you give me an example. Also if my guess as to how to drive the word generator is correct (ie start at Position in the string and move to the next delimiter which becomes Position for the next entry) is right what am I meant to do when Position is 1500 and the string length is 30?

Roy Lambert

Tue, Dec 4 2007 4:59 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< Worth adding into the template? >>

Well, it already says:

"The SearchWords parameter is used to indicate when the word generation is
occurring on search words."

I'm not sure how else to describe it.   I did change the text to make a note
about the following, however.

<< What? This I really don't understand, can you give me an example. >>

Let's say that you have the string "golfer is very bad" being parsed by EDB
via the word generator.  You can basically have EDB generate all variations
of the word "golfer" by using an internal flag in the word generator that
states that you are generating stemmed words.

For example,

1) When you first hit the word, create a list of all stemmed words such as
"golf", "golfs", "golfing", "golfed", etc. in a global list local to the
word generation module instance.

2) Set a flag in the word generation module instance to indicate that you
are generating from the stemmed words list and what the current position in
the list is, and return the original word "golfer" while incrementing the
position accordingly in the string.

3) On each subsequent call, check the stemmed words list flag and, if set,
grab the next word off the stem list, increment the stemmed words list
position, and return the word.

4) Once the stemmed words list position has reached the end of the stemmed
words, set the stemmed words flag to False and continue with the word
generation, if the position is still not past the length of the string.

Rinse and repeat as necessary.  This will give you the ability to allow
matches on variations of words without forcing the user to use wildcards
when specifying the search words.  Of course, this relies on having the list
of stemmed words available somewhere to use for lookups when generating the
list of stemmed words.

<< Also if my guess as to how to drive the word generator is correct (ie
start at Position in the string and move to the next delimiter which becomes
Position for the next entry) is right what am I meant to do when Position is
1500 and the string length is 30? >>

If you do things right, Position will never be that far past the string
length.

--
Tim Young
Elevate Software
www.elevatesoft.com

Wed, Dec 5 2007 4:30 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim


>"The SearchWords parameter is used to indicate when the word generation is
>occurring on search words."

A bit of extra clarification like:
Prior to executing CONTAINS the list of words to be searched for is parsed by the word generator. SearchWords is set to True when this occurs and to False when the list of words is being prepared for insertion into the index.

>Let's say that you have the string "golfer is very bad" being parsed by EDB
>via the word generator. You can basically have EDB generate all variations
>of the word "golfer" by using an internal flag in the word generator that
>states that you are generating stemmed words.

Good example and I'm sure some people will make use of the feature Smiley

Roy Lambert
Wed, Dec 5 2007 4:57 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< Good example and I'm sure some people will make use of the feature Smiley>>

Well, I'm just glad that I thought of it while I was designing the text
filtering.  This type of stuff is hard to graft on after the fact. Smiley

--
Tim Young
Elevate Software
www.elevatesoft.com

Image