Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 10 of 19 total
Thread Full text indexing
Sat, Aug 26 2006 2:48 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Since we now have the newsgroups I going for it

Are you going to add a minimum word length to the full text indexing as well as the maximum word length? Doing so will cut out vast amounts of rubbish eg set to 4 and if and but the etc all go without having to write special handlers.

I remember this being asked ages ago and I thought you were but your last post showing code looks as though not.

Roy Lambert
Sat, Aug 26 2006 4:49 AMPermanent Link

"Hannes Danzl[NDD]"
Roy Lambert wrote:

> Since we now have the newsgroups I going for it
>
> Are you going to add a minimum word length to the full text indexing as well
> as the maximum word length? Doing so will cut out vast amounts of rubbish eg
> set to 4 and if and but the etc all go without having to write special
> handlers.
>
> I remember this being asked ages ago and I thought you were but your last
> post showing code looks as though not.

Hm. I'd suggest to rather implement a hash based stop word list, which filters
the words you don't want. Restricting on base of length only will get rid of
lots of words you will want to search for, especially acronyms like USA, DB,
EU, UNO, ... also keep in mind that different languages call for different
filters Wink

--

Hannes Danzl
Newsgroup archive at http://www.tamaracka.com/search.htm
Sun, Aug 27 2006 10:24 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Hannes


Stop words are great for some words, and I will sort of agree about some acronyms but a user settable minimum word length would allow people to do it either way. In my case I'm happy to dump any 3 letter word ie mainly connectives and fillers rather than maintain a massive stop word list.


Roy Lambert
Sun, Aug 27 2006 7:59 PMPermanent Link

"Hannes Danzl[NDD]"
Roy Lambert wrote:

> Hannes
>
>
> Stop words are great for some words, and I will sort of agree about some
> acronyms but a user settable minimum word length would allow people to do it
> either way. In my case I'm happy to dump any 3 letter word ie mainly
> connectives and fillers rather than maintain a massive stop word list.

hm. it's quite easy to dynamically update the stopword list on basis of index
stats. a fulltext engine usually has a MaxHits property or similar that tells
the engine to not store any more records for a certain word. Once that count
is reached it's flagged as invalid (and you can get these flagged words) or
automatically added to the stopword list.

--

Hannes Danzl
Newsgroup archive at http://www.tamaracka.com/search.htm
Mon, Aug 28 2006 2:08 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< Are you going to add a minimum word length to the full text indexing as
well as the maximum word length? Doing so will cut out vast amounts of
rubbish eg set to 4 and if and but the etc all go without having to write
special handlers. >>

The default minimum word size in the default word generator is 3, meaning
that anything under 3 will be excluded.  As with any of the word generation
defaults parameters (min word size, max word size, include chars, space
chars, stop words, etc), if you want to modify them you can do so by writing
a custom word generation module, which is very simple to do.

--
Tim Young
Elevate Software
www.elevatesoft.com

Tue, Aug 29 2006 4:26 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim


Interesting word default - that indicates to my parser that its configurable.......

Roy Lambert
Tue, Aug 29 2006 2:41 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< Interesting word default - that indicates to my parser that its
configurable.......  >>

Yes it is - by writing a custom word generation module.  I think you're
getting too hung up on this - writing a custom word generation module very
simpe and entails:

1) Select New..Other..ElevateDB..Elevate DB Word Generator Module
2) Fill in this empty event handler in the project's only unit with your
code:

procedure TEDBWordGeneratorModule1.EDBWordGeneratorModuleGenerateWord(
 Collation: Integer; const Text: String; var Position: Integer;
 var Word: String);
begin
  { Fill in the word generation code here. You can have multiple
    word generators for multiple collations in a word generator module.
    Use the Collation parameter to determine which word generator is being
    called.  Be sure to increment the Position variable parameter as
    appropriate while generating each word.  The Position parameter
(1-based)
    is used to indicate to ElevateDB when the word generation is complete.
    If the Position parameter is greater than the length of Text, then the
    word generation is considered complete.}
end;

3) Save and compile the word generator and copy it into the same directory
as the ElevateDB configuration
4) Execute this SQL:

CREATE WORD GENERATOR <Your Name>
MODULE <The Name of the Above Module Minus the .DLL Extension>

5) Then, just use it in any CREATE TEXT INDEX statements that you want or in
the ElevateDB Manager with the interactive text index creation.

Granted, the writing of the module may take some time, depending upon your
needs.  However, a word generator is also infinitely configurable, in
contrast to us trying to include configuration settings for every sort of
collation, stemming, stop word, min/max word size, etc. combination that is
possible or desired.  Some folks will want very complex word generation,
while others will always simply stick with the defaults provided by the
default word generator.  For example, even if we included all sorts of
configuration options in ElevateDB for what I just mentioned, it still
wouldn't properly cover word generation for Chinese or Japanese.  ElevateDB
is designed with these languages and Unicode in mind.

And again, once someone writes a word generator module for a given collation
or for a custom type of word generation, it can be sold or simply shared
with others here on the newsgroups in the Extensions newsgroup.

--
Tim Young
Elevate Software
www.elevatesoft.com

Tue, Aug 29 2006 4:13 PMPermanent Link

Michael Baytalsky
Tim,

I'm sure you've done this already, but just in case....

In the caller code, make sure to raise error if word generator
didn't increment Position by at least 1. Otherwise you might
end up with infinite loop, just because someone has made a
silly mistake in their code Wink I can't tell you how many times
I've done just that Wink.


Michael

Tim Young [Elevate Software] wrote:
> Roy,
>
> << Interesting word default - that indicates to my parser that its
> configurable.......  >>
>
> Yes it is - by writing a custom word generation module.  I think you're
> getting too hung up on this - writing a custom word generation module very
> simpe and entails:
>
> 1) Select New..Other..ElevateDB..Elevate DB Word Generator Module
> 2) Fill in this empty event handler in the project's only unit with your
> code:
>
> procedure TEDBWordGeneratorModule1.EDBWordGeneratorModuleGenerateWord(
>   Collation: Integer; const Text: String; var Position: Integer;
>   var Word: String);
> begin
>    { Fill in the word generation code here. You can have multiple
>      word generators for multiple collations in a word generator module.
>      Use the Collation parameter to determine which word generator is being
>      called.  Be sure to increment the Position variable parameter as
>      appropriate while generating each word.  The Position parameter
> (1-based)
>      is used to indicate to ElevateDB when the word generation is complete.
>      If the Position parameter is greater than the length of Text, then the
>      word generation is considered complete.}
> end;
>
> 3) Save and compile the word generator and copy it into the same directory
> as the ElevateDB configuration
> 4) Execute this SQL:
>
> CREATE WORD GENERATOR <Your Name>
> MODULE <The Name of the Above Module Minus the .DLL Extension>
>
> 5) Then, just use it in any CREATE TEXT INDEX statements that you want or in
> the ElevateDB Manager with the interactive text index creation.
>
> Granted, the writing of the module may take some time, depending upon your
> needs.  However, a word generator is also infinitely configurable, in
> contrast to us trying to include configuration settings for every sort of
> collation, stemming, stop word, min/max word size, etc. combination that is
> possible or desired.  Some folks will want very complex word generation,
> while others will always simply stick with the defaults provided by the
> default word generator.  For example, even if we included all sorts of
> configuration options in ElevateDB for what I just mentioned, it still
> wouldn't properly cover word generation for Chinese or Japanese.  ElevateDB
> is designed with these languages and Unicode in mind.
>
> And again, once someone writes a word generator module for a given collation
> or for a custom type of word generation, it can be sold or simply shared
> with others here on the newsgroups in the Extensions newsgroup.
>
Wed, Aug 30 2006 3:14 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim


In this case all I was objecting to was the word "default" not the concept or your non-supply of every possible variation I can think of in deciding which words to index.

However, since you've given more detail, and even though its really to early to ask details. Is EDBWordGeneratorModuleGenerateWord going to be called once per word as I'm guessing?

Finally reading ".DLL" - shudder Smiley

Roy Lambert
Wed, Aug 30 2006 3:14 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Michael


Me to - especially missing table.Next

Roy Lambert
Page 1 of 2Next Page »
Jump to Page:  1 2
Image