Login ProductsSalesSupportDownloadsAbout |
Home » Technical Support » ElevateDB Technical Support » Support Forums » ElevateDB General » View Thread |
Messages 1 to 10 of 19 total |
Full text indexing |
Sat, Aug 26 2006 2:48 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | Since we now have the newsgroups I going for it
Are you going to add a minimum word length to the full text indexing as well as the maximum word length? Doing so will cut out vast amounts of rubbish eg set to 4 and if and but the etc all go without having to write special handlers. I remember this being asked ages ago and I thought you were but your last post showing code looks as though not. Roy Lambert |
Sat, Aug 26 2006 4:49 AM | Permanent Link |
"Hannes Danzl[NDD]" | Roy Lambert wrote:
> Since we now have the newsgroups I going for it > > Are you going to add a minimum word length to the full text indexing as well > as the maximum word length? Doing so will cut out vast amounts of rubbish eg > set to 4 and if and but the etc all go without having to write special > handlers. > > I remember this being asked ages ago and I thought you were but your last > post showing code looks as though not. Hm. I'd suggest to rather implement a hash based stop word list, which filters the words you don't want. Restricting on base of length only will get rid of lots of words you will want to search for, especially acronyms like USA, DB, EU, UNO, ... also keep in mind that different languages call for different filters -- Hannes Danzl Newsgroup archive at http://www.tamaracka.com/search.htm |
Sun, Aug 27 2006 10:24 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | Hannes
Stop words are great for some words, and I will sort of agree about some acronyms but a user settable minimum word length would allow people to do it either way. In my case I'm happy to dump any 3 letter word ie mainly connectives and fillers rather than maintain a massive stop word list. Roy Lambert |
Sun, Aug 27 2006 7:59 PM | Permanent Link |
"Hannes Danzl[NDD]" | Roy Lambert wrote:
> Hannes > > > Stop words are great for some words, and I will sort of agree about some > acronyms but a user settable minimum word length would allow people to do it > either way. In my case I'm happy to dump any 3 letter word ie mainly > connectives and fillers rather than maintain a massive stop word list. hm. it's quite easy to dynamically update the stopword list on basis of index stats. a fulltext engine usually has a MaxHits property or similar that tells the engine to not store any more records for a certain word. Once that count is reached it's flagged as invalid (and you can get these flagged words) or automatically added to the stopword list. -- Hannes Danzl Newsgroup archive at http://www.tamaracka.com/search.htm |
Mon, Aug 28 2006 2:08 PM | Permanent Link |
Tim Young [Elevate Software] Elevate Software, Inc. timyoung@elevatesoft.com | Roy,
<< Are you going to add a minimum word length to the full text indexing as well as the maximum word length? Doing so will cut out vast amounts of rubbish eg set to 4 and if and but the etc all go without having to write special handlers. >> The default minimum word size in the default word generator is 3, meaning that anything under 3 will be excluded. As with any of the word generation defaults parameters (min word size, max word size, include chars, space chars, stop words, etc), if you want to modify them you can do so by writing a custom word generation module, which is very simple to do. -- Tim Young Elevate Software www.elevatesoft.com |
Tue, Aug 29 2006 4:26 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | Tim
Interesting word default - that indicates to my parser that its configurable....... Roy Lambert |
Tue, Aug 29 2006 2:41 PM | Permanent Link |
Tim Young [Elevate Software] Elevate Software, Inc. timyoung@elevatesoft.com | Roy,
<< Interesting word default - that indicates to my parser that its configurable....... >> Yes it is - by writing a custom word generation module. I think you're getting too hung up on this - writing a custom word generation module very simpe and entails: 1) Select New..Other..ElevateDB..Elevate DB Word Generator Module 2) Fill in this empty event handler in the project's only unit with your code: procedure TEDBWordGeneratorModule1.EDBWordGeneratorModuleGenerateWord( Collation: Integer; const Text: String; var Position: Integer; var Word: String); begin { Fill in the word generation code here. You can have multiple word generators for multiple collations in a word generator module. Use the Collation parameter to determine which word generator is being called. Be sure to increment the Position variable parameter as appropriate while generating each word. The Position parameter (1-based) is used to indicate to ElevateDB when the word generation is complete. If the Position parameter is greater than the length of Text, then the word generation is considered complete.} end; 3) Save and compile the word generator and copy it into the same directory as the ElevateDB configuration 4) Execute this SQL: CREATE WORD GENERATOR <Your Name> MODULE <The Name of the Above Module Minus the .DLL Extension> 5) Then, just use it in any CREATE TEXT INDEX statements that you want or in the ElevateDB Manager with the interactive text index creation. Granted, the writing of the module may take some time, depending upon your needs. However, a word generator is also infinitely configurable, in contrast to us trying to include configuration settings for every sort of collation, stemming, stop word, min/max word size, etc. combination that is possible or desired. Some folks will want very complex word generation, while others will always simply stick with the defaults provided by the default word generator. For example, even if we included all sorts of configuration options in ElevateDB for what I just mentioned, it still wouldn't properly cover word generation for Chinese or Japanese. ElevateDB is designed with these languages and Unicode in mind. And again, once someone writes a word generator module for a given collation or for a custom type of word generation, it can be sold or simply shared with others here on the newsgroups in the Extensions newsgroup. -- Tim Young Elevate Software www.elevatesoft.com |
Tue, Aug 29 2006 4:13 PM | Permanent Link |
Michael Baytalsky | Tim,
I'm sure you've done this already, but just in case.... In the caller code, make sure to raise error if word generator didn't increment Position by at least 1. Otherwise you might end up with infinite loop, just because someone has made a silly mistake in their code I can't tell you how many times I've done just that . Michael Tim Young [Elevate Software] wrote: > Roy, > > << Interesting word default - that indicates to my parser that its > configurable....... >> > > Yes it is - by writing a custom word generation module. I think you're > getting too hung up on this - writing a custom word generation module very > simpe and entails: > > 1) Select New..Other..ElevateDB..Elevate DB Word Generator Module > 2) Fill in this empty event handler in the project's only unit with your > code: > > procedure TEDBWordGeneratorModule1.EDBWordGeneratorModuleGenerateWord( > Collation: Integer; const Text: String; var Position: Integer; > var Word: String); > begin > { Fill in the word generation code here. You can have multiple > word generators for multiple collations in a word generator module. > Use the Collation parameter to determine which word generator is being > called. Be sure to increment the Position variable parameter as > appropriate while generating each word. The Position parameter > (1-based) > is used to indicate to ElevateDB when the word generation is complete. > If the Position parameter is greater than the length of Text, then the > word generation is considered complete.} > end; > > 3) Save and compile the word generator and copy it into the same directory > as the ElevateDB configuration > 4) Execute this SQL: > > CREATE WORD GENERATOR <Your Name> > MODULE <The Name of the Above Module Minus the .DLL Extension> > > 5) Then, just use it in any CREATE TEXT INDEX statements that you want or in > the ElevateDB Manager with the interactive text index creation. > > Granted, the writing of the module may take some time, depending upon your > needs. However, a word generator is also infinitely configurable, in > contrast to us trying to include configuration settings for every sort of > collation, stemming, stop word, min/max word size, etc. combination that is > possible or desired. Some folks will want very complex word generation, > while others will always simply stick with the defaults provided by the > default word generator. For example, even if we included all sorts of > configuration options in ElevateDB for what I just mentioned, it still > wouldn't properly cover word generation for Chinese or Japanese. ElevateDB > is designed with these languages and Unicode in mind. > > And again, once someone writes a word generator module for a given collation > or for a custom type of word generation, it can be sold or simply shared > with others here on the newsgroups in the Extensions newsgroup. > |
Wed, Aug 30 2006 3:14 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | Tim
In this case all I was objecting to was the word "default" not the concept or your non-supply of every possible variation I can think of in deciding which words to index. However, since you've given more detail, and even though its really to early to ask details. Is EDBWordGeneratorModuleGenerateWord going to be called once per word as I'm guessing? Finally reading ".DLL" - shudder Roy Lambert |
Wed, Aug 30 2006 3:14 AM | Permanent Link |
Roy Lambert NLH Associates Team Elevate | Michael
Me to - especially missing table.Next Roy Lambert |
Page 1 of 2 | Next Page » | |
Jump to Page: 1 2 |
This web page was last updated on Tuesday, May 7, 2024 at 06:25 PM | Privacy PolicySite Map © 2024 Elevate Software, Inc. All Rights Reserved Questions or comments ? E-mail us at info@elevatesoft.com |