Icon View Thread

The following is the text of the current message along with any replies.
Messages 11 to 19 of 19 total
Thread Full text indexing
Wed, Aug 30 2006 4:35 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Michael,

<< I'm sure you've done this already, but just in case....

In the caller code, make sure to raise error if word generator didn't
increment Position by at least 1. Otherwise you might end up with infinite
loop, just because someone has made a silly mistake in their code Wink I
can't tell you how many times I've done just that Wink. >>

Actually, no, I hadn't done that yet since it really falls under the general
category of "oops, I screwed up".  In addition, I'm not sure if I want to
necessarily dictate that every call to the word generator must increment the
text position.  If someone is doing stemming or something similar, they may
or may not increment the text position on every call.

--
Tim Young
Elevate Software
www.elevatesoft.com

Wed, Aug 30 2006 4:39 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< In this case all I was objecting to was the word "default" not the
concept or your non-supply of every possible variation I can think of in
deciding which words to index. >>

Well, default was used for better lack of a word.  It is the default
behavior, hence I picked "default". Smiley

<< However, since you've given more detail, and even though its really to
early to ask details. Is EDBWordGeneratorModuleGenerateWord going to be
called once per word as I'm guessing? >>

Yes, but once per actual word that will be indexed, not once per word in the
text.

<< Finally reading ".DLL" - shudder Smiley>>

Uggh, darn religious people.... Smiley

--
Tim Young
Elevate Software
www.elevatesoft.com

Wed, Aug 30 2006 6:49 PMPermanent Link

Michael Baytalsky
Tim,

> text position.  If someone is doing stemming or something similar, they may
> or may not increment the text position on every call.
I'm not convinced, although I'm not sure I understand the word "stemming".
If you need to perform two steps, you can certainly do so at once, there's
no reason to have it invoke the same procedure twice with the same
parameters. It's just going to be the most frequent error, like "oops,
I didn't realize I had to increment the counter". Wink

Just my 2c.

Michael
Thu, Aug 31 2006 4:32 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim

><< However, since you've given more detail, and even though its really to
>early to ask details. Is EDBWordGeneratorModuleGenerateWord going to be
>called once per word as I'm guessing? >>
>
>Yes, but once per actual word that will be indexed, not once per word in the
>text.
>

This could present problems. I don't know about anyone else but when I'm stripping codes from html I simply zap the whole of the style definition ie between <style> and </style>.

I hope there is another module or something that allows the text as a who;e to be processed as at present?

Roy Lambert
Thu, Aug 31 2006 8:55 AMPermanent Link

Dan Rootham
Michael,

<< I'm not sure I understand the word "stemming" >>

If you have an inflected language, it's useful to isolate the stem of a noun
or verb and use that for the search. For example:
 remove
 removes
 removing
 removed
 removal
 remover

Here the stem would be "remov", and from this stem you can build any
inflected form of the word.

Stemming becomes really important with highly inflected languages
with different gender endings (masculine, feminine, neuter),
different case endings (nominative, accusative, genitive, dative) and
many different verb endings both for tense (present, future, past) and
for person (I, you, he, she, we, they).

And FWIW, stemming is the basis of all "real" machine translation systems.

HTH,
Dan

Lexicon Software Ltd, Bath, UK
Thu, Aug 31 2006 11:42 AMPermanent Link

Michael Baytalsky
Dan,

> If you have an inflected language, it's useful to isolate the stem of a noun
Very informative, I didn't know that, thanks!

It's still not quite clear to me, why would one need
to have double run on the same portion of text to do
stemming (initial Tim's argument)? But I guess it's not
really important - I just wanted to bring Tim's attention
to possible problem, at which point it's totally up to
him figuring out what to do or not do about it Wink

Cheers,
Michael



Dan Rootham wrote:
> Michael,
>
> << I'm not sure I understand the word "stemming" >>
>
> If you have an inflected language, it's useful to isolate the stem of a noun
> or verb and use that for the search. For example:
>   remove
>   removes
>   removing
>   removed
>   removal
>   remover
>
> Here the stem would be "remov", and from this stem you can build any
> inflected form of the word.
>
> Stemming becomes really important with highly inflected languages
> with different gender endings (masculine, feminine, neuter),
> different case endings (nominative, accusative, genitive, dative) and
> many different verb endings both for tense (present, future, past) and
> for person (I, you, he, she, we, they).
>
> And FWIW, stemming is the basis of all "real" machine translation systems.
>
> HTH,
> Dan
>
> Lexicon Software Ltd, Bath, UK
>
Thu, Aug 31 2006 6:04 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< This could present problems. I don't know about anyone else but when I'm
stripping codes from html I simply zap the whole of the style definition ie
between <style> and </style>. >>

You're getting word generation mixed up with text filtering.  HTML code
stripping would occur in a text filter, which is a different, but similar,
type of external module that is called once for the entire BLOB contents.

--
Tim Young
Elevate Software
www.elevatesoft.com

Fri, Sep 1 2006 3:41 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim


Phew

Roy Lambert

ps I wasn't getting confused, just wanted to make sure the other approach (which I currently use) was still there - roll on ElevateDB (beta or whatever).
Fri, Sep 1 2006 5:12 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< ps I wasn't getting confused, just wanted to make sure the other approach
(which I currently use) was still there - roll on ElevateDB (beta or
whatever). >>

No problem. Smiley

--
Tim Young
Elevate Software
www.elevatesoft.com

« Previous PagePage 2 of 2
Jump to Page:  1 2
Image