Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 8 of 8 total
Thread Stop word lists
Thu, Nov 29 2007 3:48 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim

In your textfilter do you use a sorted stringlist for the stopwords or do you have a faster approach? If so can you share it please?

Roy Lambert
Thu, Nov 29 2007 4:56 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< In your textfilter do you use a sorted stringlist for the stopwords or do
you have a faster approach? If so can you share it please? >>

It uses a pre-sorted, constant array of strings:

  NUM_STOP_WORDS = 144;

  STOP_WORDS: array [0..NUM_STOP_WORDS-1] of TEDBString = (
     'ABOUT','ABOVE','AFAIK','ALL','ALONG','ALSO','ALTHOUGH','AND','ARE','ARENT',
     'BECAUSE','BEEN','BTW','BUT','CAN','CANNOT','CANT','COULD','COULDNT','DID',
     'DIDNT','DOES','DOESNT','DUH','EITHER','ETC','EVEN','EVER','FOR','FROM',
     'FURTHERMORE','FYI','GET','GETS','GOT','GOTTEN','HAD','HADNT','HARDLY',
     'HAS','HASNT','HAVING','HENCE','HER','HERE','HERS','HEREBY','HEREIN',
     'HEREOF','HEREON','HERETO','HEREWITH','HIM','HIS','HOW','HOWEVER','IMHO','IMO',
     'INTO','ISNT','ITS','LOL','MINE','NOR','NOT','ONTO','OTHER','OTOH','OUR',
     'OURS','OUT','OVER','REALLY','ROTFL','SAID','SAME','SHE','SHOULD','SHOULDNT',
     'SINCE','SOMEWHAT','SUCH','THAN','THAT','THATLL','THATS','THE','THEIR',
     'THEIRS','THEM','THEN','THERE','THEREBY','THEREFORE','THEREFROM',
     'THEREIN','THEREOF','THEREON','THERETO','THEREWITH','THESE','THEY',
     'THEYLL','THEYRE','THIS','THOSE','THROUGH','THROUGHOUT','THUS','TIA','TOO',
     'UNDER','UNTIL','UNTO','UPON','VERY','WAS','WASNT','WERE','WERENT','WHAT',
     'WHEN','WHERE','WHEREBY','WHEREIN','WHETHER','WHICH','WHILE','WHO','WHOM',
     'WHOS','WHOSE','WHY','WITH','WITHIN','WITHOUT','WONT','WOULD','WOULDNT',
     'YOU','YOULL','YOUR','YOURE','YOURS');

--
Tim Young
Elevate Software
www.elevatesoft.com

Fri, Nov 30 2007 4:11 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim


Not sure I can translate that to a user defined list so I'll go for a sorted stringlist. It should never be that long anyway (famous last words).

Roy Lambert
Mon, Dec 3 2007 8:03 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< Not sure I can translate that to a user defined list so I'll go for a
sorted stringlist. It should never be that long anyway (famous last words).
>>

Isn't most text comprised of a smaller set of English words in the
sub-20,000 word range ?  If so, then the stop word list should remain pretty
small over time.

--
Tim Young
Elevate Software
www.elevatesoft.com

Tue, Dec 4 2007 3:24 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim

>Isn't most text comprised of a smaller set of English words in the
>sub-20,000 word range ? If so, then the stop word list should remain pretty
>small over time.

See I told you so - famous last words - what you're saying is I can grow the stringlist to 19,999 items Smiley

Roy Lambert
Tue, Dec 4 2007 9:09 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim
OK, I'm thick, how to you test if a word is in the sorted list?


Roy Lambert
Tue, Dec 4 2007 4:42 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Roy,

<< OK, I'm thick, how to you test if a word is in the sorted list? >>

In edbstring.pas there's this function:

function IsStopWord(Locale: Integer; const Value: TEDBString): Boolean;
var
  CompareResult: Integer;
  I: Integer;
  Low: Integer;
  High: Integer;
begin
  Result:=False;
  Low:=0;
  High:=(NUM_STOP_WORDS-1);
  while (Low <= High) do
     begin
     I:=((Low+High) div 2);
     CompareResult:=CompareStrings(Locale,STOP_WORDS[I],Value);  //
CompareStrings is in edbstring.pas also
     case CompareResult of
        CMP_GREATER: High:=(I-1);
        CMP_LESS: Low:=(I+1);
        CMP_EQUAL:
           begin
           Result:=True;
           Break;
           end;
        end;
     end;
end;

--
Tim Young
Elevate Software
www.elevatesoft.com

Wed, Dec 5 2007 4:20 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim


Nice - I may steal it for something.

Roy Lambert
Image