Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 9 of 9 total
Thread Full-text indexing taking a long time
Tue, May 8 2007 9:37 PMPermanent Link

kk aw
I know this issue has been raised a few times.

I have a table of just over 135 MB with one memo field indexed.  It is
taking hours for the table to be restructured to include this index.

Is there anything I can do to speed up this process?  This is on a 2GB,
Quad Core machine.

I read about Roy suggestion to de-dupe the string before indexing.  That
would mean changing my 4 applications plus dbsys itself.  Also, the
OnTextIndexFilter event is not fired unless the index is removed and
reinstated again.  Seems to be a long work-around.

Would this be an issue with ElevateDB?  I have bought ElevateDB but has
not got round to using it yet.

Regards,
KK Aw
Wed, May 9 2007 2:51 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

kk


From a couple of simple tests I've carried out ElevateDB is a LOT faster than DBISAM for FTI.

BTW my suggestions aren't to dedup the string, but rather to set up appropriate STOP WORDS. I also dump any word less than x or greater than y characters, and clear out any HTML tags or numbers. Not only does this speed up the indexing it also reduces the size of the index by a large amount.


Roy Lambert
Wed, May 9 2007 6:13 AMPermanent Link

kk aw
Roy,

Thanks for the information about ElevateDB.

Where do you dump words less than x and greater than y and where do you
strip the HTML tags?

Regards,
KK Aw


Wed, May 9 2007 7:31 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

kk


Here you go - the complete wadge

Roy Lambert


procedure Tmnd.ftiMassage(Sender: TObject; const TableName, FieldName: string; var TextToIndex: string);
var
Cntr: integer;
ThisUn: string;
Skip: boolean;
WorkStr: string;
sl: TStringList;
slCnt: integer;
MinWordLength: integer;
MaxWordLength: integer;
ftiWebStuff: boolean;
ftiHelp: TDBISAMTable;
ftiSession: TDBISAMSession;
URLandEmail: string;
const
Delims = [' ', ',', '.', ';', ':', '!', '"', '?', '(', ')', '/', '\', '>', '<', '[', ']', '}', '{'];
Alphas = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', '-'];
function RemoveHTML(FormattedText: string): string;
var
 CharsToProcess: integer;
 CurrentChar: integer;
 TestChar: Char;
 function HTMLTag: string;
 var
  Tag: string;
  CRCheck: string;
  EqPos: integer;
  procedure FindClosingTag(WhatTag: string);
  var
   TagLength: integer;
   function NoTagMatch: boolean;
   var
    Cntr: integer;
   begin
    Result := False;
    for Cntr := 1 to TagLength do begin
     if WhatTag[Cntr] <> UpperCase(FormattedText[CurrentChar + Cntr]) then begin
      Result := True;
      Break;
     end;
    end;
   end;
  begin
   TagLength := Length(WhatTag);
   while (CurrentChar < CharsToProcess - TagLength) and (NoTagMatch) do inc(CurrentChar);
   CurrentChar := CurrentChar + TagLength + 1;
  end;
 begin
  Tag := '';
  while (FormattedText[CurrentChar] <> '>') and (CurrentChar <= CharsToProcess) do begin
   Tag := Tag + FormattedText[CurrentChar];
   inc(CurrentChar);
  end;
  Tag := Tag + FormattedText[CurrentChar]; //This should get the >
  Tag := UpperCase(Tag);
  if Tag = '<HEAD>' then begin
   FindClosingTag('</HEAD>');
  end else if Tag = '<XML>' then begin
   FindClosingTag('</XML>');
  end else if Tag = '<TITLE>' then begin
// We need to dump everything until the closing tag
   FindClosingTag('</TITLE>');
   Result := '';
  end else if Copy(Tag, 1, 6) = '<STYLE' then begin
   FindClosingTag('</STYLE>');
// We need to dump everything until the closing tag - especially css stuff
   Result := '';
  end else if Tag = '<BR>' then begin
   Result := #13#10;
  end else if Copy(Tag, 1, 2) = '<P' then begin
   Result := #13#10;
  end else if Tag = '</DIV>' then begin
   if CurrentChar < CharsToProcess then begin
    if (FormattedText[CurrentChar - 6] <> '>') and (FormattedText[CurrentChar - 10] <> '<') then Result := #13#10 else begin
     CRCheck := FormattedText[CurrentChar - 10] +
      FormattedText[CurrentChar - 9] +
      FormattedText[CurrentChar - 8] +
      FormattedText[CurrentChar - 7] +
      FormattedText[CurrentChar - 6];
     if UpperCase(CRCheck) <> '<DIV>' then Result := #13#10;
    end
   end else Result := '';
  end else if (Copy(Tag, 1, 3) = '</H') and (Tag[4] in ['0'..'9']) then begin
   Result := #13#10;
  end else Result := '';
 end;
 function SpecialChar: string;
 var
  HTMLChar: string;
 begin
  HTMLChar := '';
  while (FormattedText[CurrentChar] <> ';') and (CurrentChar <= CharsToProcess) do begin
   HTMLChar := HTMLChar + FormattedText[CurrentChar];
   inc(CurrentChar);
  end;
  HTMLChar := LowerCase(HTMLChar + FormattedText[CurrentChar]); //This should get the ;
  Result := '';
 end;
begin
 if 0 <> Pos('<html', LowerCase(FormattedText)) then begin
  Result := '';
  CharsToProcess := Length(FormattedText);
  CurrentChar := 1;
  while CurrentChar <= CharsToProcess do begin
   TestChar := FormattedText[CurrentChar];
   case TestChar of
    #0..#9, #11, #12, #14..#31: {do nothing};
    '<': Result := Result + HTMLTag;
    '&': Result := Result + SpecialChar;
   else Result := Result + TestChar;
   end;
   inc(CurrentChar);
  end;
 end else Result := FormattedText;
end;
function LineIsUUEncoded: boolean;
var
 uuCntr: integer;
const
 TableUU = '`!"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_';
begin
 Result := False;
 if (Length(WorkStr) > MinWordLength) and (((((Length(WorkStr) - 1) * 3) / 4) = Pos(WorkStr[1], TableUU) - 1) and (WorkStr[1] < 'a')) then begin
// Only if it hits here is there a possibility that its UUEncoded, but we need to check to make sure
  Result := True;
  for uuCntr := 1 to Length(WorkStr) do begin
   if 0 = Pos(WorkStr[uuCntr], TableUU) then begin
    Result := False;
    Break;
   end;
  end;
 end;
end;
function GetWebStuff: string;
var
 wsl: TStringList;
 wCntr: integer;
 lCntr: integer;
 Line: string;
 DelimPos: integer;
const
 wDelims = [' ', ',', ';', ':', '!', '"', '?', '(', ')', '/', '/', '>', '<', '[', ']', '}', '{'];
 function SeekEnd: string;
 var
  Chop: integer;
 begin
  Chop := lCntr + 1;
  while Chop <= Length(Line) do begin
   if (Line[Chop] in wDelims) then Break else inc(Chop);
  end;
  Result := Copy(Line, DelimPos, Chop - DelimPos) + ' ';
  Delete(Line, DelimPos, Chop - DelimPos);
  lCntr := DelimPos;
 end;
begin
{
Rules
1. If the line is less than 10 characters just forget it
2. Only interested in URLs starting www.
3. I'm only interested in the base URL ie if any / then chop before it
4. I'm only interested in one line at a time
}
 wsl := TStringList.Create;
 wsl.Text := TextToIndex;
 TextToIndex := '';
 Result := '';
 for wCntr := 0 to wsl.Count - 1 do begin
  DelimPos := 1;
  if Length(wsl[wCntr]) > 10 then begin
   Line := wsl[wCntr];
   lCntr := 1;
   while lCntr <= Length(Line) do begin
    if Line[lCntr] in wDelims then DelimPos := lCntr + 1;
    if Line[lCntr] = '@' then Result := Result + SeekEnd;
    if LowerCase(Copy(Line, lCntr, 4)) = 'www.' then Result := Result + SeekEnd;
    inc(lCntr);
   end;
   if Line <> '' then TextToIndex := TextToIndex + #13#10 + Line;
  end else TextToIndex := TextToIndex + #13#10 + wsl[wCntr];
 end;
 wsl.Free;
end;
procedure AddWordToBeIndexed;
begin
 if ThisUn = '' then Exit;
 if ThisUn[Length(ThisUn)] = '-' then Delete(ThisUn, Length(ThisUn), 1);
 if (Length(ThisUn) > MinWordLength) and (Length(ThisUn) <= MaxWordLength) then TextToIndex := TextToIndex + ThisUn + ' ';
 ThisUn := '';
end;
begin
if (TextToIndex = '') or (FieldName = '_Flags') then Exit;
if FieldName = '_Headers' then begin
 TextToIndex := GetWebStuff;
 Exit;
end;
ftiSession := MakeDBISAMSession;
ftiHelp := MakeDBISAMTable('ftiHelper', 'Memory', ftiSession);
ftiHelp.Open;
MinWordLength := ftiHelp.FieldByName('_MinWordLength').AsInteger;
MaxWordLength := ftiHelp.FieldByName('_MaxWordLength').AsInteger;
ftiWebStuff := ftiHelp.FieldByName('_ftiWebStuff').AsBoolean;
ftiHelp.Close;
ftiHelp.Free;
ftiSession.Free;
sl := TStringList.Create;
URLandEmail := GetWebStuff;
if FieldName = '_Message' then sl.Text := RemoveHTML(TextToIndex) else sl.Text := TextToIndex;
TextToIndex := '';
for slCnt := 0 to sl.Count - 1 do begin
 WorkStr := sl[slCnt];
 Skip := False;
 if LineIsUUEncoded then Break; // assumption is the rest of the message is UU stuff
 if Length(WorkStr) > MinWordLength then begin
  for Cntr := 1 to length(WorkStr) do begin
   if (not Skip) and (WorkStr[Cntr] in Alphas) then begin
    ThisUn := ThisUn + WorkStr[Cntr];
    Skip := false;
    if (Cntr = length(WorkStr)) or (WorkStr[Cntr + 1] in Delims) then AddWordToBeIndexed;
   end else begin
    if (Cntr = length(WorkStr)) or (WorkStr[Cntr + 1] in Delims) then AddWordToBeIndexed else begin
     Skip := true;
     ThisUn := '';
    end;
   end;
   if Skip then Skip := (Cntr < Length(WorkStr)) and (WorkStr[Cntr + 1] in Delims);
  end;
 end;
end;
if ftiWebStuff and (URLandEmail <> '') then TextToIndex := TextToIndex + URLandEmail;
if TextToIndex <> '' then Delete(TextToIndex, Length(TextToIndex), 1); // get rid of the trailing space
sl.Free;
end;


procedure Tmnd.DataModuleCreate(Sender: TObject);
begin
with Engine do begin
 Active := False;
 MaxTableDataBufferSize := MaxTableDataBufferSize * 4;
 MaxTableDataBufferCount := MaxTableDataBufferCount * 4;
 MaxTableIndexBufferSize := MaxTableIndexBufferSize * 2;
 MaxTableIndexBufferCount := MaxTableIndexBufferCount * 2;
//MaxTableBlobBufferSize
//MaxTableBlobBufferCount
 LargeFileSupport := True;
 OnTextIndexFilter := ftiMassage;
 BeforeDeleteTrigger := BeforeDeleteTrigger;
 Active := True;
end;
end;
Wed, May 9 2007 10:25 AMPermanent Link

kk aw
Roy,

Thanks.  I will give it a try.

KK Aw
Wed, May 9 2007 10:35 AMPermanent Link

kk aw
Roy,

I am missing makeDbisamSession and makeDBISAMTable.

Regards,
KK Aw
Wed, May 9 2007 11:35 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

kk


Some people need spoon feeding Smiley

Roy Lambert


function MakeDBISAMSession: TDBISAMSession;
begin
Result := TDBISAMSession.Create(nil);
Result.PrivateDir := GetWindowsTempPath;
Result.AutoSessionName := True;
Result.LockProtocol := lpPessimistic;
end;

function MakeDBISAMDatabase(const iSession: TDBISAMSession; const Path: string): TDBISAMDatabase;
begin
Result := TDBISAMDatabase.Create(nil);
Result.DatabaseName := 'db' + iSession.SessionName;
Result.Directory := Path;
Result.SessionName := iSession.SessionName;
end;

function MakeDBISAMTable(const iName: string; const iDBPath: string; const iSession: TDBISAMSession): TDBISAMTable;
begin
Result := TDBISAMTable.Create(nil);
Result.TableName := iName;
Result.DatabaseName := iDBPath;
Result.SessionName := iSession.SessionName;
end;

function MakeDBISAMQuery(const iDBPath: string; const iSession: TDBISAMSession): TDBISAMQuery;
begin
Result := TDBISAMQuery.Create(nil);
Result.DatabaseName := iDBPath;
Result.SessionName := iSession.SessionName;
end;
Thu, May 10 2007 4:49 AMPermanent Link

kk aw
Roy,

Thanks.

I wasn't sure what these functions were supposed to do.

KK Aw
Thu, May 10 2007 5:21 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

kk

>Thanks.
>
>I wasn't sure what these functions were supposed to do.

I made the names as descriptive as I could <vbg>

Roy Lambert
Image