Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 10 of 32 total
Thread Creating indexes - looking for a silver bullet
Sat, Mar 17 2007 3:23 AMPermanent Link

"Ralf Mimoun"
Hi all.

I stumbled over a problem with DBISAM 4. Some gigabyte of data must be
imported into a DBISAM table. The result: 2 mllion records, 1 GB .DAT and 1
GB .BLB file. The BLB file contains values from the only blob field, a memo.

Importing the data is no problem at all, and fast enough (means: shorter
than a summernight). But to achieve that, I have to drop all indexes I don't
need for import, and recreate them after the work is done. If I import with
all indexes, it will need forever.

Well, I started the index creation this morning, 1:30 am. Now it's 7:47 am -
and it's not finished yet. It looks as if it will need some more hours, and
the hd is seeking like mad. Oh, the test system, especially built for that
test: Core2Duo 1.8, 1 GB RAM (should be 2, but one module was not ok), 2
Samsung 160 GB SATA2 drives as a stripe set (for speed). Yes, more RAM might
help because of more cache, and there are 15.000 rpm SCSI drives around.

Here is the table structure:

/* SQL-92 Table Creation Script with DBISAM Extensions */

CREATE TABLE IF NOT EXISTS "ArtikelBib"
(
  "ID" AUTOINC,
  "ISBN" VARCHAR(15),
  "EAN" VARCHAR(15),
  "ONIXID" VARCHAR(15),
  "LibriNr" VARCHAR(15),
  "BestellNr" VARCHAR(15),
  "Bezeichnung" VARCHAR(80),
  "DetailBezeichnung" VARCHAR(80),
  "Autor" VARCHAR(30),
  "Verlag" VARCHAR(30),
  "VerlagVerkehrsNummer" VARCHAR(15),
  "Verkaufspreis" DECIMAL(0,4),
  "IstFestpreis" BOOLEAN,
  "MwStID" INTEGER,
  "Einband" VARCHAR(25),
  "Auflage" VARCHAR(20),
  "Beschreibung" MEMO,
  "Warengruppe" VARCHAR(5),
  "Meldenummer" INTEGER,
  "MeldenummerDatum" DATE,
  "Geaendert" BOOLEAN,
PRIMARY KEY ("ID") COMPRESS NONE
TEXT INDEX ("Bezeichnung","DetailBezeichnung","Autor","Beschreibung")
STOP WORDS 'A AN AND BE FOR HOW IN IS IT OF ON OR THAT THE THIS TO WAS WHAT
WHEN WHICH WHY WILL EIN EINE EINER UND IST FÜR WIE IN IST ER SIE ES WIR IHR
VON AUF ODER UND DER DIE DAS WAR ALS WIRD'
SPACE CHARS #1+#2+#3+#4+#5+#6+#7+#8+#9+#10+#11+#12+#13
+#14+#15+#16+#17+#18+#19+#20+#21+#22+#23
+#24+#25+#26+#27+#28+#29+#30+#31+#32+'*+'+
',-./:;<=>\`'
INCLUDE CHARS '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_ab'+
'cdefghijklmnopqrstuvwxyz€‚ƒ„… ‡ˆ‰ ‹ŒŽ'+
'‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·'+
'¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß'+
' áâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ'
LOCALE CODE 1031
USER MAJOR VERSION 1
USER MINOR VERSION 22
);

CREATE NOCASE INDEX IF NOT EXISTS "AutorIndex" ON "ArtikelBib" ("Autor")
COMPRESS FULL;
CREATE NOCASE INDEX IF NOT EXISTS "ISBNIndex" ON "ArtikelBib" ("ISBN")
COMPRESS FULL;
CREATE NOCASE INDEX IF NOT EXISTS "EANIndex" ON "ArtikelBib" ("EAN")
COMPRESS FULL;
CREATE INDEX IF NOT EXISTS "LibriNrIndex" ON "ArtikelBib" ("LibriNr")
COMPRESS FULL;
CREATE INDEX IF NOT EXISTS "BestellNrIndex" ON "ArtikelBib" ("BestellNr")
COMPRESS FULL;
CREATE INDEX IF NOT EXISTS "ONIXIDIndex" ON "ArtikelBib" ("ONIXID") COMPRESS
FULL;
CREATE NOCASE INDEX IF NOT EXISTS "BezeichnungIndex" ON "ArtikelBib"
("Bezeichnung") COMPRESS FULL;
CREATE NOCASE INDEX IF NOT EXISTS "DetailBezeichnungIndex" ON "ArtikelBib"
("DetailBezeichnung") COMPRESS FULL;
CREATE INDEX IF NOT EXISTS "GeaendertIndex" ON "ArtikelBib" ("Geaendert")
COMPRESS FULL;

Some more information: the import function is multistaged. First, it drops
all indexes (including the textsearch index, ALTER TABLE "ArtikelBib" TEXT
INDEX ()) and imports the records leaving the memo field empty. That needs
about 45 minutes. After that, the indexes are restored (2.5-3 hours) if
there are no more steps. Otherwise, it only creates the EAN index (not
measured, should be relatively fast).

Then it drops all indexes except the EAN index for the next step, and
imports the memo data. That leads to the 1 GB BLB file, about half the
records get a value in that field, with an average size of 1000 bytes.
English and german text . After that, all indexes are recreated. And that
needs forever. Worse, the hd is that busy that I can't start the third step,
which only copies files from a to b without any db activity.

It seems that building the textsearch index is the culprit. What can I do to
kick it's ass and speed it up? Would it make a huge difference to upgrade
the machine from 1 GB to 4 GB? Faster drives are out of question because
that will double the price of the machine.

Ralf
Sat, Mar 17 2007 2:27 PMPermanent Link

"Ralf Mimoun"
Update...

I have thrown some more hardware into the problem: a third drive for the
private directory and another GB RAM. All in all: a 2x160 GB stripeset,
another 80 GB 7200 rpm drive (will be replaced soon by another 2x160 GB
stripeset), 2 GB RAM.

It is _much_ faster. I can see it at the "I/O read byte" column in the task
manager. Problem: even now it runs since 5 hours, and dbsys has read 250 GB
and written 100 GB (not in one file, just total amount of bytes written to
disk). And that's only for creating the TEXT INDEX, nothing else.

Tim, is there _anything_ I can do? I can't deliver something that needs a
whole day to index a gigabyte.

Ralf
Sat, Mar 17 2007 4:29 PMPermanent Link

"Ralf Mimoun"
More details...

It seems that reading and writing is on the same disk, no activity on the
second disk (for private directory). That means that the speedup I mentioned
in my last post is 100% from the additional 1 GB RAM - means, windows
caching.

How, dbsys has read about 350 GB and written about 125 GB. In about 7 hours.
To text index a 1 GB BLOB file and additional 2 million records with 3
VARCHAR fields. And I don't have a clue how long it will take.

Ralf Mimoun
Sun, Mar 18 2007 6:21 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Ralf


I have about 1.3Gb .blb with a custom filter for the full text indexing. I'm using DBSIAM 4.23 and the machine is an HP Pavilion zd8000 3.2MHz 1GB Ram and I leave it overnight if I want to do an optimise or repair. Virtually all the time is for the full text index.

Apparently ElevateDB is faster. I'm going to go and have a test.

Roy Lambert
Sun, Mar 18 2007 6:50 AMPermanent Link

"Ralf Mimoun"
Roy Lambert wrote:
> Ralf
>
>
> I have about 1.3Gb .blb with a custom filter for the full text
> indexing. I'm using DBSIAM 4.23 and the machine is an HP Pavilion
> zd8000 3.2MHz 1GB Ram and I leave it overnight if I want to do an
> optimise or repair. Virtually all the time is for the full text
> index.

It's running 21 hours now. About 820 GB read, about 280 GB written. I get
the impression that I could do it faster...

> Apparently ElevateDB is faster. I'm going to go and have a test.

Unfortunately, that's not an option for me. I am working on an add-on for an
existing system that use DBISAM 4, and there is no way that I can switch
that project to ElevateDB in the next months. But I _must_ to deliver until
Tuesday, because it _must_ be ready for a fair starting on Thursday. And I
have to give the customer at least one day to install and check everything.
He is already totally pissed off (sorry, but that's the only term that
describes the situation correctly). If anything goes wrong and I can't hold
the deadline, then I will have a huge problem. I invested in a brand new
machine just to get the job done, but it does not seem to help. Before that,
I used a slow machine (Duron 1.9, DBISAM runninng in a VM with onls 512 MB
RAM), and had no problems whatsoever. More precisely: I liked it that way
because that slow machine showed bottlenecks instantly. But 21 hours and
counting is simply out of question. I use DBISAM since version 2, and this
is the very first time that I am disappointed with it's performance. Let's
hope that Tim knows some tricks.

Ralf
Sun, Mar 18 2007 7:17 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Ralf


I just did a migrate of my mail and news client using ElevateDB. I didn't bother to time it because I didn't thing it transferred the text indices but it does, and its built them in the time between postings. I'm very impressed

The only thing I can suggest apart from switching to ElevateDB is to put in a custom word filter to chop out stuff that's currently being indexed. A lot of words just aren't needed and the engine will be spending a vast amount of time on all the common stuff.

I pick up min and max word length externally so the user can alter it but the defaults are 3 and 40.

My filter is intended to strip out html etc that comes with email and news messages but it does significantly reduce the size of the index and possibly the time taken to generate.

Roy Lambert


procedure Tmnd.DataModuleCreate(Sender: TObject);
begin
with Engine do begin
 Active := False;
 MaxTableDataBufferSize := MaxTableDataBufferSize * 4;
 MaxTableDataBufferCount := MaxTableDataBufferCount * 4;
 MaxTableIndexBufferSize := MaxTableIndexBufferSize * 2;
 MaxTableIndexBufferCount := MaxTableIndexBufferCount * 2;
//MaxTableBlobBufferSize
//MaxTableBlobBufferCount
 LargeFileSupport := True;
 OnTextIndexFilter := ftiMassage;
 BeforeDeleteTrigger := BeforeDeleteTrigger;
 Active := True;
end;
end;



procedure Tmnd.ftiMassage(Sender: TObject; const TableName, FieldName: string; var TextToIndex: string);
var
Cntr: integer;
ThisUn: string;
Skip: boolean;
WorkStr: string;
sl: TStringList;
slCnt: integer;
MinWordLength: integer;
MaxWordLength: integer;
ftiWebStuff: boolean;
ftiHelp: TDBISAMTable;
ftiSession: TDBISAMSession;
URLandEmail: string;
const
Delims = [' ', ',', '.', ';', ':', '!', '"', '?', '(', ')', '/', '\', '>', '<', '[', ']', '}', '{'];
Alphas = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', '-'];
function RemoveHTML(FormattedText: string): string;
var
 CharsToProcess: integer;
 CurrentChar: integer;
 TestChar: Char;
 function HTMLTag: string;
 var
  Tag: string;
  CRCheck: string;
  EqPos: integer;
  procedure FindClosingTag(WhatTag: string);
  var
   TagLength: integer;
   function NoTagMatch: boolean;
   var
    Cntr: integer;
   begin
    Result := False;
    for Cntr := 1 to TagLength do begin
     if WhatTag[Cntr] <> UpperCase(FormattedText[CurrentChar + Cntr]) then begin
      Result := True;
      Break;
     end;
    end;
   end;
  begin
   TagLength := Length(WhatTag);
   while (CurrentChar < CharsToProcess - TagLength) and (NoTagMatch) do inc(CurrentChar);
   CurrentChar := CurrentChar + TagLength + 1;
  end;
 begin
  Tag := '';
  while (FormattedText[CurrentChar] <> '>') and (CurrentChar <= CharsToProcess) do begin
   Tag := Tag + FormattedText[CurrentChar];
   inc(CurrentChar);
  end;
  Tag := Tag + FormattedText[CurrentChar]; //This should get the >
  Tag := UpperCase(Tag);
  if Tag = '<HEAD>' then begin
   FindClosingTag('</HEAD>');
  end else if Tag = '<XML>' then begin
   FindClosingTag('</XML>');
  end else if Tag = '<TITLE>' then begin
// We need to dump everything until the closing tag
   FindClosingTag('</TITLE>');
   Result := '';
  end else if Copy(Tag, 1, 6) = '<STYLE' then begin
   FindClosingTag('</STYLE>');
// We need to dump everything until the closing tag - especially css stuff
   Result := '';
  end else if Tag = '<BR>' then begin
   Result := #13#10;
  end else if Copy(Tag, 1, 2) = '<P' then begin
   Result := #13#10;
  end else if Tag = '</DIV>' then begin
   if CurrentChar < CharsToProcess then begin
    if (FormattedText[CurrentChar - 6] <> '>') and (FormattedText[CurrentChar - 10] <> '<') then Result := #13#10 else begin
     CRCheck := FormattedText[CurrentChar - 10] +
      FormattedText[CurrentChar - 9] +
      FormattedText[CurrentChar - 8] +
      FormattedText[CurrentChar - 7] +
      FormattedText[CurrentChar - 6];
     if UpperCase(CRCheck) <> '<DIV>' then Result := #13#10;
    end
   end else Result := '';
  end else if (Copy(Tag, 1, 3) = '</H') and (Tag[4] in ['0'..'9']) then begin
   Result := #13#10;
  end else Result := '';
 end;
 function SpecialChar: string;
 var
  HTMLChar: string;
 begin
  HTMLChar := '';
  while (FormattedText[CurrentChar] <> ';') and (CurrentChar <= CharsToProcess) do begin
   HTMLChar := HTMLChar + FormattedText[CurrentChar];
   inc(CurrentChar);
  end;
  HTMLChar := LowerCase(HTMLChar + FormattedText[CurrentChar]); //This should get the ;
  Result := '';
 end;
begin
 if 0 <> Pos('<html', LowerCase(FormattedText)) then begin
  Result := '';
  CharsToProcess := Length(FormattedText);
  CurrentChar := 1;
  while CurrentChar <= CharsToProcess do begin
   TestChar := FormattedText[CurrentChar];
   case TestChar of
    #0..#9, #11, #12, #14..#31: {do nothing};
    '<': Result := Result + HTMLTag;
    '&': Result := Result + SpecialChar;
   else Result := Result + TestChar;
   end;
   inc(CurrentChar);
  end;
 end else Result := FormattedText;
end;
function LineIsUUEncoded: boolean;
var
 uuCntr: integer;
const
 TableUU = '`!"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_';
begin
 Result := False;
 if (Length(WorkStr) > MinWordLength) and (((((Length(WorkStr) - 1) * 3) / 4) = Pos(WorkStr[1], TableUU) - 1) and (WorkStr[1] < 'a')) then begin
// Only if it hits here is there a possibility that its UUEncoded, but we need to check to make sure
  Result := True;
  for uuCntr := 1 to Length(WorkStr) do begin
   if 0 = Pos(WorkStr[uuCntr], TableUU) then begin
    Result := False;
    Break;
   end;
  end;
 end;
end;
function GetWebStuff: string;
var
 wsl: TStringList;
 wCntr: integer;
 lCntr: integer;
 Line: string;
 DelimPos: integer;
const
 wDelims = [' ', ',', ';', ':', '!', '"', '?', '(', ')', '/', '/', '>', '<', '[', ']', '}', '{'];
 function SeekEnd: string;
 var
  Chop: integer;
 begin
  Chop := lCntr + 1;
  while Chop <= Length(Line) do begin
   if (Line[Chop] in wDelims) then Break else inc(Chop);
  end;
  Result := Copy(Line, DelimPos, Chop - DelimPos) + ' ';
  Delete(Line, DelimPos, Chop - DelimPos);
  lCntr := DelimPos;
 end;
begin
{
Rules
1. If the line is less than 10 characters just forget it
2. Only interested in URLs starting www.
3. I'm only interested in the base URL ie if any / then chop before it
4. I'm only interested in one line at a time
}
 wsl := TStringList.Create;
 wsl.Text := TextToIndex;
 TextToIndex := '';
 Result := '';
 for wCntr := 0 to wsl.Count - 1 do begin
  DelimPos := 1;
  if Length(wsl[wCntr]) > 10 then begin
   Line := wsl[wCntr];
   lCntr := 1;
   while lCntr <= Length(Line) do begin
    if Line[lCntr] in wDelims then DelimPos := lCntr + 1;
    if Line[lCntr] = '@' then Result := Result + SeekEnd;
    if LowerCase(Copy(Line, lCntr, 4)) = 'www.' then Result := Result + SeekEnd;
    inc(lCntr);
   end;
   if Line <> '' then TextToIndex := TextToIndex + #13#10 + Line;
  end else TextToIndex := TextToIndex + #13#10 + wsl[wCntr];
 end;
 wsl.Free;
end;
procedure AddWordToBeIndexed;
begin
 if ThisUn = '' then Exit;
 if ThisUn[Length(ThisUn)] = '-' then Delete(ThisUn, Length(ThisUn), 1);
 if (Length(ThisUn) > MinWordLength) and (Length(ThisUn) <= MaxWordLength) then TextToIndex := TextToIndex + ThisUn + ' ';
 ThisUn := '';
end;
begin
if (TextToIndex = '') or (FieldName = '_Flags') then Exit;
if FieldName = '_Headers' then begin
 TextToIndex := GetWebStuff;
 Exit;
end;
ftiSession := MakeDBISAMSession;
ftiHelp := MakeDBISAMTable('ftiHelper', 'Memory', ftiSession);
ftiHelp.Open;
MinWordLength := ftiHelp.FieldByName('_MinWordLength').AsInteger;
MaxWordLength := ftiHelp.FieldByName('_MaxWordLength').AsInteger;
ftiWebStuff := ftiHelp.FieldByName('_ftiWebStuff').AsBoolean;
ftiHelp.Close;
ftiHelp.Free;
ftiSession.Free;
sl := TStringList.Create;
URLandEmail := GetWebStuff;
if FieldName = '_Message' then sl.Text := RemoveHTML(TextToIndex) else sl.Text := TextToIndex;
TextToIndex := '';
for slCnt := 0 to sl.Count - 1 do begin
 WorkStr := sl[slCnt];
 Skip := False;
 if LineIsUUEncoded then Break; // assumption is the rest of the message is UU stuff
 if Length(WorkStr) > MinWordLength then begin
  for Cntr := 1 to length(WorkStr) do begin
   if (not Skip) and (WorkStr[Cntr] in Alphas) then begin
    ThisUn := ThisUn + WorkStr[Cntr];
    Skip := false;
    if (Cntr = length(WorkStr)) or (WorkStr[Cntr + 1] in Delims) then AddWordToBeIndexed;
   end else begin
    if (Cntr = length(WorkStr)) or (WorkStr[Cntr + 1] in Delims) then AddWordToBeIndexed else begin
     Skip := true;
     ThisUn := '';
    end;
   end;
   if Skip then Skip := (Cntr < Length(WorkStr)) and (WorkStr[Cntr + 1] in Delims);
  end;
 end;
end;
if ftiWebStuff and (URLandEmail <> '') then TextToIndex := TextToIndex + URLandEmail;
if TextToIndex <> '' then Delete(TextToIndex, Length(TextToIndex), 1); // get rid of the trailing space
sl.Free;
end;
Sun, Mar 18 2007 8:39 AMPermanent Link

"Ralf Mimoun"
Roy,

Roy Lambert wrote:
....
> The only thing I can suggest apart from switching to ElevateDB is to
> put in a custom word filter to chop out stuff that's currently being
> indexed. A lot of words just aren't needed and the engine will be
> spending a vast amount of time on all the common stuff.

Thank you for the suggestion, but I can't do that. I have 32 stop words, and
I can't add much more because the user might search for all other words. A
filter is also not appropiate because I have always free text (book
descriptions), without HTML or anything else.

Tim, for the next build of DBISAM, you should add a paragrath to TEXT
INDEX(): "Please be aware that the index build process needs up to two full
days to complete".

Another problem: I can't simply stop the index build and try some stuff.
First, because I have no more ideas what I can optimize (except maybe more
RAM), second, because I can't throw away the 23 hours it runs right now if
the optimization does not cut the time by a giant factor.

Ralf
Sun, Mar 18 2007 9:12 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Ralf

>> The only thing I can suggest apart from switching to ElevateDB is to
>> put in a custom word filter to chop out stuff that's currently being
>> indexed. A lot of words just aren't needed and the engine will be
>> spending a vast amount of time on all the common stuff.
>
>Thank you for the suggestion, but I can't do that. I have 32 stop words, and
>I can't add much more because the user might search for all other words. A
>filter is also not appropiate because I have always free text (book
>descriptions), without HTML or anything else.

I don't know what its like in German (I'm guessing that's the language you're using) but in English if you analyse the text you'll find a LOT of filler words which no one is likely to search for and won't be surprised if they're ignored.

Just take the above sentence my automated rules would reduce it to

know what like German guessing that language using English analyse text find filler words which likely search surprised they ignored

Additional STOP WORDS reduce it to

know German guessing language using English analyse text find filler words likely search surprised ignored

The reason I came up with this strategy is years ago I developed a PICK based system and we wanted full text indexing. We had to roll our own using B-Trees  and we found that this was the best trade off between speed of indexing/deletion (slowwww) and search with a high hit rate.

May not be appropriate in your case but worth thinking about

You may also find that unless you're concerned how many times a word occurs its worth deduping the word list before allowing it to reach Tim's software.

Roy Lambert
Sun, Mar 18 2007 2:10 PMPermanent Link

Dave Harrison
Roy Lambert wrote:
> Ralf
>
>
> I just did a migrate of my mail and news client using ElevateDB. I didn't bother to time it because I didn't thing it transferred the text indices but it does, and its built them in the time between postings. I'm very impressed
>

Roy,
    Have you determined how much faster percentage-wise ElevateDb is in
building the index over DBISAM (assuming you're using the same data)?
And were the full text searches just as fast with ElevateDb as DBISAM?
If you have time, could you publish the results because this time saving
in building indexes could be important for users with large tables.

TIA
Dave
Sun, Mar 18 2007 4:32 PMPermanent Link

"Ralf Mimoun"
Roy,

I had to stop the index build because worst case happened: I encountered a
bug. With turnarund times of days, bug hunting is no fun.

Now I invested some hours to cound words, about 70 are now in the stop word
list. And I will offer baby food to the text indexing systems: stop words
and dupes filtered out, no special chars, making it as easy as possible for
DBISAM. I _really_ hope that it will have some effect because I have to
deliver the table and, of course, the whole function, in some hours.

Ralf
Page 1 of 4Next Page
Jump to Page:  1 2 3 4
Image