Support Forums - View Thread

View Thread

The following is the text of the current message along with any replies.

Messages 1 to 10 of 32 total

Creating indexes - looking for a silver bullet

Sat, Mar 17 2007 3:23 AM	Permanent Link
"Ralf Mimoun"	Hi all. I stumbled over a problem with DBISAM 4. Some gigabyte of data must be imported into a DBISAM table. The result: 2 mllion records, 1 GB .DAT and 1 GB .BLB file. The BLB file contains values from the only blob field, a memo. Importing the data is no problem at all, and fast enough (means: shorter than a summernight). But to achieve that, I have to drop all indexes I don't need for import, and recreate them after the work is done. If I import with all indexes, it will need forever. Well, I started the index creation this morning, 1:30 am. Now it's 7:47 am - and it's not finished yet. It looks as if it will need some more hours, and the hd is seeking like mad. Oh, the test system, especially built for that test: Core2Duo 1.8, 1 GB RAM (should be 2, but one module was not ok), 2 Samsung 160 GB SATA2 drives as a stripe set (for speed). Yes, more RAM might help because of more cache, and there are 15.000 rpm SCSI drives around. Here is the table structure: /* SQL-92 Table Creation Script with DBISAM Extensions / CREATE TABLE IF NOT EXISTS "ArtikelBib" ( "ID" AUTOINC, "ISBN" VARCHAR(15), "EAN" VARCHAR(15), "ONIXID" VARCHAR(15), "LibriNr" VARCHAR(15), "BestellNr" VARCHAR(15), "Bezeichnung" VARCHAR(80), "DetailBezeichnung" VARCHAR(80), "Autor" VARCHAR(30), "Verlag" VARCHAR(30), "VerlagVerkehrsNummer" VARCHAR(15), "Verkaufspreis" DECIMAL(0,4), "IstFestpreis" BOOLEAN, "MwStID" INTEGER, "Einband" VARCHAR(25), "Auflage" VARCHAR(20), "Beschreibung" MEMO, "Warengruppe" VARCHAR(5), "Meldenummer" INTEGER, "MeldenummerDatum" DATE, "Geaendert" BOOLEAN, PRIMARY KEY ("ID") COMPRESS NONE TEXT INDEX ("Bezeichnung","DetailBezeichnung","Autor","Beschreibung") STOP WORDS 'A AN AND BE FOR HOW IN IS IT OF ON OR THAT THE THIS TO WAS WHAT WHEN WHICH WHY WILL EIN EINE EINER UND IST FÃœR WIE IN IST ER SIE ES WIR IHR VON AUF ODER UND DER DIE DAS WAR ALS WIRD' SPACE CHARS #1+#2+#3+#4+#5+#6+#7+#8+#9+#10+#11+#12+#13 +#14+#15+#16+#17+#18+#19+#20+#21+#22+#23 +#24+#25+#26+#27+#28+#29+#30+#31+#32+'+'+ ',-./:;<=>\`' INCLUDE CHARS '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_ab'+ 'cdefghijklmnopqrstuvwxyzâ‚¬Ââ€šÆ’â€žâ€¦â€ â€¡Ë†â€°Å â€¹Å’ÂÅ½Â'+ 'Ââ€˜â€™â€œâ€â€¢â€“â€”Ëœâ„¢Å¡â€ºÅ“ÂÅ¾Å¸ Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬ÂÂ®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·'+ 'Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸ'+ 'Ã Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»Ã¼Ã½Ã¾' LOCALE CODE 1031 USER MAJOR VERSION 1 USER MINOR VERSION 22 ); CREATE NOCASE INDEX IF NOT EXISTS "AutorIndex" ON "ArtikelBib" ("Autor") COMPRESS FULL; CREATE NOCASE INDEX IF NOT EXISTS "ISBNIndex" ON "ArtikelBib" ("ISBN") COMPRESS FULL; CREATE NOCASE INDEX IF NOT EXISTS "EANIndex" ON "ArtikelBib" ("EAN") COMPRESS FULL; CREATE INDEX IF NOT EXISTS "LibriNrIndex" ON "ArtikelBib" ("LibriNr") COMPRESS FULL; CREATE INDEX IF NOT EXISTS "BestellNrIndex" ON "ArtikelBib" ("BestellNr") COMPRESS FULL; CREATE INDEX IF NOT EXISTS "ONIXIDIndex" ON "ArtikelBib" ("ONIXID") COMPRESS FULL; CREATE NOCASE INDEX IF NOT EXISTS "BezeichnungIndex" ON "ArtikelBib" ("Bezeichnung") COMPRESS FULL; CREATE NOCASE INDEX IF NOT EXISTS "DetailBezeichnungIndex" ON "ArtikelBib" ("DetailBezeichnung") COMPRESS FULL; CREATE INDEX IF NOT EXISTS "GeaendertIndex" ON "ArtikelBib" ("Geaendert") COMPRESS FULL; Some more information: the import function is multistaged. First, it drops all indexes (including the textsearch index, ALTER TABLE "ArtikelBib" TEXT INDEX ()) and imports the records leaving the memo field empty. That needs about 45 minutes. After that, the indexes are restored (2.5-3 hours) if there are no more steps. Otherwise, it only creates the EAN index (not measured, should be relatively fast). Then it drops all indexes except the EAN index for the next step, and imports the memo data. That leads to the 1 GB BLB file, about half the records get a value in that field, with an average size of 1000 bytes. English and german text . After that, all indexes are recreated. And that needs forever. Worse, the hd is that busy that I can't start the third step, which only copies files from a to b without any db activity. It seems that building the textsearch index is the culprit. What can I do to kick it's ass and speed it up? Would it make a huge difference to upgrade the machine from 1 GB to 4 GB? Faster drives are out of question because that will double the price of the machine. Ralf
Sat, Mar 17 2007 2:27 PM	Permanent Link
"Ralf Mimoun"	Update... I have thrown some more hardware into the problem: a third drive for the private directory and another GB RAM. All in all: a 2x160 GB stripeset, another 80 GB 7200 rpm drive (will be replaced soon by another 2x160 GB stripeset), 2 GB RAM. It is _much_ faster. I can see it at the "I/O read byte" column in the task manager. Problem: even now it runs since 5 hours, and dbsys has read 250 GB and written 100 GB (not in one file, just total amount of bytes written to disk). And that's only for creating the TEXT INDEX, nothing else. Tim, is there _anything_ I can do? I can't deliver something that needs a whole day to index a gigabyte. Ralf
Sat, Mar 17 2007 4:29 PM	Permanent Link
"Ralf Mimoun"	More details... It seems that reading and writing is on the same disk, no activity on the second disk (for private directory). That means that the speedup I mentioned in my last post is 100% from the additional 1 GB RAM - means, windows caching. How, dbsys has read about 350 GB and written about 125 GB. In about 7 hours. To text index a 1 GB BLOB file and additional 2 million records with 3 VARCHAR fields. And I don't have a clue how long it will take. Ralf Mimoun
Sun, Mar 18 2007 6:21 AM	Permanent Link
Roy Lambert NLH Associates Team Elevate	Ralf I have about 1.3Gb .blb with a custom filter for the full text indexing. I'm using DBSIAM 4.23 and the machine is an HP Pavilion zd8000 3.2MHz 1GB Ram and I leave it overnight if I want to do an optimise or repair. Virtually all the time is for the full text index. Apparently ElevateDB is faster. I'm going to go and have a test. Roy Lambert
Sun, Mar 18 2007 6:50 AM	Permanent Link
"Ralf Mimoun"	Roy Lambert wrote: > Ralf > > > I have about 1.3Gb .blb with a custom filter for the full text > indexing. I'm using DBSIAM 4.23 and the machine is an HP Pavilion > zd8000 3.2MHz 1GB Ram and I leave it overnight if I want to do an > optimise or repair. Virtually all the time is for the full text > index. It's running 21 hours now. About 820 GB read, about 280 GB written. I get the impression that I could do it faster... > Apparently ElevateDB is faster. I'm going to go and have a test. Unfortunately, that's not an option for me. I am working on an add-on for an existing system that use DBISAM 4, and there is no way that I can switch that project to ElevateDB in the next months. But I _must_ to deliver until Tuesday, because it _must_ be ready for a fair starting on Thursday. And I have to give the customer at least one day to install and check everything. He is already totally pissed off (sorry, but that's the only term that describes the situation correctly). If anything goes wrong and I can't hold the deadline, then I will have a huge problem. I invested in a brand new machine just to get the job done, but it does not seem to help. Before that, I used a slow machine (Duron 1.9, DBISAM runninng in a VM with onls 512 MB RAM), and had no problems whatsoever. More precisely: I liked it that way because that slow machine showed bottlenecks instantly. But 21 hours and counting is simply out of question. I use DBISAM since version 2, and this is the very first time that I am disappointed with it's performance. Let's hope that Tim knows some tricks. Ralf
Sun, Mar 18 2007 7:17 AM	Permanent Link
Roy Lambert NLH Associates Team Elevate	Ralf I just did a migrate of my mail and news client using ElevateDB. I didn't bother to time it because I didn't thing it transferred the text indices but it does, and its built them in the time between postings. I'm very impressed The only thing I can suggest apart from switching to ElevateDB is to put in a custom word filter to chop out stuff that's currently being indexed. A lot of words just aren't needed and the engine will be spending a vast amount of time on all the common stuff. I pick up min and max word length externally so the user can alter it but the defaults are 3 and 40. My filter is intended to strip out html etc that comes with email and news messages but it does significantly reduce the size of the index and possibly the time taken to generate. Roy Lambert procedure Tmnd.DataModuleCreate(Sender: TObject); begin with Engine do begin Active := False; MaxTableDataBufferSize := MaxTableDataBufferSize * 4; MaxTableDataBufferCount := MaxTableDataBufferCount * 4; MaxTableIndexBufferSize := MaxTableIndexBufferSize * 2; MaxTableIndexBufferCount := MaxTableIndexBufferCount * 2; //MaxTableBlobBufferSize //MaxTableBlobBufferCount LargeFileSupport := True; OnTextIndexFilter := ftiMassage; BeforeDeleteTrigger := BeforeDeleteTrigger; Active := True; end; end; procedure Tmnd.ftiMassage(Sender: TObject; const TableName, FieldName: string; var TextToIndex: string); var Cntr: integer; ThisUn: string; Skip: boolean; WorkStr: string; sl: TStringList; slCnt: integer; MinWordLength: integer; MaxWordLength: integer; ftiWebStuff: boolean; ftiHelp: TDBISAMTable; ftiSession: TDBISAMSession; URLandEmail: string; const Delims = [' ', ',', '.', ';', ':', '!', '"', '?', '(', ')', '/', '\', '>', '<', '[', ']', '}', '{']; Alphas = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', '-']; function RemoveHTML(FormattedText: string): string; var CharsToProcess: integer; CurrentChar: integer; TestChar: Char; function HTMLTag: string; var Tag: string; CRCheck: string; EqPos: integer; procedure FindClosingTag(WhatTag: string); var TagLength: integer; function NoTagMatch: boolean; var Cntr: integer; begin Result := False; for Cntr := 1 to TagLength do begin if WhatTag[Cntr] <> UpperCase(FormattedText[CurrentChar + Cntr]) then begin Result := True; Break; end; end; end; begin TagLength := Length(WhatTag); while (CurrentChar < CharsToProcess - TagLength) and (NoTagMatch) do inc(CurrentChar); CurrentChar := CurrentChar + TagLength + 1; end; begin Tag := ''; while (FormattedText[CurrentChar] <> '>') and (CurrentChar <= CharsToProcess) do begin Tag := Tag + FormattedText[CurrentChar]; inc(CurrentChar); end; Tag := Tag + FormattedText[CurrentChar]; //This should get the > Tag := UpperCase(Tag); if Tag = '<HEAD>' then begin FindClosingTag('</HEAD>'); end else if Tag = '<XML>' then begin FindClosingTag('</XML>'); end else if Tag = '<TITLE>' then begin // We need to dump everything until the closing tag FindClosingTag('</TITLE>'); Result := ''; end else if Copy(Tag, 1, 6) = '<STYLE' then begin FindClosingTag('</STYLE>'); // We need to dump everything until the closing tag - especially css stuff Result := ''; end else if Tag = '<BR>' then begin Result := #13#10; end else if Copy(Tag, 1, 2) = '<P' then begin Result := #13#10; end else if Tag = '</DIV>' then begin if CurrentChar < CharsToProcess then begin if (FormattedText[CurrentChar - 6] <> '>') and (FormattedText[CurrentChar - 10] <> '<') then Result := #13#10 else begin CRCheck := FormattedText[CurrentChar - 10] + FormattedText[CurrentChar - 9] + FormattedText[CurrentChar - 8] + FormattedText[CurrentChar - 7] + FormattedText[CurrentChar - 6]; if UpperCase(CRCheck) <> '<DIV>' then Result := #13#10; end end else Result := ''; end else if (Copy(Tag, 1, 3) = '</H') and (Tag[4] in ['0'..'9']) then begin Result := #13#10; end else Result := ''; end; function SpecialChar: string; var HTMLChar: string; begin HTMLChar := ''; while (FormattedText[CurrentChar] <> ';') and (CurrentChar <= CharsToProcess) do begin HTMLChar := HTMLChar + FormattedText[CurrentChar]; inc(CurrentChar); end; HTMLChar := LowerCase(HTMLChar + FormattedText[CurrentChar]); //This should get the ; Result := ''; end; begin if 0 <> Pos('<html', LowerCase(FormattedText)) then begin Result := ''; CharsToProcess := Length(FormattedText); CurrentChar := 1; while CurrentChar <= CharsToProcess do begin TestChar := FormattedText[CurrentChar]; case TestChar of #0..#9, #11, #12, #14..#31: {do nothing}; '<': Result := Result + HTMLTag; '&': Result := Result + SpecialChar; else Result := Result + TestChar; end; inc(CurrentChar); end; end else Result := FormattedText; end; function LineIsUUEncoded: boolean; var uuCntr: integer; const TableUU = '`!"#$%&''()+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_'; begin Result := False; if (Length(WorkStr) > MinWordLength) and (((((Length(WorkStr) - 1) 3) / 4) = Pos(WorkStr[1], TableUU) - 1) and (WorkStr[1] < 'a')) then begin // Only if it hits here is there a possibility that its UUEncoded, but we need to check to make sure Result := True; for uuCntr := 1 to Length(WorkStr) do begin if 0 = Pos(WorkStr[uuCntr], TableUU) then begin Result := False; Break; end; end; end; end; function GetWebStuff: string; var wsl: TStringList; wCntr: integer; lCntr: integer; Line: string; DelimPos: integer; const wDelims = [' ', ',', ';', ':', '!', '"', '?', '(', ')', '/', '/', '>', '<', '[', ']', '}', '{']; function SeekEnd: string; var Chop: integer; begin Chop := lCntr + 1; while Chop <= Length(Line) do begin if (Line[Chop] in wDelims) then Break else inc(Chop); end; Result := Copy(Line, DelimPos, Chop - DelimPos) + ' '; Delete(Line, DelimPos, Chop - DelimPos); lCntr := DelimPos; end; begin { Rules 1. If the line is less than 10 characters just forget it 2. Only interested in URLs starting www. 3. I'm only interested in the base URL ie if any / then chop before it 4. I'm only interested in one line at a time } wsl := TStringList.Create; wsl.Text := TextToIndex; TextToIndex := ''; Result := ''; for wCntr := 0 to wsl.Count - 1 do begin DelimPos := 1; if Length(wsl[wCntr]) > 10 then begin Line := wsl[wCntr]; lCntr := 1; while lCntr <= Length(Line) do begin if Line[lCntr] in wDelims then DelimPos := lCntr + 1; if Line[lCntr] = '@' then Result := Result + SeekEnd; if LowerCase(Copy(Line, lCntr, 4)) = 'www.' then Result := Result + SeekEnd; inc(lCntr); end; if Line <> '' then TextToIndex := TextToIndex + #13#10 + Line; end else TextToIndex := TextToIndex + #13#10 + wsl[wCntr]; end; wsl.Free; end; procedure AddWordToBeIndexed; begin if ThisUn = '' then Exit; if ThisUn[Length(ThisUn)] = '-' then Delete(ThisUn, Length(ThisUn), 1); if (Length(ThisUn) > MinWordLength) and (Length(ThisUn) <= MaxWordLength) then TextToIndex := TextToIndex + ThisUn + ' '; ThisUn := ''; end; begin if (TextToIndex = '') or (FieldName = '_Flags') then Exit; if FieldName = '_Headers' then begin TextToIndex := GetWebStuff; Exit; end; ftiSession := MakeDBISAMSession; ftiHelp := MakeDBISAMTable('ftiHelper', 'Memory', ftiSession); ftiHelp.Open; MinWordLength := ftiHelp.FieldByName('_MinWordLength').AsInteger; MaxWordLength := ftiHelp.FieldByName('_MaxWordLength').AsInteger; ftiWebStuff := ftiHelp.FieldByName('_ftiWebStuff').AsBoolean; ftiHelp.Close; ftiHelp.Free; ftiSession.Free; sl := TStringList.Create; URLandEmail := GetWebStuff; if FieldName = '_Message' then sl.Text := RemoveHTML(TextToIndex) else sl.Text := TextToIndex; TextToIndex := ''; for slCnt := 0 to sl.Count - 1 do begin WorkStr := sl[slCnt]; Skip := False; if LineIsUUEncoded then Break; // assumption is the rest of the message is UU stuff if Length(WorkStr) > MinWordLength then begin for Cntr := 1 to length(WorkStr) do begin if (not Skip) and (WorkStr[Cntr] in Alphas) then begin ThisUn := ThisUn + WorkStr[Cntr]; Skip := false; if (Cntr = length(WorkStr)) or (WorkStr[Cntr + 1] in Delims) then AddWordToBeIndexed; end else begin if (Cntr = length(WorkStr)) or (WorkStr[Cntr + 1] in Delims) then AddWordToBeIndexed else begin Skip := true; ThisUn := ''; end; end; if Skip then Skip := (Cntr < Length(WorkStr)) and (WorkStr[Cntr + 1] in Delims); end; end; end; if ftiWebStuff and (URLandEmail <> '') then TextToIndex := TextToIndex + URLandEmail; if TextToIndex <> '' then Delete(TextToIndex, Length(TextToIndex), 1); // get rid of the trailing space sl.Free; end;
Sun, Mar 18 2007 8:39 AM	Permanent Link
"Ralf Mimoun"	Roy, Roy Lambert wrote: .... > The only thing I can suggest apart from switching to ElevateDB is to > put in a custom word filter to chop out stuff that's currently being > indexed. A lot of words just aren't needed and the engine will be > spending a vast amount of time on all the common stuff. Thank you for the suggestion, but I can't do that. I have 32 stop words, and I can't add much more because the user might search for all other words. A filter is also not appropiate because I have always free text (book descriptions), without HTML or anything else. Tim, for the next build of DBISAM, you should add a paragrath to TEXT INDEX(): "Please be aware that the index build process needs up to two full days to complete". Another problem: I can't simply stop the index build and try some stuff. First, because I have no more ideas what I can optimize (except maybe more RAM), second, because I can't throw away the 23 hours it runs right now if the optimization does not cut the time by a giant factor. Ralf
Sun, Mar 18 2007 9:12 AM	Permanent Link
Roy Lambert NLH Associates Team Elevate	Ralf >> The only thing I can suggest apart from switching to ElevateDB is to >> put in a custom word filter to chop out stuff that's currently being >> indexed. A lot of words just aren't needed and the engine will be >> spending a vast amount of time on all the common stuff. > >Thank you for the suggestion, but I can't do that. I have 32 stop words, and >I can't add much more because the user might search for all other words. A >filter is also not appropiate because I have always free text (book >descriptions), without HTML or anything else. I don't know what its like in German (I'm guessing that's the language you're using) but in English if you analyse the text you'll find a LOT of filler words which no one is likely to search for and won't be surprised if they're ignored. Just take the above sentence my automated rules would reduce it to know what like German guessing that language using English analyse text find filler words which likely search surprised they ignored Additional STOP WORDS reduce it to know German guessing language using English analyse text find filler words likely search surprised ignored The reason I came up with this strategy is years ago I developed a PICK based system and we wanted full text indexing. We had to roll our own using B-Trees and we found that this was the best trade off between speed of indexing/deletion (slowwww) and search with a high hit rate. May not be appropriate in your case but worth thinking about You may also find that unless you're concerned how many times a word occurs its worth deduping the word list before allowing it to reach Tim's software. Roy Lambert
Sun, Mar 18 2007 2:10 PM	Permanent Link
Dave Harrison	Roy Lambert wrote: > Ralf > > > I just did a migrate of my mail and news client using ElevateDB. I didn't bother to time it because I didn't thing it transferred the text indices but it does, and its built them in the time between postings. I'm very impressed > Roy, Have you determined how much faster percentage-wise ElevateDb is in building the index over DBISAM (assuming you're using the same data)? And were the full text searches just as fast with ElevateDb as DBISAM? If you have time, could you publish the results because this time saving in building indexes could be important for users with large tables. TIA Dave
Sun, Mar 18 2007 4:32 PM	Permanent Link
"Ralf Mimoun"	Roy, I had to stop the index build because worst case happened: I encountered a bug. With turnarund times of days, bug hunting is no fun. Now I invested some hours to cound words, about 70 are now in the stop word list. And I will offer baby food to the text indexing systems: stop words and dupes filtered out, no special chars, making it as easy as possible for DBISAM. I _really_ hope that it will have some effect because I have to deliver the table and, of course, the whole function, in some hours. Ralf

	Page 1 of 4	Next Page »
Jump to Page: 1 2 3 4