Support Forums - View Thread

View Thread

The following is the text of the current message along with any replies.

Messages 1 to 5 of 5 total

CONTAINS filter condition with partial word match

Thu, Feb 26 2009 1:07 PM	Permanent Link
Peter Moser	Dear Tim, select * from searchtable where searchcolumn contains 'searchword1* searchword2' has erratic results: if "searchword1" is actually a complete word adding an asterisk to "searchword1" to perform a partial search still shows the same results of the Example: "house" finds all words with "house" but not "housekeeper". But "housek" finds all "housekeeper" "housekeeping" etc. This behavour only occurs when searchtable is reaching a certain size - which is why i cannot provide an example database so easily: table with 50000 records: search performs ok above 50000 records: search performs erratic Please find below two (slighty shortend) execution plans 1. with search term "hund freu" which results in 2 rows and 2. with search term "hunde freu" with 6 rows Which other information do you need to identiy the problem? Many thanks in advance! ================================================================================ SQL Query (Executed by ElevateDB 2.02 Build 8) ================================================================================ SELECT ALL "ubit_titelstamm"."titel_id" AS "titel_id", .... FROM "ubit_titelstamm" WHERE "st" CONTAINS 'hund freu' Source Tables ------------- ubit_titelstamm: 58770 rows Result Set ---------- The result set was insensitive and read-only The result set consisted of zero or more rows Filtering --------- The following filter condition was applied to the ubit_titelstamm table: "st" CONTAINS 'hund freu' [Index scan (ubit_titelstamm.Text_st):, 432 keys, 16384 bytes estimated cost] ================================================================================ 2 row(s) returned in 0,016 secs ================================================================================ ================================================================================ SQL Query (Executed by ElevateDB 2.02 Build 8) Note: The SQL shown here is generated by ElevateDB and may not be exactly the same as the SQL that was originally entered. However, none of the differences alter the execution results in any way. ================================================================================ SELECT ALL "ubit_titelstamm"."titel_id" AS "titel_id", FROM "ubit_titelstamm" WHERE "st" CONTAINS 'hunde freu' Source Tables ------------- ubit_titelstamm: 58770 rows Result Set ---------- The result set was insensitive and read-only The result set consisted of zero or more rows Filtering --------- The following filter condition was applied to the ubit_titelstamm table: "st" CONTAINS 'hunde freu*' [Index scan (ubit_titelstamm.Text_st):, 450 keys, 16384 bytes estimated cost] ================================================================================ 6 row(s) returned in 0,015 secs ================================================================================
Fri, Feb 27 2009 8:48 AM	Permanent Link
Tim Young [Elevate Software] Elevate Software, Inc. timyoung@elevatesoft.com	Peter, << This behavour only occurs when searchtable is reaching a certain size - which is why i cannot provide an example database so easily: table with 50000 records: search performs ok above 50000 records: search performs erratic Please find below two (slighty shortend) execution plans 1. with search term "hund* freu" which results in 2 rows and 2. with search term "hunde freu*" with 6 rows Which other information do you need to identiy the problem? >> Unfortunately, I'm going to probably need the table with more than 50,000 rows. I suspect that if I try what you're doing here with less rows, I'm not going to be able to reproduce it. -- Tim Young Elevate Software www.elevatesoft.com
Fri, Feb 27 2009 10:59 AM	Permanent Link
Peter Moser	Tim, just send you a mail to support@elevatesoft.com with the location where and how to retrieve the test data. Thank you very much for your efforts! Peter Moser sys.team software GmbH "Tim Young [Elevate Software]" wrote: >>> Unfortunately, I'm going to probably need the table with more than 50,000 rows. I suspect that if I try what you're doing here with less rows, I'm not going to be able to reproduce it. -- Tim Young Elevate Software www.elevatesoft.com <<<
Sat, Feb 28 2009 4:56 AM	Permanent Link
Tim Young [Elevate Software] Elevate Software, Inc. timyoung@elevatesoft.com	Peter, << just send you a mail to support@elevatesoft.com with the location where and how to retrieve the test data. >> Thanks. The reason that the results change when you add more rows is that you're adding rows with st column values that include the word: Hündchen You can see what the text index looks like with DEU_CI as the collation in this index dump: Hun Row Count: 2 First Row ID: 20275 First Row/Page Pointer: 20275 Huncke Row Count: 1 First Row ID: 52036 First Row/Page Pointer: 52036 Hund Row Count: 75 First Row ID: 549 First Row/Page Pointer: 549 Hündchen Row Count: 1 First Row ID: 56717 First Row/Page Pointer: 56717 Hunde Row Count: 2 First Row ID: 232 First Row/Page Pointer: 232 As you can see, the ü is sorted after the u in terms of the collation order. The problem is that a partial-length search on Hund* causes the search to stop when it reaches the Hündchen value because Hünd <> Hund, thus resulting in less rows selected than you would expect. There are two solutions to this: 1) Use the DEP (German Phone Book order) collation instead: ALTER TEXT INDEX "Text_st" ON "ubit_titelstamm" ("st" COLLATE "DEP_CI") INDEXED WORD LENGTH 20 WORD GENERATOR "Default" and use this SQL instead: SELECT st FROM "ubit_titelstamm" WHERE "st" COLLATE DEP_CI CONTAINS 'hund* freu' You can see the difference with an index dump when the collation is set to DEP_CI: Hund Row Count: 75 First Row ID: 549 First Row/Page Pointer: 549 Hunde Row Count: 39 First Row ID: 232 First Row/Page Pointer: 232 Hundealltag Row Count: 1 First Row ID: 20694 First Row/Page Pointer: 20694 Hundeausbildung Row Count: 1 First Row ID: 50934 First Row/Page Pointer: 50934 Hundebabys Row Count: 2 First Row ID: 24612 First Row/Page Pointer: 24612 Hundebuch Row Count: 2 First Row ID: 5750 First Row/Page Pointer: 5750 As you can see, umlauts are effectively ignored for collation purposes. 2) Use the accent-insensitive flag on the DEU collation, like this: ALTER TEXT INDEX "Text_st" ON "ubit_titelstamm" ("st" COLLATE "DEU_CI_AI") INDEXED WORD LENGTH 20 WORD GENERATOR "Default" and use this SQL instead: SELECT st FROM "ubit_titelstamm" WHERE "st" COLLATE DEU_CI_AI CONTAINS 'hund freu*' -- Tim Young Elevate Software www.elevatesoft.com
Mon, Mar 2 2009 2:37 AM	Permanent Link
Peter Moser	Tim, perfect! Thank you once again for the outstanding support! Peter Moser sys.team software GmbH "Tim Young [Elevate Software]" wrote: Peter, << just send you a mail to support@elevatesoft.com with the location where and how to retrieve the test data. >> Thanks. The reason that the results change when you add more rows is that you're adding rows with st column values that include the word: Hündchen You can see what the text index looks like with DEU_CI as the collation in this index dump: Hun Row Count: 2 First Row ID: 20275 First Row/Page Pointer: 20275 Huncke Row Count: 1 First Row ID: 52036 First Row/Page Pointer: 52036 Hund Row Count: 75 First Row ID: 549 First Row/Page Pointer: 549 Hündchen Row Count: 1 First Row ID: 56717 First Row/Page Pointer: 56717 Hunde Row Count: 2 First Row ID: 232 First Row/Page Pointer: 232 As you can see, the ü is sorted after the u in terms of the collation order. The problem is that a partial-length search on Hund* causes the search to stop when it reaches the Hündchen value because Hünd <> Hund, thus resulting in less rows selected than you would expect. There are two solutions to this: 1) Use the DEP (German Phone Book order) collation instead: ALTER TEXT INDEX "Text_st" ON "ubit_titelstamm" ("st" COLLATE "DEP_CI") INDEXED WORD LENGTH 20 WORD GENERATOR "Default" and use this SQL instead: SELECT st FROM "ubit_titelstamm" WHERE "st" COLLATE DEP_CI CONTAINS 'hund* freu' You can see the difference with an index dump when the collation is set to DEP_CI: Hund Row Count: 75 First Row ID: 549 First Row/Page Pointer: 549 Hunde Row Count: 39 First Row ID: 232 First Row/Page Pointer: 232 Hundealltag Row Count: 1 First Row ID: 20694 First Row/Page Pointer: 20694 Hundeausbildung Row Count: 1 First Row ID: 50934 First Row/Page Pointer: 50934 Hundebabys Row Count: 2 First Row ID: 24612 First Row/Page Pointer: 24612 Hundebuch Row Count: 2 First Row ID: 5750 First Row/Page Pointer: 5750 As you can see, umlauts are effectively ignored for collation purposes. 2) Use the accent-insensitive flag on the DEU collation, like this: ALTER TEXT INDEX "Text_st" ON "ubit_titelstamm" ("st" COLLATE "DEU_CI_AI") INDEXED WORD LENGTH 20 WORD GENERATOR "Default" and use this SQL instead: SELECT st FROM "ubit_titelstamm" WHERE "st" COLLATE DEU_CI_AI CONTAINS 'hund freu*' -- Tim Young Elevate Software www.elevatesoft.com