Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 5 of 5 total
Thread CONTAINS filter condition with partial word match
Thu, Feb 26 2009 1:07 PMPermanent Link

Peter Moser
Dear Tim,

select * from searchtable where searchcolumn contains 'searchword1* searchword2*' has
erratic results:
if "searchword1"  is actually a complete word  adding an asterisk to "searchword1"  to
perform a partial search still shows the same results of the
Example: "house*"  finds all words with "house" but not "housekeeper".
But "housek*" finds all "housekeeper" "housekeeping" etc.

This behavour only occurs when searchtable is reaching a certain size - which is why i
cannot provide an example database so easily:
table with 50000 records: search performs ok
above 50000 records: search performs erratic

Please find below two (slighty shortend) execution plans
1. with search term "hund* freu*" which results in 2 rows and
2. with search term "hunde* freu*" with 6 rows

Which other information do you need to identiy the problem?
Many thanks in advance!

================================================================================
SQL Query (Executed by ElevateDB 2.02 Build 8)
================================================================================

SELECT ALL "ubit_titelstamm"."titel_id" AS "titel_id",
....
FROM "ubit_titelstamm" WHERE "st" CONTAINS 'hund* freu*'

Source Tables
-------------

ubit_titelstamm: 58770 rows

Result Set
----------

The result set was insensitive and read-only
The result set consisted of zero or more rows

Filtering
---------

The following filter condition was applied to the ubit_titelstamm table:

"st" CONTAINS 'hund* freu*' [Index scan (ubit_titelstamm.Text_st):, 432 keys,
16384 bytes estimated cost]

================================================================================
2 row(s) returned in 0,016 secs
================================================================================


================================================================================
SQL Query (Executed by ElevateDB 2.02 Build 8)

Note: The SQL  shown here is generated by ElevateDB and may not be exactly the
same as the SQL that was originally entered.  However, none of the differences
alter the execution results in any way.
================================================================================

SELECT ALL "ubit_titelstamm"."titel_id" AS "titel_id",

FROM "ubit_titelstamm" WHERE "st" CONTAINS 'hunde* freu*'

Source Tables
-------------

ubit_titelstamm: 58770 rows

Result Set
----------

The result set was insensitive and read-only
The result set consisted of zero or more rows

Filtering
---------

The following filter condition was applied to the ubit_titelstamm table:

"st" CONTAINS 'hunde* freu*' [Index scan (ubit_titelstamm.Text_st):, 450 keys,
16384 bytes estimated cost]

================================================================================
6 row(s) returned in 0,015 secs
================================================================================
Fri, Feb 27 2009 8:48 AMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Peter,

<< This behavour only occurs when searchtable is reaching a certain size -
which is why i cannot provide an example database so easily:
table with 50000 records: search performs ok
above 50000 records: search performs erratic

Please find below two (slighty shortend) execution plans
1. with search term "hund* freu*" which results in 2 rows and
2. with search term "hunde* freu*" with 6 rows

Which other information do you need to identiy the problem? >>

Unfortunately, I'm going to probably need the table with more than 50,000
rows.  I suspect that if I try what you're doing here with less rows, I'm
not going to be able to reproduce it.

--
Tim Young
Elevate Software
www.elevatesoft.com

Fri, Feb 27 2009 10:59 AMPermanent Link

Peter Moser
Tim,

just send you a mail to support@elevatesoft.com with the location where and how to
retrieve the test data.
Thank you very much for your efforts!

Peter Moser
sys.team software GmbH

"Tim Young [Elevate Software]" wrote:
>>>
Unfortunately, I'm going to probably need the table with more than 50,000
rows.  I suspect that if I try what you're doing here with less rows, I'm
not going to be able to reproduce it.

--
Tim Young
Elevate Software
www.elevatesoft.com
<<<
Sat, Feb 28 2009 4:56 AMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Peter,

<< just send you a mail to support@elevatesoft.com with the location where
and how to retrieve the test data. >>

Thanks.

The reason that the results change when you add more rows is that you're
adding rows with st column values that include the word:

Hündchen

You can see what the text index looks like with DEU_CI as the collation in
this index dump:

Hun            Row Count: 2    First Row ID: 20275    First Row/Page
Pointer: 20275

Huncke        Row Count: 1    First Row ID: 52036    First Row/Page Pointer:
52036

Hund            Row Count: 75    First Row ID: 549    First Row/Page
Pointer: 549

Hündchen    Row Count: 1    First Row ID: 56717    First Row/Page Pointer:
56717

Hunde        Row Count: 2    First Row ID: 232    First Row/Page Pointer:
232

As you can see, the ü is sorted after the u in terms of the collation order.

The problem is that a partial-length search on Hund* causes the search to
stop when it reaches the Hündchen value because Hünd <> Hund, thus resulting
in less rows selected than you would expect.

There are two solutions to this:

1) Use the DEP (German Phone Book order) collation instead:

ALTER TEXT INDEX "Text_st"
ON "ubit_titelstamm"
("st" COLLATE "DEP_CI")
INDEXED WORD LENGTH 20
WORD GENERATOR "Default"

and use this SQL instead:

SELECT st
FROM "ubit_titelstamm" WHERE "st" COLLATE DEP_CI CONTAINS 'hund* freu*'

You can see the difference with an index dump when the collation is set to
DEP_CI:

Hund            Row Count: 75    First Row ID: 549    First Row/Page
Pointer: 549

Hunde            Row Count: 39    First Row ID: 232    First Row/Page
Pointer: 232

Hundealltag    Row Count: 1    First Row ID: 20694    First Row/Page
Pointer: 20694

Hundeausbildung    Row Count: 1    First Row ID: 50934    First Row/Page
Pointer: 50934

Hundebabys    Row Count: 2    First Row ID: 24612    First Row/Page Pointer:
24612

Hundebuch    Row Count: 2    First Row ID: 5750    First Row/Page Pointer:
5750

As you can see, umlauts are effectively ignored for collation purposes.

2) Use the accent-insensitive flag on the DEU collation, like this:

ALTER TEXT INDEX "Text_st"
ON "ubit_titelstamm"
("st" COLLATE "DEU_CI_AI")
INDEXED WORD LENGTH 20
WORD GENERATOR "Default"

and use this SQL instead:

SELECT st
FROM "ubit_titelstamm" WHERE "st" COLLATE DEU_CI_AI CONTAINS 'hund* freu*'

--
Tim Young
Elevate Software
www.elevatesoft.com

Mon, Mar 2 2009 2:37 AMPermanent Link

Peter Moser
Tim,
perfect!
Thank you once again for the outstanding support!

Peter Moser
sys.team software GmbH


"Tim Young [Elevate Software]" wrote:

Peter,

<< just send you a mail to support@elevatesoft.com with the location where
and how to retrieve the test data. >>

Thanks.

The reason that the results change when you add more rows is that you're
adding rows with st column values that include the word:

Hündchen

You can see what the text index looks like with DEU_CI as the collation in
this index dump:

Hun            Row Count: 2    First Row ID: 20275    First Row/Page
Pointer: 20275

Huncke        Row Count: 1    First Row ID: 52036    First Row/Page Pointer:
52036

Hund            Row Count: 75    First Row ID: 549    First Row/Page
Pointer: 549

Hündchen    Row Count: 1    First Row ID: 56717    First Row/Page Pointer:
56717

Hunde        Row Count: 2    First Row ID: 232    First Row/Page Pointer:
232

As you can see, the ü is sorted after the u in terms of the collation order.

The problem is that a partial-length search on Hund* causes the search to
stop when it reaches the Hündchen value because Hünd <> Hund, thus resulting
in less rows selected than you would expect.

There are two solutions to this:

1) Use the DEP (German Phone Book order) collation instead:

ALTER TEXT INDEX "Text_st"
ON "ubit_titelstamm"
("st" COLLATE "DEP_CI")
INDEXED WORD LENGTH 20
WORD GENERATOR "Default"

and use this SQL instead:

SELECT st
FROM "ubit_titelstamm" WHERE "st" COLLATE DEP_CI CONTAINS 'hund* freu*'

You can see the difference with an index dump when the collation is set to
DEP_CI:

Hund            Row Count: 75    First Row ID: 549    First Row/Page
Pointer: 549

Hunde            Row Count: 39    First Row ID: 232    First Row/Page
Pointer: 232

Hundealltag    Row Count: 1    First Row ID: 20694    First Row/Page
Pointer: 20694

Hundeausbildung    Row Count: 1    First Row ID: 50934    First Row/Page
Pointer: 50934

Hundebabys    Row Count: 2    First Row ID: 24612    First Row/Page Pointer:
24612

Hundebuch    Row Count: 2    First Row ID: 5750    First Row/Page Pointer:
5750

As you can see, umlauts are effectively ignored for collation purposes.

2) Use the accent-insensitive flag on the DEU collation, like this:

ALTER TEXT INDEX "Text_st"
ON "ubit_titelstamm"
("st" COLLATE "DEU_CI_AI")
INDEXED WORD LENGTH 20
WORD GENERATOR "Default"

and use this SQL instead:

SELECT st
FROM "ubit_titelstamm" WHERE "st" COLLATE DEU_CI_AI CONTAINS 'hund* freu*'

--
Tim Young
Elevate Software
www.elevatesoft.com
Image