Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 10 of 35 total
Thread TextIndex CLOB Problem
Wed, Oct 28 2009 9:56 AMPermanent Link

Igor Colovic
I have downloaded Trial version of EDB 2.03 (NoUnicode) and have some problems with TextIndex.

The situation is like this:
Table Documents:
ID  Integer
Document CLOB
DocType CHAR(4)

I want to create TextIndex on this table. Document can be one of this types (HTML, RTF,
RVF - our internal format).

I have created TextFilterModule for RVF conversion to plain text. The problem is that
TextToFilter dose not contains all data from Document field. It is trimmed in first #0
character. This was not a problem with DBISAM (witch I would like to replace with EDB).

You can replicate this by setting value of CLOB filed to something with #0 in the middle
of text.
Document.Value :=  'BROW'#0'FOX';
The content of field will be displayed correctly but TextIndex will only have BROWN in index.

Best regards
Igor Colovic
Wed, Oct 28 2009 10:49 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Igor


Whilst I don't know for certain my bet is that with DBISAM as parameters to the procedure call you had

(Sender: TObject; const TableName, FieldName: string; var TextToIndex: string);

in ElevateDB you have

(const FilterType: string; const TextToFilter: string; var FilteredText: string)

#0 is the string delimiter so its never even getting in there. I think you're going to have to replace #0 with something else.

Roy Lambert [Team Elevate]
Wed, Oct 28 2009 11:19 AMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Igor,

<< I have created TextFilterModule for RVF conversion to plain text. The
problem is that TextToFilter dose not contains all data from Document field.
It is trimmed in first #0 character. >>

You cannot embed #0 characters in CLOB columns and have the text indexing
work correctly with an external word generator or text filter DLL.  The text
filter DLL uses pAnsiChar and pWideChar types to transfer data back and
forth, which means that any #0 characters will serve to truncate the text.

You should use something like tab (#9) characters or some other control
character below #32 to delimit your text.

--
Tim Young
Elevate Software
www.elevatesoft.com

Wed, Oct 28 2009 11:23 AMPermanent Link

Igor Colovic
I can not replace #0 with anything. It is from another component and we use it as our
internal format for documents.

If your statement is correct why can I display data from CLOB correctly in application.

Can you create another version of TEDBTextFilterModule witch will use
buffer(TEDBBufferFilterModule)?
I think that would clear this problem.

Roy Lambert wrote:

Igor

....
(const FilterType: string; const TextToFilter: string; var FilteredText: string)

#0 is the string delimiter so its never even getting in there. I think you're going to
have to replace #0 with something else.

Roy Lambert [Team Elevate]
Wed, Oct 28 2009 11:53 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Igor


>I can not replace #0 with anything. It is from another component and we use it as our
>internal format for documents.

I have no idea if it will work but what about setting up a calculated field, use REPLACE(field,#0,#8) and index that?

>If your statement is correct why can I display data from CLOB correctly in application.

Loading a CLOB into a DBMemo or some such is a different operation to passing strings to be edited.

>Can you create another version of TEDBTextFilterModule witch will use
>buffer(TEDBBufferFilterModule)?
>I think that would clear this problem.

I can't (I'm just another user), Tim might be able to but that's his decision.

Roy Lambert [Team Elevate]
Wed, Oct 28 2009 12:35 PMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Igor


Following on from my suggestion I just tried

ALTER TABLE "EMails"
ADD COLUMN "DUMP" CLOB COLLATE "ANSI" COMPUTED ALWAYS AS REPLACE(#0,#8,_Message)


CREATE TEXT INDEX "dumping"
ON "EMails"
("DUMP" COLLATE "ANSI_CI")
INDEXED WORD LENGTH 30
WORD GENERATOR "Default"


and it works fine. I think I'm right in that COMPUTED columns aren't stored (GENERATED ones are) so there's no impact on disk space and I'd guess not to much on performance.


Roy Lambert
Wed, Oct 28 2009 1:53 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Igor,

<< If your statement is correct why can I display data from CLOB correctly
in application. >>

Because we use normal strings (AnsiString/WideString/UnicodeString) in EDB,
and what you're talking about is an external DLL.  The only way to get
strings to and from a DLL with reasonable performance is to use
null-terminated pAnsiChar/pWideChar/pUnicodeChar strings.

<< Can you create another version of TEDBTextFilterModule witch will use
buffer(TEDBBufferFilterModule)? >>

Maybe, but I'd have to look into it.  Another option for you might be for us
to simply surface an event handler in the TEDBSession component instead,
which will remove the necessity of using the external DLL.  However, both of
these are things that will have to wait until 2.04 because they will involve
configuration changes.

--
Tim Young
Elevate Software
www.elevatesoft.com

Thu, Oct 29 2009 4:40 AMPermanent Link

Igor Colovic
"Tim Young [Elevate Software]" wrote:

...

<<Maybe, but I'd have to look into it.  Another option for you might be for us
to simply surface an event handler in the TEDBSession component instead,
which will remove the necessity of using the external DLL.  However, both of
these are things that will have to wait until 2.04 because they will involve
configuration changes.>>

That would be very nice. Either solution will fit my needs.

P.S. Can You tell me the time frame for 2.04

Best regards
Igor Colovic
Thu, Oct 29 2009 5:05 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Tim

>Maybe, but I'd have to look into it. Another option for you might be for us
>to simply surface an event handler in the TEDBSession component instead,
>which will remove the necessity of using the external DLL. However, both of
>these are things that will have to wait until 2.04 because they will involve
>configuration changes.

That sounds interesting but wouldn't we then be back to the DBSys days of having to recompile DBSys to get the code in?

What's your views on my idea of a COMPUTED column?

Roy Lambert
Thu, Oct 29 2009 6:14 AMPermanent Link

Igor Colovic
Roy Lambert wrote:

What's your views on my idea of a COMPUTED column?

Roy COMPUTED column is not an option.
This #0 chars are path of internal document format (header).
Text in document is saved as MultiByte string.
Document can contain pictures.

So changing all #0 to #8 will allow for data to be passed to TextFilterModule. In this
situation I will have to change all #8 back to #0 before I could get plain text from
document. And this is fine if documents are small. But my documents are 10-40MB and there
are ~30000(and number is growing) of them. This would be bottleneck in text indexing.

Best regards
Igor Colovic
Page 1 of 4Next Page »
Jump to Page:  1 2 3 4
Image