Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 10 of 10 total
Thread UPPER() does not convert cyrillic text
Wed, Dec 20 2017 11:06 AMPermanent Link

Malcolm Taylor

I just noticed that when I use the SQL UPPER() function it appears to
bail out with Cyrillic text.
Googling sort of explains it, but in pure Delphi code I do get the
conversion.
I am not much bothered about it, now that I see it is not my bug, but
is this ever likely to be addressed in EDB?
Fri, Dec 22 2017 1:50 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Malcolm,

<< I just noticed that when I use the SQL UPPER() function it appears to bail out with Cyrillic text. >>

In what context is the UPPER() function being used ?  In other words, what is being upper-cased ?

UPPER() and LOWER() use the collation of the target expression for determining how the transformation should take place:

https://www.elevatesoft.com/articles?action=view&category=edb&article=collations_comparisons

So, you could force a specific collation by using something like this:

UPPER(MyColumn)

becomes

UPPER(MyColumn COLLATE RUS)

Or, even easier, if you're simply trying to do case-insensitive comparisons, you could just simply do this:

MyColumn COLLATE RUS_CI = 'Text'

and EDB will try to find an index on the MyColumn column with a collation of RUS_CI in order to optimize the condition.

Tim Young
Elevate Software
www.elevatesoft.com
Fri, Dec 22 2017 6:08 PMPermanent Link

Malcolm Taylor

Tim
I am not using it for comparisons or collations, but just for
presentation.  I am using UNICODE and the columns in question are
peoples' names.
In Europe at least, there are differing conventions for displaying
names so I store names with initial capitals (as best I can) then allow
users to choose one of the 6 common variations for display purposes.
So if they choose the "FAMILYNAME, Firstname" format I simply
..  SELECT UPPER(FamilyName) + ', ' + FirstName, ....
Mostly it works fine, but I just noticed that if there are any Cyrillic
names in the list they are not converted, so for this format are
presented like "Толстяков, Антон" instead of "ТОЛСТЯКОВ, Антон".
I suppose there will  be other languages where either there is no
comparable convention (no problem), or there may be a similar issue.
If I am missing a trick I would like to know, but if that is just the
way it works I am not too bothered.  No user has 'complained'.  Smile
Malcolm
Wed, Aug 1 2018 5:41 AMPermanent Link

Malcolm Taylor

Tim Young [Elevate Software] wrote:

...snip..
> So, you could force a specific collation by using something like this:
>
> UPPER(MyColumn)
>
> becomes
>
> UPPER(MyColumn COLLATE RUS)
>
...snip..

sorry, been doing other things  Frown

My data columns in question are defined as collation UNI_CI.

OK, so I tried including COLLATE RUS .. and it works, as you suggested.

So I wondered what if the query returns text in a mix of character sets?
Using COLLATE CHS also works for cyrillic.  Hmm.
Using COLLATE ENU ditto !
Using COLLATE UNI  does not return uppercase cyrillic but does work for
many other character sets.

So at least for now, is it enough for me to include COLLATE ENU (or
almost any other collation except UNI) in order to get the returned
text in upper case?

Is there some reason why COLLATE UNI does not work .. or is that just
because the collation is already known from the Column definition and
the problem is that the UPPER() function does not yet support UTF-16
completely?

This is not quite what I expected to discover so I must be missing
something (brains?).
I thought that collations were important for ordering/matching and
would not affect letter case.

Malcolm
Wed, Aug 1 2018 7:08 AMPermanent Link

Wolfgang Klein

I had a similar discusssion with Tim almost 3 years ago. Unfortunately I was not able to convince him that the current implementation of UNI_CI is wrong.

In the end I had to modify ElevateDB myself. It's just two lines in edbstring.pas:

UnicodeLowerCharMap[I]:=UnicodeChar(I);
->
UnicodeLowerCharMap[I]:=UnicodeChar(TEDBIntPtr(CharLower(pUnicodeChar(I))));

and

UnicodeUpperCharMap[I]:=UnicodeChar(I);
->
UnicodeUpperCharMap[I]:=UnicodeChar(TEDBIntPtr(CharUpper(pUnicodeChar(I))));


Wolfgang
Wed, Aug 1 2018 12:19 PMPermanent Link

Malcolm Taylor

Wolfgang Klein wrote:

> I had a similar discusssion with Tim almost 3 years ago.
> Unfortunately I was not able to convince him that the current
> implementation of UNI_CI is wrong.
>
> In the end I had to modify ElevateDB myself. It's just two lines in
> edbstring.pas:
>
> UnicodeLowerCharMap[I]:=UnicodeChar(I);
> ->
> UnicodeLowerCharMap[I]:=UnicodeChar(TEDBIntPtr(CharLower(pUnicodeChar(
> I))));
>
> and
>
> UnicodeUpperCharMap[I]:=UnicodeChar(I);
> ->
> UnicodeUpperCharMap[I]:=UnicodeChar(TEDBIntPtr(CharUpper(pUnicodeChar(
> I))));
>
>
> Wolfgang

Hi Wolfgang

Thanks, that is interesting.  

I don't have the source so I can't try that, but what puzzles me is why
the SQL UPPER/LOWER functions return UPPER or LOWER for most of
languages I encounter but not other ones.  I know that some languages
do not have directly equivalent upper and lower case characters or even
no concept of letter case, but for those that do the functions should
'work'.

I am with you in expecting the upper/lower equivalents in the Unicode
Specification to be returned and I wonder if Tim may be using one of
the earlier Delphi functions or maps, now deprecated, to serve up the
values.  But that's not like Tim.  Smile

Malcolm
Mon, Aug 6 2018 1:56 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Wolfgang,

<< I had a similar discusssion with Tim almost 3 years ago. Unfortunately I was not able to convince him that the
current implementation of UNI_CI is wrong. >>

It *doesn't matter* if it's wrong for certain linguistic conversions.  What matters is that changing it will break *every single index* that uses it, and that's no good.

The existing UNI collation is a simple basic character mapping that covers primarily English and Western European mappings.  It's the equivalent to similar simple mappings in SQL Server, etc. for straight binary code point comparisons, with a slight twist.

If you need a *linguistically-correct* upper-casing result, then you need to specify a collation that will provide it.

<< In the end I had to modify ElevateDB myself. It's just two lines in edbstring.pas: >>

For anyone reading, I would *not* recommend doing that.  If you moved the DB to another OS,  won't work anymore, and it is dependent upon the current system locale for ANSI versions of the function (non-Unicode Delphi versions).  Remember, EDB still has to work across multiple OS's, as well as with ANSI and Unicode strings, without changing behaviors.

Tim Young
Elevate Software
www.elevatesoft.com
Mon, Aug 6 2018 2:00 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Malcolm,

<< So I wondered what if the query returns text in a mix of character sets? >>

In that case, your results will be mixed, depending upon which collation you're using.  Some will cope, and others will not.

<< So at least for now, is it enough for me to include COLLATE ENU (or almost any other collation except UNI)
in order to get the returned text in upper case? >>

Yes.

<< Is there some reason why COLLATE UNI does not work .. >>

Yes, it's not designed to work with anything other than English and other western European languages.  It's a simple mapping that is similar to a basic binary code point comparison.

Tim Young
Elevate Software
www.elevatesoft.com
Mon, Aug 6 2018 6:50 PMPermanent Link

Malcolm Taylor

Tim Young [Elevate Software] wrote:

>
> << Is there some reason why COLLATE UNI does not work .. >>
>
> Yes, it's not designed to work with anything other than English and
> other western European languages.  It's a simple mapping that is
> similar to a basic binary code point comparison.
>

Thanks for the clarification, Tim.
Ah, yes, the good old ASCII 'case' bit, updated to ANSI-ish.
Well, it is what it is. Frown
COLLATE ENG seems to work for me - certainly better than before.  Smile

Malcolm

 
Wed, Aug 8 2018 1:51 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Malcolm,

<< Thanks for the clarification, Tim. Ah, yes, the good old ASCII 'case' bit, updated to ANSI-ish. >>

Yeah, it's very basic, but it's portable and works the same for both ANSI and Unicode code points.

Tim Young
Elevate Software
www.elevatesoft.com
Image