Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 10 of 20 total
Thread Very slow grouping
Mon, Mar 19 2007 1:21 PMPermanent Link

"Ole Willy Tuv"
My big_table has 5 million rows. col10 is a TIMESTAMP column containing 5
groups of datetime values, 1 million rows each. There is an index on the
col10 column.

The following query:

select col10, count(*)
from big_table
group by col10

takes 150 seconds to process the result with 5 groups/rows.

I guess EDB doesn't use the index when grouping the table.

Ole Willy Tuv

Mon, Mar 19 2007 6:39 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Ole,

<< I guess EDB doesn't use the index when grouping the table. >>

Not at this time, no.  Optimized ordering and grouping by index is scheduled
for round 2.

--
Tim Young
Elevate Software
www.elevatesoft.com

Mon, Mar 19 2007 7:14 PMPermanent Link

Michael Baytalsky
Ole,

I suppose without optimization this query should be solved at speed
around 20-40MB/sec depending on the computer. What size of the big_table
are we talking about?

Michael

Ole Willy Tuv wrote:
> My big_table has 5 million rows. col10 is a TIMESTAMP column containing 5
> groups of datetime values, 1 million rows each. There is an index on the
> col10 column.
>
> The following query:
>
> select col10, count(*)
> from big_table
> group by col10
>
> takes 150 seconds to process the result with 5 groups/rows.
>
> I guess EDB doesn't use the index when grouping the table.
>
> Ole Willy Tuv
>
>
Mon, Mar 19 2007 7:27 PMPermanent Link

"Ole Willy Tuv"
Michael,

<< I suppose without optimization this query should be solved at speed
around 20-40MB/sec depending on the computer. What size of the big_table are
we talking about? >>

The table has has 10 columns and 5 million rows, total size 1450 MB. There
are 5 different col10 values, such that each group contains 1 million rows.

With index support, the grouping would take only a few seconds.

Ole Willy Tuv

Tue, Mar 20 2007 6:07 AMPermanent Link

Michael Baytalsky
Ole,

> The table has has 10 columns and 5 million rows, total size 1450 MB. There
> are 5 different col10 values, such that each group contains 1 million rows.
Well, then the speed should be around 50 secs with no optimization.
150 secs sounds pretty slow anyway even without any optimization or index.

> With index support, the grouping would take only a few seconds.
This really depends on the type of index and whether the query can actually
be optimized in this case. I know that e.g. Firebird won't take advantage
of an index in this particular case. In order to take advantage you will
need to know that no more values are requested but count(*). If, e.g. you
would try to filter out anything in where clause or use sum instead of
count (or even count(another field)) - the index would become completely
useless and even worth, because scanning table using index is slower, then
doing so in natural order and the selectiveness is zero in this case.
There would be a need for index if you write:
select x, count(*)
from t
where x between a and b
group by x

In this case if we are lucky an index by x could become selective.

<wild_guess>
Try dropping index and see if that helps EDB to improve speed.
It could be that it still tries to scan using index which is how we
get form 50 secs to 150 secs.
</wild_guess>


Michael
Tue, Mar 20 2007 11:07 AMPermanent Link

"Ole Willy Tuv"
Michael,

<<< With index support, the grouping would take only a few seconds. >>>

<< This really depends on the type of index and whether the query can
actually be optimized in this case. I know that e.g. Firebird won't take
advantage of an index in this particular case. In order to take advantage
you will need to know that no more values are requested but count(*). If,
e.g. you would try to filter out anything in where clause or use sum instead
of count (or even count(another field)) - the index would become completely
useless and even worth, because scanning table using index is slower, then
doing so in natural order and the selectiveness is zero in this case. >>

I guess it depends on the implementation. SQL Server certainly uses index
optimization in this case.

SQL Server executes the following query:

select
 col10,
 count(*),
 count(col1),
 min(col1),
 max(col1)
from big_table
group by col10

in 100 seconds without an index on the col10 column and 3 seconds with an
index.

<< Try dropping index and see if that helps EDB to improve speed. It could
be that it still tries to scan using index which is how we get form 50 secs
to 150 secs. >>

Nope, same timing with/without the index.

Hopefully, ElevateDB will execute the query a lot faster when Tim has
implemented index optimization of grouping and ordering.

Ole Willy Tuv

Tue, Mar 20 2007 2:18 PMPermanent Link

Michael Baytalsky
Ole,

> SQL Server executes the following query:
>
> select
>   col10,
>   count(*),
>   count(col1),
>   min(col1),
>   max(col1)
> from big_table
> group by col10
>
> in 100 seconds without an index on the col10 column and 3 seconds with an
> index.
3 secs for the above query is quite impressive... I don't see how
index can help in optimization of min(col1)... unless it stores extensive
statistics and/or there are only few distinct values stored in col1.
On the other hand 100 secs for no index seem a bit slow.

How about:
select
  col10,
  count(*),
  count(col1),
  min(col1 + 2 * col1),
  max(col1)
from big_table
group by col10


Do you by any chance have the script that creates big_table?

Michael
Tue, Mar 20 2007 2:58 PMPermanent Link

"Ole Willy Tuv"
Michael,

<< 3 secs for the above query is quite impressive...  >>

Very impressive, indeed. I just executed the query after a reboot to
eliminate any caching and it only took 4 seconds, then 3 seconds on
subsequent runs.

<< On the other hand 100 secs for no index seem a bit slow. >>

Well, that might just be my system (Centrino laptop with 1 giga ram and a
pretty full hd). Anyhow, it's not slow compared to other engines on the same
system. Interbase 7.5 ran the simple "select col10, count(*)" query in 197
seconds without the index and 90 seconds with the index.

<< How about:
select
 col10,
 count(*),
 count(col1),
 min(col1 + 2 * col1),
 max(col1)
from big_table
group by col10 >>

3.3 seconds Smile

<< Do you by any chance have the script that creates big_table? >>

I'll make a script and get back to you later.

Ole Willy Tuv

Tue, Mar 20 2007 4:29 PMPermanent Link

Michael Baytalsky
Ole,

> << 3 secs for the above query is quite impressive...  >>
> Very impressive, indeed. I just executed the query after a reboot to
> eliminate any caching and it only took 4 seconds, then 3 seconds on
> subsequent runs.
I don't really understand how they do it Wink that's a miracle Wink
I tried something similar on FB and it takes ~100-200 secs on such
huge database. Actually, I was wrong when I estimated throughput
at 30MB. On my system (also 1GB laptop) it degrades dramatically
once data file reaches ~1G. FB processes 1.1 GB file for ~60sec
reliably, however 1.7 GB file takes anywhere from 170 to 500 secs
and the system may become disk bound.

To research this further I wrote a simple Delphi program to write
and read a huge file with plan test strings. The results are
pretty strange:
1. 380 MB file: writing 20 secs, reading 1.7 secs = !!! 223 MB/sec
2. 1.4 GB file: writing 100 secs, reading 95 secs = about 15 MB/sec

It seems like reading slows down way too much for huge files.
I will try it tomorrow on 2GB system to see how it perform.

If my findings are correct, then multi-file database engines rule.

> << On the other hand 100 secs for no index seem a bit slow. >>
> Well, that might just be my system (Centrino laptop with 1 giga ram and a
> pretty full hd). Anyhow, it's not slow compared to other engines on the same
> system. Interbase 7.5 ran the simple "select col10, count(*)" query in 197
> seconds without the index and 90 seconds with the index.
Yes, this seem about right, I was extrapolating performance based
on results from 400MB database - it seems like I was wrong.

> << How about:
> select
>   col10,
>   count(*),
>   count(col1),
>   min(col1 + 2 * col1),
>   max(col1)
> from big_table
> group by col10 >>
>
> 3.3 seconds Smile
Unbelievable. How can it process col1 at such speed?
Are you sure you have different values in col1?

> << Do you by any chance have the script that creates big_table? >>
> I'll make a script and get back to you later.
Thanks.


Michael
Tue, Mar 20 2007 4:56 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Michael,

<< Unbelievable. How can it process col1 at such speed? >>

You would be amazed at how fast index leaf pages can be scanned when they
are a) highly-compressed due to a lot of repeating values and b) use a
pretty decent page size (8k or more).   A database engine can literally read
millions of index entries in seconds.

You guys have me curious now - so perhaps I'll see what EDB could do with an
index optimization on the GROUP BY.

--
Tim Young
Elevate Software
www.elevatesoft.com

Page 1 of 2Next Page »
Jump to Page:  1 2
Image