Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 10 of 10 total
Thread Strange But True
Tue, Oct 30 2007 11:10 PMPermanent Link

Dave Harrison
DBISAM and ElevateDb are the only databases I've seen where it make no
speed difference if the index is built before or after the data is
loaded. They both complete at nearly the same times. Every other
database I tested will load data faster if the table doesn't have any
indexes and the index is built after the data has been loaded.

Dave
Wed, Oct 31 2007 1:12 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Dave,

<< DBISAM and ElevateDb are the only databases I've seen where it make no
speed difference if the index is built before or after the data is loaded.
They both complete at nearly the same times. Every other database I tested
will load data faster if the table doesn't have any indexes and the index is
built after the data has been loaded. >>

It's due to the way that the indexes are built in both.  There are two ways
to build an index:

1) By looping through the rows and building the indexes row by row as if the
rows were "naturally" inserted.  This is what DBISAM and EDB do, and is why
you don't see a difference.

2) Building the index in bulk by looping through the rows, extracting the
keys, sorting them using a merge-sort, and then writing them out
page-by-page and building the index from the leaf pages of the tree upwards.
This is a bit faster and results in a very compact index initially, but
later on the index will develop pockets of free space as new keys are
inserted/deleted from the index and will eventually end up just like 1).

--
Tim Young
Elevate Software
www.elevatesoft.com

Wed, Oct 31 2007 4:37 PMPermanent Link

Dave Harrison
Tim Young [Elevate Software] wrote:
> Dave,
>
> << DBISAM and ElevateDb are the only databases I've seen where it make no
> speed difference if the index is built before or after the data is loaded.
> They both complete at nearly the same times. Every other database I tested
> will load data faster if the table doesn't have any indexes and the index is
> built after the data has been loaded. >>
>
> It's due to the way that the indexes are built in both.  There are two ways
> to build an index:
>
> 1) By looping through the rows and building the indexes row by row as if the
> rows were "naturally" inserted.  This is what DBISAM and EDB do, and is why
> you don't see a difference.
>
> 2) Building the index in bulk by looping through the rows, extracting the
> keys, sorting them using a merge-sort, and then writing them out
> page-by-page and building the index from the leaf pages of the tree upwards.
> This is a bit faster and results in a very compact index initially, but
> later on the index will develop pockets of free space as new keys are
> inserted/deleted from the index and will eventually end up just like 1).

I would have thought an SQL "Create Index On Table ..." would have been
implemented by #2 then you'd at least get the index built twice as fast.
Just because later on, the index may eventually degrade into #1, is no
reason not to optimize index creation. I have a lot of tables that are
readonly. Building a couple of indexes on a large table is an overnight
process that I'd like to avoid.

Dave
Thu, Nov 1 2007 4:53 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Dave,

<< I would have thought an SQL "Create Index On Table ..." would have been
implemented by #2 then you'd at least get the index built twice as fast. >>

We've done tests here with DBISAM, and the difference in performance was
negligible until you got into using a *lot* of memory.

<< Just because later on, the index may eventually degrade into #1, is no
reason not to optimize index creation. >>

That's not what I said.  I was just pointing out that the compactness of the
index initially results in some performance issues when the index is first
used because of all of the additional page splits required for the initial
inserts.

<< I have a lot of tables that are readonly. Building a couple of indexes on
a large table is an overnight process that I'd like to avoid. >>

Did you try bumping up the memory settings in EDB or DBISAM ?  That should
help a bit, although the randomness of the data being input will always have
a bearing on how fast an index build executes, regardless of how it is
built.

It's something we'll revisit later on with the EDB enterprise server and the
ability to allocate larger blocks of memory for a given session without
potentially causing issues with other sessions.

--
Tim Young
Elevate Software
www.elevatesoft.com

Thu, Nov 1 2007 6:31 PMPermanent Link

Dave Harrison
Tim Young [Elevate Software] wrote:
> Dave,
>
> << I would have thought an SQL "Create Index On Table ..." would have been
> implemented by #2 then you'd at least get the index built twice as fast. >>
>
> We've done tests here with DBISAM, and the difference in performance was
> negligible until you got into using a *lot* of memory.

And do you remember how much the performance increased if you throw say
a gb of RAM at the problem? I don't mind loading the machine up with RAM
if it solves the problem.

>
> << Just because later on, the index may eventually degrade into #1, is no
> reason not to optimize index creation. >>
>
> That's not what I said.  I was just pointing out that the compactness of the
> index initially results in some performance issues when the index is first
> used because of all of the additional page splits required for the initial
> inserts.

Ok

>
> << I have a lot of tables that are readonly. Building a couple of indexes on
> a large table is an overnight process that I'd like to avoid. >>
>
> Did you try bumping up the memory settings in EDB or DBISAM ?

How do I do that with DBISAM?
I tried increasing the buffers in EDB but didn't notice any difference.

> That should
> help a bit, although the randomness of the data being input will always have
> a bearing on how fast an index build executes, regardless of how it is
> built.

In my benchmark demo the data was random. In my real word application
the data comes in with the first 2 key fields as constant for every 5k
rows and only the date changes for each row (successive days). So the
index values are pretty much non-random. It still takes 8 hours to load
the data.

> It's something we'll revisit later on with the EDB enterprise server and the
> ability to allocate larger blocks of memory for a given session without
> potentially causing issues with other sessions.

I look forward to testing it. Ent class databases usually throw gb of
RAM at the db server to solve this problem and it works quite well. I
just find it a shame that I have a spare gb (or two) and DBISAM can't
use it.

Dave
Sat, Nov 3 2007 6:21 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Dave,

<< And do you remember how much the performance increased if you throw say a
gb of RAM at the problem? I don't mind loading the machine up with RAM if it
solves the problem. >>

We never went that high, but it was performed with 2-100 megs of memory for
the sorting, with the best performance above 75 megs or so.


<< How do I do that with DBISAM? >>

See here:

http://www.elevatesoft.com/dbisam4d7_customizing_engine.htm

under "Memory Buffer Customizations".

<< I tried increasing the buffers in EDB but didn't notice any difference.
>>

As I said, it may not make any difference due to the randomness of the data
in the rows.   You have to start using memory buffers in the hundreds of
megs if you're trying to optimize the indexing/sorting of very large tables.

<< In my benchmark demo the data was random. In my real word application the
data comes in with the first 2 key fields as constant for every 5k rows and
only the date changes for each row (successive days). So the index values
are pretty much non-random. It still takes 8 hours to load the data. >>

I'd have to do some actual playing around with your data to comment any
further.  If you'd like to adapt the benchmark app that you wrote to use the
non-random data, I'd be happy to tell you what is going on.

<< I look forward to testing it. Ent class databases usually throw gb of RAM
at the db server to solve this problem and it works quite well. I just find
it a shame that I have a spare gb (or two) and DBISAM can't use it. >>

It just wasn't designed that way.  Large memory buffering in a database
engine requires a design that cannot handle multi-user (non-C/S) usage, and
vice-versa.

--
Tim Young
Elevate Software
www.elevatesoft.com

Wed, Nov 7 2007 8:31 AMPermanent Link

Can I ask whether these buffers have any impact on multi-threaded
operation? I ask because I have an app that may have up to 8 threads
analysing data and storing the results into a single database with
multiple tables. The end result is loads of data all being inserted
"feathered" into the tables.

I wonder if there is a change in the buffer sizes that would make this
more efficient - there are a lot of indexes to update and I pondered
dropping them before, storing the data, then indexing at the end. I didn't
do this as I couldn't easily work out when the last thread had finished
and do the indexing. But perhaps there are buffer or other configurations
that might make the operations faster as they'd have to throw away less as
presumably each thread has to keep re-reading the files to refresh its
copy.

I'm now at the stage where I can do some tweaking to test alternative
options - are there any you'd recommend looking at?

Thanks in advance,

/Matthew Jones/
Wed, Nov 7 2007 1:40 PMPermanent Link

Tim Young [Elevate Software]

Elevate Software, Inc.

Avatar

Email timyoung@elevatesoft.com

Matthew,

<< I'm now at the stage where I can do some tweaking to test alternative
options - are there any you'd recommend looking at? >>

I would start with increasing the buffer sizes.  Increasing them, especially
for the indexes, should help the updating quite a bit.   However, you will
reach a point of diminishing returns - the benefits don't increase forever
as you increase the buffer sizes.

--
Tim Young
Elevate Software
www.elevatesoft.com

Thu, Nov 8 2007 5:29 AMPermanent Link

Thanks. I'll report back when I get to experiment.

/Matthew Jones/
Thu, Nov 8 2007 3:57 PMPermanent Link

Dave Harrison
matthew@matthewdelme-jones.delme.com (Matthew Jones) wrote:
> Thanks. I'll report back when I get to experiment.
>
> /Matthew Jones/

Matthew,
    Great. Let us know what speed improvements you saw for your
different buffer sizes.

Dave
Image