Icon View Thread

The following is the text of the current message along with any replies.
Messages 1 to 10 of 12 total
Thread Cache Foreign Key's in memory when loading data
Sun, Jun 29 2014 6:14 PMPermanent Link

Barry

I have a foreign key set up on a Detail table with almost 1 million rows in it. The fk points to a small table with only 4 rows in it. I want to validate the Company_Id, Branch_Id to a table refCompany when the row is saved. Unfortunately this really slows down adding rows to the table. Maybe by a factor of 5x.

Suggestions:

1) Can't EDB cache the fk table in memory so it doesn't have to do the lookup to disk? If no one has updated the fk refCompany table, why go to disk every time a row is added to a table? I ran ProcMon and sure enough, there are lots of references to the refCompany.EDBIdx file when the rows are being added to the Detail table.

2) If I put refCompany in the same transaction that is updating the Detail table, will it help EDB to cache the refCompany table in memory and eliminate disk access to the fk table?

Barry
v2.13 B3 Unicode
Mon, Jun 30 2014 3:18 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Barry

>1) Can't EDB cache the fk table in memory so it doesn't have to do the lookup to disk? If no one has updated the fk refCompany table, why go to disk every time a row is added to a table? I ran ProcMon and sure enough, there are lots of references to the refCompany.EDBIdx file when the rows are being added to the Detail table.

I didn't know that ProcMon differentiated between cache and hard drive (long time since I've used it). The fact that refCompany.EDBIdx is being accessed doesn't mean that its the hard drive being accessed. With the table as small as you say it is then I'd expect it to either be in ElevateDB's cache or Windows cache.

Regardless of where its actually being looked at I can't see a way for ElevateDB to avoid looking since you've told it to check that table when you add a new row into your detail table. I don't know how Tim has written it but my approach would be:

a) see if master table has been altered since I last used it. If it has update it (this will probably mandate a disk check)
b) check to see if the key is there
c) continue

If an appropriate index exists then that would be used for b) since that should be faster than a brute force row scan.

>2) If I put refCompany in the same transaction that is updating the Detail table, will it help EDB to cache the refCompany table in memory and eliminate disk access to the fk table?

I don't know but, on the basis the tables can only be changed internally and in memory within the transaction, it might improve things - let us know when you've tried it Smiley


Here's a freaky idea. Remove the current constraint and replace it with one that checks hard coded values for the company ID. When the company table is altered then update the constraint. Unfortunately you'll need exclusive access to the detail table for this, and there will be a big hit as the million plus rows are checked. As and when the company table exceeds a predetermined size then go back to checking a foreign key constraint.

Roy Lambert
Thu, Jul 3 2014 1:05 AMPermanent Link

Barry

Roy Lambert wrote:

>>1) Can't EDB cache the fk table in memory so it doesn't have to do the lookup to disk? If no one has updated the fk refCompany table, why go to disk every time a row is added to a table? I ran ProcMon and sure enough, there are lots of references to the refCompany.EDBIdx file when the rows are being added to the Detail table.<<

>I didn't know that ProcMon differentiated between cache and hard drive (long time since I've used it). The fact that refCompany.EDBIdx is being accessed doesn't mean that its the hard drive being accessed. With the table as small as you say it is then I'd expect it to either be in ElevateDB's cache or Windows cache.<

Correct. When I said "cache", I meant EDB's internal cache, not Windows cache. If EDB has to go to disk (or Windows cache), then it has to generate an I/O request to the OS and that is very very slow compared to accessing the data in say a TList. I was hoping EDB could reference its internal structure (TList) when checking a FK constraint.

>Regardless of where its actually being looked at I can't see a way for ElevateDB to avoid looking since you've told it to check that table when you add a new row into your detail table. I don't know how Tim has written it but my approach would be:

a) see if master table has been altered since I last used it. If it has update it (this will probably mandate a disk check)
b) check to see if the key is there
c) continue

If an appropriate index exists then that would be used for b) since that should be faster than a brute force row scan.<

How about this. If FK value "MA" was valid 10ms ago, and the FK tables hasn't been updated, and another "MA" is encountered on a subsequent record, why not just allow it without rechecking the FK table?

>>2) If I put refCompany in the same transaction that is updating the Detail table, will it help EDB to cache the refCompany table in memory and eliminate disk access to the fk table?<<

>I don't know but, on the basis the tables can only be changed internally and in memory within the transaction, it might improve things - let us know when you've tried it Smiley

I tried it, no difference. Frown

>
Here's a freaky idea. Remove the current constraint and replace it with one that checks hard coded values for the company ID. When the company table is altered then update the constraint. Unfortunately you'll need exclusive access to the detail table for this, and there will be a big hit as the million plus rows are checked. As and when the company table exceeds a predetermined size then go back to checking a foreign key constraint.
<

Yes, I ended up removing the constraint. My app defines the FK value, so I trust it. I am just a belt and suspenders type of programmer. I decided to lose the suspenders for this table and my pants are still holding. Smile

Barry
Thu, Jul 3 2014 7:23 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Barry


>How about this. If FK value "MA" was valid 10ms ago, and the FK tables hasn't been updated, and another "MA" is encountered on a subsequent record, why not just allow it without rechecking the FK table?

You'll still need some sort of table access to find out if the table has been altered, and I'm guessing that will require a request to Windows to ask it if its got anything it hasn't told the hard drive about yet Frown

>>>2) If I put refCompany in the same transaction that is updating the Detail table, will it help EDB to cache the refCompany table in memory and eliminate disk access to the fk table?<<

>Yes, I ended up removing the constraint. My app defines the FK value, so I trust it. I am just a belt and suspenders type of programmer. I decided to lose the suspenders for this table and my pants are still holding. Smile

I'm the opposite - since my apps are free standing (ie no third party allowed to access via their own software) and I've been bitten several times by database enforced RI I rely on the app not the database.

Roy
Thu, Jul 3 2014 9:40 AMPermanent Link

Eryk Bottomley

Barry

>>Correct. When I said "cache", I meant EDB's internal cache, not Windows cache. If EDB has to go to disk (or >>Windows cache), then it has to generate an I/O request to the OS and that is very very slow compared to >>accessing the data in say a TList.

No it isn't. We did lots of testing in that area back in the DBISAM era. You can prove it for yourself by knocking up a benchmark that compares an in memory table with a disk based equivalent. Accessing a TDataset vs. accessing a TList will generally be slower (depending upon what exactly you are doing) but that is not a consequence of Windows API file IO overhead.

>>How about this. If FK value "MA" was valid 10ms ago, and the FK tables hasn't been updated, and another >>"MA" is encountered on a subsequent record, why not just allow it without rechecking the FK table?

In order to know that the FK table hasn't been updated you either need to perform disk IO or else have a lock on the table such that an update cannot have occurred. Also, if you are going to keep a list of previously encountered keys in RAM then you need a TList or a THashList or something of that nature and you are on the road to reinventing your own index file.

>>2) If I put refCompany in the same transaction that is updating the Detail table, will it help EDB to cache the >>refCompany table in memory and eliminate disk access to the fk table?<<
>>I tried it, no difference. Frown

Load the whole thing into RAM yourself as a memory table. It still won't make much difference.

>>Yes, I ended up removing the constraint. My app defines the FK value, so I trust it. I am just a belt and >>suspenders type of programmer. I decided to lose the suspenders for this table and my pants are still holding. Smile

If it is a bulk load operation then it would probably be faster to suspend the FK, load all the data completely "blind" and then post-process to kick out the rows that shouldn't have been added in the first place. Depends on the probable ratio to good to bad data of course.

Eryk
Thu, Jul 3 2014 10:18 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Eryk

Nice to know you're still lurking.

>>>Correct. When I said "cache", I meant EDB's internal cache, not Windows cache. If EDB has to go to disk (or >>Windows cache), then it has to generate an I/O request to the OS and that is very very slow compared to >>accessing the data in say a TList.
>
>No it isn't. We did lots of testing in that area back in the DBISAM era. You can prove it for yourself by knocking up a benchmark that compares an in memory table with a disk based equivalent. Accessing a TDataset vs. accessing a TList will generally be slower (depending upon what exactly you are doing) but that is not a consequence of Windows API file IO overhead.

What I found when I was building my own string table based on linked stringlists was that in memory / on disk made little difference for some dataset operations. Stuffing things directly into the stringlists was a lot faster than inserting even with indices. I put it down to the overhead from Delphi's TDataset.

>If it is a bulk load operation then it would probably be faster to suspend the FK, load all the data completely "blind" and then post-process to kick out the rows that shouldn't have been added in the first place. Depends on the probable ratio to good to bad data of course.

That's given me a thought - I wonder if DISABLE TRIGGERS turns off the constraints checking - BARRY have a go and let us know please. I would but I'd have to set up a database with RI to do it.

Roy Lambert
Thu, Jul 3 2014 11:16 AMPermanent Link

Eryk Bottomley

Roy

>>Stuffing things directly into the stringlists was a lot faster than inserting even with indices.

That is sort of back to front except in the special case of detecting primary key/unique constraint violations in big data sets (and even then a THashStringList would probably win). In all other cases the existence of indexes is pure overhead as far as insert operations are concerned.

It is sometimes be faster in a bulk load scenario to drop all indexes as well as constraints and then rebuild them afterwards. It depends on the probable percentage of constraint violations one expects.

Eryk
Fri, Jul 4 2014 2:44 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Eryk

>>>Stuffing things directly into the stringlists was a lot faster than inserting even with indices.
>
>That is sort of back to front except in the special case of detecting primary key/unique constraint violations in big data sets (and even then a THashStringList would probably win). In all other cases the existence of indexes is pure overhead as far as insert operations are concerned.
>
>It is sometimes be faster in a bulk load scenario to drop all indexes as well as constraints and then rebuild them afterwards. It depends on the probable percentage of constraint violations one expects.

I think that was a typo - with should have read without.

Roy Lambert
Fri, Jul 4 2014 5:18 PMPermanent Link

Barry

Eryk Bottomley wrote:

Hi Eryk. Nice to hear from you.  Are you developing with EDB?

>>Correct. When I said "cache", I meant EDB's internal cache, not Windows cache. If EDB has to go to disk (or >>Windows cache), then it has to generate an I/O request to the OS and that is very very slow compared to >>accessing the data in say a TList.

>No it isn't. We did lots of testing in that area back in the DBISAM era. You can prove it for yourself by knocking >up a benchmark that compares an in memory table with a disk based equivalent. Accessing a TDataset vs. >accessing a TList will generally be slower (depending upon what exactly you are doing) but that is not a >consequence of Windows API file IO overhead.

I agree with you if traversing the table sequentially because of EDB's caching. If there was no caching and each Next statement required an I/O command to be issued to Windows, it would be a lot slower than a TList which is stored totally in memory. Anytime you need to access the OS, there is overhead involved.

Traversing a TList and accessing all of its TObject variables is lightning fast compared to accessing a TDataset, whether from disk or memory. I've found accessing an EDB memory table from the server shows no speed improvement compared to an EDB disk table from the same server because most of the overhead is caused by TCP overhead. A local memory table compared to a local disk table should be a bit faster for a large table.

As Roy pointed out, the TDataset is to blame for the lack of speed. I've noticed accessing a TField.AsInteger etc is extremely slow because it has to convert the TDataset field value into a Delphi Integer. But a TList's TObject already stores the value as an Integer. This is certainly no fault of EDB or DBISAM. That is the way the TDataset is written.

A couple of years ago I had loaded several million rows of data (about 1.6gb) into a TList and traversed it, accessing all of its field values in about 15 seconds. This included some pretty complicated code within the loop. It took the MySQL database about 20 minutes just to load the TList with the data. Once loaded, it was extremely fast. So the TList was at least 80x faster than the disk based table. I suspect an EDB local table would cut that difference down to maybe 20x faster. Still, there is a big speed difference that is noticeable on large tables.

>>How about this. If FK value "MA" was valid 10ms ago, and the FK tables hasn't been updated, and another >>"MA" is encountered on a subsequent record, why not just allow it without rechecking the FK table?

>In order to know that the FK table hasn't been updated you either need to perform disk IO or else have a lock >on the table such that an update cannot have occurred. Also, if you are going to keep a list of previously >encountered keys in RAM then you need a TList or a THashList or something of that nature and you are on the >road to reinventing your own index file.

I would have thought EDB would already have its own internal TList to cache the key values of the fk table.
If I have the fk table in the same transaction that is updating the other table, the fk table isn't being updated so EDB should not have to go to disk to keep validating the fk value.

My fk table has only 4 rows in it. When loading a million rows of data into a table, I don't see the point of doing a million lookups to the fk table. There should be some way to optimize it. So like you said, if the fk table is locked, it would be nice if EDB could rely more on its internal cache (if it can).

>>2) If I put refCompany in the same transaction that is updating the Detail table, will it help EDB to cache the >>refCompany table in memory and eliminate disk access to the fk table?<<
>>I tried it, no difference. Frown

>Load the whole thing into RAM yourself as a memory table. It still won't make much difference.<
I believe you. That's because it appears EDB is still accessing the table as a TDataset, and not as an internal EDB TList cache.

>>Yes, I ended up removing the constraint. My app defines the FK value, so I trust it. I am just a belt and >>suspenders type of programmer. I decided to lose the suspenders for this table and my pants are still holding.

>If it is a bulk load operation then it would probably be faster to suspend the FK, load all the data completely <"blind" and then post-process to kick out the rows that shouldn't have been added in the first place. Depends >on the probable ratio to good to bad data of course.

Yes. I agree that would be the best solution for today's EDB.
Though it would be great if a future version of EDB  (v3?) could find a way to speed up fk constraints.
I can always hope, right?

Barry
(Fingers crossed.)
Sat, Jul 5 2014 6:40 AMPermanent Link

Roy Lambert

NLH Associates

Team Elevate Team Elevate

Barry


>As Roy pointed out, the TDataset is to blame for the lack of speed. I've noticed accessing a TField.AsInteger etc is extremely slow because it has to convert the TDataset field value into a Delphi Integer. But a TList's TObject already stores the value as an Integer. This is certainly no fault of EDB or DBISAM. That is the way the TDataset is written.

I had a good starter in Gerald Nunn's code but I invented some new swearwords for Borland's team who came up with TDataset.

>Yes. I agree that would be the best solution for today's EDB.
>Though it would be great if a future version of EDB (v3?) could find a way to speed up fk constraints.
>I can always hope, right?

Well, if Tim moves to a totally ElevateDB managed file system, c/s only engine then it should be possible since he can hold flags in engine accessible memory to say wether data has been changed or not. There will still be a need to TDataset compliance at some level or Tim will have to write a new component set to allow you to see your data.

Roy Lambert
Page 1 of 2Next Page »
Jump to Page:  1 2
Image