🦦 What OtterTune hates the most in PostgreSQL🐘 is solved in YugabyteDB🚀

I was reading The Part of PostgreSQL We Hate the Most, by Bohan Zhang & Andy Pavlo. The article attributes the problems to a “relic of the 1980s”. Let’s compare with YugabyteDB which uses Postgres to process the SQL but with a distributed storage b…


This content originally appeared on DEV Community and was authored by Franck Pachot

I was reading The Part of PostgreSQL We Hate the Most, by Bohan Zhang & Andy Pavlo. The article attributes the problems to a "relic of the 1980s". Let's compare with YugabyteDB which uses Postgres to process the SQL but with a distributed storage built for modern infrastructure.

Basically, the OtterTune article is explaining some cons of the MVCC implementation in PostgreSQL, with good explanations, but no nuances. In this post, I'll also mention the pros of it.

Problems

There are 4 problems described in the article

Problem #1: Version Copying

When you update a single byte in a PostgreSQL row, the whole row is copied rather than only the change. This is easy to experiment:

drop table demo;
create table demo as 
 select generate_series(1,10000000) n, 'N' flag, lpad('x',1000,'x') filler;
vacuum demo;

select pg_size_pretty(pg_table_size('demo'));
explain (analyze, wal, buffers, costs off) update demo set flag='Y';

select pg_size_pretty(pg_table_size('demo'));

The interesting part is when I update the flag, check the WAL generated, and the increase of the table size:

postgresql=> select pg_size_pretty(pg_table_size('demo'));
 pg_size_pretty
----------------
 11 GB
(1 row)

postgresql=> explain (analyze, wal, buffers, costs off) 
             update demo set flag='Y';

                                 QUERY PLAN
-----------------------------------------------------------------------------
 Update on demo (actual time=103872.095..103872.097 rows=0 loops=1)
   Buffers: shared hit=34285343 read=1428953 dirtied=2857543 written=1493088
   I/O Timings: read=8671.563 write=259.356
   WAL: records=20000354 fpi=1428940 bytes=11758071248
   ->  Seq Scan on demo (actual time=0.540..13265.632 rows=10000000 loops=1)
         Buffers: shared read=1428572 written=32324
         I/O Timings: read=8669.064 write=127.201
 Planning:
   Buffers: shared hit=7
 Planning Time: 0.061 ms
 Execution Time: 103872.487 ms
(11 rows)

select pg_size_pretty(pg_table_size('demo'));
 pg_size_pretty
----------------
 22 GB
(1 row)

The table size has doubled (from 11GB to 22GB) and the WAL generated is to full additional size (11758071248 bytes). Basically, all rows have been copied to the newer version with one byte changed, and all blocks involved were logged in the WAL for recovery purposes.

I'm running the same in a YugabyteDB cluster. The WAL size is included in pg_table_size (to account for the real size as the first level of the LSM-Tree is in memory):

drop table demo;
create table demo as 
 select generate_series(1,10000000) n
     , 'N' flag, lpad('x',1000,'x') filler;

select pg_size_pretty(pg_table_size('demo'));

explain (analyze, buffers, costs off) update demo set flag='Y';

Here is the result:

yugabyte=> select pg_size_pretty(pg_table_size('demo'));

 pg_size_pretty
----------------
 1572 MB
(1 row)

yugabyte=> explain (analyze, buffers, costs off)
           update demo set flag='Y';

                                  QUERY PLAN
-------------------------------------------------------------------------------
 Update on demo (actual time=539216.462..539216.462 rows=0 loops=1)
   ->  Seq Scan on demo (actual time=19.515..514476.322 rows=10000000 loops=1)
 Planning Time: 0.697 ms
 Execution Time: 539264.892 ms
 Peak Memory Usage: 27 kB
(5 rows)

yugabyte=> select pg_size_pretty(pg_table_size('demo'));

 pg_size_pretty
----------------
 1762 MB
(1 row)

The table size didn't increase a lot, which is what you can expect when you update only one byte for each row.

Problem #2: Table Bloat

The PostgreSQL autovacuum doesn't reclaim the space from the filesystem. This needs a full reorg (VACUUM FULL) during a maintenance window.

postgres=> vacuum demo;
VACUUM

postgres=> select pg_size_pretty(pg_table_size('demo'));

 pg_size_pretty
----------------
 22 GB
(1 row)

postgres=> select pg_size_pretty(pg_table_size('demo'));
 pg_size_pretty
----------------
 11 GB
(1 row)

In YugabyteDB, there are no holes in the files. To reclaim space from the intermediate versions above the retention (15 minutes by default), the current files are merged to new files by the background compaction. It has the same effect as a VACUUM FULL but, because SSD files are immutable, doesn't conflict with ongoing transactions (except the resources taken by it when it runs, or the space/read amplification if it doesn't run) and doesn't need to rebuild indexes.

I could reduce the table back to its original size with full compaction, but that's not needed. The compaction happens at some thresholds that I've explained in a previous post.

Problem #3: Secondary Index Maintenance

PostgreSQL adds index entries for the new versions without removing the previous entries, until autovacuum is able to do it. This competes with the application activity, and is postponed when long transactions are running. Indexes do not have the visibility information. A consequence of it, not mentioned in the OtterTune article, is that even in case of Index Only Scan the table must be read to get the MVCC visibility. This can be fast on a freshly vacuumed table (because a visibility bitmap is updated) but long if not. Note that the index maintenance can be reduced by reserving enough free space (set with FILLFACTOR) so that the new version of the row fits in the same block (HOT optimization).

PostgreSQL tables are Heap Tables and the primary key is a secondary index. Adding a primary key to my table creates the index, and an access by primary key requires many reads:

postgres=> alter table demo add primary key (n);
ALTER TABLE

postgres=> select pg_size_pretty(pg_table_size('demo'));
 pg_size_pretty
----------------
 11 GB
(1 row)

postgres=> select pg_size_pretty(pg_table_size('demo_pkey'));
 pg_size_pretty
----------------
 214 MB
(1 row)

postgres=> explain (analyze, buffers, costs off) select * from demo where n=42;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Index Scan using demo_pkey on demo (actual time=0.861..0.862 rows=1 loops=1)
   Index Cond: (n = 42)
   Buffers: shared read=4
   I/O Timings: read=0.843
 Planning:
   Buffers: shared hit=13 read=1 dirtied=3
   I/O Timings: read=1.390
 Planning Time: 2.056 ms
 Execution Time: 0.889 ms
(9 rows)

For one row, 4 buffers have been read from the index (the B-Tree levels) and the table.

YugabyteDB stores the table in its primary key, like many other databases. All (table and indexes) are stored as LSM-Tree, with MVCC information included. There is no need for FILLFACTOR, no duplication of rows when updated. Even when a table row moves physically (during automatic re-sharding for example) the index doesn't add any maintenance overhead because it references the primary key and not a physical location.

An Index Scan from a secondary index has to go to the table though its primary key, which is fast in a LSM-Tree, and this can also be skipped with Index Only Scan that never have to read the table.

I've explained covering indexes here.

When I add a primary key, the size is the same (there's no additional index) and the access by primary key is only one read:

yugabyte=> alter table demo add primary key (n);
ALTER TABLE

yugabyte=> select pg_size_pretty(pg_table_size('demo'));
 pg_size_pretty
----------------
 1314 MB
(1 row)

yugabyte=> select pg_size_pretty(pg_table_size('demo_pkey'));
 pg_size_pretty
----------------

(1 row)

yugabyte=> explain (analyze, dist, costs off) select * from demo where n=42;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Index Scan using demo_pkey on demo (actual time=1.206..1.208 rows=1 loops=1)
   Index Cond: (n = 42)
   Storage Index Read Requests: 1
   Storage Index Execution Time: 2.000 ms
 Planning Time: 0.054 ms
 Execution Time: 1.242 ms
 Storage Read Requests: 1
 Storage Write Requests: 0
 Storage Execution Time: 2.000 ms
 Peak Memory Usage: 0 kB
(10 rows)

The size is actually even smaller because compaction happened. The access is displayed as Index Scan but, on the primary key, this is the same ans an Index Only Scan.

Problem #4: Vacuum Management

Because of bloat and also because of PostgreSQL transaction ID wraparound, vacuum must run frequently, and not be blocked by long transactions. This can be tricky with high a high rate of DML and the main point of the OtterTune article is to mention their tool that makes it easier to monitor.

YugabyteDB doesn't have this bloat problem. The space amplification of SST Files is resolved by the background compaction which doesn't conflict with on ongoing transactions as they read and write only the immutable SST Files. There is also no transaction ID wraparound as the transactions are sequenced with the cluster Hybrid Logical Clock which is always increasing.

In short, VACUUM is a no-op in YugabyteDB and is there to be compatible with scripts made for PostgreSQL:

yugabyte=# vacuum demo;
WARNING:  VACUUM will be ignored
VACUUM
yugabyte=#

Advantages

The OtterTune article doesn't mention any advantages of PostgreSQL MVCC implementation. There are always some tradeoffs.

fast rollback

I have been working a lot with Oracle Database, that I've always considered as the best implementation of MVCC for Heap Tables and B-Trees. However, there's one case where PostgreSQL is better. A rollback in PostgreSQL takes no time:

postgres=> \timing on
Timing is on.

postgres=> begin transaction;
BEGIN
Time: 31.719 ms

postgres=*> delete from demo;
DELETE 10000000
Time: 88588.840 ms (01:28.589)

postgres=*> rollback;
ROLLBACK
Time: 31.856 ms

Even if YugabyteDB has a different implementation of MVCC, the transaction provisional records go to the IntentsDB (another LSM-Tree) to be merge, in the background, to the RegularDB on commit.

YugabyteDB benefits from the same behavior: fast rollback:

yugabyte=> begin transaction;
BEGIN
Time: 31.178 ms
yugabyte=*> delete from demo;
DELETE 10000000
Time: 376758.463 ms (06:16.758)
yugabyte=*> rollback;
ROLLBACK
Time: 31.061 ms

This is not only useful for user rollbacks, but is also critical for the Recovery Time Objective. In case of recovery, the ongoing transaction have to be rolled back before the tables are available.

To do the same, Oracle (or MySQL InnoDB) has to go though the chain of rollback segments to undo all changes one by one before the table is available again. I've seen that also with transactions that never ends and the user asking to kill it. If you kill it, and even if you can restart the database, the rollback has to be done. All Oracle DBAs remember monitoring V$TRANSACTION.USED_UREC to see how it decreases and estimate when it will be available again.

DEMO@o21c_tp> set timing on

DEMO@o21c_tp> create table demo compress as select rownum n, 'N' flag, lpad('x',1000,'x') filler from xmltable('1 to 10000000');

Table DEMO created.

Elapsed: 00:00:24.971

DEMO@o21c_tp> alter table demo add primary key (n);

Table DEMO altered.

Elapsed: 00:00:07.714
DEMO@o21c_tp> delete from demo;

10,000,000 rows deleted.

Elapsed: 00:01:30.074
DEMO@o21c_tp> rollback;

Rollback complete.

Elapsed: 00:02:53.114
DEMO@o21c_tp>

The rollback was longer than the operation itself in this Oracle Autonomous Database.

index types

Another advantage of PostgreSQL MVCC implementation is that it doesn't push the transaction management complexity to the indexes. This is good for extensibility. There are 6 index types with PostgreSQL to be optimized for many use-cases, and it is extensible.

YugabyteDB, because of sharding, and the many capabilities of LSM-Trees, like the Hybrid Scan, may not need so many index types. Modern hardware and the ability to scale out makes Seq Scan efficient for more cases, especially with the pushed down Remote Filter. There's currently two types of indexes (LSM and GIN). GiST is in the roadmap to fully support PostGIS. LSM indexes also provide additional features that doesn't exists with PostgreSQL: loose index scan, scaling out,...

In summary

There are always trade-offs in IT. Some choices were made in different context and traditional databases may still use the same implementation on modern hardware. New databases have the possibility of different choices. I'm very enthusiastic with any attempts to improve the storage of PostgreSQL: the abandoned zHeap project, the modern Oriole engine, the branching Neon, and of course the distributed YugabyteDB.

It is also good to remember that PostgreSQL is also good enough for many cases and @ryanbooz has written a nice article to counterbalance the many negative ones about PostgreSQL MVCC:

I have presented YugabyteDB as solving everything about MVCC here but there are also some tradeoffs. I have run the PostgreSQL and YugabyteDB on similar instance sizes (4 vCPU 16 Gib RAM on AWS). You have seen some operations being faster on PostgreSQL. That's because PostgreSQL is monolithic and works in shared memory. Fast, but stops in case of failure or maintenance. The YugabyteDB cluster has one instance on each Availability Zone, and the application continues if one AZ is down. This provides High Availability but adds some latency. The most important, before comparing different implementations, is to understand how it works and have the freedom of choice. Stay with Open Source, PostgreSQL or PostgreSQL-compatible, read multiple sources from different point of view. And test everything.


This content originally appeared on DEV Community and was authored by Franck Pachot


Print Share Comment Cite Upload Translate Updates
APA

Franck Pachot | Sciencx (2023-04-27T19:48:13+00:00) 🦦 What OtterTune hates the most in PostgreSQL🐘 is solved in YugabyteDB🚀. Retrieved from https://www.scien.cx/2023/04/27/%f0%9f%a6%a6-what-ottertune-hates-the-most-in-postgresql%f0%9f%90%98-is-solved-in-yugabytedb%f0%9f%9a%80/

MLA
" » 🦦 What OtterTune hates the most in PostgreSQL🐘 is solved in YugabyteDB🚀." Franck Pachot | Sciencx - Thursday April 27, 2023, https://www.scien.cx/2023/04/27/%f0%9f%a6%a6-what-ottertune-hates-the-most-in-postgresql%f0%9f%90%98-is-solved-in-yugabytedb%f0%9f%9a%80/
HARVARD
Franck Pachot | Sciencx Thursday April 27, 2023 » 🦦 What OtterTune hates the most in PostgreSQL🐘 is solved in YugabyteDB🚀., viewed ,<https://www.scien.cx/2023/04/27/%f0%9f%a6%a6-what-ottertune-hates-the-most-in-postgresql%f0%9f%90%98-is-solved-in-yugabytedb%f0%9f%9a%80/>
VANCOUVER
Franck Pachot | Sciencx - » 🦦 What OtterTune hates the most in PostgreSQL🐘 is solved in YugabyteDB🚀. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2023/04/27/%f0%9f%a6%a6-what-ottertune-hates-the-most-in-postgresql%f0%9f%90%98-is-solved-in-yugabytedb%f0%9f%9a%80/
CHICAGO
" » 🦦 What OtterTune hates the most in PostgreSQL🐘 is solved in YugabyteDB🚀." Franck Pachot | Sciencx - Accessed . https://www.scien.cx/2023/04/27/%f0%9f%a6%a6-what-ottertune-hates-the-most-in-postgresql%f0%9f%90%98-is-solved-in-yugabytedb%f0%9f%9a%80/
IEEE
" » 🦦 What OtterTune hates the most in PostgreSQL🐘 is solved in YugabyteDB🚀." Franck Pachot | Sciencx [Online]. Available: https://www.scien.cx/2023/04/27/%f0%9f%a6%a6-what-ottertune-hates-the-most-in-postgresql%f0%9f%90%98-is-solved-in-yugabytedb%f0%9f%9a%80/. [Accessed: ]
rf:citation
» 🦦 What OtterTune hates the most in PostgreSQL🐘 is solved in YugabyteDB🚀 | Franck Pachot | Sciencx | https://www.scien.cx/2023/04/27/%f0%9f%a6%a6-what-ottertune-hates-the-most-in-postgresql%f0%9f%90%98-is-solved-in-yugabytedb%f0%9f%9a%80/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.