cstore-fdw is an open sourced Postgres extension that uses the Optimized Row Columnar (ORC) format for its data layout. Use Protobuf-c for serializing and deserializing table metadata. It's dedicated for data analyzing purpose.
Postgres works with it through Postgres foreign data wrapper APIs, so it's transparent to user end.
You can find other articles like the introduction of cstore-fdw and how to setup/config cstore-fdw, at here I'll just give you a glance at technical layer what and how it works in details from handson experience.
What is cstore-fdw
– CStore is an open source project actively in development:github.com/citusdata/cstore_fdw
– Improved statistics gathering
– Automatic management of table filenames
– Management of table file data
Columnar Store(cstore-fdw) Motivation
- Read subset of columns to reduce I/O – Less disk I/O
- Better compression – Less disk usage
There are two optional parameters you can set when creatin cstore foreign table
stripe_row_count (optional): Number of rows per stripe. The default is
150000. Reducing this decreases the amount memory used for loading data and querying, but also decreases the performance.
block_row_count (optional): Number of rows per column block. The default is
10000. cstore_fdw compresses, creates skip indexes, and reads from disk at the block granularity. Increasing this value helps with compression and results in fewer reads from disk. However, higher values also reduce the probability of skipping over unrelated row blocks.
Use Foreign wrapper APIs
- Supports all PostgreSQL data types
- Statistics collection for better query plans
- Load extension. Create Table.
Postgres foreign data wrapper APIs make sure SQLs work the same as it does on regular table.
cstore_fdw currently doesn't support PostgreSQL indexes. But it automatically stores some min/max statistics in skip indexes which makes execution of some queries much more efficient.
Currently FDWs are not able to cleanly push down aggregation to the foreign data wrapper, the actual aggregation is still done on the postgres side. This means that large aggregates are about the same speed on cstore_fdw as they are for PostgreSQL tables
cstore-fdw is a foreign data wrapper, physical data is not managed by Postgres, No data vacuum any more, but it does need vacuum analyze sometime for a while to let Postgres Optimizer to work better.
No problem, still work like before.
- Current compression method is PG_LZ from PostgreSQL core
- Easy to add new compression methods depending on the CPU / disk trade-off
- cstore_fdw enables using different compression methods at the column block level
Column projections: Only reads column data relevant to the query. Improves performance for I/O bound queries.
Skip indexes: Stores min/max statistics for row groups, and uses them to skip over unrelated rows.
Here is a good article described performance measurement, though it's done by monetdb.
ORC File Layout benefits
- Columnar layout – reads columns only related to the query
- Compression – groups column values (10K) together and compresses them
- Skip indexes – applies predicate filtering to skip over unrelated values
Drawbacks to ORC
- Support for limited data types. Each data type further needs to have a separate code path for min/max value collection and constraint exclusion.
- Gathering statistics from the data and table JOINs are an afterthought.
cstore_fdw depends on protobuf-c for serializing and deserializing table metadata
State of the Columnar Store
- Fork a popular database, swap in your storage engine, and never look back
- Develop an open columnar store format for the Hadoop Distributed Filesystem (HDFS)
- Use PostgreSQL extension machinery for in-memory stores / external databases
- Improve memory usage
- Native Delete / Insert / Update support
- Improve read query performance (vectorized execution)
- Different compression codecs