If you have >10 Million records or more in you traditional PostgreSQL db, doing some data analysis will be take painfully long, same for the disk space cost.
I have a dCache billing db that generates >1 Billion records / year, the origional purpose was to find a way to save space cost, so that I can have more years data keep on disk.
Throughly looked into the solutions https://wiki.postgresql.org/wiki/Foreign_data_wrappers, I found this column-oriented cstore_fdw is the best one to play with.
cstore-fdw is an open sourced Postgres extension that uses the Optimized Row Columnar (ORC) format for its data layout. Table metadata handled by protobuf-c for serializing and deserializing table metadata. It's dedicated for data analysis purpose.
ORC improves upon the RCFile format developed at Facebook, open source
protocal buffer is developed at Google, open source
It brings huge benefit on space cost, as well as performance.
Postgres works with it through Postgres foreign data wrapper APIs, so it's transparent to user end. Regular SQLs, stored procedures, triggers are all good with it. No table level vacuum any more, but it does need vacuum analyze sometime for a while to let PG planner to work better with it.
You can find more feathers and quick setup in https://citusdata.github.io/cstore_fdw/
Here is the copy of from the link:
cstore_fdw, the first column-oriented store available for PostgreSQL. Using it will let you:
- Leverage typical analytics benefits of columnar stores
- Deploy on stock PostgreSQL or scale-out PostgreSQL (CitusDB)
Download and get started at https://github.com/citusdata/cstore_fdw.
Key areas improved by this extension:
- Faster Analytics — Reduce analytics query disk and memory use by 10x
- Lower Storage — Compress data by 3x
- Easy Setup — Deploy as standard PostgreSQL extension
- Flexibility — Mix row- and column-based tables in the same DB
- Community — Benefit from PostgreSQL compatibility and open development
Learn more on our blog post.
cstore_fdw brings substantial performance benefits to analytics-heavy workloads:
- Column projections: only read columns relevant to the query
- Compressed data: higher data density reduces disk I/O
- Skip indexes: row group stats permit skipping irrelevant rows
- Stats collections: integrates with PostgreSQL’s own query optimizer
- PostgreSQL-native formats: no deserialization overhead at query time
Cleanly implements full-table compression:
- Uses PostgreSQL’s own LZ family compression technique
- Only decompresses columns needed by the query
- Extensible to support different codecs
If you know how to use PostgreSQL extensions, you know how to use
- Deploy as standard PostgreSQL extension
- Simply specify table type at creation time using FDW commands
- Copy data into your tables using standard PostgreSQL
Have the best of all worlds… mix row- and column-based tables in the same DB:
CREATE FOREIGN TABLE cstore_table (num integer, name text) SERVER cstore_server; CREATE TABLE plain_table (num integer, name text); COPY cstore_table FROM STDIN (FORMAT csv); -- 1, foo -- 2, bar -- 3, baz -- \. COPY plain_table FROM STDIN (FORMAT csv); -- 4, foo -- 5, bar -- 6, baz -- \. SELECT * FROM cstore_table c, plain_table p WHERE c.name=p.name; -- num | name | num | name -------+------+-----+------ -- 1 | foo | 4 | foo -- 2 | bar | 5 | bar -- 3 | baz | 6 | baz
Join the cstore users Google Group.
Full integration with rich PostgreSQL ecosystem:
- Compatible with all existing PostgreSQL data types
- Leverage semi-structured data using
- Quickly keep track of distinct values with HyperLogLog