This article is pretty much same as the one in github, a bit more easier than that because RHEL/CentOS already have their package in their repo.
This extension implements a columnar store for PostgreSQL. Columnar stores provide notable benefits for analytic use-cases where data is loaded in batches.
Join the Mailing List to stay on top of the latest developments.
This extension uses the Optimized Row Columnar (ORC) format for its data layout. ORC improves upon the RCFile format developed at Facebook, and brings the following benefits:
- Compression: Reduces in-memory and on-disk data size by 2-4x. Can be extended to support different codecs.
- Column projections: Only reads column data relevant to the query. Improves performance for I/O bound queries.
- Skip indexes: Stores min/max statistics for row groups, and uses them to skip over unrelated rows.
Further, we used the Postgres foreign data wrapper APIs and type representations with this extension. This brings:
- Support for 40+ Postgres data types. The user can also create new types and use them.
- Statistics collection. PostgreSQL's query optimizer uses these stats to evaluate different query plans and pick the best one.
- Simple setup. Create foreign table and copy data. Run SQL.
cstore_fdw depends on protobuf-c for serializing and deserializing table metadata. So we need to install these packages first:
# RHEL, CentOS, SL, Amazon Linux yum --enablerepo=EPEL install protobuf-c-develNote. In CentOS 5, 6, and 7, you may need to install or update EPEL 5, 6, or 7 repositories. See this page for instructions.
Note. In Amazon Linux, the EPEL repository is installed by default, but not enabled. See these instructions for how to enable it.
Once you have protobuf-c installed on your machine, you are ready to install cstore_fdw.
The cstore_fdw package is also available in RHEL repo.
yum install cstore_fdw
Note. cstore_fdw requires PostgreSQL 9.3, 9.4 or 9.5. It doesn't support earlier versions of PostgreSQL.
Before using cstore_fdw, you need to add it to
shared_preload_libraries in your
postgresql.conf and restart Postgres:
shared_preload_libraries = 'cstore_fdw' # (change requires restart)
The following parameters can be set on a cstore foreign table object.
- filename (optional): The absolute path to the location for storing table data. If you don't specify the filename option, cstore_fdw will automatically choose the $PGDATA/cstore_fdw directory to store the files. If specified the value of this parameter will be used as a prefix for all files created to store table data. For example, the value
/cstore_fdw/my_tablecould result in the files
/cstore_fdw/my_table.footerbeing used to manage table data.
- compression (optional): The compression used for compressing value streams. Valid options are
pglz. The default is
- stripe_row_count (optional): Number of rows per stripe. The default is
150000. Reducing this decreases the amount memory used for loading data and querying, but also decreases the performance.
- block_row_count (optional): Number of rows per column block. The default is
10000. cstore_fdw compresses, creates skip indexes, and reads from disk at the block granularity. Increasing this value helps with compression and results in fewer reads from disk. However, higher values also reduce the probability of skipping over unrelated row blocks.
To load or append data into a cstore table, you have two options:
- You can use the
COPYcommand to load or append data from a file, a program, or STDIN.
- You can use the
INSERT INTO cstore_table SELECT ...syntax to load or append data from another table.
You can use the
ANALYZE command to collect statistics about the table. These statistics help the query planner to help determine the most efficient execution plan for each query.
Note. We currently don't support updating table using DELETE, and UPDATE commands. We also don't support single row inserts.
To update an existing cstore_fdw installation from versions earlier than 1.4.1 you can take the following steps:
- Download and install cstore_fdw version 1.4.1 using instructions from the "Building" section,
- Restart the PostgreSQL server,
ALTER EXTENSION cstore_fdw UPDATE;
As an example, we demonstrate loading and querying data to/from a column store table from scratch here. Let's start with downloading and decompressing the data files.
wget http://examples.citusdata.com/customer_reviews_1998.csv.gz wget http://examples.citusdata.com/customer_reviews_1999.csv.gz gzip -d customer_reviews_1998.csv.gz gzip -d customer_reviews_1999.csv.gz
Then, let's log into Postgres, and run the following commands to create a column store foreign table:
-- load extension first time after install CREATE EXTENSION cstore_fdw; -- create server object CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw; -- create foreign table CREATE FOREIGN TABLE customer_reviews ( customer_id TEXT, review_date DATE, review_rating INTEGER, review_votes INTEGER, review_helpful_votes INTEGER, product_id CHAR(10), product_title TEXT, product_sales_rank BIGINT, product_group TEXT, product_category TEXT, product_subcategory TEXT, similar_product_ids CHAR(10) ) SERVER cstore_server OPTIONS(compression 'pglz');
Next, we load data into the table:
COPY customer_reviews FROM '/home/user/customer_reviews_1998.csv' WITH CSV; COPY customer_reviews FROM '/home/user/customer_reviews_1999.csv' WITH CSV;
Note. If you are getting
ERROR: cannot copy to foreign table "customer_reviews" when trying to run the COPY commands, double check that you have added cstore_fdw to
postgresql.conf and restarted Postgres.
Next, we collect data distribution statistics about the table. This is optional, but usually very helpful:
Finally, let's run some example SQL queries on the column store table.
-- Find all reviews a particular customer made on the Dune series in 1998. SELECT customer_id, review_date, review_rating, product_id, product_title FROM customer_reviews WHERE customer_id ='A27T7HVDXA3K2A' AND product_title LIKE '%Dune%' AND review_date >= '1998-01-01' AND review_date <= '1998-12-31'; -- Do we have a correlation between a book's title's length and its review ratings? SELECT width_bucket(length(product_title), 1, 50, 5) title_length_bucket, round(avg(review_rating), 2) AS review_average, count(*) FROM customer_reviews WHERE product_group = 'Book' GROUP BY title_length_bucket ORDER BY title_length_bucket;
The example above illustrated how to load data into a PostgreSQL database running on a single host. However, sometimes your data is too large to analyze effectively on a single host. Citus is a product built by Citus Data that allows you to run a distributed PostgreSQL database to analyze your data using the power of multiple hosts. You can easily install and run other PostgreSQL extensions and foreign data wrappers—including cstore_fdw—alongside Citus.
You can create a cstore_fdw table and distribute it using the
master_create_distributed_table() UDF just like any other table. You can load data using the
copy command as you would do in single node PostgreSQL.
cstore_fdw partitions each column into multiple blocks. Skip indexes store minimum and maximum values for each of these blocks. While scanning the table, if min/max values of the block contradict the WHERE clause, then the block is completely skipped. This way, the query processes less data and hence finishes faster.
To use skip indexes more efficiently, you should load the data after sorting it on a column that is commonly used in the WHERE clause. This ensures that there is a minimum overlap between blocks and the chance of them being skipped is higher.
In practice, the data generally has an inherent dimension (for example a time field) on which it is naturally sorted. Usually, the queries also have a filter clause on that column (for example you want to query only the last week's data), and hence you don't need to sort the data in such cases.
Before uninstalling the extension, first you need to drop all the cstore tables:
postgres=# DROP FOREIGN TABLE cstore_table_1; ... postgres=# DROP FOREIGN TABLE cstore_table_n;
Then, you should drop the cstore server and extension:
postgres=# DROP SERVER cstore_server; postgres=# DROP EXTENSION cstore_fdw;
cstore_fdw automatically creates some directories inside the PostgreSQL's data directory to store its files. To remove them, you can run:
$ rm -rf $PGDATA/cstore_fdw
Then, you should remove cstore_fdw from
shared_preload_libraries in your
shared_preload_libraries = '' # (change requires restart)
Finally, to uninstall the extension you can run the following command in the extension's source code directory. This will clean up all the files copied during the installation:
$ sudo PATH=/usr/local/pgsql/bin/:$PATH make uninstall