At this stage we don't even parse the details of the Redshift identity such
as the seed and step values and consider them the same as a MySQL
auto_increment extra description field.
Fixes#860 (again).
Neither does array_agg(), unnest() and other very useful PostgreSQL
functions. Redshift is from 8.0 times, so do things the old way: parse the
output of the index definition that get from calling pg_index_def().
For that, this patch introduces the notion of SQL support that depends on
PostgreSQL major version. If no major-version specific query is found in the
pgloader source tree, then we use the generic one.
Fixes#860.
It turns out that the rules about the names of users and databases are more
lax than pgloader would know, so it might be a good move for our DSN parsing
to accept more values and then let the source/target systems to complain
when something goes wrong.
See #230 which got broke again somewhere.
With this patch, the following distribution rule
distribute companies using id
is equivalent to the following distribution rule set, given foreign keys in
the source schema:
distribute companies using id
distribute campaigns using company_id
distribute ads using company_id from campaigns
distribute clicks using company_id from ads, campaigns
distribute impressions using company_id from ads, campaigns
In the current code (of this patch) pgloader walks the foreign-keys
dependency tree and knows how to automatically derive distribution rules
from a single rule and the foreign keys.
With this patch it's now actually possible to backfill the data on the fly
when using the "distribute" new commands. The schema is modified to add the
distribution key where specified, and changes to the primary and foreign
keys happen automatically. Then a JOIN is generated to get the data directly
during the COPY streaming to the Citus cluster.
The idea is for pgloader to tweak the schema from a description of the
sharding model, the distribute clause. Here's an example of such a clause:
distribute company using id
distribute campaign using company_id
distribute ads using company_id from campaign
distribute clicks using company_id from ads, campaign
Given such commands, pgloader adds the distibution key to the table when
needed, to the primary key definition of the table, and also to the foreign
keys that are pointing to the changed primary key.
Then when SELECTing the data from the source database, the idea is for
pgloader to automatically JOIN the base table with the source table where to
find the distribution key, in case it was just added in the schema.
Finally, pgloader also calls the following Citus commands:
SELECT create_distributed_table('company', 'id');
SELECT create_distributed_table('campaign', 'company_id');
SELECT create_distributed_table('ads', 'company_id');
SELECT create_distributed_table('clicks', 'company_id');
The catalog queries used in pgloader have to be adjusted for Redshift
because this thing forked PostgreSQL 8.0, which is a long time ago now.
Also, we had a couple bugs here and there that were not really related to
Redshift support but were shown in that context.
Fixes#813.
It turns out that when trying to debug "decoding as" the SQLtype listing
support in sqltype-list was found broken, so this patch fixes it. Then goes
on to fix the DECODING AS filters support, which we have switched to using
the better regexp-or-string filter struct but forgot to update the matching
code accordingly.
Fixes#665.
When dealing with PostgreSQL protocol compatible databases, often enough
they don't support the same catalogs as PostgreSQL itself. Redshift for
instance lacks foreign key support.
We now accept the more general string and regex match rules, but the code to
generate including and excluding lists from the catalogs had not been updated.
It's now possible to use pgloader to migrate from PostgreSQL to PostgreSQL.
That might be useful for several reasons, including applying user defined
cast rules at COPY time, or just moving from an hosted solution to another.
* Add dockerfiles to .dockerignore
Otherwise changes in the dockerfiles would invalidate the cache
* Rewrite Dockerfile
- Fix deprecated MAINTAINER instruction
- Move maintainer label to the bottom (improving cache)
- Tidy up apt-get
- Use COPY instead of ADD
see https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#add-or-copy
- Remove WORKDIR instruction (we don't really need this)
- Combine remaining RUN layers to reduce layer count
- Move final binary instead of copying (reduce image size)
* Use -slim image an multistage build
Reduce size by using multistage builds and the -slim image.
Use debian:stable instead of an specific code name (future proof).
* [cosmetic] indent Dockerfile instructions
Make it easier to see where a new build stage begins
* Rewrite Dockerfile.ccl
Apply the same changes to Dockerfile.ccl as we did for Dockerfile
Given the variety of ways to setup default behavior for datetime and
timestamp data types in MySQL, we need yet more default casting rules. It
might be time to think about a more principled way to solve the problem, but
on the other hand, this ad-hoc one also comes with full overriding
flexibility for the end user.
Fixes#811.
Travis spotted a bug with CCL that I failed to see, and that happens with
Clozure-CL but not with SBCL apparently:
2018-07-03T21:04:11.053795Z FATAL The value "\\\"", derived from the initarg :DELIMITER, can not be used to set the value of the slot CL-CSV::DELIMITER in #<CL-CSV::READ-DISPATCH-TABLE-ENTRY #x30200143DDCD>, because it is not of type (VECTOR (OR (MEMBER T NIL) CHARACTER)).
To fix, prefer the syntax #(#\\ #\") rather than "\\\"".
It's easy to avoid having the warning about unused lexical variable with the
proper declaration, that I failed to install before because of a syntax
error when I tried. Let's fix it now that I realise what was wrong.