From 56d24de67a3a31a92f514ff1db5eb686ec2d0a63 Mon Sep 17 00:00:00 2001 From: Dimitri Fontaine Date: Tue, 11 Dec 2018 14:25:08 +0900 Subject: [PATCH] Update documentation with new features. We have a lot of new features to document. This is a first patch about that, some more work is to be done. That said, it's better than nothing already. --- docs/index.rst | 4 + docs/intro.rst | 8 + docs/pgloader.rst | 28 +++ docs/ref/mysql.rst | 15 +- docs/ref/pgsql-citus-target.rst | 77 ++++++ docs/ref/pgsql-redshift-source.rst | 12 + docs/ref/pgsql-redshift-target.rst | 10 + docs/ref/pgsql.rst | 371 +++++++++++++++++++++++++++++ 8 files changed, 521 insertions(+), 4 deletions(-) create mode 100644 docs/ref/pgsql-citus-target.rst create mode 100644 docs/ref/pgsql-redshift-source.rst create mode 100644 docs/ref/pgsql-redshift-target.rst create mode 100644 docs/ref/pgsql.rst diff --git a/docs/index.rst b/docs/index.rst index d69915e..3fb2f9a 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -22,6 +22,10 @@ Welcome to pgloader's documentation! ref/mysql ref/sqlite ref/mssql + ref/pgsql + ref/pgsql-citus-target + ref/pgsql-redshift-source + ref/pgsql-redshift-target ref/transforms bugreport diff --git a/docs/intro.rst b/docs/intro.rst index 0dc75e2..2a098d9 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -17,6 +17,14 @@ pgloader knows how to read data from different kind of sources: * SQLite * MySQL * MS SQL Server + * PostgreSQL + * Redshift + +pgloader knows how to target different products using the PostgresQL Protocol: + + * PostgreSQL + * `Citus `_ + * Redshift The level of automation provided by pgloader depends on the data source type. In the case of CSV and Fixed Format files, a full description of the diff --git a/docs/pgloader.rst b/docs/pgloader.rst index 4a8cade..00fa186 100644 --- a/docs/pgloader.rst +++ b/docs/pgloader.rst @@ -154,6 +154,18 @@ Those options are meant to tweak `pgloader` behavior when loading data. machine code) another version of itself, usually a newer one like a very recent git checkout. + * `--no-ssl-cert-verification` + + Uses the OpenSSL option to accept a locally issued server-side + certificate, avoiding the following error message:: + + SSL verify error: 20 X509_V_ERR_UNABLE_TO_GET_ISSUER_CERT_LOCALLY + + The right way to fix the SSL issue is to use a trusted certificate, of + course. Sometimes though it's useful to make progress with the pgloader + setup while the certificate chain of trust is being fixed, maybe by + another team. That's when this option is useful. + Command Line Only Operations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -552,6 +564,22 @@ queries from a SQL file. Implements support for PostgreSQL dollar-quoting and the `\i` and `\ir` include facilities as in `psql` batch mode (where they are the same thing). +AFTER CREATE SCHEMA DO +^^^^^^^^^^^^^^^^^^^^^^ + +Same format as *BEFORE LOAD DO*, the dollar-quoted queries found in that +section are executed once the schema has been craeted by pgloader, and +before the data is loaded. It's the right time to ALTER TABLE or do some +custom implementation on-top of what pgloader does, like maybe partitioning. + +AFTER CREATE SCHEMA EXECUTE +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Same behaviour as in the *AFTER CREATE SCHEMA DO* clause. Allows you to read +the SQL queries from a SQL file. Implements support for PostgreSQL +dollar-quoting and the `\i` and `\ir` include facilities as in `psql` batch +mode (where they are the same thing). + Connection String ^^^^^^^^^^^^^^^^^ diff --git a/docs/ref/mysql.rst b/docs/ref/mysql.rst index dec33a9..a55f5d7 100644 --- a/docs/ref/mysql.rst +++ b/docs/ref/mysql.rst @@ -1,10 +1,9 @@ Migrating a MySQL Database to PostgreSQL ======================================== -This command instructs pgloader to load data from a database connection. The -only supported database source is currently *MySQL*, and pgloader supports -dynamically converting the schema of the source database and the indexes -building. +This command instructs pgloader to load data from a database connection. +pgloader supports dynamically converting the schema of the source database +and the indexes building. A default set of casting rules are provided and might be overloaded and appended to by the command. @@ -609,6 +608,14 @@ Date:: to timestamptz drop default using zero-dates-to-null + type datetime with extra on update current timestamp when not null + to timestamptz drop not null drop default + using zero-dates-to-null + + type datetime with extra on update current timestamp + to timestamptz drop default + using zero-dates-to-null + type timestamp when default "0000-00-00 00:00:00" and not null to timestamptz drop not null drop default using zero-dates-to-null diff --git a/docs/ref/pgsql-citus-target.rst b/docs/ref/pgsql-citus-target.rst new file mode 100644 index 0000000..257e081 --- /dev/null +++ b/docs/ref/pgsql-citus-target.rst @@ -0,0 +1,77 @@ +Migrating a PostgreSQL Database to Citus +======================================== + +This command instructs pgloader to load data from a database connection. +Automatic discovery of the schema is supported, including build of the +indexes, primary and foreign keys constraints. A default set of casting +rules are provided and might be overloaded and appended to by the command. + +Automatic distribution column backfilling is supported, either from commands +that specify what is the distribution column in every table, or only in the +main table, then relying on foreign key constraints to discover the other +distribution keys. + +Here's a short example of migrating a database from a PostgreSQL server to +another: + +:: + + load database + from pgsql:///hackathon + into pgsql://localhost:9700/dim + + with include drop, reset no sequences + + cast column impressions.seen_at to "timestamp with time zone" + + distribute companies using id + -- distribute campaigns using company_id + -- distribute ads using company_id from campaigns + -- distribute clicks using company_id from ads, campaigns + -- distribute impressions using company_id from ads, campaigns + ; + +Everything works exactly the same way as when doing a PostgreSQL to +PostgreSQL migration, with the added fonctionality of this new `distribute` +command. + +Distribute Command +^^^^^^^^^^^^^^^^^^ + +The distribute command syntax is as following:: + + distribute using + distribute
using from
[,
, ...] + distribute
as reference table + +When using the distribute command, the following steps are added to pgloader +operations when migrating the schema: + + - if the distribution column does not exist in the table, it is added as + the first column of the table + + - if the distribution column does not exists in the primary key of the + table, it is added as the first column of the primary of the table + + - all the foreign keys that point to the table are added the distribution + key automatically too, including the source tables of the foreign key + constraints + + - once the schema has been created on the target database, pgloader then + issues Citus specific command `create_reference_table() + `_ + and `create_distributed_table() + `_ + to make the tables distributed + +Those operations are done in the schema section of pgloader, before the data +is loaded. When the data is loaded, the newly added columns need to be +backfilled from referenced data. pgloader knows how to do that by generating +a query like the following and importing the result set of such a query +rather than the raw data from the source table. + +Citus Migration: Limitations +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The way pgloader implements *reset sequence* does not work with Citus at +this point, so sequences need to be taken care of separately at this point. diff --git a/docs/ref/pgsql-redshift-source.rst b/docs/ref/pgsql-redshift-source.rst new file mode 100644 index 0000000..b69b6d9 --- /dev/null +++ b/docs/ref/pgsql-redshift-source.rst @@ -0,0 +1,12 @@ +Migrating a Redhift Database to PostgreSQL +========================================== + +This command instructs pgloader to load data from a database connection. +Automatic discovery of the schema is supported, including build of the +indexes, primary and foreign keys constraints. A default set of casting +rules are provided and might be overloaded and appended to by the command. + +The command and behavior are the same as when migration from a PostgreSQL +database source. pgloader automatically discovers that it's talking to a +Redshift database by parsing the output of the `SELECT version()` SQL query. + diff --git a/docs/ref/pgsql-redshift-target.rst b/docs/ref/pgsql-redshift-target.rst new file mode 100644 index 0000000..50cc356 --- /dev/null +++ b/docs/ref/pgsql-redshift-target.rst @@ -0,0 +1,10 @@ +Migrating a PostgreSQL Database to Redshift +=========================================== + +This command instructs pgloader to load data from a database connection. +Automatic discovery of the schema is supported, including build of the +indexes, primary and foreign keys constraints. A default set of casting +rules are provided and might be overloaded and appended to by the command. + + +TODO: add details about S3 credentials and bucket configuration. diff --git a/docs/ref/pgsql.rst b/docs/ref/pgsql.rst new file mode 100644 index 0000000..d233ffa --- /dev/null +++ b/docs/ref/pgsql.rst @@ -0,0 +1,371 @@ +Migrating a PostgreSQL Database to PostgreSQL +============================================= + +This command instructs pgloader to load data from a database connection. +Automatic discovery of the schema is supported, including build of the +indexes, primary and foreign keys constraints. A default set of casting +rules are provided and might be overloaded and appended to by the command. + +Here's a short example of migrating a database from a PostgreSQL server to +another: + +:: + + load database + from pgsql://localhost/pgloader + into pgsql://localhost/copy + + including only table names matching 'bits', ~/utilisateur/ in schema 'mysql' + including only table names matching ~/geolocations/ in schema 'public' + ; + +PostgreSQL Database Source Specification: FROM +---------------------------------------------- + +Must be a connection URL pointing to a PostgreSQL database. + +See the `SOURCE CONNECTION STRING` section above for details on how to write +the connection string. + +:: + + pgsql://[user[:password]@][netloc][:port][/dbname][?option=value&...] + + +PostgreSQL Database Migration Options: WITH +------------------------------------------- + +When loading from a `PostgreSQL` database, the following options are +supported, and the default *WITH* clause is: *no truncate*, *create schema*, +*create tables*, *include drop*, *create indexes*, *reset sequences*, +*foreign keys*, *downcase identifiers*, *uniquify index names*, *reindex*. + + - *include drop* + + When this option is listed, pgloader drops all the tables in the target + PostgreSQL database whose names appear in the MySQL database. This + option allows for using the same command several times in a row until + you figure out all the options, starting automatically from a clean + environment. Please note that `CASCADE` is used to ensure that tables + are dropped even if there are foreign keys pointing to them. This is + precisely what `include drop` is intended to do: drop all target tables + and recreate them. + + Great care needs to be taken when using `include drop`, as it will + cascade to *all* objects referencing the target tables, possibly + including other tables that are not being loaded from the source DB. + + - *include no drop* + + When this option is listed, pgloader will not include any `DROP` + statement when loading the data. + + - *truncate* + + When this option is listed, pgloader issue the `TRUNCATE` command + against each PostgreSQL table just before loading data into it. + + - *no truncate* + + When this option is listed, pgloader issues no `TRUNCATE` command. + + - *disable triggers* + + When this option is listed, pgloader issues an `ALTER TABLE ... DISABLE + TRIGGER ALL` command against the PostgreSQL target table before copying + the data, then the command `ALTER TABLE ... ENABLE TRIGGER ALL` once the + `COPY` is done. + + This option allows loading data into a pre-existing table ignoring the + *foreign key constraints* and user defined triggers and may result in + invalid *foreign key constraints* once the data is loaded. Use with + care. + + - *create tables* + + When this option is listed, pgloader creates the table using the meta + data found in the `MySQL` file, which must contain a list of fields with + their data type. A standard data type conversion from DBF to PostgreSQL + is done. + + - *create no tables* + + When this option is listed, pgloader skips the creation of table before + loading data, target tables must then already exist. + + Also, when using *create no tables* pgloader fetches the metadata from + the current target database and checks type casting, then will remove + constraints and indexes prior to loading the data and install them back + again once the loading is done. + + - *create indexes* + + When this option is listed, pgloader gets the definitions of all the + indexes found in the MySQL database and create the same set of index + definitions against the PostgreSQL database. + + - *create no indexes* + + When this option is listed, pgloader skips the creating indexes. + + - *drop indexes* + + When this option is listed, pgloader drops the indexes in the target + database before loading the data, and creates them again at the end + of the data copy. + + - *reindex* + + When this option is used, pgloader does both *drop indexes* before + loading the data and *create indexes* once data is loaded. + + - *drop schema* + + When this option is listed, pgloader drops the target schema in the + target PostgreSQL database before creating it again and all the objects + it contains. The default behavior doesn't drop the target schemas. + + - *foreign keys* + + When this option is listed, pgloader gets the definitions of all the + foreign keys found in the MySQL database and create the same set of + foreign key definitions against the PostgreSQL database. + + - *no foreign keys* + + When this option is listed, pgloader skips creating foreign keys. + + - *reset sequences* + + When this option is listed, at the end of the data loading and after the + indexes have all been created, pgloader resets all the PostgreSQL + sequences created to the current maximum value of the column they are + attached to. + + The options *schema only* and *data only* have no effects on this + option. + + - *reset no sequences* + + When this option is listed, pgloader skips resetting sequences after the + load. + + The options *schema only* and *data only* have no effects on this + option. + + - *downcase identifiers* + + When this option is listed, pgloader converts all MySQL identifiers + (table names, index names, column names) to *downcase*, except for + PostgreSQL *reserved* keywords. + + The PostgreSQL *reserved* keywords are determined dynamically by using + the system function `pg_get_keywords()`. + + - *quote identifiers* + + When this option is listed, pgloader quotes all MySQL identifiers so + that their case is respected. Note that you will then have to do the + same thing in your application code queries. + + - *schema only* + + When this option is listed pgloader refrains from migrating the data + over. Note that the schema in this context includes the indexes when the + option *create indexes* has been listed. + + - *data only* + + When this option is listed pgloader only issues the `COPY` statements, + without doing any other processing. + + - *rows per range* + + How many rows are fetched per `SELECT` query when using *multiple + readers per thread*, see above for details. + +PostgreSQL Database Casting Rules +--------------------------------- + +The command *CAST* introduces user-defined casting rules. + +The cast clause allows to specify custom casting rules, either to overload +the default casting rules or to amend them with special cases. + +A casting rule is expected to follow one of the forms:: + + type [ ... ] to [