diff --git a/docs/batches.rst b/docs/batches.rst new file mode 100644 index 0000000..07227c3 --- /dev/null +++ b/docs/batches.rst @@ -0,0 +1,123 @@ +Batch Processing +================ + +To load data to PostgreSQL, pgloader uses the `COPY` streaming protocol. +While this is the faster way to load data, `COPY` has an important drawback: +as soon as PostgreSQL emits an error with any bit of data sent to it, +whatever the problem is, the whole data set is rejected by PostgreSQL. + +To work around that, pgloader cuts the data into *batches* of 25000 rows +each, so that when a problem occurs it's only impacting that many rows of +data. Each batch is kept in memory while the `COPY` streaming happens, in +order to be able to handle errors should some happen. + +When PostgreSQL rejects the whole batch, pgloader logs the error message +then isolates the bad row(s) from the accepted ones by retrying the batched +rows in smaller batches. To do that, pgloader parses the *CONTEXT* error +message from the failed COPY, as the message contains the line number where +the error was found in the batch, as in the following example:: + + CONTEXT: COPY errors, line 3, column b: "2006-13-11" + +Using that information, pgloader will reload all rows in the batch before +the erroneous one, log the erroneous one as rejected, then try loading the +remaining of the batch in a single attempt, which may or may not contain +other erroneous data. + +At the end of a load containing rejected rows, you will find two files in +the *root-dir* location, under a directory named the same as the target +database of your setup. The filenames are the target table, and their +extensions are `.dat` for the rejected data and `.log` for the file +containing the full PostgreSQL client side logs about the rejected data. + +The `.dat` file is formatted in PostgreSQL the text COPY format as documented +in `http://www.postgresql.org/docs/9.2/static/sql-copy.html#AEN66609`. + +It is possible to use the following WITH options to control pgloader batch +behavior: + + - *on error stop*, *on error resume next* + + This option controls if pgloader is using building batches of data at + all. The batch implementation allows pgloader to recover errors by + sending the data that PostgreSQL accepts again, and by keeping away the + data that PostgreSQL rejects. + + To enable retrying the data and loading the good parts, use the option + *on error resume next*, which is the default to file based data loads + (such as CSV, IXF or DBF). + + When migrating from another RDMBS technology, it's best to have a + reproducible loading process. In that case it's possible to use *on + error stop* and fix either the casting rules, the data transformation + functions or in cases the input data until your migration runs through + completion. That's why *on error resume next* is the default for SQLite, + MySQL and MS SQL source kinds. + +A Note About Performance +------------------------ + +pgloader has been developed with performance in mind, to be able to cope +with ever growing needs in loading large amounts of data into PostgreSQL. + +The basic architecture it uses is the old Unix pipe model, where a thread is +responsible for loading the data (reading a CSV file, querying MySQL, etc) +and fills pre-processed data into a queue. Another threads feeds from the +queue, apply some more *transformations* to the input data and stream the +end result to PostgreSQL using the COPY protocol. + +When given a file that the PostgreSQL `COPY` command knows how to parse, and +if the file contains no erroneous data, then pgloader will never be as fast +as just using the PostgreSQL `COPY` command. + +Note that while the `COPY` command is restricted to read either from its +standard input or from a local file on the server's file system, the command +line tool `psql` implements a `\copy` command that knows how to stream a +file local to the client over the network and into the PostgreSQL server, +using the same protocol as pgloader uses. + +A Note About Parallelism +------------------------ + +pgloader uses several concurrent tasks to process the data being loaded: + + - a reader task reads the data in and pushes it to a queue, + + - at last one write task feeds from the queue and formats the raw into the + PostgreSQL COPY format in batches (so that it's possible to then retry a + failed batch without reading the data from source again), and then sends + the data to PostgreSQL using the COPY protocol. + +The parameter *workers* allows to control how many worker threads are +allowed to be active at any time (that's the parallelism level); and the +parameter *concurrency* allows to control how many tasks are started to +handle the data (they may not all run at the same time, depending on the +*workers* setting). + +We allow *workers* simultaneous workers to be active at the same time in the +context of a single table. A single unit of work consist of several kinds of +workers: + + - a reader getting raw data from the source, + - N writers preparing and sending the data down to PostgreSQL. + +The N here is setup to the *concurrency* parameter: with a *CONCURRENCY* of +2, we start (+ 1 2) = 3 concurrent tasks, with a *concurrency* of 4 we start +(+ 1 4) = 5 concurrent tasks, of which only *workers* may be active +simultaneously. + +The defaults are `workers = 4, concurrency = 1` when loading from a database +source, and `workers = 8, concurrency = 2` when loading from something else +(currently, a file). Those defaults are arbitrary and waiting for feedback +from users, so please consider providing feedback if you play with the +settings. + +As the `CREATE INDEX` threads started by pgloader are only waiting until +PostgreSQL is done with the real work, those threads are *NOT* counted into +the concurrency levels as detailed here. + +By default, as many `CREATE INDEX` threads as the maximum number of indexes +per table are found in your source schema. It is possible to set the `max +parallel create index` *WITH* option to another number in case there's just +too many of them to create. + diff --git a/docs/command.rst b/docs/command.rst new file mode 100644 index 0000000..beb4e00 --- /dev/null +++ b/docs/command.rst @@ -0,0 +1,380 @@ +Command Syntax +============== + +pgloader implements a Domain Specific Language allowing to setup complex +data loading scripts handling computed columns and on-the-fly sanitization +of the input data. For more complex data loading scenarios, you will be +required to learn that DSL's syntax. It's meant to look familiar to DBA by +being inspired by SQL where it makes sense, which is not that much after +all. + +The pgloader commands follow the same global grammar rules. Each of them +might support only a subset of the general options and provide specific +options. + +:: + + LOAD + FROM + [ HAVING FIELDS ] + INTO + [ TARGET TABLE [ "" ]."" ] + [ TARGET COLUMNS ] + + [ WITH ] + + [ SET ] + + [ BEFORE LOAD [ DO | EXECUTE ] ... ] + [ AFTER LOAD [ DO | EXECUTE ] ... ] + ; + +The main clauses are the `LOAD`, `FROM`, `INTO` and `WITH` clauses that each +command implements. Some command then implement the `SET` command, or some +specific clauses such as the `CAST` clause. + +.. _common_clauses: + +Command Clauses +--------------- + +The pgloader command syntax allows composing CLAUSEs together. Some clauses +are specific to the FROM source-type, most clauses are always available. + +FROM +---- + +The *FROM* clause specifies where to read the data from, and each command +introduces its own variant of sources. For instance, the *CSV* source +supports `inline`, `stdin`, a filename, a quoted filename, and a *FILENAME +MATCHING* clause (see above); whereas the *MySQL* source only supports a +MySQL database URI specification. + +INTO +---- + +The PostgreSQL connection URI must contains the name of the target table +where to load the data into. That table must have already been created in +PostgreSQL, and the name might be schema qualified. + +Then *INTO* option also supports an optional comma separated list of target +columns, which are either the name of an input *field* or the white space +separated list of the target column name, its PostgreSQL data type and a +*USING* expression. + +The *USING* expression can be any valid Common Lisp form and will be read +with the current package set to `pgloader.transforms`, so that you can use +functions defined in that package, such as functions loaded dynamically with +the `--load` command line parameter. + +Each *USING* expression is compiled at runtime to native code. + +This feature allows pgloader to load any number of fields in a CSV file into +a possibly different number of columns in the database, using custom code +for that projection. + +WITH +---- + +Set of options to apply to the command, using a global syntax of either: + + - *key = value* + - *use option* + - *do not use option* + +See each specific command for details. + +All data sources specific commands support the following options: + + - *on error stop*, *on error resume next* + - *batch rows = R* + - *batch size = ... MB* + - *prefetch rows = ...* + +See the section BATCH BEHAVIOUR OPTIONS for more details. + +In addition, the following settings are available: + + - *workers = W* + - *concurrency = C* + - *max parallel create index = I* + +See section A NOTE ABOUT PARALLELISM for more details. + +SET +--- + +This clause allows to specify session parameters to be set for all the +sessions opened by pgloader. It expects a list of parameter name, the equal +sign, then the single-quoted value as a comma separated list. + +The names and values of the parameters are not validated by pgloader, they +are given as-is to PostgreSQL. + +BEFORE LOAD DO +-------------- + +You can run SQL queries against the database before loading the data from +the `CSV` file. Most common SQL queries are `CREATE TABLE IF NOT EXISTS` so +that the data can be loaded. + +Each command must be *dollar-quoted*: it must begin and end with a double +dollar sign, `$$`. Dollar-quoted queries are then comma separated. No extra +punctuation is expected after the last SQL query. + +BEFORE LOAD EXECUTE +------------------- + +Same behaviour as in the *BEFORE LOAD DO* clause. Allows you to read the SQL +queries from a SQL file. Implements support for PostgreSQL dollar-quoting +and the `\i` and `\ir` include facilities as in `psql` batch mode (where +they are the same thing). + +AFTER LOAD DO +------------- + +Same format as *BEFORE LOAD DO*, the dollar-quoted queries found in that +section are executed once the load is done. That's the right time to create +indexes and constraints, or re-enable triggers. + +AFTER LOAD EXECUTE +------------------ + +Same behaviour as in the *AFTER LOAD DO* clause. Allows you to read the SQL +queries from a SQL file. Implements support for PostgreSQL dollar-quoting +and the `\i` and `\ir` include facilities as in `psql` batch mode (where +they are the same thing). + +AFTER CREATE SCHEMA DO +---------------------- + +Same format as *BEFORE LOAD DO*, the dollar-quoted queries found in that +section are executed once the schema has been created by pgloader, and +before the data is loaded. It's the right time to ALTER TABLE or do some +custom implementation on-top of what pgloader does, like maybe partitioning. + +AFTER CREATE SCHEMA EXECUTE +--------------------------- + +Same behaviour as in the *AFTER CREATE SCHEMA DO* clause. Allows you to read +the SQL queries from a SQL file. Implements support for PostgreSQL +dollar-quoting and the `\i` and `\ir` include facilities as in `psql` batch +mode (where they are the same thing). + +Connection String +----------------- + +The `` parameter is expected to be given as a *Connection URI* +as documented in the PostgreSQL documentation at +http://www.postgresql.org/docs/9.3/static/libpq-connect.html#LIBPQ-CONNSTRING. + +:: + + postgresql://[user[:password]@][netloc][:port][/dbname][?option=value&...] + +Where: + + - *user* + + Can contain any character, including colon (`:`) which must then be + doubled (`::`) and at-sign (`@`) which must then be doubled (`@@`). + + When omitted, the *user* name defaults to the value of the `PGUSER` + environment variable, and if it is unset, the value of the `USER` + environment variable. + + - *password* + + Can contain any character, including the at sign (`@`) which must then + be doubled (`@@`). To leave the password empty, when the *user* name + ends with at at sign, you then have to use the syntax user:@. + + When omitted, the *password* defaults to the value of the `PGPASSWORD` + environment variable if it is set, otherwise the password is left + unset. + + When no *password* is found either in the connection URI nor in the + environment, then pgloader looks for a `.pgpass` file as documented at + https://www.postgresql.org/docs/current/static/libpq-pgpass.html. The + implementation is not that of `libpq` though. As with `libpq` you can + set the environment variable `PGPASSFILE` to point to a `.pgpass` file, + and pgloader defaults to `~/.pgpass` on unix like systems and + `%APPDATA%\postgresql\pgpass.conf` on windows. Matching rules and syntax + are the same as with `libpq`, refer to its documentation. + + - *netloc* + + Can be either a hostname in dotted notation, or an ipv4, or an Unix + domain socket path. Empty is the default network location, under a + system providing *unix domain socket* that method is preferred, otherwise + the *netloc* default to `localhost`. + + It's possible to force the *unix domain socket* path by using the syntax + `unix:/path/to/where/the/socket/file/is`, so to force a non default + socket path and a non default port, you would have: + + postgresql://unix:/tmp:54321/dbname + + The *netloc* defaults to the value of the `PGHOST` environment + variable, and if it is unset, to either the default `unix` socket path + when running on a Unix system, and `localhost` otherwise. + + Socket path containing colons are supported by doubling the colons + within the path, as in the following example: + + postgresql://unix:/tmp/project::region::instance:5432/dbname + + - *dbname* + + Should be a proper identifier (letter followed by a mix of letters, + digits and the punctuation signs comma (`,`), dash (`-`) and underscore + (`_`). + + When omitted, the *dbname* defaults to the value of the environment + variable `PGDATABASE`, and if that is unset, to the *user* value as + determined above. + + - *options* + + The optional parameters must be supplied with the form `name=value`, and + you may use several parameters by separating them away using an + ampersand (`&`) character. + + Only some options are supported here, *tablename* (which might be + qualified with a schema name) *sslmode*, *host*, *port*, *dbname*, + *user* and *password*. + + The *sslmode* parameter values can be one of `disable`, `allow`, + `prefer` or `require`. + + For backward compatibility reasons, it's possible to specify the + *tablename* option directly, without spelling out the `tablename=` + parts. + + The options override the main URI components when both are given, and + using the percent-encoded option parameters allow using passwords + starting with a colon and bypassing other URI components parsing + limitations. + +Regular Expressions +------------------- + +Several clauses listed in the following accept *regular expressions* with +the following input rules: + + - A regular expression begins with a tilde sign (`~`), + + - is then followed with an opening sign, + + - then any character is allowed and considered part of the regular + expression, except for the closing sign, + + - then a closing sign is expected. + +The opening and closing sign are allowed by pair, here's the complete list +of allowed delimiters:: + + ~// + ~[] + ~{} + ~() + ~<> + ~"" + ~'' + ~|| + ~## + +Pick the set of delimiters that don't collide with the *regular expression* +you're trying to input. If your expression is such that none of the +solutions allow you to enter it, the places where such expressions are +allowed should allow for a list of expressions. + +Comments +-------- + +Any command may contain comments, following those input rules: + + - the `--` delimiter begins a comment that ends with the end of the + current line, + + - the delimiters `/*` and `*/` respectively start and end a comment, which + can be found in the middle of a command or span several lines. + +Any place where you could enter a *whitespace* will accept a comment too. + +Batch behaviour options +----------------------- + +All pgloader commands have support for a *WITH* clause that allows for +specifying options. Some options are generic and accepted by all commands, +such as the *batch behaviour options*, and some options are specific to a +data source kind, such as the CSV *skip header* option. + +The global batch behaviour options are: + + - *batch rows* + + Takes a numeric value as argument, used as the maximum number of rows + allowed in a batch. The default is `25 000` and can be changed to try + having better performance characteristics or to control pgloader memory + usage; + + - *batch size* + + Takes a memory unit as argument, such as *20 MB*, its default value. + Accepted multipliers are *kB*, *MB*, *GB*, *TB* and *PB*. The case is + important so as not to be confused about bits versus bytes, we're only + talking bytes here. + + - *prefetch rows* + + Takes a numeric value as argument, defaults to `100000`. That's the + number of rows that pgloader is allowed to read in memory in each reader + thread. See the *workers* setting for how many reader threads are + allowed to run at the same time. + +Other options are specific to each input source, please refer to specific +parts of the documentation for their listing and covering. + +A batch is then closed as soon as either the *batch rows* or the *batch +size* threshold is crossed, whichever comes first. In cases when a batch has +to be closed because of the *batch size* setting, a *debug* level log +message is printed with how many rows did fit in the *oversized* batch. + +Templating with Mustache +------------------------ + +pgloader implements the https://mustache.github.io/ templating system so +that you may have dynamic parts of your commands. See the documentation for +this template system online. + +A specific feature of pgloader is the ability to fetch a variable from the +OS environment of the pgloader process, making it possible to run pgloader +as in the following example:: + + $ DBPATH=sqlite/sqlite.db pgloader ./test/sqlite-env.load + +or in several steps:: + + $ export DBPATH=sqlite/sqlite.db + $ pgloader ./test/sqlite-env.load + +The variable can then be used in a typical mustache fashion:: + + load database + from '{{DBPATH}}' + into postgresql:///pgloader; + +It's also possible to prepare a INI file such as the following:: + + [pgloader] + + DBPATH = sqlite/sqlite.db + +And run the following command, feeding the INI values as a *context* for +pgloader templating system:: + + $ pgloader --context ./test/sqlite.ini ./test/sqlite-ini.load + +The mustache templates implementation with OS environment support replaces +former `GETENV` implementation, which didn't work anyway. diff --git a/docs/index.rst b/docs/index.rst index c8a26a2..efd7c84 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -6,6 +6,14 @@ Welcome to pgloader's documentation! ==================================== +The `pgloader`__ project is an Open Source Software project. The development +happens at `https://github.com/dimitri/pgloader`__ and is public: everyone +is welcome to participate by opening issues, pull requests, giving feedback, +etc. + +__ https://github.com/dimitri/pgloader +__ https://github.com/dimitri/pgloader + pgloader loads data from various sources into PostgreSQL. It can transform the data it reads on the fly and submit raw SQL before and after the loading. It uses the `COPY` PostgreSQL protocol to stream the data into the @@ -238,28 +246,47 @@ In order to be able to follow this great methodology, you need tooling to implement the third step in a fully automated way. That's pgloader. .. toctree:: - :maxdepth: 2 - :caption: Table Of Contents: + :hidden: + :caption: Getting Started intro quickstart tutorial/tutorial install + bugreport + +.. toctree:: + :hidden: + :caption: Reference Manual + pgloader + command + batches + ref/transforms + +.. toctree:: + :hidden: + :caption: Manual for file formats + ref/csv ref/fixed ref/copy ref/dbf ref/ixf ref/archive + +.. toctree:: + :maxdepth: 2 + :hidden: + :caption: Manual for Database Servers + ref/mysql ref/sqlite ref/mssql ref/pgsql ref/pgsql-citus-target ref/pgsql-redshift - ref/transforms - bugreport + Indices and tables ================== diff --git a/docs/intro.rst b/docs/intro.rst index 7df8427..a21f5c7 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -13,7 +13,9 @@ pgloader knows how to read data from different kind of sources: * CSV * Fixed Format + * Postgres COPY text format * DBF + * IXF * Databases diff --git a/docs/pgloader.rst b/docs/pgloader.rst index 65cef82..f50c231 100644 --- a/docs/pgloader.rst +++ b/docs/pgloader.rst @@ -1,5 +1,5 @@ -PgLoader Reference Manual -========================= +Command Line +============ pgloader loads data from various sources into PostgreSQL. It can transform the data it reads on the fly and submit raw SQL before and @@ -230,535 +230,3 @@ to saying `--client-min-messages data`. Then the log messages will show the data being processed, in the cases where the code has explicit support for it. -Batches And Retry Behaviour ---------------------------- - -To load data to PostgreSQL, pgloader uses the `COPY` streaming protocol. -While this is the faster way to load data, `COPY` has an important drawback: -as soon as PostgreSQL emits an error with any bit of data sent to it, -whatever the problem is, the whole data set is rejected by PostgreSQL. - -To work around that, pgloader cuts the data into *batches* of 25000 rows -each, so that when a problem occurs it's only impacting that many rows of -data. Each batch is kept in memory while the `COPY` streaming happens, in -order to be able to handle errors should some happen. - -When PostgreSQL rejects the whole batch, pgloader logs the error message -then isolates the bad row(s) from the accepted ones by retrying the batched -rows in smaller batches. To do that, pgloader parses the *CONTEXT* error -message from the failed COPY, as the message contains the line number where -the error was found in the batch, as in the following example:: - - CONTEXT: COPY errors, line 3, column b: "2006-13-11" - -Using that information, pgloader will reload all rows in the batch before -the erroneous one, log the erroneous one as rejected, then try loading the -remaining of the batch in a single attempt, which may or may not contain -other erroneous data. - -At the end of a load containing rejected rows, you will find two files in -the *root-dir* location, under a directory named the same as the target -database of your setup. The filenames are the target table, and their -extensions are `.dat` for the rejected data and `.log` for the file -containing the full PostgreSQL client side logs about the rejected data. - -The `.dat` file is formatted in PostgreSQL the text COPY format as documented -in `http://www.postgresql.org/docs/9.2/static/sql-copy.html#AEN66609`. - -It is possible to use the following WITH options to control pgloader batch -behavior: - - - *on error stop*, *on error resume next* - - This option controls if pgloader is using building batches of data at - all. The batch implementation allows pgloader to recover errors by - sending the data that PostgreSQL accepts again, and by keeping away the - data that PostgreSQL rejects. - - To enable retrying the data and loading the good parts, use the option - *on error resume next*, which is the default to file based data loads - (such as CSV, IXF or DBF). - - When migrating from another RDMBS technology, it's best to have a - reproducible loading process. In that case it's possible to use *on - error stop* and fix either the casting rules, the data transformation - functions or in cases the input data until your migration runs through - completion. That's why *on error resume next* is the default for SQLite, - MySQL and MS SQL source kinds. - -A Note About Performance ------------------------- - -pgloader has been developed with performance in mind, to be able to cope -with ever growing needs in loading large amounts of data into PostgreSQL. - -The basic architecture it uses is the old Unix pipe model, where a thread is -responsible for loading the data (reading a CSV file, querying MySQL, etc) -and fills pre-processed data into a queue. Another threads feeds from the -queue, apply some more *transformations* to the input data and stream the -end result to PostgreSQL using the COPY protocol. - -When given a file that the PostgreSQL `COPY` command knows how to parse, and -if the file contains no erroneous data, then pgloader will never be as fast -as just using the PostgreSQL `COPY` command. - -Note that while the `COPY` command is restricted to read either from its -standard input or from a local file on the server's file system, the command -line tool `psql` implements a `\copy` command that knows how to stream a -file local to the client over the network and into the PostgreSQL server, -using the same protocol as pgloader uses. - -A Note About Parallelism ------------------------- - -pgloader uses several concurrent tasks to process the data being loaded: - - - a reader task reads the data in and pushes it to a queue, - - - at last one write task feeds from the queue and formats the raw into the - PostgreSQL COPY format in batches (so that it's possible to then retry a - failed batch without reading the data from source again), and then sends - the data to PostgreSQL using the COPY protocol. - -The parameter *workers* allows to control how many worker threads are -allowed to be active at any time (that's the parallelism level); and the -parameter *concurrency* allows to control how many tasks are started to -handle the data (they may not all run at the same time, depending on the -*workers* setting). - -We allow *workers* simultaneous workers to be active at the same time in the -context of a single table. A single unit of work consist of several kinds of -workers: - - - a reader getting raw data from the source, - - N writers preparing and sending the data down to PostgreSQL. - -The N here is setup to the *concurrency* parameter: with a *CONCURRENCY* of -2, we start (+ 1 2) = 3 concurrent tasks, with a *concurrency* of 4 we start -(+ 1 4) = 5 concurrent tasks, of which only *workers* may be active -simultaneously. - -The defaults are `workers = 4, concurrency = 1` when loading from a database -source, and `workers = 8, concurrency = 2` when loading from something else -(currently, a file). Those defaults are arbitrary and waiting for feedback -from users, so please consider providing feedback if you play with the -settings. - -As the `CREATE INDEX` threads started by pgloader are only waiting until -PostgreSQL is done with the real work, those threads are *NOT* counted into -the concurrency levels as detailed here. - -By default, as many `CREATE INDEX` threads as the maximum number of indexes -per table are found in your source schema. It is possible to set the `max -parallel create index` *WITH* option to another number in case there's just -too many of them to create. - -Source Formats --------------- - -pgloader supports the following input formats: - - - csv, which includes also tsv and other common variants where you can - change the *separator* and the *quoting* rules and how to *escape* the - *quotes* themselves; - - - fixed columns file, where pgloader is flexible enough to accomodate with - source files missing columns (*ragged fixed length column files* do - exist); - - - PostgreSLQ COPY formatted files, following the COPY TEXT documentation - of PostgreSQL, such as the reject files prepared by pgloader; - - - dbase files known as db3 or dbf file; - - - ixf formated files, ixf being a binary storage format from IBM; - - - sqlite databases with fully automated discovery of the schema and - advanced cast rules; - - - mysql databases with fully automated discovery of the schema and - advanced cast rules; - - - MS SQL databases with fully automated discovery of the schema and - advanced cast rules. - -Pgloader Commands Syntax ------------------------- - -pgloader implements a Domain Specific Language allowing to setup complex -data loading scripts handling computed columns and on-the-fly sanitization -of the input data. For more complex data loading scenarios, you will be -required to learn that DSL's syntax. It's meant to look familiar to DBA by -being inspired by SQL where it makes sense, which is not that much after -all. - -The pgloader commands follow the same global grammar rules. Each of them -might support only a subset of the general options and provide specific -options. - -:: - - LOAD - FROM - [ HAVING FIELDS ] - INTO - [ TARGET TABLE [ "" ]."
" ] - [ TARGET COLUMNS ] - - [ WITH ] - - [ SET ] - - [ BEFORE LOAD [ DO | EXECUTE ] ... ] - [ AFTER LOAD [ DO | EXECUTE ] ... ] - ; - -The main clauses are the `LOAD`, `FROM`, `INTO` and `WITH` clauses that each -command implements. Some command then implement the `SET` command, or some -specific clauses such as the `CAST` clause. - -Templating with Mustache ------------------------- - -pgloader implements the https://mustache.github.io/ templating system so -that you may have dynamic parts of your commands. See the documentation for -this template system online. - -A specific feature of pgloader is the ability to fetch a variable from the -OS environment of the pgloader process, making it possible to run pgloader -as in the following example:: - - $ DBPATH=sqlite/sqlite.db pgloader ./test/sqlite-env.load - -or in several steps:: - - $ export DBPATH=sqlite/sqlite.db - $ pgloader ./test/sqlite-env.load - -The variable can then be used in a typical mustache fashion:: - - load database - from '{{DBPATH}}' - into postgresql:///pgloader; - -It's also possible to prepare a INI file such as the following:: - - [pgloader] - - DBPATH = sqlite/sqlite.db - -And run the following command, feeding the INI values as a *context* for -pgloader templating system:: - - $ pgloader --context ./test/sqlite.ini ./test/sqlite-ini.load - -The mustache templates implementation with OS environment support replaces -former `GETENV` implementation, which didn't work anyway. - -.. _common_clauses: - -Common Clauses --------------- - -Some clauses are common to all commands: - -FROM -^^^^ - -The *FROM* clause specifies where to read the data from, and each command -introduces its own variant of sources. For instance, the *CSV* source -supports `inline`, `stdin`, a filename, a quoted filename, and a *FILENAME -MATCHING* clause (see above); whereas the *MySQL* source only supports a -MySQL database URI specification. - -INTO -^^^^ - -The PostgreSQL connection URI must contains the name of the target table -where to load the data into. That table must have already been created in -PostgreSQL, and the name might be schema qualified. - -Then *INTO* option also supports an optional comma separated list of target -columns, which are either the name of an input *field* or the white space -separated list of the target column name, its PostgreSQL data type and a -*USING* expression. - -The *USING* expression can be any valid Common Lisp form and will be read -with the current package set to `pgloader.transforms`, so that you can use -functions defined in that package, such as functions loaded dynamically with -the `--load` command line parameter. - -Each *USING* expression is compiled at runtime to native code. - -This feature allows pgloader to load any number of fields in a CSV file into -a possibly different number of columns in the database, using custom code -for that projection. - -WITH -^^^^ - -Set of options to apply to the command, using a global syntax of either: - - - *key = value* - - *use option* - - *do not use option* - -See each specific command for details. - -All data sources specific commands support the following options: - - - *on error stop*, *on error resume next* - - *batch rows = R* - - *batch size = ... MB* - - *prefetch rows = ...* - -See the section BATCH BEHAVIOUR OPTIONS for more details. - -In addition, the following settings are available: - - - *workers = W* - - *concurrency = C* - - *max parallel create index = I* - -See section A NOTE ABOUT PARALLELISM for more details. - -SET -^^^ - -This clause allows to specify session parameters to be set for all the -sessions opened by pgloader. It expects a list of parameter name, the equal -sign, then the single-quoted value as a comma separated list. - -The names and values of the parameters are not validated by pgloader, they -are given as-is to PostgreSQL. - -BEFORE LOAD DO -^^^^^^^^^^^^^^ - -You can run SQL queries against the database before loading the data from -the `CSV` file. Most common SQL queries are `CREATE TABLE IF NOT EXISTS` so -that the data can be loaded. - -Each command must be *dollar-quoted*: it must begin and end with a double -dollar sign, `$$`. Dollar-quoted queries are then comma separated. No extra -punctuation is expected after the last SQL query. - -BEFORE LOAD EXECUTE -^^^^^^^^^^^^^^^^^^^ - -Same behaviour as in the *BEFORE LOAD DO* clause. Allows you to read the SQL -queries from a SQL file. Implements support for PostgreSQL dollar-quoting -and the `\i` and `\ir` include facilities as in `psql` batch mode (where -they are the same thing). - -AFTER LOAD DO -^^^^^^^^^^^^^ - -Same format as *BEFORE LOAD DO*, the dollar-quoted queries found in that -section are executed once the load is done. That's the right time to create -indexes and constraints, or re-enable triggers. - -AFTER LOAD EXECUTE -^^^^^^^^^^^^^^^^^^ - -Same behaviour as in the *AFTER LOAD DO* clause. Allows you to read the SQL -queries from a SQL file. Implements support for PostgreSQL dollar-quoting -and the `\i` and `\ir` include facilities as in `psql` batch mode (where -they are the same thing). - -AFTER CREATE SCHEMA DO -^^^^^^^^^^^^^^^^^^^^^^ - -Same format as *BEFORE LOAD DO*, the dollar-quoted queries found in that -section are executed once the schema has been created by pgloader, and -before the data is loaded. It's the right time to ALTER TABLE or do some -custom implementation on-top of what pgloader does, like maybe partitioning. - -AFTER CREATE SCHEMA EXECUTE -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Same behaviour as in the *AFTER CREATE SCHEMA DO* clause. Allows you to read -the SQL queries from a SQL file. Implements support for PostgreSQL -dollar-quoting and the `\i` and `\ir` include facilities as in `psql` batch -mode (where they are the same thing). - -Connection String -^^^^^^^^^^^^^^^^^ - -The `` parameter is expected to be given as a *Connection URI* -as documented in the PostgreSQL documentation at -http://www.postgresql.org/docs/9.3/static/libpq-connect.html#LIBPQ-CONNSTRING. - -:: - - postgresql://[user[:password]@][netloc][:port][/dbname][?option=value&...] - -Where: - - - *user* - - Can contain any character, including colon (`:`) which must then be - doubled (`::`) and at-sign (`@`) which must then be doubled (`@@`). - - When omitted, the *user* name defaults to the value of the `PGUSER` - environment variable, and if it is unset, the value of the `USER` - environment variable. - - - *password* - - Can contain any character, including the at sign (`@`) which must then - be doubled (`@@`). To leave the password empty, when the *user* name - ends with at at sign, you then have to use the syntax user:@. - - When omitted, the *password* defaults to the value of the `PGPASSWORD` - environment variable if it is set, otherwise the password is left - unset. - - When no *password* is found either in the connection URI nor in the - environment, then pgloader looks for a `.pgpass` file as documented at - https://www.postgresql.org/docs/current/static/libpq-pgpass.html. The - implementation is not that of `libpq` though. As with `libpq` you can - set the environment variable `PGPASSFILE` to point to a `.pgpass` file, - and pgloader defaults to `~/.pgpass` on unix like systems and - `%APPDATA%\postgresql\pgpass.conf` on windows. Matching rules and syntax - are the same as with `libpq`, refer to its documentation. - - - *netloc* - - Can be either a hostname in dotted notation, or an ipv4, or an Unix - domain socket path. Empty is the default network location, under a - system providing *unix domain socket* that method is preferred, otherwise - the *netloc* default to `localhost`. - - It's possible to force the *unix domain socket* path by using the syntax - `unix:/path/to/where/the/socket/file/is`, so to force a non default - socket path and a non default port, you would have: - - postgresql://unix:/tmp:54321/dbname - - The *netloc* defaults to the value of the `PGHOST` environment - variable, and if it is unset, to either the default `unix` socket path - when running on a Unix system, and `localhost` otherwise. - - Socket path containing colons are supported by doubling the colons - within the path, as in the following example: - - postgresql://unix:/tmp/project::region::instance:5432/dbname - - - *dbname* - - Should be a proper identifier (letter followed by a mix of letters, - digits and the punctuation signs comma (`,`), dash (`-`) and underscore - (`_`). - - When omitted, the *dbname* defaults to the value of the environment - variable `PGDATABASE`, and if that is unset, to the *user* value as - determined above. - - - *options* - - The optional parameters must be supplied with the form `name=value`, and - you may use several parameters by separating them away using an - ampersand (`&`) character. - - Only some options are supported here, *tablename* (which might be - qualified with a schema name) *sslmode*, *host*, *port*, *dbname*, - *user* and *password*. - - The *sslmode* parameter values can be one of `disable`, `allow`, - `prefer` or `require`. - - For backward compatibility reasons, it's possible to specify the - *tablename* option directly, without spelling out the `tablename=` - parts. - - The options override the main URI components when both are given, and - using the percent-encoded option parameters allow using passwords - starting with a colon and bypassing other URI components parsing - limitations. - -Regular Expressions -^^^^^^^^^^^^^^^^^^^ - -Several clauses listed in the following accept *regular expressions* with -the following input rules: - - - A regular expression begins with a tilde sign (`~`), - - - is then followed with an opening sign, - - - then any character is allowed and considered part of the regular - expression, except for the closing sign, - - - then a closing sign is expected. - -The opening and closing sign are allowed by pair, here's the complete list -of allowed delimiters:: - - ~// - ~[] - ~{} - ~() - ~<> - ~"" - ~'' - ~|| - ~## - -Pick the set of delimiters that don't collide with the *regular expression* -you're trying to input. If your expression is such that none of the -solutions allow you to enter it, the places where such expressions are -allowed should allow for a list of expressions. - -Comments -^^^^^^^^ - -Any command may contain comments, following those input rules: - - - the `--` delimiter begins a comment that ends with the end of the - current line, - - - the delimiters `/*` and `*/` respectively start and end a comment, which - can be found in the middle of a command or span several lines. - -Any place where you could enter a *whitespace* will accept a comment too. - -Batch behaviour options -^^^^^^^^^^^^^^^^^^^^^^^ - -All pgloader commands have support for a *WITH* clause that allows for -specifying options. Some options are generic and accepted by all commands, -such as the *batch behaviour options*, and some options are specific to a -data source kind, such as the CSV *skip header* option. - -The global batch behaviour options are: - - - *batch rows* - - Takes a numeric value as argument, used as the maximum number of rows - allowed in a batch. The default is `25 000` and can be changed to try - having better performance characteristics or to control pgloader memory - usage; - - - *batch size* - - Takes a memory unit as argument, such as *20 MB*, its default value. - Accepted multipliers are *kB*, *MB*, *GB*, *TB* and *PB*. The case is - important so as not to be confused about bits versus bytes, we're only - talking bytes here. - - - *prefetch rows* - - Takes a numeric value as argument, defaults to `100000`. That's the - number of rows that pgloader is allowed to read in memory in each reader - thread. See the *workers* setting for how many reader threads are - allowed to run at the same time. - -Other options are specific to each input source, please refer to specific -parts of the documentation for their listing and covering. - -A batch is then closed as soon as either the *batch rows* or the *batch -size* threshold is crossed, whichever comes first. In cases when a batch has -to be closed because of the *batch size* setting, a *debug* level log -message is printed with how many rows did fit in the *oversized* batch. - diff --git a/docs/ref/archive.rst b/docs/ref/archive.rst index 8732d3a..85d59e4 100644 --- a/docs/ref/archive.rst +++ b/docs/ref/archive.rst @@ -1,5 +1,5 @@ -Loading From an Archive -======================= +Archive (http, zip) +=================== This command instructs pgloader to load data from one or more files contained in an archive. Currently the only supported archive format is *ZIP*, and the diff --git a/docs/ref/copy.rst b/docs/ref/copy.rst index 3409531..d6f9e7f 100644 --- a/docs/ref/copy.rst +++ b/docs/ref/copy.rst @@ -1,5 +1,5 @@ -Loading COPY Formatted Files -============================ +COPY +==== This commands instructs pgloader to load from a file containing COPY TEXT data as described in the PostgreSQL documentation. diff --git a/docs/ref/csv.rst b/docs/ref/csv.rst index f5f648e..50ff127 100644 --- a/docs/ref/csv.rst +++ b/docs/ref/csv.rst @@ -1,5 +1,5 @@ -Loading CSV data -================ +CSV +=== This command instructs pgloader to load data from a `CSV` file. Because of the complexity of guessing the parameters of a CSV file, it's simpler to diff --git a/docs/ref/dbf.rst b/docs/ref/dbf.rst index 1f894ab..34767c8 100644 --- a/docs/ref/dbf.rst +++ b/docs/ref/dbf.rst @@ -1,5 +1,5 @@ -Loading DBF data -================= +DBF +=== This command instructs pgloader to load data from a `DBF` file. A default set of casting rules are provided and might be overloaded and appended to by diff --git a/docs/ref/fixed.rst b/docs/ref/fixed.rst index e4f2a57..b793d36 100644 --- a/docs/ref/fixed.rst +++ b/docs/ref/fixed.rst @@ -1,5 +1,5 @@ -Loading Fixed Cols File Formats -=============================== +Fixed Columns +============= This command instructs pgloader to load data from a text file containing columns arranged in a *fixed size* manner. diff --git a/docs/ref/ixf.rst b/docs/ref/ixf.rst index 996b29f..e4fa7f8 100644 --- a/docs/ref/ixf.rst +++ b/docs/ref/ixf.rst @@ -1,5 +1,5 @@ -Loading IXF Data -================ +IXF +=== This command instructs pgloader to load data from an IBM `IXF` file. diff --git a/docs/ref/mssql.rst b/docs/ref/mssql.rst index 6a2252a..5e3a0b1 100644 --- a/docs/ref/mssql.rst +++ b/docs/ref/mssql.rst @@ -1,5 +1,5 @@ -Migrating a MS SQL Database to PostgreSQL -========================================= +MS SQL to Postgres +================== This command instructs pgloader to load data from a MS SQL database. Automatic discovery of the schema is supported, including build of the diff --git a/docs/ref/mysql.rst b/docs/ref/mysql.rst index 3adb79a..a067659 100644 --- a/docs/ref/mysql.rst +++ b/docs/ref/mysql.rst @@ -1,5 +1,5 @@ -Migrating a MySQL Database to PostgreSQL -======================================== +MySQL to Postgres +================= This command instructs pgloader to load data from a database connection. pgloader supports dynamically converting the schema of the source database diff --git a/docs/ref/pgsql-citus-target.rst b/docs/ref/pgsql-citus-target.rst index f6397d1..e248509 100644 --- a/docs/ref/pgsql-citus-target.rst +++ b/docs/ref/pgsql-citus-target.rst @@ -1,5 +1,5 @@ -Migrating a PostgreSQL Database to Citus -======================================== +PostgreSQL to Citus +=================== This command instructs pgloader to load data from a database connection. Automatic discovery of the schema is supported, including build of the diff --git a/docs/ref/pgsql-redshift.rst b/docs/ref/pgsql-redshift.rst index d3e76ab..09dda71 100644 --- a/docs/ref/pgsql-redshift.rst +++ b/docs/ref/pgsql-redshift.rst @@ -1,5 +1,5 @@ -Support for Redshift in pgloader -================================ +Redshift to Postgres +==================== The command and behavior are the same as when migration from a PostgreSQL database source, see :ref:`migrating_to_pgsql`. pgloader automatically diff --git a/docs/ref/pgsql.rst b/docs/ref/pgsql.rst index 791a1fb..729d86b 100644 --- a/docs/ref/pgsql.rst +++ b/docs/ref/pgsql.rst @@ -1,13 +1,18 @@ .. _migrating_to_pgsql: -Migrating a PostgreSQL Database to PostgreSQL -============================================= +Postgres to Postgres +==================== This command instructs pgloader to load data from a database connection. Automatic discovery of the schema is supported, including build of the indexes, primary and foreign keys constraints. A default set of casting rules are provided and might be overloaded and appended to by the command. +For a complete Postgres to Postgres solution including Change Data Capture +support with Logical Decoding, see `pgcopydb`__. + +__ https://pgcopydb.readthedocs.io/ + Using default settings ---------------------- diff --git a/docs/ref/sqlite.rst b/docs/ref/sqlite.rst index f1a9ab8..0a0699e 100644 --- a/docs/ref/sqlite.rst +++ b/docs/ref/sqlite.rst @@ -1,5 +1,5 @@ -Migrating a SQLite database to PostgreSQL -========================================= +SQLite to Postgres +================== This command instructs pgloader to load data from a SQLite file. Automatic discovery of the schema is supported, including build of the indexes.