mirror of
https://github.com/dimitri/pgloader.git
synced 2026-01-29 11:01:06 +01:00
Both as a new transformation function available, and as the default for Text conversions when coming from MySQL. See #258, Fixes #219.
2303 lines
80 KiB
Markdown
2303 lines
80 KiB
Markdown
# pgloader(1) -- PostgreSQL data loader
|
|
|
|
## SYNOPSIS
|
|
|
|
pgloader [<options>] [<command-file>]...
|
|
pgloader [<options>] SOURCE TARGET
|
|
|
|
## DESCRIPTION
|
|
|
|
pgloader loads data from various sources into PostgreSQL. It can
|
|
transform the data it reads on the fly and submit raw SQL before and
|
|
after the loading. It uses the `COPY` PostgreSQL protocol to stream
|
|
the data into the server, and manages errors by filling a pair of
|
|
*reject.dat* and *reject.log* files.
|
|
|
|
pgloader operates either using commands which are read from files:
|
|
|
|
pgloader commands.load
|
|
|
|
or by using arguments and options all provided on the command line:
|
|
|
|
pgloader SOURCE TARGET
|
|
|
|
## ARGUMENTS
|
|
|
|
The pgloader arguments can be as many load files as needed, or a couple of
|
|
connection strings to a specific input file.
|
|
|
|
### SOURCE CONNECTION STRING
|
|
|
|
The source connection string format is as follows:
|
|
|
|
format:///absolute/path/to/file.ext
|
|
format://./relative/path/to/file.ext
|
|
|
|
Where format might be one of `csv`, `fixed`, `copy`, `dbf`, `db3` or `ixf`.
|
|
|
|
db://user:pass@host:port/dbname
|
|
|
|
Where db might be of `sqlite`, `mysql` or `mssql`.
|
|
|
|
When using a file based source format, pgloader also support natively
|
|
fetching the file from an http location and decompressing an archive if
|
|
needed. In that case it's necessary to use the `--type` option to specify
|
|
the expected format of the file. See the examples below.
|
|
|
|
Also note that some file formats require describing some implementation
|
|
details such as columns to be read and delimiters and quoting when loading
|
|
from csv.
|
|
|
|
For more complex loading scenarios, you will need to write a full fledge
|
|
load command in the syntax described later in this document.
|
|
|
|
### TARGET CONNECTION STRING
|
|
|
|
The target connection string format is described in details later in this
|
|
document, see Section Connection String.
|
|
|
|
## OPTIONS
|
|
|
|
### INQUIRY OPTIONS
|
|
|
|
Use these options when you want to know more about how to use `pgloader`, as
|
|
those options will cause `pgloader` not to load any data.
|
|
|
|
* `-h`, `--help`:
|
|
Show command usage summary and exit.
|
|
|
|
* `-V`, `--version`:
|
|
Show pgloader version string and exit.
|
|
|
|
* `-E`, `--list-encodings`:
|
|
List known encodings in this version of pgloader.
|
|
|
|
* `-U`, `--upgrade-config`:
|
|
Parse given files in the command line as `pgloader.conf` files with the
|
|
`INI` syntax that was in use in pgloader versions 2.x, and output the
|
|
new command syntax for pgloader on standard output.
|
|
|
|
|
|
### GENERAL OPTIONS
|
|
|
|
Those options are meant to tweak `pgloader` behavior when loading data.
|
|
|
|
* `-v`, `--verbose`:
|
|
Be verbose.
|
|
|
|
* `-q`, `--quiet`:
|
|
Be quiet.
|
|
|
|
* `-d`, `--debug`:
|
|
Show debug level information messages.
|
|
|
|
* `-D`, `--root-dir`:
|
|
Set the root working directory (default to "/tmp/pgloader").
|
|
|
|
* `-L`, `--logfile`:
|
|
Set the pgloader log file (default to "/tmp/pgloader.log").
|
|
|
|
* `--log-min-messages`:
|
|
Minimum level of verbosity needed for log message to make it to the
|
|
logfile. One of critical, log, error, warning, notice, info or debug.
|
|
|
|
* `--client-min-messages`:
|
|
Minimum level of verbosity needed for log message to make it to the
|
|
console. One of critical, log, error, warning, notice, info or debug.
|
|
|
|
* `-S`, `--summary`:
|
|
A filename where to copy the summary output. When relative, the filename
|
|
is expanded into `*root-dir*`.
|
|
|
|
The format of the filename defaults to being *human readable*. It is
|
|
possible to have the output in machine friendly formats such as *CSV*,
|
|
*COPY* (PostgreSQL's own COPY format) or *JSON* by specifying a filename
|
|
with the extension resp. `.csv`, `.copy` or `.json`.
|
|
|
|
* `-l <file>`, `--load-lisp-file <file>`:
|
|
Specify a lisp <file> to compile and load into the pgloader image before
|
|
reading the commands, allowing to define extra transformation function.
|
|
Those functions should be defined in the `pgloader.transforms` package.
|
|
This option can appear more than once in the command line.
|
|
|
|
* `--self-upgrade <directory>`:
|
|
|
|
Specify a <directory> where to find pgloader sources so that one of the
|
|
very first things it does is dynamically loading-in (and compiling to
|
|
machine code) another version of itself, usually a newer one like a very
|
|
recent git checkout.
|
|
|
|
### COMMAND LINE ONLY OPERATIONS
|
|
|
|
Those options are meant to be used when using `pgloader` from the command
|
|
line only, rather than using a command file and the rich command clauses and
|
|
parser. In simple cases, it can be much easier to use the *SOURCE* and
|
|
*TARGET* directly on the command line, then tweak the loading with those
|
|
options:
|
|
|
|
* `--with "option"`:
|
|
|
|
Allows setting options from the command line. You can use that option as
|
|
many times as you want. The option arguments must follow the *WITH*
|
|
clause for the source type of the `SOURCE` specification, as described
|
|
later in this document.
|
|
|
|
* `--set "guc_name='value'"`
|
|
|
|
Allows setting PostgreSQL configuration from the command line. Note that
|
|
the option parsing is the same as when used from the *SET* command
|
|
clause, in particular you must enclose the guc value with single-quotes.
|
|
|
|
* `--field "..."`
|
|
|
|
Allows setting a source field definition. Fields are accumulated in the
|
|
order given on the command line. It's possible to either use a `--field`
|
|
option per field in the source file, or to separate field definitions by
|
|
a comma, as you would do in the *HAVING FIELDS* clause.
|
|
|
|
* `--cast "..."`
|
|
|
|
Allows setting a specific casting rule for loading the data.
|
|
|
|
* `--type csv|fixed|db3|ixf|sqlite|mysql|mssql`
|
|
|
|
Allows forcing the source type, in case when the *SOURCE* parsing isn't
|
|
satisfying.
|
|
|
|
* `--encoding <encoding>`
|
|
|
|
Set the encoding of the source file to load data from.
|
|
|
|
* `--before <filename>`
|
|
|
|
Parse given filename for SQL queries and run them against the target
|
|
database before loading the data from the source. The queries are parsed
|
|
by pgloader itself: they need to be terminated by a semi-colon (;) and
|
|
the file may include `\i` or `\ir` commands to *include* another file.
|
|
|
|
* `--after <filename>`
|
|
|
|
Parse given filename for SQL queries and run them against the target
|
|
database after having loaded the data from the source. The queries are
|
|
parsed in the same way as with the `--before` option, see above.
|
|
|
|
### MORE DEBUG INFORMATION
|
|
|
|
To get the maximum amount of debug information, you can use both the
|
|
`--verbose` and the `--debug` switches at the same time, which is equivalent
|
|
to saying `--client-min-messages data`. Then the log messages will show the
|
|
data being processed, in the cases where the code has explicit support for
|
|
it.
|
|
|
|
## USAGE EXAMPLES
|
|
|
|
Review the command line options and pgloader's version:
|
|
|
|
pgloader --help
|
|
pgloader --version
|
|
|
|
### Loading from a complex command
|
|
|
|
Use the command file as the pgloader command argument, pgloader will parse
|
|
that file and execute the commands found in it:
|
|
|
|
pgloader --verbose ./test/csv-districts.load
|
|
|
|
### CSV
|
|
|
|
Load data from a CSV file into a pre-existing table in your database:
|
|
|
|
pgloader --type csv \
|
|
--field id --field field \
|
|
--with truncate \
|
|
--with "fields terminated by ','" \
|
|
./test/data/matching-1.csv \
|
|
postgres:///pgloader?tablename=matching
|
|
|
|
In that example the whole loading is driven from the command line, bypassing
|
|
the need for writing a command in the pgloader command syntax entirely. As
|
|
there's no command though, the extra inforamtion needed must be provided on
|
|
the command line using the `--type` and `--field` and `--with` switches.
|
|
|
|
For documentation about the available syntaxes for the `--field` and
|
|
`--with` switches, please refer to the CSV section later in the man page.
|
|
|
|
Note also that the PostgreSQL URI includes the target *tablename*.
|
|
|
|
### Reading from STDIN
|
|
|
|
File based pgloader sources can be loaded from the standard input, as in the
|
|
following example:
|
|
|
|
pgloader --type csv \
|
|
--field "usps,geoid,aland,awater,aland_sqmi,awater_sqmi,intptlat,intptlong" \
|
|
--with "skip header = 1" \
|
|
--with "fields terminated by '\t'" \
|
|
- \
|
|
postgresql:///pgloader?districts_longlat \
|
|
< test/data/2013_Gaz_113CDs_national.txt
|
|
|
|
The dash (`-`) character as a source is used to mean *standard input*, as
|
|
usual in Unix command lines. It's possible to stream compressed content to
|
|
pgloader with this technique, using the Unix pipe:
|
|
|
|
gunzip -c source.gz | pgloader --type csv ... - pgsql:///target?foo
|
|
|
|
### Loading from CSV available through HTTP
|
|
|
|
The same command as just above can also be run if the CSV file happens to be
|
|
found on a remote HTTP location:
|
|
|
|
pgloader --type csv \
|
|
--field "usps,geoid,aland,awater,aland_sqmi,awater_sqmi,intptlat,intptlong" \
|
|
--with "skip header = 1" \
|
|
--with "fields terminated by '\t'" \
|
|
http://pgsql.tapoueh.org/temp/2013_Gaz_113CDs_national.txt \
|
|
postgresql:///pgloader?districts_longlat
|
|
|
|
Some more options have to be used in that case, as the file contains a
|
|
one-line header (most commonly that's column names, could be a copyright
|
|
notice). Also, in that case, we specify all the fields right into a single
|
|
`--field` option argument.
|
|
|
|
Again, the PostgreSQL target connection string must contain the *tablename*
|
|
option and you have to ensure that the target table exists and may fit the
|
|
data. Here's the SQL command used in that example in case you want to try it
|
|
yourself:
|
|
|
|
create table districts_longlat
|
|
(
|
|
usps text,
|
|
geoid text,
|
|
aland bigint,
|
|
awater bigint,
|
|
aland_sqmi double precision,
|
|
awater_sqmi double precision,
|
|
intptlat double precision,
|
|
intptlong double precision
|
|
);
|
|
|
|
Also notice that the same command will work against an archived version of
|
|
the same data, e.g.
|
|
http://pgsql.tapoueh.org/temp/2013_Gaz_113CDs_national.txt.gz.
|
|
|
|
Finally, it's important to note that pgloader first fetches the content from
|
|
the HTTP URL it to a local file, then expand the archive when it's
|
|
recognized to be one, and only then processes the locally expanded file.
|
|
|
|
In some cases, either because pgloader has no direct support for your
|
|
archive format or maybe because expanding the archive is not feasible in
|
|
your environment, you might want to *stream* the content straight from its
|
|
remote location into PostgreSQL. Here's how to do that, using the old battle
|
|
tested Unix Pipes trick:
|
|
|
|
curl http://pgsql.tapoueh.org/temp/2013_Gaz_113CDs_national.txt.gz \
|
|
| gunzip -c \
|
|
| pgloader --type csv \
|
|
--field "usps,geoid,aland,awater,aland_sqmi,awater_sqmi,intptlat,intptlong"
|
|
--with "skip header = 1" \
|
|
--with "fields terminated by '\t'" \
|
|
- \
|
|
postgresql:///pgloader?districts_longlat
|
|
|
|
Now the OS will take care of the streaming and buffering between the network
|
|
and the commands and pgloader will take care of streaming the data down to
|
|
PostgreSQL.
|
|
|
|
### Migrating from SQLite
|
|
|
|
The following command will open the SQLite database, discover its tables
|
|
definitions including indexes and foreign keys, migrate those definitions
|
|
while *casting* the data type specifications to their PostgreSQL equivalent
|
|
and then migrate the data over:
|
|
|
|
createdb newdb
|
|
pgloader ./test/sqlite/sqlite.db postgresql:///newdb
|
|
|
|
### Migrating from MySQL
|
|
|
|
Just create a database where to host the MySQL data and definitions and have
|
|
pgloader do the migration for you in a single command line:
|
|
|
|
createdb pagila
|
|
pgloader mysql://user@localhost/sakila postgresql:///pagila
|
|
|
|
### Fetching an archived DBF file from a HTTP remote location
|
|
|
|
It's possible for pgloader to download a file from HTTP, unarchive it, and
|
|
only then open it to discover the schema then load the data:
|
|
|
|
createdb foo
|
|
pgloader --type dbf http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement/2013/dbf/historiq2013.zip postgresql:///foo
|
|
|
|
Here it's not possible for pgloader to guess the kind of data source it's
|
|
being given, so it's necessary to use the `--type` command line switch.
|
|
|
|
## BATCHES AND RETRY BEHAVIOUR
|
|
|
|
To load data to PostgreSQL, pgloader uses the `COPY` streaming protocol.
|
|
While this is the faster way to load data, `COPY` has an important drawback:
|
|
as soon as PostgreSQL emits an error with any bit of data sent to it,
|
|
whatever the problem is, the whole data set is rejected by PostgreSQL.
|
|
|
|
To work around that, pgloader cuts the data into *batches* of 25000 rows
|
|
each, so that when a problem occurs it's only impacting that many rows of
|
|
data. Each batch is kept in memory while the `COPY` streaming happens, in
|
|
order to be able to handle errors should some happen.
|
|
|
|
When PostgreSQL rejects the whole batch, pgloader logs the error message
|
|
then isolates the bad row(s) from the accepted ones by retrying the batched
|
|
rows in smaller batches. To do that, pgloader parses the *CONTEXT* error
|
|
message from the failed COPY, as the message contains the line number where
|
|
the error was found in the batch, as in the following example:
|
|
|
|
CONTEXT: COPY errors, line 3, column b: "2006-13-11"
|
|
|
|
Using that information, pgloader will reload all rows in the batch before
|
|
the erroneous one, log the erroneous one as rejected, then try loading the
|
|
remaining of the batch in a single attempt, which may or may not contain
|
|
other erroneous data.
|
|
|
|
At the end of a load containing rejected rows, you will find two files in
|
|
the *root-dir* location, under a directory named the same as the target
|
|
database of your setup. The filenames are the target table, and their
|
|
extensions are `.dat` for the rejected data and `.log` for the file
|
|
containing the full PostgreSQL client side logs about the rejected data.
|
|
|
|
The `.dat` file is formatted in PostgreSQL the text COPY format as documented
|
|
in [http://www.postgresql.org/docs/9.2/static/sql-copy.html#AEN66609]().
|
|
|
|
## A NOTE ABOUT PERFORMANCES
|
|
|
|
pgloader has been developed with performances in mind, to be able to cope
|
|
with ever growing needs in loading large amounts of data into PostgreSQL.
|
|
|
|
The basic architecture it uses is the old Unix pipe model, where a thread is
|
|
responsible for loading the data (reading a CSV file, querying MySQL, etc)
|
|
and fills pre-processed data into a queue. Another threads feeds from the
|
|
queue, apply some more *transformations* to the input data and stream the
|
|
end result to PostgreSQL using the COPY protocol.
|
|
|
|
When given a file that the PostgreSQL `COPY` command knows how to parse, and
|
|
if the file contains no erroneous data, then pgloader will never be as fast
|
|
as just using the PostgreSQL `COPY` command.
|
|
|
|
Note that while the `COPY` command is restricted to read either from its
|
|
standard input or from a local file on the server's file system, the command
|
|
line tool `psql` implements a `\copy` command that knows how to stream a
|
|
file local to the client over the network and into the PostgreSQL server,
|
|
using the same protocol as pgloader uses.
|
|
|
|
## SOURCE FORMATS
|
|
|
|
pgloader supports the following input formats:
|
|
|
|
- csv, which includes also tsv and other common variants where you can
|
|
change the *separator* and the *quoting* rules and how to *escape* the
|
|
*quotes* themselves;
|
|
|
|
- fixed columns file, where pgloader is flexible enough to accomodate with
|
|
source files missing columns (*ragged fixed length column files* do
|
|
exist);
|
|
|
|
- PostgreSLQ COPY formatted files, following the COPY TEXT documentation
|
|
of PostgreSQL, such as the reject files prepared by pgloader;
|
|
|
|
- dbase files known as db3 or dbf file;
|
|
|
|
- ixf formated files, ixf being a binary storage format from IBM;
|
|
|
|
- sqlite databases with fully automated discovery of the schema and
|
|
advanced cast rules;
|
|
|
|
- mysql databases with fully automated discovery of the schema and
|
|
advanced cast rules;
|
|
|
|
- MS SQL databases with fully automated discovery of the schema and
|
|
advanced cast rules.
|
|
|
|
## PGLOADER COMMANDS SYNTAX
|
|
|
|
pgloader implements a Domain Specific Language allowing to setup complex
|
|
data loading scripts handling computed columns and on-the-fly sanitization
|
|
of the input data. For more complex data loading scenarios, you will be
|
|
required to learn that DSL's syntax. It's meant to look familiar to DBA by
|
|
being inspired by SQL where it makes sense, which is not that much after
|
|
all.
|
|
|
|
The pgloader commands follow the same global grammar rules. Each of them
|
|
might support only a subset of the general options and provide specific
|
|
options.
|
|
|
|
LOAD <source-type>
|
|
FROM <source-url> [ HAVING FIELDS <source-level-options> ]
|
|
INTO <postgresql-url> [ TARGET COLUMNS <columns-and-options> ]
|
|
|
|
[ WITH <load-options> ]
|
|
|
|
[ SET <postgresql-settings> ]
|
|
|
|
[ BEFORE LOAD [ DO <sql statements> | EXECUTE <sql file> ] ... ]
|
|
[ AFTER LOAD [ DO <sql statements> | EXECUTE <sql file> ] ... ]
|
|
;
|
|
|
|
The main clauses are the `LOAD`, `FROM`, `INTO` and `WITH` clauses that each
|
|
command implements. Some command then implement the `SET` command, or some
|
|
specific clauses such as the `CAST` clause.
|
|
|
|
## COMMON CLAUSES
|
|
|
|
Some clauses are common to all commands:
|
|
|
|
- *FROM*
|
|
|
|
The *FROM* clause specifies where to read the data from, and each
|
|
command introduces its own variant of sources. For instance, the *CSV*
|
|
source supports `inline`, `stdin`, a filename, a quoted filename, and a
|
|
*FILENAME MATCHING* clause (see above); whereas the *MySQL* source only
|
|
supports a MySQL database URI specification.
|
|
|
|
In all cases, the *FROM* clause is able to read its value from an
|
|
environment variable when using the form `GETENV 'varname'`.
|
|
|
|
- *INTO*
|
|
|
|
The PostgreSQL connection URI must contains the name of the target table
|
|
where to load the data into. That table must have already been created
|
|
in PostgreSQL, and the name might be schema qualified.
|
|
|
|
The *INTO* target database connection URI can be parsed from the value
|
|
of an environment variable when using the form `GETENV 'varname'`.
|
|
|
|
Then *INTO* option also supports an optional comma separated list of
|
|
target columns, which are either the name of an input *field* or the
|
|
white space separated list of the target column name, its PostgreSQL data
|
|
type and a *USING* expression.
|
|
|
|
The *USING* expression can be any valid Common Lisp form and will be
|
|
read with the current package set to `pgloader.transforms`, so that you
|
|
can use functions defined in that package, such as functions loaded
|
|
dynamically with the `--load` command line parameter.
|
|
|
|
Each *USING* expression is compiled at runtime to native code.
|
|
|
|
This feature allows pgloader to load any number of fields in a CSV file
|
|
into a possibly different number of columns in the database, using
|
|
custom code for that projection.
|
|
|
|
- *WITH*
|
|
|
|
Set of options to apply to the command, using a global syntax of either:
|
|
|
|
- *key = value*
|
|
- *use option*
|
|
- *do not use option*
|
|
|
|
See each specific command for details.
|
|
|
|
- *SET*
|
|
|
|
This clause allows to specify session parameters to be set for all the
|
|
sessions opened by pgloader. It expects a list of parameter name, the
|
|
equal sign, then the single-quoted value as a comma separated list.
|
|
|
|
The names and values of the parameters are not validated by pgloader,
|
|
they are given as-is to PostgreSQL.
|
|
|
|
- *BEFORE LOAD DO*
|
|
|
|
You can run SQL queries against the database before loading the data
|
|
from the `CSV` file. Most common SQL queries are `CREATE TABLE IF NOT
|
|
EXISTS` so that the data can be loaded.
|
|
|
|
Each command must be *dollar-quoted*: it must begin and end with a
|
|
double dollar sign, `$$`. Dollar-quoted queries are then comma
|
|
separated. No extra punctuation is expected after the last SQL query.
|
|
|
|
- *BEFORE LOAD EXECUTE*
|
|
|
|
Same behaviour as in the *BEFORE LOAD DO* clause. Allows you to read
|
|
the SQL queries from a SQL file. Implements support for PostgreSQL
|
|
dollar-quoting and the `\i` and `\ir` include facilities as in `psql`
|
|
batch mode (where they are the same thing).
|
|
|
|
- *AFTER LOAD DO*
|
|
|
|
Same format as *BEFORE LOAD DO*, the dollar-quoted queries found in that
|
|
section are executed once the load is done. That's the right time to
|
|
create indexes and constraints, or re-enable triggers.
|
|
|
|
- *AFTER LOAD EXECUTE*
|
|
|
|
Same behaviour as in the *AFTER LOAD DO* clause. Allows you to read the
|
|
SQL queries from a SQL file. Implements support for PostgreSQL
|
|
dollar-quoting and the `\i` and `\ir` include facilities as in `psql`
|
|
batch mode (where they are the same thing).
|
|
|
|
### Connection String
|
|
|
|
The `<postgresql-url>` parameter is expected to be given as a *Connection URI*
|
|
as documented in the PostgreSQL documentation at
|
|
http://www.postgresql.org/docs/9.3/static/libpq-connect.html#LIBPQ-CONNSTRING.
|
|
|
|
postgresql://[user[:password]@][netloc][:port][/dbname][?option=value&...]
|
|
|
|
Where:
|
|
|
|
- *user*
|
|
|
|
Can contain any character, including colon (`:`) which must then be
|
|
doubled (`::`) and at-sign (`@`) which must then be doubled (`@@`).
|
|
|
|
When omitted, the *user* name defaults to the value of the `PGUSER`
|
|
environment variable, and if it is unset, the value of the `USER`
|
|
environment variable.
|
|
|
|
- *password*
|
|
|
|
Can contain any character, including the at sign (`@`) which must then
|
|
be doubled (`@@`). To leave the password empty, when the *user* name
|
|
ends with at at sign, you then have to use the syntax user:@.
|
|
|
|
When omitted, the *password* defaults to the value of the `PGPASSWORD`
|
|
environment variable if it is set, otherwise the password is left
|
|
unset.
|
|
|
|
- *netloc*
|
|
|
|
Can be either a hostname in dotted notation, or an ipv4, or an Unix
|
|
domain socket path. Empty is the default network location, under a
|
|
system providing *unix domain socket* that method is preferred, otherwise
|
|
the *netloc* default to `localhost`.
|
|
|
|
It's possible to force the *unix domain socket* path by using the syntax
|
|
`unix:/path/to/where/the/socket/file/is`, so to force a non default
|
|
socket path and a non default port, you would have:
|
|
|
|
postgresql://unix:/tmp:54321/dbname
|
|
|
|
The *netloc* defaults to the value of the `PGHOST` environment
|
|
variable, and if it is unset, to either the default `unix` socket path
|
|
when running on a Unix system, and `localhost` otherwise.
|
|
|
|
- *dbname*
|
|
|
|
Should be a proper identifier (letter followed by a mix of letters,
|
|
digits and the punctuation signs comma (`,`), dash (`-`) and underscore
|
|
(`_`).
|
|
|
|
When omitted, the *dbname* defaults to the value of the environment
|
|
variable `PGDATABASE`, and if that is unset, to the *user* value as
|
|
determined above.
|
|
|
|
- *options*
|
|
|
|
The optional parameters must be supplied with the form `name=value`, and
|
|
you may use several parameters by separating them away using an
|
|
ampersand (`&`) character.
|
|
|
|
Only some options are supported here, *tablename* (which might be
|
|
qualified with a schema name) *sslmode*, *host*, *port*, *dbname*,
|
|
*user* and *password*.
|
|
|
|
The *sslmode* parameter values can be one of `disable`, `allow`,
|
|
`prefer` or `require`.
|
|
|
|
For backward compatibility reasons, it's possible to specify the
|
|
*tablename* option directly, without spelling out the `tablename=`
|
|
parts.
|
|
|
|
The options override the main URI components when both are given, and
|
|
using the percent-encoded option parameters allow using passwords
|
|
starting with a colon and bypassing other URI components parsing
|
|
limitations.
|
|
|
|
### Regular Expressions
|
|
|
|
Several clauses listed in the following accept *regular expressions* with
|
|
the following input rules:
|
|
|
|
- A regular expression begins with a tilde sign (`~`),
|
|
|
|
- is then followed with an opening sign,
|
|
|
|
- then any character is allowed and considered part of the regular
|
|
expression, except for the closing sign,
|
|
|
|
- then a closing sign is expected.
|
|
|
|
The opening and closing sign are allowed by pair, here's the complete list
|
|
of allowed delimiters:
|
|
|
|
~//
|
|
~[]
|
|
~{}
|
|
~()
|
|
~<>
|
|
~""
|
|
~''
|
|
~||
|
|
~##
|
|
|
|
Pick the set of delimiters that don't collide with the *regular expression*
|
|
you're trying to input. If your expression is such that none of the
|
|
solutions allow you to enter it, the places where such expressions are
|
|
allowed should allow for a list of expressions.
|
|
|
|
### Comments
|
|
|
|
Any command may contain comments, following those input rules:
|
|
|
|
- the `--` delimiter begins a comment that ends with the end of the
|
|
current line,
|
|
|
|
- the delimiters `/*` and `*/` respectively start and end a comment, which
|
|
can be found in the middle of a command or span several lines.
|
|
|
|
Any place where you could enter a *whitespace* will accept a comment too.
|
|
|
|
### Batch behaviour options
|
|
|
|
All pgloader commands have support for a *WITH* clause that allows for
|
|
specifying options. Some options are generic and accepted by all commands,
|
|
such as the *batch behaviour options*, and some options are specific to a
|
|
data source kind, such as the CSV *skip header* option.
|
|
|
|
The global batch behaviour options are:
|
|
|
|
- *batch rows*
|
|
|
|
Takes a numeric value as argument, used as the maximum number of rows
|
|
allowed in a batch. The default is `25 000` and can be changed to try
|
|
having better performances characteristics or to control pgloader memory
|
|
usage;
|
|
|
|
- *batch size*
|
|
|
|
Takes a memory unit as argument, such as *20 MB*, its default value.
|
|
Accepted multipliers are *kB*, *MB*, *GB*, *TB* and *PB*. The case is
|
|
important so as not to be confused about bits versus bytes, we're only
|
|
talking bytes here.
|
|
|
|
- *batch concurrency*
|
|
|
|
Takes a numeric value as argument, defaults to `10`. That's the number
|
|
of batches that pgloader is allows to build in memory, even when only a
|
|
single batch at a time might be sent to PostgreSQL.
|
|
|
|
Supporting more than a single batch being sent at a time is on the TODO
|
|
list of pgloader, but is not implemented yet. This option is about
|
|
controlling the memory needs of pgloader as a trade-off to the
|
|
performances characteristics, and not about parallel activity of
|
|
pgloader.
|
|
|
|
Other options are specific to each input source, please refer to specific
|
|
parts of the documentation for their listing and covering.
|
|
|
|
A batch is then closed as soon as either the *batch rows* or the *batch
|
|
size* threshold is crossed, whichever comes first. In cases when a batch has
|
|
to be closed because of the *batch size* setting, a *debug* level log
|
|
message is printed with how many rows did fit in the *oversized* batch.
|
|
|
|
## LOAD CSV
|
|
|
|
This command instructs pgloader to load data from a `CSV` file. Here's an
|
|
example:
|
|
|
|
LOAD CSV
|
|
FROM 'GeoLiteCity-Blocks.csv' WITH ENCODING iso-646-us
|
|
HAVING FIELDS
|
|
(
|
|
startIpNum, endIpNum, locId
|
|
)
|
|
INTO postgresql://user@localhost:54393/dbname?geolite.blocks
|
|
TARGET COLUMNS
|
|
(
|
|
iprange ip4r using (ip-range startIpNum endIpNum),
|
|
locId
|
|
)
|
|
WITH truncate,
|
|
skip header = 2,
|
|
fields optionally enclosed by '"',
|
|
fields escaped by backslash-quote,
|
|
fields terminated by '\t'
|
|
|
|
SET work_mem to '32 MB', maintenance_work_mem to '64 MB';
|
|
|
|
The `csv` format command accepts the following clauses and options:
|
|
|
|
- *FROM*
|
|
|
|
Filename where to load the data from. Accepts an *ENCODING* option. Use
|
|
the `--list-encodings` option to know which encoding names are
|
|
supported.
|
|
|
|
The filename may be enclosed by single quotes, and could be one of the
|
|
following special values:
|
|
|
|
- *inline*
|
|
|
|
The data is found after the end of the parsed commands. Any number
|
|
of empty lines between the end of the commands and the beginning of
|
|
the data is accepted.
|
|
|
|
- *stdin*
|
|
|
|
Reads the data from the standard input stream.
|
|
|
|
- *FILENAMES MATCHING*
|
|
|
|
The whole *matching* clause must follow the following rule:
|
|
|
|
[ ALL FILENAMES | [ FIRST ] FILENAME ]
|
|
MATCHING regexp
|
|
[ IN DIRECTORY '...' ]
|
|
|
|
The *matching* clause applies given *regular expression* (see above
|
|
for exact syntax, several options can be used here) to filenames.
|
|
It's then possible to load data from only the first match of all of
|
|
them.
|
|
|
|
The optional *IN DIRECTORY* clause allows specifying which directory
|
|
to walk for finding the data files, and can be either relative to
|
|
where the command file is read from, or absolute. The given
|
|
directory must exists.
|
|
|
|
The *FROM* option also supports an optional comma separated list of
|
|
*field* names describing what is expected in the `CSV` data file,
|
|
optionally introduced by the clause `HAVING FIELDS`.
|
|
|
|
Each field name can be either only one name or a name following with
|
|
specific reader options for that field, enclosed in square brackets and
|
|
comma-separated. Supported per-field reader options are:
|
|
|
|
- *terminated by*
|
|
|
|
See the description of *field terminated by* below.
|
|
|
|
The processing of this option is not currently implemented.
|
|
|
|
- *date format*
|
|
|
|
When the field is expected of the date type, then this option allows
|
|
to specify the date format used in the file.
|
|
|
|
Date format string are template strings modeled against the
|
|
PostgreSQL `to_char` template strings support, limited to the
|
|
following patterns:
|
|
|
|
- YYYY, YYY, YY for the year part
|
|
- MM for the numeric month part
|
|
- DD for the numeric day part
|
|
- HH, HH12, HH24 for the hour part
|
|
- am, AM, a.m., A.M.
|
|
- pm, PM, p.m., P.M.
|
|
- MI for the minutes part
|
|
- SS for the seconds part
|
|
- MS for the milliseconds part (4 digits)
|
|
- US for the microseconds part (6 digits)
|
|
- unparsed punctuation signs: - . * # @ T / \ and space
|
|
|
|
Here's an example of a *date format* specification:
|
|
|
|
column-name [date format 'YYYY-MM-DD HH24-MI-SS.US']
|
|
|
|
- *null if*
|
|
|
|
This option takes an argument which is either the keyword *blanks*
|
|
or a double-quoted string.
|
|
|
|
When *blanks* is used and the field value that is read contains only
|
|
space characters, then it's automatically converted to an SQL `NULL`
|
|
value.
|
|
|
|
When a double-quoted string is used and that string is read as the
|
|
field value, then the field value is automatically converted to an
|
|
SQL `NULL` value.
|
|
|
|
- *trim both whitespace*, *trim left whitespace*, *trim right whitespace*
|
|
|
|
This option allows to trim whitespaces in the read data, either from
|
|
both sides of the data, or only the whitespace characters found on
|
|
the left of the streaing, or only those on the right of the string.
|
|
|
|
- *WITH*
|
|
|
|
When loading from a `CSV` file, the following options are supported:
|
|
|
|
- *truncate*
|
|
|
|
When this option is listed, pgloader issues a `TRUNCATE` command
|
|
against the PostgreSQL target table before reading the data file.
|
|
|
|
- *drop indexes*
|
|
|
|
When this option is listed, pgloader issues `DROP INDEX` commands
|
|
against all the indexes defined on the target table before copying
|
|
the data, then `CREATE INDEX` commands once the `COPY` is done.
|
|
|
|
In order to get the best performances possible, all the indexes are
|
|
created in parallel and when done the primary keys are built again
|
|
from the unique indexes just created. This two step process allows
|
|
creating the primary key index in parallel with the other indexes,
|
|
as only the `ALTER TABLE` command needs an *access exclusive lock*
|
|
on the target table.
|
|
|
|
- *disable triggers*
|
|
|
|
When this option is listed, pgloader issues an `ALTER TABLE ...
|
|
DISABLE TRIGGER ALL` command against the PostgreSQL target table
|
|
before copying the data, then the command `ALTER TABLE ... ENABLE
|
|
TRIGGER ALL` once the `COPY` is done.
|
|
|
|
This option allows loading data into a pre-existing table ignoring
|
|
the *foreign key constraints* and user defined triggers and may
|
|
result in invalid *foreign key constraints* once the data is loaded.
|
|
Use with care.
|
|
|
|
- *skip header*
|
|
|
|
Takes a numeric value as argument. Instruct pgloader to skip that
|
|
many lines at the beginning of the input file.
|
|
|
|
- *csv header*
|
|
|
|
Use the first line read after *skip header* as the list of csv field
|
|
names to be found in the CSV file, using the same CSV parameters as
|
|
for the CSV data.
|
|
|
|
- *trim unquoted blanks*
|
|
|
|
When reading unquoted values in the `CSV` file, remove the blanks
|
|
found in between the separator and the value. That behaviour is the
|
|
default.
|
|
|
|
- *keep unquoted blanks*
|
|
|
|
When reading unquoted values in the `CSV` file, keep blanks found in
|
|
between the separator and the value.
|
|
|
|
- *fields optionally enclosed by*
|
|
|
|
Takes a single character as argument, which must be found inside
|
|
single quotes, and might be given as the printable character itself,
|
|
the special value \t to denote a tabulation character, or `0x` then
|
|
an hexadecimal value read as the ASCII code for the character.
|
|
|
|
This character is used as the quoting character in the `CSV` file,
|
|
and defaults to double-quote.
|
|
|
|
- *fields not enclosed*
|
|
|
|
By default, pgloader will use the double-quote character as the
|
|
enclosing character. If you have a CSV file where fields are not
|
|
enclosed and are using double-quote as an expected ordinary
|
|
character, then use the option *fields not enclosed* for the CSV
|
|
parser to accept those values.
|
|
|
|
- *fields escaped by*
|
|
|
|
Takes either the special value *backslash-quote* or *double-quote*,
|
|
or any value supported by the *fields terminated by* option (see
|
|
below). This value is used to recognize escaped field separators
|
|
when they are to be found within the data fields themselves.
|
|
Defaults to *double-quote*.
|
|
|
|
- *csv escape mode*
|
|
|
|
Takes either the special value *quote* (the default) or *following*
|
|
and allows the CSV parser to parse either only escaped field
|
|
separator or any character (including CSV data) when using the
|
|
*following* value.
|
|
|
|
- *fields terminated by*
|
|
|
|
Takes a single character as argument, which must be found inside
|
|
single quotes, and might be given as the printable character itself,
|
|
the special value \t to denote a tabulation character, or `0x` then
|
|
an hexadecimal value read as the ASCII code for the character.
|
|
|
|
This character is used as the *field separator* when reading the
|
|
`CSV` data.
|
|
|
|
- *lines terminated by*
|
|
|
|
Takes a single character as argument, which must be found inside
|
|
single quotes, and might be given as the printable character itself,
|
|
the special value \t to denote a tabulation character, or `0x` then
|
|
an hexadecimal value read as the ASCII code for the character.
|
|
|
|
This character is used to recognize *end-of-line* condition when
|
|
reading the `CSV` data.
|
|
|
|
## LOAD FIXED COLS
|
|
|
|
This command instructs pgloader to load data from a text file containing
|
|
columns arranged in a *fixed size* manner. Here's an example:
|
|
|
|
LOAD FIXED
|
|
FROM inline
|
|
(
|
|
a from 0 for 10,
|
|
b from 10 for 8,
|
|
c from 18 for 8,
|
|
d from 26 for 17 [null if blanks, trim right whitespace]
|
|
)
|
|
INTO postgresql:///pgloader?fixed
|
|
(
|
|
a, b,
|
|
c time using (time-with-no-separator c),
|
|
d
|
|
)
|
|
|
|
WITH truncate
|
|
|
|
SET client_encoding to 'latin1',
|
|
work_mem to '14MB',
|
|
standard_conforming_strings to 'on'
|
|
|
|
BEFORE LOAD DO
|
|
$$ drop table if exists fixed; $$,
|
|
$$ create table fixed (
|
|
a integer,
|
|
b date,
|
|
c time,
|
|
d text
|
|
);
|
|
$$;
|
|
|
|
01234567892008052011431250firstline
|
|
01234562008052115182300left blank-padded
|
|
12345678902008052208231560another line
|
|
2345609872014092914371500
|
|
2345678902014092914371520
|
|
|
|
The `fixed` format command accepts the following clauses and options:
|
|
|
|
- *FROM*
|
|
|
|
Filename where to load the data from. Accepts an *ENCODING* option. Use
|
|
the `--list-encodings` option to know which encoding names are
|
|
supported.
|
|
|
|
The filename may be enclosed by single quotes, and could be one of the
|
|
following special values:
|
|
|
|
- *inline*
|
|
|
|
The data is found after the end of the parsed commands. Any number
|
|
of empty lines between the end of the commands and the beginning of
|
|
the data is accepted.
|
|
|
|
- *stdin*
|
|
|
|
Reads the data from the standard input stream.
|
|
|
|
The *FROM* option also supports an optional comma separated list of
|
|
*field* names describing what is expected in the `FIXED` data file.
|
|
|
|
Each field name is composed of the field name followed with specific
|
|
reader options for that field. Supported per-field reader options are
|
|
the following, where only *start* and *length* are required.
|
|
|
|
- *start*
|
|
|
|
Position in the line where to start reading that field's value. Can
|
|
be entered with decimal digits or `0x` then hexadecimal digits.
|
|
|
|
- *length*
|
|
|
|
How many bytes to read from the *start* position to read that
|
|
field's value. Same format as *start*.
|
|
|
|
Those optional parameters must be enclosed in square brackets and
|
|
comma-separated:
|
|
|
|
- *terminated by*
|
|
|
|
See the description of *field terminated by* below.
|
|
|
|
The processing of this option is not currently implemented.
|
|
|
|
- *date format*
|
|
|
|
When the field is expected of the date type, then this option allows
|
|
to specify the date format used in the file.
|
|
|
|
Date format string are template strings modeled against the
|
|
PostgreSQL `to_char` template strings support, limited to the
|
|
following patterns:
|
|
|
|
- YYYY, YYY, YY for the year part
|
|
- MM for the numeric month part
|
|
- DD for the numeric day part
|
|
- HH, HH12, HH24 for the hour part
|
|
- am, AM, a.m., A.M.
|
|
- pm, PM, p.m., P.M.
|
|
- MI for the minutes part
|
|
- SS for the seconds part
|
|
- MS for the milliseconds part (4 digits)
|
|
- US for the microseconds part (6 digits)
|
|
- unparsed punctuation signs: - . * # @ T / \ and space
|
|
|
|
Here's an example of a *date format* specification:
|
|
|
|
column-name [date format 'YYYY-MM-DD HH24-MI-SS.US']
|
|
|
|
- *null if*
|
|
|
|
This option takes an argument which is either the keyword *blanks*
|
|
or a double-quoted string.
|
|
|
|
When *blanks* is used and the field value that is read contains only
|
|
space characters, then it's automatically converted to an SQL `NULL`
|
|
value.
|
|
|
|
When a double-quoted string is used and that string is read as the
|
|
field value, then the field value is automatically converted to an
|
|
SQL `NULL` value.
|
|
|
|
- *trim both whitespace*, *trim left whitespace*, *trim right whitespace*
|
|
|
|
This option allows to trim whitespaces in the read data, either from
|
|
both sides of the data, or only the whitespace characters found on
|
|
the left of the streaing, or only those on the right of the string.
|
|
|
|
- *WITH*
|
|
|
|
When loading from a `FIXED` file, the following options are supported:
|
|
|
|
- *truncate*
|
|
|
|
When this option is listed, pgloader issues a `TRUNCATE` command
|
|
against the PostgreSQL target table before reading the data file.
|
|
|
|
- *disable triggers*
|
|
|
|
When this option is listed, pgloader issues an `ALTER TABLE ...
|
|
DISABLE TRIGGER ALL` command against the PostgreSQL target table
|
|
before copying the data, then the command `ALTER TABLE ... ENABLE
|
|
TRIGGER ALL` once the `COPY` is done.
|
|
|
|
This option allows loading data into a pre-existing table ignoring
|
|
the *foreign key constraints* and user defined triggers and may
|
|
result in invalid *foreign key constraints* once the data is loaded.
|
|
Use with care.
|
|
|
|
- *skip header*
|
|
|
|
Takes a numeric value as argument. Instruct pgloader to skip that
|
|
many lines at the beginning of the input file.
|
|
|
|
## LOAD COPY FORMATTED FILES
|
|
|
|
This commands instructs pgloader to load from a file containing COPY TEXT
|
|
data as described in the PostgreSQL documentation. Here's an example:
|
|
|
|
LOAD COPY
|
|
FROM copy://./data/track.copy
|
|
(
|
|
trackid, track, album, media, genre, composer,
|
|
milliseconds, bytes, unitprice
|
|
)
|
|
INTO postgresql:///pgloader?track_full
|
|
|
|
WITH truncate
|
|
|
|
SET client_encoding to 'latin1',
|
|
work_mem to '14MB',
|
|
standard_conforming_strings to 'on'
|
|
|
|
BEFORE LOAD DO
|
|
$$ drop table if exists track_full; $$,
|
|
$$ create table track_full (
|
|
trackid bigserial,
|
|
track text,
|
|
album text,
|
|
media text,
|
|
genre text,
|
|
composer text,
|
|
milliseconds bigint,
|
|
bytes bigint,
|
|
unitprice numeric
|
|
);
|
|
$$;
|
|
|
|
The `COPY` format command accepts the following clauses and options:
|
|
|
|
- *FROM*
|
|
|
|
Filename where to load the data from. This support local files, HTTP
|
|
URLs and zip files containing a single dbf file of the same name. Fetch
|
|
such a zip file from an HTTP address is of course supported.
|
|
|
|
|
|
- *WITH*
|
|
|
|
When loading from a `COPY` file, the following options are supported:
|
|
|
|
- *delimiter*
|
|
|
|
Takes a single character as argument, which must be found inside
|
|
single quotes, and might be given as the printable character itself,
|
|
the special value \t to denote a tabulation character, or `0x` then
|
|
an hexadecimal value read as the ASCII code for the character.
|
|
|
|
This character is used as the *delimiter* when reading the data, in
|
|
a similar way to the PostgreSQL `COPY` option.
|
|
|
|
- *null*
|
|
|
|
Takes a quoted string as an argument (quotes can be either double
|
|
quotes or single quotes) and uses that string as the `NULL`
|
|
representation in the data.
|
|
|
|
This is similar to the *null* `COPY` option in PostgreSQL.
|
|
|
|
- *truncate*
|
|
|
|
When this option is listed, pgloader issues a `TRUNCATE` command
|
|
against the PostgreSQL target table before reading the data file.
|
|
|
|
- *disable triggers*
|
|
|
|
When this option is listed, pgloader issues an `ALTER TABLE ...
|
|
DISABLE TRIGGER ALL` command against the PostgreSQL target table
|
|
before copying the data, then the command `ALTER TABLE ... ENABLE
|
|
TRIGGER ALL` once the `COPY` is done.
|
|
|
|
This option allows loading data into a pre-existing table ignoring
|
|
the *foreign key constraints* and user defined triggers and may
|
|
result in invalid *foreign key constraints* once the data is loaded.
|
|
Use with care.
|
|
|
|
- *skip header*
|
|
|
|
Takes a numeric value as argument. Instruct pgloader to skip that
|
|
many lines at the beginning of the input file.
|
|
|
|
## LOAD DBF
|
|
|
|
This command instructs pgloader to load data from a `DBF` file. Here's an
|
|
example:
|
|
|
|
LOAD DBF
|
|
FROM http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement/2013/dbf/reg2013.dbf
|
|
INTO postgresql://user@localhost/dbname
|
|
WITH truncate, create table;
|
|
|
|
The `dbf` format command accepts the following clauses and options:
|
|
|
|
- *FROM*
|
|
|
|
Filename where to load the data from. This support local files, HTTP
|
|
URLs and zip files containing a single dbf file of the same name. Fetch
|
|
such a zip file from an HTTP address is of course supported.
|
|
|
|
- *WITH*
|
|
|
|
When loading from a `DBF` file, the following options are supported:
|
|
|
|
- *truncate*
|
|
|
|
When this option is listed, pgloader issues a `TRUNCATE` command
|
|
against the PostgreSQL target table before reading the data file.
|
|
|
|
- *disable triggers*
|
|
|
|
When this option is listed, pgloader issues an `ALTER TABLE ...
|
|
DISABLE TRIGGER ALL` command against the PostgreSQL target table
|
|
before copying the data, then the command `ALTER TABLE ... ENABLE
|
|
TRIGGER ALL` once the `COPY` is done.
|
|
|
|
This option allows loading data into a pre-existing table ignoring
|
|
the *foreign key constraints* and user defined triggers and may
|
|
result in invalid *foreign key constraints* once the data is loaded.
|
|
Use with care.
|
|
|
|
- *create table*
|
|
|
|
When this option is listed, pgloader creates the table using the
|
|
meta data found in the `DBF` file, which must contain a list of
|
|
fields with their data type. A standard data type conversion from
|
|
DBF to PostgreSQL is done.
|
|
|
|
- *table name*
|
|
|
|
This options expects as its value the possibly qualified name of the
|
|
table to create.
|
|
|
|
## LOAD IXF
|
|
|
|
This command instructs pgloader to load data from an IBM `IXF` file. Here's
|
|
an example:
|
|
|
|
LOAD IXF
|
|
FROM data/nsitra.test1.ixf
|
|
INTO postgresql:///pgloader?nsitra.test1
|
|
WITH truncate, create table, timezone UTC
|
|
|
|
BEFORE LOAD DO
|
|
$$ create schema if not exists nsitra; $$,
|
|
$$ drop table if exists nsitra.test1; $$;
|
|
|
|
The `ixf` format command accepts the following clauses and options:
|
|
|
|
- *FROM*
|
|
|
|
Filename where to load the data from. This support local files, HTTP
|
|
URLs and zip files containing a single ixf file of the same name. Fetch
|
|
such a zip file from an HTTP address is of course supported.
|
|
|
|
- *WITH*
|
|
|
|
When loading from a `IXF` file, the following options are supported:
|
|
|
|
- *truncate*
|
|
|
|
When this option is listed, pgloader issues a `TRUNCATE` command
|
|
against the PostgreSQL target table before reading the data file.
|
|
|
|
- *disable triggers*
|
|
|
|
When this option is listed, pgloader issues an `ALTER TABLE ...
|
|
DISABLE TRIGGER ALL` command against the PostgreSQL target table
|
|
before copying the data, then the command `ALTER TABLE ... ENABLE
|
|
TRIGGER ALL` once the `COPY` is done.
|
|
|
|
This option allows loading data into a pre-existing table ignoring
|
|
the *foreign key constraints* and user defined triggers and may
|
|
result in invalid *foreign key constraints* once the data is loaded.
|
|
Use with care.
|
|
|
|
- *create table*
|
|
|
|
When this option is listed, pgloader creates the table using the
|
|
meta data found in the `DBF` file, which must contain a list of
|
|
fields with their data type. A standard data type conversion from
|
|
DBF to PostgreSQL is done.
|
|
|
|
- *table name*
|
|
|
|
This options expects as its value the possibly qualified name of the
|
|
table to create.
|
|
|
|
- *timezone*
|
|
|
|
This options allows to specify which timezone is used when parsing
|
|
timestamps from an IXF file, and defaults to *UTC*. Expected values
|
|
are either `UTC`, `GMT` or a single quoted location name such as
|
|
`'Universal'` or `'Europe/Paris'`.
|
|
|
|
## LOAD ARCHIVE
|
|
|
|
This command instructs pgloader to load data from one or more files contained
|
|
in an archive. Currently the only supported archive format is *ZIP*, and the
|
|
archive might be downloaded from an *HTTP* URL.
|
|
|
|
Here's an example:
|
|
|
|
LOAD ARCHIVE
|
|
FROM /Users/dim/Downloads/GeoLiteCity-latest.zip
|
|
INTO postgresql:///ip4r
|
|
|
|
BEFORE LOAD
|
|
DO $$ create extension if not exists ip4r; $$,
|
|
$$ create schema if not exists geolite; $$,
|
|
|
|
EXECUTE 'geolite.sql'
|
|
|
|
LOAD CSV
|
|
FROM FILENAME MATCHING ~/GeoLiteCity-Location.csv/
|
|
WITH ENCODING iso-8859-1
|
|
(
|
|
locId,
|
|
country,
|
|
region null if blanks,
|
|
city null if blanks,
|
|
postalCode null if blanks,
|
|
latitude,
|
|
longitude,
|
|
metroCode null if blanks,
|
|
areaCode null if blanks
|
|
)
|
|
INTO postgresql:///ip4r?geolite.location
|
|
(
|
|
locid,country,region,city,postalCode,
|
|
location point using (format nil "(~a,~a)" longitude latitude),
|
|
metroCode,areaCode
|
|
)
|
|
WITH skip header = 2,
|
|
fields optionally enclosed by '"',
|
|
fields escaped by double-quote,
|
|
fields terminated by ','
|
|
|
|
AND LOAD CSV
|
|
FROM FILENAME MATCHING ~/GeoLiteCity-Blocks.csv/
|
|
WITH ENCODING iso-8859-1
|
|
(
|
|
startIpNum, endIpNum, locId
|
|
)
|
|
INTO postgresql:///ip4r?geolite.blocks
|
|
(
|
|
iprange ip4r using (ip-range startIpNum endIpNum),
|
|
locId
|
|
)
|
|
WITH skip header = 2,
|
|
fields optionally enclosed by '"',
|
|
fields escaped by double-quote,
|
|
fields terminated by ','
|
|
|
|
FINALLY DO
|
|
$$ create index blocks_ip4r_idx on geolite.blocks using gist(iprange); $$;
|
|
|
|
The `archive` command accepts the following clauses and options:
|
|
|
|
- *FROM*
|
|
|
|
Filename or HTTP URI where to load the data from. When given an HTTP
|
|
URL the linked file will get downloaded locally before processing.
|
|
|
|
If the file is a `zip` file, the command line utility `unzip` is used
|
|
to expand the archive into files in `$TMPDIR`, or `/tmp` if `$TMPDIR`
|
|
is unset or set to a non-existing directory.
|
|
|
|
Then the following commands are used from the top level directory where
|
|
the archive has been expanded.
|
|
|
|
- command [ *AND* command ... ]
|
|
|
|
A series of commands against the contents of the archive, at the moment
|
|
only `CSV`,`'FIXED` and `DBF` commands are supported.
|
|
|
|
Note that commands are supporting the clause *FROM FILENAME MATCHING*
|
|
which allows the pgloader command not to depend on the exact names of
|
|
the archive directories.
|
|
|
|
The same clause can also be applied to several files with using the
|
|
spelling *FROM ALL FILENAMES MATCHING* and a regular expression.
|
|
|
|
The whole *matching* clause must follow the following rule:
|
|
|
|
FROM [ ALL FILENAMES | [ FIRST ] FILENAME ] MATCHING
|
|
|
|
- *FINALLY DO*
|
|
|
|
SQL Queries to run once the data is loaded, such as `CREATE INDEX`.
|
|
|
|
## LOAD MYSQL DATABASE
|
|
|
|
This command instructs pgloader to load data from a database connection. The
|
|
only supported database source is currently *MySQL*, and pgloader supports
|
|
dynamically converting the schema of the source database and the indexes
|
|
building.
|
|
|
|
A default set of casting rules are provided and might be overloaded and
|
|
appended to by the command.
|
|
|
|
Here's an example:
|
|
|
|
LOAD DATABASE
|
|
FROM mysql://root@localhost/sakila
|
|
INTO postgresql://localhost:54393/sakila
|
|
|
|
WITH include drop, create tables, create indexes, reset sequences
|
|
|
|
SET maintenance_work_mem to '128MB',
|
|
work_mem to '12MB',
|
|
search_path to 'sakila'
|
|
|
|
CAST type datetime to timestamptz drop default drop not null using zero-dates-to-null,
|
|
type date drop not null drop default using zero-dates-to-null,
|
|
-- type tinyint to boolean using tinyint-to-boolean,
|
|
type year to integer
|
|
|
|
MATERIALIZE VIEWS film_list, staff_list
|
|
|
|
-- INCLUDING ONLY TABLE NAMES MATCHING ~/film/, 'actor'
|
|
-- EXCLUDING TABLE NAMES MATCHING ~<ory>
|
|
-- DECODING TABLE NAMES MATCHING ~/messed/, ~/encoding/ AS utf8
|
|
|
|
BEFORE LOAD DO
|
|
$$ create schema if not exists sakila; $$;
|
|
|
|
The `database` command accepts the following clauses and options:
|
|
|
|
- *FROM*
|
|
|
|
Must be a connection URL pointing to a MySQL database. At the moment
|
|
only MySQL is supported as a pgloader source.
|
|
|
|
If the connection URI contains a table name, then only this table is
|
|
migrated from MySQL to PostgreSQL.
|
|
|
|
- *WITH*
|
|
|
|
When loading from a `MySQL` database, the following options are
|
|
supported, and the efault *WITH* clause is: *no truncate*, *create
|
|
tables*, *include drop*, *create indexes*, *reset sequences*, *foreign
|
|
keys*, *downcase identifiers*.
|
|
|
|
*WITH* options:
|
|
|
|
- *include drop*
|
|
|
|
When this option is listed, pgloader drops all the tables in the
|
|
target PostgreSQL database whose names appear in the SQLite
|
|
database. This option allows for using the same command several
|
|
times in a row until you figure out all the options, starting
|
|
automatically from a clean environment. Please note that `CASCADE`
|
|
is used to ensure that tables are dropped even if there are foreign
|
|
keys pointing to them. This is precisely what `include drop` is
|
|
intended to do: drop all target tables and recreate them.
|
|
|
|
Great care needs to be taken when using `include drop`, as
|
|
it will cascade to *all* objects referencing the target tables,
|
|
possibly including other tables that are not being loaded from
|
|
the source DB.
|
|
|
|
- *include no drop*
|
|
|
|
When this option is listed, pgloader will not include any `DROP`
|
|
statement when loading the data.
|
|
|
|
- *truncate*
|
|
|
|
When this option is listed, pgloader issue the `TRUNCATE` command
|
|
against each PostgreSQL table just before loading data into it.
|
|
|
|
- *no truncate*
|
|
|
|
When this option is listed, pgloader issues no `TRUNCATE` command.
|
|
|
|
- *disable triggers*
|
|
|
|
When this option is listed, pgloader issues an `ALTER TABLE ...
|
|
DISABLE TRIGGER ALL` command against the PostgreSQL target table
|
|
before copying the data, then the command `ALTER TABLE ... ENABLE
|
|
TRIGGER ALL` once the `COPY` is done.
|
|
|
|
This option allows loading data into a pre-existing table ignoring
|
|
the *foreign key constraints* and user defined triggers and may
|
|
result in invalid *foreign key constraints* once the data is loaded.
|
|
Use with care.
|
|
|
|
- *create tables*
|
|
|
|
When this option is listed, pgloader creates the table using the
|
|
meta data found in the `MySQL` file, which must contain a list of
|
|
fields with their data type. A standard data type conversion from
|
|
DBF to PostgreSQL is done.
|
|
|
|
- *create no tables*
|
|
|
|
When this option is listed, pgloader skips the creation of table
|
|
before lading data, target tables must then already exist.
|
|
|
|
- *create indexes*
|
|
|
|
When this option is listed, pgloader gets the definitions of all the
|
|
indexes found in the MySQL database and create the same set of index
|
|
definitions against the PostgreSQL database.
|
|
|
|
- *create no indexes*
|
|
|
|
When this option is listed, pgloader skips the creating indexes.
|
|
|
|
- *uniquify index names*, *preserve index names*
|
|
|
|
MySQL index names are unique per-table whereas in PostgreSQL index
|
|
names have to be unique per-schema. The default for pgloader is to
|
|
change the index name by prefixing it with `idx_OID` where `OID` is
|
|
the internal numeric identifier of the table the index is built
|
|
against.
|
|
|
|
In somes cases like when the DDL are entirely left to a framework it
|
|
might be sensible for pgloader to refrain from handling index unique
|
|
names, that is achieved by using the *preserve index names* option.
|
|
|
|
The default is to *uniquify index names*.
|
|
|
|
Even when using the option *preserve index names*, MySQL primary key
|
|
indexes named "PRIMARY" will get their names uniquified. Failing to
|
|
do so would prevent the primary keys to be created again in
|
|
PostgreSQL where the index names must be unique per schema.
|
|
|
|
- *foreign keys*
|
|
|
|
When this option is listed, pgloader gets the definitions of all the
|
|
foreign keys found in the MySQL database and create the same set of
|
|
foreign key definitions against the PostgreSQL database.
|
|
|
|
- *no foreign keys*
|
|
|
|
When this option is listed, pgloader skips creating foreign keys.
|
|
|
|
- *reset sequences*
|
|
|
|
When this option is listed, at the end of the data loading and after
|
|
the indexes have all been created, pgloader resets all the
|
|
PostgreSQL sequences created to the current maximum value of the
|
|
column they are attached to.
|
|
|
|
The options *schema only* and *data only* have no effects on this
|
|
option.
|
|
|
|
- *reset no sequences*
|
|
|
|
When this option is listed, pgloader skips resetting sequences after
|
|
the load.
|
|
|
|
The options *schema only* and *data only* have no effects on this
|
|
option.
|
|
|
|
- *downcase identifiers*
|
|
|
|
When this option is listed, pgloader converts all MySQL identifiers
|
|
(table names, index names, column names) to *downcase*, except for
|
|
PostgreSQL *reserved* keywords.
|
|
|
|
The PostgreSQL *reserved* keywords are determined dynamically by
|
|
using the system function `pg_get_keywords()`.
|
|
|
|
- *quote identifiers*
|
|
|
|
When this option is listed, pgloader quotes all MySQL identifiers so
|
|
that their case is respected. Note that you will then have to do the
|
|
same thing in your application code queries.
|
|
|
|
- *schema only*
|
|
|
|
When this option is listed pgloader refrains from migrating the data
|
|
over. Note that the schema in this context includes the indexes when
|
|
the option *create indexes* has been listed.
|
|
|
|
- *data only*
|
|
|
|
When this option is listed pgloader only issues the `COPY`
|
|
statements, without doing any other processing.
|
|
|
|
- *CAST*
|
|
|
|
The cast clause allows to specify custom casting rules, either to
|
|
overload the default casting rules or to amend them with special cases.
|
|
|
|
A casting rule is expected to follow one of the forms:
|
|
|
|
type <mysql-type-name> [ <guard> ... ] to <pgsql-type-name> [ <option> ... ]
|
|
column <table-name>.<column-name> [ <guards> ] to ...
|
|
|
|
It's possible for a *casting rule* to either match against a MySQL data
|
|
type or against a given *column name* in a given *table name*. That
|
|
flexibility allows to cope with cases where the type `tinyint` might
|
|
have been used as a `boolean` in some cases but as a `smallint` in
|
|
others.
|
|
|
|
The *casting rules* are applied in order, the first match prevents
|
|
following rules to be applied, and user defined rules are evaluated
|
|
first.
|
|
|
|
The supported guards are:
|
|
|
|
- *when default 'value'*
|
|
|
|
The casting rule is only applied against MySQL columns of the source
|
|
type that have given *value*, which must be a single-quoted or a
|
|
double-quoted string.
|
|
|
|
- *when typemod expression*
|
|
|
|
The casting rule is only applied against MySQL columns of the source
|
|
type that have a *typemod* value matching the given *typemod
|
|
expression*. The *typemod* is separated into its *precision* and
|
|
*scale* components.
|
|
|
|
Example of a cast rule using a *typemod* guard:
|
|
|
|
type char when (= precision 1) to char keep typemod
|
|
|
|
This expression casts MySQL `char(1)` column to a PostgreSQL column
|
|
of type `char(1)` while allowing for the general case `char(N)` will
|
|
be converted by the default cast rule into a PostgreSQL type
|
|
`varchar(N)`.
|
|
|
|
- *with extra auto_increment*
|
|
|
|
The casting rule is only applied against MySQL columns having the
|
|
*extra* column `auto_increment` option set, so that it's possible to
|
|
target e.g. `serial` rather than `integer`.
|
|
|
|
The default matching behavior, when this option isn't set, is to
|
|
match both columns with the extra definition and without.
|
|
|
|
This means that if you want to implement a casting rule that target
|
|
either `serial` or `integer` from a `smallint` definition depending
|
|
on the *auto_increment* extra bit of information from MySQL, then
|
|
you need to spell out two casting rules as following:
|
|
|
|
type smallint with extra auto_increment
|
|
to serial drop typemod keep default keep not null,
|
|
type smallint
|
|
to integer drop typemod keep default keep not null
|
|
|
|
The supported casting options are:
|
|
|
|
- *drop default*, *keep default*
|
|
|
|
When the option *drop default* is listed, pgloader drops any
|
|
existing default expression in the MySQL database for columns of the
|
|
source type from the `CREATE TABLE` statement it generates.
|
|
|
|
The spelling *keep default* explicitly prevents that behaviour and
|
|
can be used to overload the default casting rules.
|
|
|
|
- *drop not null*, *keep not null*
|
|
|
|
When the option *drop not null* is listed, pgloader drops any
|
|
existing `NOT NULL` constraint associated with the given source
|
|
MySQL datatype when it creates the tables in the PostgreSQL
|
|
database.
|
|
|
|
The spelling *keep not null* explicitly prevents that behaviour and
|
|
can be used to overload the default casting rules.
|
|
|
|
- *drop typemod*, *keep typemod*
|
|
|
|
When the option *drop typemod* is listed, pgloader drops any
|
|
existing *typemod* definition (e.g. *precision* and *scale*) from
|
|
the datatype definition found in the MySQL columns of the source
|
|
type when it created the tables in the PostgreSQL database.
|
|
|
|
The spelling *keep typemod* explicitly prevents that behaviour and
|
|
can be used to overload the default casting rules.
|
|
|
|
- *using*
|
|
|
|
This option takes as its single argument the name of a function to
|
|
be found in the `pgloader.transforms` Common Lisp package. See above
|
|
for details.
|
|
|
|
It's possible to augment a default cast rule (such as one that
|
|
applies against `ENUM` data type for example) with a *transformation
|
|
function* by omitting entirely the `type` parts of the casting rule,
|
|
as in the following example:
|
|
|
|
column enumerate.foo using empty-string-to-null
|
|
|
|
- *MATERIALIZE VIEWS*
|
|
|
|
This clause allows you to implement custom data processing at the data
|
|
source by providing a *view definition* against which pgloader will
|
|
query the data. It's not possible to just allow for plain `SQL` because
|
|
we want to know a lot about the exact data types of each column involved
|
|
in the query output.
|
|
|
|
This clause expect a comma separated list of view definitions, each one
|
|
being either the name of an existing view in your database or the
|
|
following expression:
|
|
|
|
*name* `AS` `$$` *sql query* `$$`
|
|
|
|
The *name* and the *sql query* will be used in a `CREATE VIEW` statement
|
|
at the beginning of the data loading, and the resulting view will then
|
|
be dropped at the end of the data loading.
|
|
|
|
- *MATERIALIZE ALL VIEWS*
|
|
|
|
Same behaviour as *MATERIALIZE VIEWS* using the dynamic list of views as
|
|
returned by MySQL rather than asking the user to specify the list.
|
|
|
|
- *INCLUDING ONLY TABLE NAMES MATCHING*
|
|
|
|
Introduce a comma separated list of table names or *regular expression*
|
|
used to limit the tables to migrate to a sublist.
|
|
|
|
Example:
|
|
|
|
INCLUDING ONLY TABLE NAMES MATCHING ~/film/, 'actor'
|
|
|
|
- *EXCLUDING TABLE NAMES MATCHING*
|
|
|
|
Introduce a comma separated list of table names or *regular expression*
|
|
used to exclude table names from the migration. This filter only applies
|
|
to the result of the *INCLUDING* filter.
|
|
|
|
EXCLUDING TABLE NAMES MATCHING ~<ory>
|
|
|
|
- *DECODING TABLE NAMES MATCHING*
|
|
|
|
Introduce a comma separated list of table names or *regular expressions*
|
|
used to force the encoding to use when processing data from MySQL. If
|
|
the data encoding known to you is different from MySQL's idea about it,
|
|
this is the option to use.
|
|
|
|
DECODING TABLE NAMES MATCHING ~/messed/, ~/encoding/ AS utf8
|
|
|
|
You can use as many such rules as you need, all with possibly different
|
|
encodings.
|
|
|
|
### LIMITATIONS
|
|
|
|
The `database` command currently only supports MySQL source database and has
|
|
the following limitations:
|
|
|
|
- Views are not migrated,
|
|
|
|
Supporting views might require implementing a full SQL parser for the
|
|
MySQL dialect with a porting engine to rewrite the SQL against
|
|
PostgreSQL, including renaming functions and changing some constructs.
|
|
|
|
While it's not theoretically impossible, don't hold your breath.
|
|
|
|
- Triggers are not migrated
|
|
|
|
The difficulty of doing so is not yet assessed.
|
|
|
|
- `ON UPDATE CURRENT_TIMESTAMP` is currently not migrated
|
|
|
|
It's simple enough to implement, just not on the priority list yet.
|
|
|
|
- Of the geometric datatypes, only the `POINT` database has been covered.
|
|
The other ones should be easy enough to implement now, it's just not
|
|
done yet.
|
|
|
|
### DEFAULT MySQL CASTING RULES
|
|
|
|
When migrating from MySQL the following Casting Rules are provided:
|
|
|
|
Numbers:
|
|
|
|
- type int with extra auto_increment to serial when (< precision 10)
|
|
- type int with extra auto_increment to bigserial when (<= 10 precision)
|
|
- type int to int when (< precision 10)
|
|
- type int to bigint when (<= 10 precision)
|
|
- type tinyint with extra auto_increment to serial
|
|
- type smallint with extra auto_increment to serial
|
|
- type mediumint with extra auto_increment to serial
|
|
- type bigint with extra auto_increment to bigserial
|
|
|
|
- type tinyint to boolean when (= 1 precision) using tinyint-to-boolean
|
|
|
|
- type tinyint to smallint drop typemod
|
|
- type smallint to smallint drop typemod
|
|
- type mediumint to integer drop typemod
|
|
- type integer to integer drop typemod
|
|
- type float to float drop typemod
|
|
- type bigint to bigint drop typemod
|
|
- type double to double precision drop typemod
|
|
|
|
- type numeric to numeric keep typemod
|
|
- type decimal to decimal keep typemod
|
|
|
|
Texts:
|
|
|
|
- type char to varchar keep typemod using remove-null-characters
|
|
- type varchar to text using remove-null-characters
|
|
- type tinytext to text using remove-null-characters
|
|
- type text to text using remove-null-characters
|
|
- type mediumtext to text using remove-null-characters
|
|
- type longtext to text using remove-null-characters
|
|
|
|
Binary:
|
|
|
|
- type binary to bytea
|
|
- type varbinary to bytea
|
|
- type tinyblob to bytea
|
|
- type blob to bytea
|
|
- type mediumblob to bytea
|
|
- type longblob to bytea
|
|
|
|
Date:
|
|
|
|
- type datetime when default "0000-00-00 00:00:00" and not null
|
|
to timestamptz drop not null drop default
|
|
using zero-dates-to-null
|
|
|
|
- type datetime when default "0000-00-00 00:00:00"
|
|
to timestamptz drop default
|
|
using zero-dates-to-null
|
|
|
|
- type timestamp when default "0000-00-00 00:00:00" and not null
|
|
to timestamptz drop not null drop default
|
|
using zero-dates-to-null
|
|
|
|
- type timestamp when default "0000-00-00 00:00:00"
|
|
to timestamptz drop default
|
|
using zero-dates-to-null
|
|
|
|
- type date when default "0000-00-00" to date drop default
|
|
using zero-dates-to-null
|
|
|
|
- type date to date
|
|
- type datetime to timestamptz
|
|
- type timestamp to timestamptz
|
|
- type year to integer drop typemod
|
|
|
|
Geometric:
|
|
|
|
- type point to point using pgloader.transforms::convert-mysql-point
|
|
|
|
Enum types are declared inline in MySQL and separately with a `CREATE TYPE`
|
|
command in PostgreSQL, so each column of Enum Type is converted to a type
|
|
named after the table and column names defined with the same labels in the
|
|
same order.
|
|
|
|
When the source type definition is not matched in the default casting rules
|
|
nor in the casting rules provided in the command, then the type name with
|
|
the typemod is used.
|
|
|
|
## LOAD SQLite DATABASE
|
|
|
|
This command instructs pgloader to load data from a SQLite file. Automatic
|
|
discovery of the schema is supported, including build of the indexes.
|
|
|
|
Here's an example:
|
|
|
|
load database
|
|
from sqlite:///Users/dim/Downloads/lastfm_tags.db
|
|
into postgresql:///tags
|
|
|
|
with include drop, create tables, create indexes, reset sequences
|
|
|
|
set work_mem to '16MB', maintenance_work_mem to '512 MB';
|
|
|
|
The `sqlite` command accepts the following clauses and options:
|
|
|
|
- *FROM*
|
|
|
|
Path or HTTP URL to a SQLite file, might be a `.zip` file.
|
|
|
|
- *WITH*
|
|
|
|
When loading from a `SQLite` database, the following options are
|
|
supported:
|
|
|
|
When loading from a `SQLite` database, the following options are
|
|
supported, and the default *WITH* clause is: *no truncate*, *create
|
|
tables*, *include drop*, *create indexes*, *reset sequences*, *downcase
|
|
identifiers*, *encoding 'utf-8'*.
|
|
|
|
- *include drop*
|
|
|
|
When this option is listed, pgloader drops all the tables in the
|
|
target PostgreSQL database whose names appear in the SQLite
|
|
database. This option allows for using the same command several
|
|
times in a row until you figure out all the options, starting
|
|
automatically from a clean environment. Please note that `CASCADE`
|
|
is used to ensure that tables are dropped even if there are foreign
|
|
keys pointing to them. This is precisely what `include drop` is
|
|
intended to do: drop all target tables and recreate them.
|
|
|
|
Great care needs to be taken when using `include drop`, as
|
|
it will cascade to *all* objects referencing the target tables,
|
|
possibly including other tables that are not being loaded from
|
|
the source DB.
|
|
|
|
- *include no drop*
|
|
|
|
When this option is listed, pgloader will not include any `DROP`
|
|
statement when loading the data.
|
|
|
|
- *truncate*
|
|
|
|
When this option is listed, pgloader issue the `TRUNCATE` command
|
|
against each PostgreSQL table just before loading data into it.
|
|
|
|
- *no truncate*
|
|
|
|
When this option is listed, pgloader issues no `TRUNCATE` command.
|
|
|
|
- *disable triggers*
|
|
|
|
When this option is listed, pgloader issues an `ALTER TABLE ...
|
|
DISABLE TRIGGER ALL` command against the PostgreSQL target table
|
|
before copying the data, then the command `ALTER TABLE ... ENABLE
|
|
TRIGGER ALL` once the `COPY` is done.
|
|
|
|
This option allows loading data into a pre-existing table ignoring
|
|
the *foreign key constraints* and user defined triggers and may
|
|
result in invalid *foreign key constraints* once the data is loaded.
|
|
Use with care.
|
|
|
|
- *create tables*
|
|
|
|
When this option is listed, pgloader creates the table using the
|
|
meta data found in the `SQLite` file, which must contain a list of
|
|
fields with their data type. A standard data type conversion from
|
|
DBF to PostgreSQL is done.
|
|
|
|
- *create no tables*
|
|
|
|
When this option is listed, pgloader skips the creation of table
|
|
before lading data, target tables must then already exist.
|
|
|
|
- *create indexes*
|
|
|
|
When this option is listed, pgloader gets the definitions of all
|
|
the indexes found in the SQLite database and create the same set of
|
|
index definitions against the PostgreSQL database.
|
|
|
|
- *create no indexes*
|
|
|
|
When this option is listed, pgloader skips the creating indexes.
|
|
|
|
- *reset sequences*
|
|
|
|
When this option is listed, at the end of the data loading and after
|
|
the indexes have all been created, pgloader resets all the
|
|
PostgreSQL sequences created to the current maximum value of the
|
|
column they are attached to.
|
|
|
|
- *reset no sequences*
|
|
|
|
When this option is listed, pgloader skips resetting sequences after
|
|
the load.
|
|
|
|
The options *schema only* and *data only* have no effects on this
|
|
option.
|
|
|
|
- *schema only*
|
|
|
|
When this option is listed pgloader will refrain from migrating the
|
|
data over. Note that the schema in this context includes the indexes
|
|
when the option *create indexes* has been listed.
|
|
|
|
- *data only*
|
|
|
|
When this option is listed pgloader only issues the `COPY`
|
|
statements, without doing any other processing.
|
|
|
|
- *encoding*
|
|
|
|
This option allows to control which encoding to parse the SQLite
|
|
text data with. Defaults to UTF-8.
|
|
|
|
- *CAST*
|
|
|
|
The cast clause allows to specify custom casting rules, either to
|
|
overload the default casting rules or to amend them with special cases.
|
|
|
|
Please refer to the MySQL CAST clause for details.
|
|
|
|
- *INCLUDING ONLY TABLE NAMES LIKE*
|
|
|
|
Introduce a comma separated list of table name patterns used to limit
|
|
the tables to migrate to a sublist.
|
|
|
|
Example:
|
|
|
|
INCLUDING ONLY TABLE NAMES LIKE 'Invoice%'
|
|
|
|
- *EXCLUDING TABLE NAMES LIKE*
|
|
|
|
Introduce a comma separated list of table name patterns used to exclude
|
|
table names from the migration. This filter only applies to the result
|
|
of the *INCLUDING* filter.
|
|
|
|
EXCLUDING TABLE NAMES LIKE 'appointments'
|
|
|
|
### DEFAULT SQLite CASTING RULES
|
|
|
|
When migrating from SQLite the following Casting Rules are provided:
|
|
|
|
Numbers:
|
|
|
|
- type tinyint to smallint
|
|
- type integer to bigint
|
|
|
|
- type float to float using float-to-string
|
|
- type real to real using float-to-string
|
|
- type double to double precision using float-to-string
|
|
- type numeric to numeric using float-to-string
|
|
|
|
Texts:
|
|
|
|
- type character to text drop typemod
|
|
- type varchar to text drop typemod
|
|
- type nvarchar to text drop typemod
|
|
- type char to text drop typemod
|
|
- type nchar to text drop typemod
|
|
- type nvarchar to text drop typemod
|
|
- type clob to text drop typemod
|
|
|
|
Binary:
|
|
|
|
- type blob to bytea
|
|
|
|
Date:
|
|
|
|
- type datetime to timestamptz using sqlite-timestamp-to-timestamp
|
|
- type timestamp to timestamptz using sqlite-timestamp-to-timestamp
|
|
- type timestamptz to timestamptz using sqlite-timestamp-to-timestamp
|
|
|
|
|
|
## LOAD MS SQL DATABASE
|
|
|
|
This command instructs pgloader to load data from a MS SQL database.
|
|
Automatic discovery of the schema is supported, including build of the
|
|
indexes, primary and foreign keys constraints.
|
|
|
|
Here's an example:
|
|
|
|
load database
|
|
from mssql://user@host/dbname
|
|
into postgresql:///dbname
|
|
|
|
including only table names like 'GlobalAccount' in schema 'dbo'
|
|
|
|
set work_mem to '16MB', maintenance_work_mem to '512 MB'
|
|
|
|
before load do $$ drop schema if exists dbo cascade; $$;
|
|
|
|
The `mssql` command accepts the following clauses and options:
|
|
|
|
- *FROM*
|
|
|
|
Connection string to an existing MS SQL database server that listens and
|
|
welcome external TCP/IP connection. As pgloader currently piggybacks on
|
|
the FreeTDS driver, to change the port of the server please export the
|
|
`TDSPORT` environment variable.
|
|
|
|
- *WITH*
|
|
|
|
When loading from a `MS SQL` database, the same options as when loading
|
|
a `MySQL` database are supported. Please refer to the MySQL section. The
|
|
following options are added:
|
|
|
|
- *create schemas*
|
|
|
|
When this option is listed, pgloader creates the same schemas as
|
|
found on the MS SQL instance. This is the default.
|
|
|
|
- *create no schemas*
|
|
|
|
When this option is listed, pgloader refrains from creating any
|
|
schemas at all, you must then ensure that the target schema do
|
|
exist.
|
|
|
|
- *CAST*
|
|
|
|
The cast clause allows to specify custom casting rules, either to
|
|
overload the default casting rules or to amend them with special cases.
|
|
|
|
Please refer to the MySQL CAST clause for details.
|
|
|
|
- *INCLUDING ONLY TABLE NAMES LIKE '...' [, '...'] IN SCHEMA '...'*
|
|
|
|
Introduce a comma separated list of table name patterns used to limit
|
|
the tables to migrate to a sublist. More than one such clause may be
|
|
used, they will be accumulated together.
|
|
|
|
Example:
|
|
|
|
including only table names lile 'GlobalAccount' in schema 'dbo'
|
|
|
|
- *EXCLUDING TABLE NAMES LIKE '...' [, '...'] IN SCHEMA '...'*
|
|
|
|
Introduce a comma separated list of table name patterns used to exclude
|
|
table names from the migration. This filter only applies to the result
|
|
of the *INCLUDING* filter.
|
|
|
|
EXCLUDING TABLE NAMES MATCHING 'LocalAccount' in schema 'dbo'
|
|
|
|
### DEFAULT MS SQL CASTING RULES
|
|
|
|
When migrating from MS SQL the following Casting Rules are provided:
|
|
|
|
Numbers:
|
|
|
|
- type tinyint to smallint
|
|
|
|
- type float to float using float-to-string
|
|
- type real to real using float-to-string
|
|
- type double to double precision using float-to-string
|
|
- type numeric to numeric using float-to-string
|
|
- type decimal to numeric using float-to-string
|
|
- type money to numeric using float-to-string
|
|
- type smallmoney to numeric using float-to-string
|
|
|
|
Texts:
|
|
|
|
- type char to text drop typemod
|
|
- type nchat to text drop typemod
|
|
- type varchar to text drop typemod
|
|
- type nvarchar to text drop typemod
|
|
- type xml to text drop typemod
|
|
|
|
Binary:
|
|
|
|
- type binary to bytea using byte-vector-to-bytea
|
|
- type varbinary to bytea using byte-vector-to-bytea
|
|
|
|
Date:
|
|
|
|
- type datetime to timestamptz
|
|
- type datetime2 to timestamptz
|
|
|
|
Others:
|
|
|
|
- type bit to boolean
|
|
- type hierarchyid to bytea
|
|
- type geography to bytea
|
|
- type uniqueidentifier to uuid using sql-server-uniqueidentifier-to-uuid
|
|
|
|
## TRANSFORMATION FUNCTIONS
|
|
|
|
Some data types are implemented in a different enough way that a
|
|
transformation function is necessary. This function must be written in
|
|
`Common lisp` and is searched in the `pgloader.transforms` package.
|
|
|
|
Some default transformation function are provided with pgloader, and you can
|
|
use the `--load` command line option to load and compile your own lisp file
|
|
into pgloader at runtime. For your functions to be found, remember to begin
|
|
your lisp file with the following form:
|
|
|
|
(in-package #:pgloader.transforms)
|
|
|
|
The provided transformation functions are:
|
|
|
|
- *zero-dates-to-null*
|
|
|
|
When the input date is all zeroes, return `nil`, which gets loaded as a
|
|
PostgreSQL `NULL` value.
|
|
|
|
- *date-with-no-separator*
|
|
|
|
Applies *zero-dates-to-null* then transform the given date into a format
|
|
that PostgreSQL will actually process:
|
|
|
|
In: "20041002152952"
|
|
Out: "2004-10-02 15:29:52"
|
|
|
|
- *time-with-no-separator*
|
|
|
|
Transform the given time into a format that PostgreSQL will actually
|
|
process:
|
|
|
|
In: "08231560"
|
|
Out: "08:23:15.60"
|
|
|
|
- *tinyint-to-boolean*
|
|
|
|
As MySQL lacks a proper boolean type, *tinyint* is often used to
|
|
implement that. This function transforms `0` to `'false'` and anything
|
|
else to `'true`'.
|
|
|
|
- *bits-to-boolean*
|
|
|
|
As MySQL lacks a proper boolean type, *BIT* is often used to implement
|
|
that. This function transforms 1-bit bit vectors from `0` to `f` and any
|
|
other value to `t`..
|
|
|
|
- *int-to-ip*
|
|
|
|
Convert an integer into a dotted representation of an ip4.
|
|
|
|
In: 18435761
|
|
Out: "1.25.78.177"
|
|
|
|
- *ip-range*
|
|
|
|
Converts a couple of integers given as strings into a range of ip4.
|
|
|
|
In: "16825344" "16825599"
|
|
Out: "1.0.188.0-1.0.188.255"
|
|
|
|
- *convert-mysql-point*
|
|
|
|
Converts from the `astext` representation of points in MySQL to the
|
|
PostgreSQL representation.
|
|
|
|
In: "POINT(48.5513589 7.6926827)"
|
|
Out: "(48.5513589,7.6926827)"
|
|
|
|
- *float-to-string*
|
|
|
|
Converts a Common Lisp float into a string suitable for a PostgreSQL float:
|
|
|
|
In: 100.0d0
|
|
Out: "100.0"
|
|
|
|
- *set-to-enum-array*
|
|
|
|
Converts a string representing a MySQL SET into a PostgreSQL Array of
|
|
Enum values from the set.
|
|
|
|
In: "foo,bar"
|
|
Out: "{foo,bar}"
|
|
|
|
- *empty-string-to-null*
|
|
|
|
Convert an empty string to a null.
|
|
|
|
- *right-trim*
|
|
|
|
Remove whitespace at end of string.
|
|
|
|
- *remove-null-characters*
|
|
|
|
Remove `NUL` characters (`0x0`) from given strings.
|
|
|
|
- *byte-vector-to-bytea*
|
|
|
|
Transform a simple array of unsigned bytes to the PostgreSQL bytea Hex
|
|
Format representation as documented at
|
|
http://www.postgresql.org/docs/9.3/interactive/datatype-binary.html
|
|
|
|
- *sqlite-timestamp-to-timestamp*
|
|
|
|
SQLite type system is quite interesting, so cope with it here to produce
|
|
timestamp literals as expected by PostgreSQL. That covers year only on 4
|
|
digits, 0 dates to null, and proper date strings.
|
|
|
|
- *sql-server-uniqueidentifier-to-uuid*
|
|
|
|
The SQL Server driver receives data fo type uniqueidentifier as byte
|
|
vector that we then need to convert to an UUID string for PostgreSQL
|
|
COPY input format to process.
|
|
|
|
- *unix-timestamp-to-timestamptz*
|
|
|
|
Converts a unix timestamp (number of seconds elapsed since beginning of
|
|
1970) into a proper PostgreSQL timestamp format.
|
|
|
|
## LOAD MESSAGES
|
|
|
|
This command is still experimental and allows receiving messages via
|
|
UDP using a syslog like format, and, depending on rule matching, loads
|
|
named portions of the data stream into a destination table.
|
|
|
|
LOAD MESSAGES
|
|
FROM syslog://localhost:10514/
|
|
|
|
WHEN MATCHES rsyslog-msg IN apache
|
|
REGISTERING timestamp, ip, rest
|
|
INTO postgresql://localhost/db?logs.apache
|
|
SET guc_1 = 'value', guc_2 = 'other value'
|
|
|
|
WHEN MATCHES rsyslog-msg IN others
|
|
REGISTERING timestamp, app-name, data
|
|
INTO postgresql://localhost/db?logs.others
|
|
SET guc_1 = 'value', guc_2 = 'other value'
|
|
|
|
WITH apache = rsyslog
|
|
DATA = IP REST
|
|
IP = 1*3DIGIT "." 1*3DIGIT "."1*3DIGIT "."1*3DIGIT
|
|
REST = ~/.*/
|
|
|
|
WITH others = rsyslog;
|
|
|
|
As the command is still experimental the options might be changed in the
|
|
future and the details are not documented.
|
|
|
|
## AUTHOR
|
|
|
|
Dimitri Fontaine <dimitri@2ndQuadrant.fr>
|
|
|
|
## SEE ALSO
|
|
|
|
PostgreSQL COPY documentation at <http://www.postgresql.org/docs/9.3/static/sql-copy.html>.
|
|
|
|
The pgloader source code, binary packages, documentation and examples may be
|
|
downloaded from <http://pgloader.io/>.
|