% PGLOADER(1) pgloader User Manuals % Dimitri Fontaine % September 8, 2013 # NAME pgloader - PostgreSQL data loader # SYNOPSIS pgloader [*options*] [*command-file*]... # DESCRIPTION pgloader loads data from different sources into PostgreSQL. It can tranform the data it reads on the fly and send raw SQL before and after the loading. It uses the `COPY` PostgreSQL protocol to stream the data into the server, and manages errors by filling a pair fo *reject.dat* and *reject.log* files. pgloader operates from commands which are read from files: pgloader commands.load # OPTIONS -h, \--help : Show command usage summary and exit. -V, \--version : Show pgloader version string and exit. -v, \--verbose : Be verbose. -q, \--quiet : Be quiet. -d, \--debug : Show debug level information messages. --log-min-messages : Minimum level of verbosity needed for log message to make it to the logfile. One of critical, log, error, warning, notice, info or debug. --client-min-messages : Minimum level of verbosity needed for log message to make it to the console. One of critical, log, error, warning, notice, info or debug. -E, \--list-encodings : List known encodings in this version of pgloader. -U, \--upgrade-config : Parse given files in the command line as `pgloader.conf` files with the `INI` syntax that was in use in pgloader versions 2.x, and output the new command syntax for pgloader on standard output. -l *FILE*, \--load *FILE* : Specify a lisp *FILE* to compile and load into the pgloader image before reading the commands, allowing to define extra transformation function. Those functions should be defined in the `pgloader.transforms` package. This option can appear more than once in the command line. # BATCHES AND RETRY BEHAVIOUR TODO. Add CLI options for batch size maybe. # COMMANDS pgloader support the following commands: - `LOAD CSV` - `LOAD FIXED` - `LOAD DBF` - `LOAD SQLite` - `LOAD MYSQL` - `LOAD ARCHIVE` - `LOAD DATABASE` - `LOAD MESSAGES` The pgloader commands follow the same grammar rules. Each of them might support only a subset of the general options and provide specific options. LOAD FROM [ WITH ] INTO [ WITH ] [ SET ] ; The main clauses are the `LOAD`, `FROM`, `INTO` and `WITH` clauses that each command implements. Some command then implement the `SET` command, or some specific clauses such as the `CAST` clause. ### Connection String The `` parameter is expected to be given as a *Connection URI* as documented in the PostgreSQL documentation at http://www.postgresql.org/docs/9.3/static/libpq-connect.html#LIBPQ-CONNSTRING. postgresql://[user[:password]@][netloc][:port][/dbname][?schema.table] Where: - *user* Can contain any character, including colon (`:`) which must then be doubled (`::`) and at-sign (`@`) which must then be doubled (`@@`). - *password* Can contain any character, including that at sign (`@`) which must then be doubled (`@@`). To leave the password empty, when the *user* name ends with at at sign, you then have to use the syntax user:@. - *netloc* Can be either a hostname in dotted notation, or an ipv4. - *dbname* Should be a proper identifier (letter followed by a mix of letters, digits and the punctuation signs comma (`,`), dash (`-`) and underscore (`_`). - The only optionnal parameter should be a possibly qualified table name. ### Regular Expressions Several clauses listed in the following accept *regular expressions* with the following input rules: - A regular expression begins with a tilde sign (`~`), - is then followed with an opening sign, - then any character is allowed and considered part of the regular expression, except for the closing sign, - then a closing sign is expected. The opening and closing sign are allowed by pair, here's the complete list of allowed delimiters: ~// ~[] ~{} ~() ~<> ~"" ~'' ~|| ~## Pick the set of delimiters that don't collide with the *regular expression* you're trying to input. If your expression is such that none of the solutions allow you to enter it, the places where such expressions are allowed should allow for a list of expressions. ### Comments Any command may contain comments, following those input rules: - the `--` delimiter begins a comment that ends with the end of the current line, - the delimiters `/*` and `*/` respectively start and end a comment, which can be found in the middle of a command or span several lines. Any place where you could enter a *whitespace* will accept a comment too. ## LOAD CSV This command instructs pgloader to load data from a `CSV` file. Here's an example: LOAD CSV FROM 'GeoLiteCity-Blocks.csv' WITH ENCODING iso-646-us ( startIpNum, endIpNum, locId ) INTO postgresql://user@localhost:54393/dbname?geolite.blocks ( iprange ip4r using (ip-range startIpNum endIpNum), locId ) WITH truncate, skip header = 2, fields optionally enclosed by '"', fields escaped by backslash-quote, fields terminated by '\t' SET work_mem to '32 MB', maintenance_work_mem to '64 MB'; The `csv` format command accepts the following clauses and options: - *FROM* Filename where to load the data from. Accepts an *ENCODING* option. Use the `--list-encodings` option to know which encoding names are supported. The filename may be enclosed by single quotes, and could be one of the following special values: - *inline* The data is found after the end of the parsed commands. Any number of empty lines between the end of the commands and the beginning of the data is accepted. - *stdin* Reads the data from the standard input stream. The *FROM* option also supports an optional comma separated list of *field* names describing what is expected in the `CSV` data file. Each field name can be either only one name or a name following with specific reader options for that field. Supported per-field reader options are: - *terminated by* See the description of *field terminated by* below. The processing of this option is not currently implemented. - *date format* When the field is expected of the date type, then this option allows to specify the date format used in the file. The processing of this option is not currently implemented. - *null if* This option takes an argument which is either the keyword *blanks* or a double-quoted string. When *blanks* is used and the field value that is read contains only space characters, then it's automatically converted to an SQL `NULL` value. When a double-quoted string is used and that string is read as the field value, then the field value is automatically converted to an SQL `NULL` value. - *INTO* The PostgreSQL connection URI must contains the name of the target table where to load the data into. That table must have already been created in PostgreSQL, and the name might be schema qualified. Then *INTO* option also supports an optional comma separated list of target columns, which are either the name of an input *field* or the whitespace separated list of the target column name, its PostgreSQL data type and a *USING* expression. The *USING* expression can be any valid Common Lisp form and will be read with the current package set to `pgloader.transforms`, so that you can use functions defined in that package, such as functions loaded dynamically with the `--load` command line parameter. Each *USING* expression is compiled at runtime to native code, and will be called in a context such as: (destructuring-bind (field-name-1 field-name-2 ...) row (list column-name-1 column-name-2 (expression column-name-1 column-name-2))) This feature allows pgloader to load any number of fields in a CSV file into a possibly different number of columns in the database, using custom code for that projection. - *WITH* When loading from a `CSV` file, the following options are supported: - *truncate* When this option is listed, pgloader issues a `TRUNCATE` command against the PostgreSQL target table before reading the data file. - *skip header* Takes a numeric value as argument. Instruct pgloader to skip that many lines at the beginning of the input file. - *trim unquoted blanks* When reading unquoted values in the `CSV` file, remove the blanks found in between the separator and the value. That behaviour is the default. - *keep unquoted blanks* When reading unquoted values in the `CSV` file, keep blanks found in between the separator and the value. - *fields optionally enclosed by* Takes a single character as argument, which must be found inside single quotes, and might be given as the printable character itself, the special value \t to denote a tabulation character, or `0x` then an hexadecimal value read as the ascii code for the character. This character is used as the quoting character in the `CSV` file, and defaults to double-quote. - *fields escaped by* Takes either the special value *backslash-quote* or *double-quote*. This value is used to recognize escaped field separators when they are to be found within the data fields themselves. Defaults to *double-quote*. - *fields terminated by* Takes a single character as argument, which must be found inside single quotes, and might be given as the printable character itself, the special value \t to denote a tabulation character, or `0x` then an hexadecimal value read as the ascii code for the character. This character is used as the *field separator* when reading the `CSV` data. - *SET* This clause allows to specify session parameters to be set for all the sessions opened by pgloader. It expects a list of parameter name, the equal sign, then the single-quoted value as a comma separated list. The names and values of the parameters are not validated by pgloader, they are given as-is to PostgreSQL. - *BEFORE LOAD DO* You can run SQL queries against the database before loading the data from the `CSV` file. Most common SQL queries are `CREATE TABLE IF NOT EXISTS` so that the data can be loaded. Each command must be *dollar-quoted*: it must begin and end with a double dollar sign, `$$`. Dollar-quoted queries are then comma separated. No extra punctuation is expected after the last SQL query. - *AFTER LOAD DO* Same format as *BEFORE LOAD DO*, the dollar-quoted queries found in that section are executed once the load is done. That's the right time to create indexes and constraints, or re-enable triggers. ## LOAD FIXED COLS This command instructs pgloader to load data from a text file containing columns arranged in a *fixed size* manner. Here's an example: LOAD FIXED FROM inline (a 0 10, b 10 8, c 18 8, d 26 17) INTO postgresql:///pgloader?fixed ( a, b, c time using (time-with-no-separator c), d ) WITH truncate SET client_encoding to 'latin1', work_mem to '14MB', standard_conforming_strings to 'on' BEFORE LOAD DO $$ drop table if exists fixed; $$, $$ create table fixed ( a integer, b date, c time, d text ); $$; 01234567892008052011431250firstline 01234562008052115182300left blank-padded 12345678902008052208231560another line The `fixed` format command accepts the following clauses and options: - *FROM* Filename where to load the data from. Accepts an *ENCODING* option. Use the `--list-encodings` option to know which encoding names are supported. The filename may be enclosed by single quotes, and could be one of the following special values: - *inline* The data is found after the end of the parsed commands. Any number of empty lines between the end of the commands and the beginning of the data is accepted. - *stdin* Reads the data from the standard input stream. The *FROM* option also supports an optional comma separated list of *field* names describing what is expected in the `FIXED` data file. Each field name is composed of the field name followed with specific reader options for that field. Supported per-field reader options are the following, where only *start* and *length* are required. - *start* Position in the line where to start reading that field's value. Can be entered with decimal digits or `0x` then hexadecimal digits. - *length* How many bytes to read from the *start* position to read that field's value. Same format as *start*. - *terminated by* See the description of *field terminated by* below. The processing of this option is not currently implemented. - *date format* When the field is expected of the date type, then this option allows to specify the date format used in the file. The processing of this option is not currently implemented. - *null if* This option takes an argument which is either the keyword *blanks* or a double-quoted string. When *blanks* is used and the field value that is read contains only space characters, then it's automatically converted to an SQL `NULL` value. When a double-quoted string is used and that string is read as the field value, then the field value is automatically converted to an SQL `NULL` value. - *INTO* The PostgreSQL connection URI must contains the name of the target table where to load the data into. That table must have already been created in PostgreSQL, and the name might be schema qualified. Then *INTO* option also supports an optional comma separated list of target columns, which are either the name of an input *field* or the whitespace separated list of the target column name, its PostgreSQL data type and a *USING* expression. The *USING* expression can be any valid Common Lisp form and will be read with the current package set to `pgloader.transforms`, so that you can use functions defined in that package, such as functions loaded dynamically with the `--load` command line parameter. Each *USING* expression is compiled at runtime to native code, and will be called in a context such as: (destructuring-bind (field-name-1 field-name-2 ...) row (list column-name-1 column-name-2 (expression column-name-1 column-name-2))) This feature allows pgloader to load any number of fields in a CSV file into a possibly different number of columns in the database, using custom code for that projection. - *WITH* When loading from a `CSV` file, the following options are supported: - *truncate* When this option is listed, pgloader issues a `TRUNCATE` command against the PostgreSQL target table before reading the data file. - *skip header* Takes a numeric value as argument. Instruct pgloader to skip that many lines at the beginning of the input file. - *SET* This clause allows to specify session parameters to be set for all the sessions opened by pgloader. It expects a list of parameter name, the equal sign, then the single-quoted value as a comma separated list. The names and values of the parameters are not validated by pgloader, they are given as-is to PostgreSQL. - *BEFORE LOAD DO* You can run SQL queries against the database before loading the data from the `CSV` file. Most common SQL queries are `CREATE TABLE IF NOT EXISTS` so that the data can be loaded. Each command must be *dollar-quoted*: it must begin and end with a double dollar sign, `$$`. Dollar-quoted queries are then comma separated. No extra punctuation is expected after the last SQL query. - *AFTER LOAD DO* Same format as *BEFORE LOAD DO*, the dollar-quoted queries found in that section are executed once the load is done. That's the right time to create indexes and constraints, or re-enable triggers. ## LOAD DBF This command instructs pgloader to load data from a `DBF` file. Here's an example: LOAD DBF FROM http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement/2013/dbf/reg2013.dbf INTO postgresql://user@localhost/dbname WITH truncate, create table; The `dbf` format command accepts the following clauses and options: - *FROM* Filename where to load the data from. This support local files, HTTP URLs and zip files containing a single dbf file of the same name. Fetch such a zip file from an HTTP address is of course supported. - *INTO* The PostgreSQL connection URI. If it doesn't have a table name in the target, then the name part of the filename will be used as a table name. - *WITH* When loading from a `DBF` file, the following options are supported: - *truncate* When this option is listed, pgloader issues a `TRUNCATE` command against the PostgreSQL target table before reading the data file. - *create table* When this option is listed, pgloader creates the table using the meta data found in the `DBF` file, which must contain a list of fields with their data type. A standard data type conversion from DBF to PostgreSQL is done. - *table name* This options expects as its value the possibly qualified name of the table to create. - *SET* This clause allows to specify session parameters to be set for all the sessions opened by pgloader. It expects a list of parameter name, the equal sign, then the single-quoted value as a comma separated list. The names and values of the parameters are not validated by pgloader, they are given as-is to PostgreSQL. ## LOAD ARCHIVE This command instructs pgloader to load data from one or more files contained in an archive. Currently the only supported archive format is *ZIP*, and the archive might be downloaded from an *HTTP* URL. Here's an example: LOAD ARCHIVE FROM /Users/dim/Downloads/GeoLiteCity-latest.zip INTO postgresql:///ip4r BEFORE LOAD DO $$ create extension if not exists ip4r; $$, $$ create schema if not exists geolite; $$, $$ create table if not exists geolite.location ( locid integer primary key, country text, region text, city text, postalcode text, location point, metrocode text, areacode text ); $$, $$ create table if not exists geolite.blocks ( iprange ip4r, locid integer ); $$, $$ drop index if exists geolite.blocks_ip4r_idx; $$, $$ truncate table geolite.blocks, geolite.location cascade; $$ LOAD CSV FROM FILENAME MATCHING ~/GeoLiteCity-Location.csv/ WITH ENCODING iso-8859-1 ( locId, country, region null if blanks, city null if blanks, postalCode null if blanks, latitude, longitude, metroCode null if blanks, areaCode null if blanks ) INTO postgresql:///ip4r?geolite.location ( locid,country,region,city,postalCode, location point using (format nil "(~a,~a)" longitude latitude), metroCode,areaCode ) WITH skip header = 2, fields optionally enclosed by '"', fields escaped by double-quote, fields terminated by ',' AND LOAD CSV FROM FILENAME MATCHING ~/GeoLiteCity-Blocks.csv/ WITH ENCODING iso-8859-1 ( startIpNum, endIpNum, locId ) INTO postgresql:///ip4r?geolite.blocks ( iprange ip4r using (ip-range startIpNum endIpNum), locId ) WITH skip header = 2, fields optionally enclosed by '"', fields escaped by double-quote, fields terminated by ',' FINALLY DO $$ create index blocks_ip4r_idx on geolite.blocks using gist(iprange); $$; The `archive` command accepts the following clauses and options: - *FROM* Filename or HTTP URI where to load the data from. When given an HTTP URL the linked file will get downloaded locally before processing. If the file is a `zip` file, the command line utility `unzip` is used to expand the archive into files in `$TMPDIR`, or `/tmp` if `$TMPDIR` is unset or set to a non-existing directory. Then the following commands are used from the top level directory where the archive has been expanded. - *INTO* A PostgreSQL database connection URL is expected and will be used in the *BEFORE LOAD DO* and *FINALLY DO* clauses. - *BEFORE LOAD DO* You can run SQL queries against the database before loading from the data files found in the archive. Most common SQL queries are `CREATE TABLE IF NOT EXISTS` so that the data can be loaded. Each command must be *dollar-quoted*: it must begin and end with a double dollar sign, `$$`. Queries are then comma separated. No extra punctuation is expected after the last SQL query. - command [ *AND* command ... ] A series of commands against the contents of the archive, at the moment only `CSV` and `DBF` commands are supported. Note that commands are supporting the clause *FROM FILENAME MATCHING* which allows the pgloader command not to depend on the exact names of the archive directories. - *FINALLY DO* SQL Queries to run once the data is loaded, such as `CREATE INDEX`. ## LOAD MYSQL DATABASE This command instructs pgloader to load data from a database connection. The only supported database source is currently *MySQL*, and pgloader supports dynamically converting the schema of the source database and the indexes building. A default set of casting rules are provided and might be overloaded and appended to by the command. Here's an example: LOAD DATABASE FROM mysql://root@localhost/sakila INTO postgresql://localhost:54393/sakila WITH include drop, create tables, create indexes, reset sequences SET maintenance_work_mem to '128MB', work_mem to '12MB', search_path to 'sakila' CAST type datetime to timestamptz drop default drop not null using zero-dates-to-null, type date drop not null drop default using zero-dates-to-null, -- type tinyint to boolean using tinyint-to-boolean, type year to integer -- INCLUDING ONLY TABLE NAMES MATCHING ~/film/, 'actor' -- EXCLUDING TABLE NAMES MATCHING ~ BEFORE LOAD DO $$ create schema if not exists sakila; $$; The `database` command accepts the following clauses and options: - *FROM* Must be a connection URL pointing to a MySQL database. At the moment only MySQL is supported as a pgloader source. If the connection URI contains a table name, then only this table is migrated from MySQL to PostgreSQL. - *INTO* The target PostgreSQL connection URI. - *WITH* When loading from a `MySQL` database, the following options are supported: - *include drop* When this option is listed, pgloader drop in the PostgreSQL connection all the table whose names have been found in the MySQL database. This option allows for using the same command several times in a row until you figure out all the options, starting automatically from a clean environment. - *include no drop* When this option is listed, pgloader will not include any `DROP` statement when loading the data. - *truncate* When this option is listed, pgloader issue the `TRUNCATE` command against each PostgreSQL table just before loading data into it. - *no truncate* When this topion is listed, pgloader issues no `TRUNCATE` command. - *create tables* When this option is listed, pgloader creates the table using the meta data found in the `MySQL` file, which must contain a list of fields with their data type. A standard data type conversion from DBF to PostgreSQL is done. - *create no tables* When this option is listed, pgloader skips the creation of table before lading data, target tables must then already exist. - *create indexes* When this option is listed, pgloader gets the definitions of all the indexes found in the MySQL database and create the same set of index definitions against the PostgreSQL database. - *create no indexes* When this option is listed, pgloader skips the creating indexes. - *foreign keys* When this option is listed, pgloader gets the definitions of all the foreign keys found in the MySQL database and create the same set of foreign key definitions against the PostgreSQL database. - *no foreign keys* When this option is listed, pgloader skips creating foreign keys. - *reset sequences* When this option is listed, at the end of the data loading and after the indexes have all been created, pgloader resets all the PostgreSQL sequences created to the current maximum value of the column they are attached to. The options *schema-only* and *data-ony* have no effects on this option. - *reset no sequences* When this option is listed, pgloader skips resetting sequences after the load. The options *schema-only* and *data-ony* have no effects on this option. - *downcase identifiers* When this option is listed, pgloader converts all MySQL identifiers (table names, index names, column names) to *downcase*, except for PostgreSQL *reserved* keywords. The PostgreSQL *reserved* keywords are determined dynamically by using the system function `pg_get_keywords()`. - *quote identifiers* When this option is listed, pgloader quotes all MySQL identifiers so that their case is respected. Note that you will then have to do the same thing in your application code queries. - *schema only* When this option is listed pgloader refrains from migrating the data over. Note that the schema in this context includes the indexes when the option *create indexes* has been listed. - *data-only* When this option is listed pgloader only issues the `COPY` statements, without doing any other processing. - *SET* This clause allows to specify session parameters to be set for all the sessions opened by pgloader. It expects a list of parameter name, the equal sign, then the single-quoted value as a comma separated list. The names and values of the parameters are not validated by pgloader, they are given as-is to PostgreSQL. - *CAST* The cast clause allows to specify custom casting rules, either to overload the default casting rules or to amend them with special cases. A casting rule is expected to follow the form: type to [