mirror of
https://github.com/dimitri/pgloader.git
synced 2025-08-12 01:07:00 +02:00
865 lines
28 KiB
Plaintext
865 lines
28 KiB
Plaintext
= pgloader(1) =
|
|
|
|
== NAME ==
|
|
|
|
pgloader - Import CSV data and Large Object to PostgreSQL
|
|
|
|
== SYNOPSIS ==
|
|
|
|
pgloader [--version] [-c configuration file]
|
|
[-p pedantic] [-d debug] [-v verbose] [-q quiet] [-s summary]
|
|
[-l loglevel] [-L logfile]
|
|
[-n dryrun] [-Cn count] [-Fn from] [-In from id]
|
|
[-E input files encoding] [-R reformat:path]
|
|
[Section|Filename ...]
|
|
|
|
== DESCRIPTION ==
|
|
|
|
+pgloader+ imports data from a flat file and insert it into a database
|
|
table. It uses a flat file per database table, and you can configure
|
|
as many Sections as you want, each one associating a table name and a
|
|
data file.
|
|
|
|
Data are parsed and rewritten, then given to PostgreSQL +COPY+
|
|
command. Parsing is necessary for dealing with end of lines and
|
|
eventual trailing separator characters, and for column reordering:
|
|
your flat data file may not have the same column order as the database
|
|
table has.
|
|
|
|
+pgloader+ is also able to load some large objects data into
|
|
PostgreSQL, as of now only Informix +UNLOAD+ data files are
|
|
supported. This command gives large objects data location information
|
|
into the main data file. +pgloader+ parse it add the +text+ or +bytea+
|
|
content properly escaped to the +COPY+ data.
|
|
|
|
+pgloader+ issue some timing statistics every +commit_every+ commits
|
|
(see Configuration for this setting). At the end of each section
|
|
processing, a summary of overall operations, numbers of rows copied
|
|
and commits, time it took in seconds, errors logged and database
|
|
errors is issued.
|
|
|
|
+pgloader+ is available from +pgfoundry+ at
|
|
http://pgfoundry.org/projects/pgloader/[], where you'll find a debian
|
|
package, a source package and an anonymous CVS.
|
|
|
|
== ARGUMENTS ==
|
|
|
|
+pgloader+ as of version +2.3.3+ accepts two kinds of arguments, either
|
|
section names of file names. If both a section and a file exist with the
|
|
same name, preference is given to the section, where you can edit your
|
|
settings rather than using default ones.
|
|
|
|
Section::
|
|
+
|
|
is the name of a configured Section describing some data to load
|
|
+
|
|
Section arguments are optional, if no section is given all configured
|
|
sections are processed.
|
|
|
|
Filename::
|
|
|
|
The name of a file containing the data to load. +pgloader+ will
|
|
internally setup a +Section+ for this filename, with the default field
|
|
separator or the given +--field-separator+ and the +columns+ parameter
|
|
set to '*', and more importantly the format set to +CSV+. It's the only
|
|
supported format with sane enough defaults to apply here.
|
|
|
|
== OPTIONS ==
|
|
|
|
In order for pgloader to run, you have to edit a configuration file
|
|
(see Configuration) consisting of Section definitions. Each section
|
|
refers to a PostgreSQL table into which some data is to be loaded.
|
|
|
|
--version::
|
|
|
|
print out pgloader version, then quit.
|
|
|
|
-c, --config::
|
|
|
|
specifies the configuration file to use. The default file name is
|
|
pgloader.conf, searched into current working directory.
|
|
|
|
-p, --pedantic::
|
|
|
|
activates the pedantic mode, where any warning is considered as a fatal
|
|
error, thus stopping the processing of the input file.
|
|
|
|
-d, --debug::
|
|
|
|
makes pgloader say it all about what it does. debug implies verbose.
|
|
|
|
-v, --verbose::
|
|
|
|
makes pgloader very verbose about what it does.
|
|
|
|
-q, --quiet::
|
|
|
|
makes pgloader very quiet about what it does: only output errors.
|
|
|
|
-l, --loglevel::
|
|
|
|
log level to use when reporting to the console, see +client_min_messages+.
|
|
|
|
-L, --logfile::
|
|
|
|
file where to log messages, see +log_min_messages+.
|
|
|
|
-r, --reject-log::
|
|
|
|
Filename, with a single "%s" placeholder, where to store the bad data
|
|
logs (that's the error messages given by PostgreSQL). If you want to put
|
|
a percent in the file name, write it '%%'.
|
|
|
|
-j, --reject-data::
|
|
|
|
Filename, with a single "%s" placeholder, where to store the bad data
|
|
(the exact lines that didn't make it from your input file). If you want
|
|
to put a percent in the file name, write it '%%'.
|
|
|
|
-s, --summary::
|
|
|
|
makes pgloader print a 'nice' summary at the end of operations.
|
|
|
|
-n, --dry-run::
|
|
|
|
makes pgloader simulate operations, that implies no database connection and
|
|
no data extraction from blob files.
|
|
|
|
-D, --disable-triggers::
|
|
|
|
makes +pgloader+ issue a +ALTER TABLE <table> DISABLE TRIGGER ALL+
|
|
before loading the data, and +ENABLE+ them again once data is
|
|
loaded.
|
|
|
|
-T, --truncate::
|
|
|
|
makes +pgloader+ issue a +TRUNCATE <table>+ SQL command before
|
|
importing data.
|
|
|
|
-V, --vacuum::
|
|
|
|
makes +pgloader+ issue a +VACUUM ANALYZE <table>+ SQL command
|
|
after data loading.
|
|
|
|
-C, --count::
|
|
|
|
Number of input lines to process, default is to process all the input
|
|
lines.
|
|
|
|
-F, --from::
|
|
+
|
|
Input line number from which we begin to process (and count). pgloader
|
|
will skip all preceding lines.
|
|
+
|
|
You can't use both -F and -I at the same time.
|
|
|
|
-I, --from-id::
|
|
+
|
|
From which id do we begin to process (and count) input lines.
|
|
+
|
|
When a composite key is used, you have to give each column of the key
|
|
separated by comma, on the form col_name=value.
|
|
+
|
|
Please notice using the --from-id option implies pgloader will try to
|
|
get row id of each row, it being on the interval processed or
|
|
not. This could have some performance impact, and you may end up
|
|
preferring to use --from instead.
|
|
+
|
|
Example: pgloader -I col1:val1,col2:val2
|
|
+
|
|
You can't use both -F and -I at the same time.
|
|
|
|
-f, --field-sep::
|
|
|
|
Default field separator to use, when not set +pgloader+ will use
|
|
'|'. Useful when using +filename+ arguments rather than +section+ ones.
|
|
|
|
-E, --encoding::
|
|
|
|
Input data files encoding. Defaults to 'latin9'.
|
|
|
|
-o, --pg-options::
|
|
+
|
|
Any option to give to the PostgreSQL server by mean of the +SET+
|
|
command. You can use this argument more than once to set more than one
|
|
option.
|
|
+
|
|
Example: -o standard_conforming_strings=on -o client_encoding=utf8
|
|
|
|
-t, --section-threads::
|
|
|
|
How many threads per section to use, defaults to 1. The command line
|
|
value override the configuration file one.
|
|
|
|
-m, --max-parallel-sections::
|
|
|
|
How many sections to load in parallel, defaults to 1. The command line
|
|
value override the configuration file one. That's a max value because
|
|
you will end up having less sections to load than this number.
|
|
|
|
-R, --reformat_path::
|
|
|
|
PATH where to find reformat python modules, defaults to
|
|
+/usr/share/pgloader/reformat+. See +reformat_path+ option for
|
|
syntax and default value.
|
|
|
|
-1, --psycopg1::
|
|
|
|
Force usage of psycopg version 1.
|
|
|
|
-2, --psycopg2::
|
|
|
|
Force usage of psycopg version 2.
|
|
|
|
--psycopg-version::
|
|
|
|
Force +pgloader+ to use given version of psycopg, either +1+ or
|
|
+2+.
|
|
|
|
== INTERNAL USAGE OPTIONS ==
|
|
|
|
Those have been developped for internal +pgloader+ usage only, but still
|
|
need to be documented. Also, they are maintained and you could find an usage
|
|
for them.
|
|
|
|
--load-from-stdin::
|
|
|
|
Consider standard input as the data file. When using this function,
|
|
either give a section name from which to apply all the setup except for
|
|
the +filename+ to load from, or use +--load-to-table+.
|
|
|
|
--load-to-table::
|
|
|
|
This option's argument must be the name of the PostgreSQL table you're
|
|
loading the data to, it's useful when you want to load from +stdin+ and
|
|
avoid editing a full configuration section.
|
|
|
|
--boundaries::
|
|
|
|
Allow for limiting the range of bytes to read and process, must be given
|
|
in the X..Y format, with X and Y integers.
|
|
|
|
== GLOBAL CONFIGURATION SECTION ==
|
|
|
|
The configuration file has a +.ini+ file syntax, its first section has
|
|
to be the pgsql one, defining how to access to the PostgreSQL database
|
|
server where to load data. Then you may define any number of sections,
|
|
each one describing a data loading task to be performed by pgloader.
|
|
|
|
The [pgsql] section has the following options, which all must be set.
|
|
|
|
host::
|
|
|
|
PostgreSQL database server name, for example +localhost+. For Unix
|
|
Domain connection, give the directory where to find the Unix
|
|
Socket, e.g. +/tmp+. The +port+ will then get used to locate the
|
|
Unix Socket filename.
|
|
|
|
port::
|
|
|
|
PostgreSQL database server listening port, 5432. You have to fill this
|
|
entry.
|
|
|
|
base::
|
|
|
|
The name of the database you want to load data into.
|
|
|
|
user::
|
|
|
|
Connecting PostgreSQL user name.
|
|
|
|
pass::
|
|
|
|
The password of the user. The better is to grant a trust access
|
|
privilege in PostgreSQL +pg_hba.conf+. Then you can set this entry
|
|
to whatever value you want to.
|
|
|
|
client_encoding::
|
|
+
|
|
Set this parameter to have +pgloader+ connects to PostgreSQL using this
|
|
encoding.
|
|
+
|
|
This parameter is optional and defaults to 'latin9'.
|
|
+
|
|
As of +pgloader 2.3.3+ you can also use +pg_option_client_encoding+ which is
|
|
the more general approach.
|
|
|
|
datestyle::
|
|
+
|
|
Set this parameter to have +pgloader+ connects to PostgreSQL using this
|
|
datestyle setting.
|
|
+
|
|
This parameter is optional and has no default value, thus pgloader will
|
|
use whatever your PostgreSQL is configured to as default.
|
|
+
|
|
As of +pgloader 2.3.3+ you can also use +pg_option_datestyle+ which is
|
|
the more general approach.
|
|
|
|
pg_option_<foo>::
|
|
|
|
Replace <foo> with any option you're allowed to setup for the session
|
|
only with the +SET+ command, and +pgloader+ will do just that for
|
|
you. Consider for example +pg_option_standard_conforming_strings = on+.
|
|
|
|
copy_every::
|
|
+
|
|
When issuing +COPY+ PostgreSQL commands, pgloader will not make a
|
|
single big +COPY+ attempt, but copy copy_every lines at a time.
|
|
+
|
|
This parameter is optional and defaults to 10000.
|
|
|
|
//////////////////////////////////////////
|
|
commit_every::
|
|
+
|
|
PostgreSQL +COMMIT+ frequency, in +UPDATE+ orders. A good value is
|
|
+1000+, that means committing the SQL transaction every 1000 input
|
|
lines.
|
|
+
|
|
+pgloader+ issues commit every +commit_every+ updates, on connection
|
|
closing and when a SQL error occurs.
|
|
+
|
|
This parameter is optional and defaults to +1000+.
|
|
//////////////////////////////////////////
|
|
|
|
copy_delimiter::
|
|
+
|
|
The field separator to use in +COPY FROM+ produced statements. If you
|
|
don't specify this, the same separator as the one given in +field_sep+
|
|
parameter will be used.
|
|
+
|
|
Please note PostgreSQL requires a single char properly encoded (see
|
|
your +client_encoding+ parameter), or it abort in error and even may
|
|
crash.
|
|
+
|
|
This parameter is optional and defaults to +field_sep+.
|
|
|
|
newline_escapes::
|
|
+
|
|
For parameter effect description, see below (same name, table local
|
|
setting).
|
|
+
|
|
You can setup here a global escape character, to be considered on each
|
|
and every column of each and every text-format table defined
|
|
thereafter.
|
|
|
|
null::
|
|
+
|
|
You can configure here how null value is represented into your flat
|
|
data file.
|
|
+
|
|
This parameter is optional and defaults to +''+ (that is +empty string+).
|
|
|
|
empty_string::
|
|
+
|
|
You can configure here how empty values are represented into your flat
|
|
data file.
|
|
+
|
|
This parameter is optional and defaults to +$$'\ '$$+ (that is
|
|
backslash followed by space).
|
|
|
|
reformat_path::
|
|
+
|
|
When using +reformat+ option, provide here a colon separated path list
|
|
where to look for reformatting module.
|
|
+
|
|
reformat_path = .:/home/dim/PostgreSQL/pgfoundry/pgloader/reformat
|
|
+
|
|
The directories given here should exist and contain a
|
|
+$$__init__.py$$+ file (for python to consider them as packages), the
|
|
only modules and functions used in the package will be the one you
|
|
configure with +reformat+ section specific option.
|
|
+
|
|
Default value is +/usr/share/pgloader/reformat+, which is where the
|
|
provided +debian+ package of +pgloader+ installs the +reformat+
|
|
modules.
|
|
+
|
|
If the +-R+ or +--reformat_path+ command line option is used, it will
|
|
have precedence over configuration file setting.
|
|
|
|
client_min_messages::
|
|
|
|
Minimum level of messages to print to the console while
|
|
running. Defined levels are +DEBUG+, +INFO+, +WARNING+, +ERROR+,
|
|
+CRITICAL+, from min to max. +
|
|
|
|
log_min_messages::
|
|
|
|
Minimum level of messages to print out to the log file, which
|
|
defaults to +/tmp/pgloader.log+. See +client_min_messages+ for
|
|
available levels.
|
|
|
|
log_file::
|
|
|
|
Relative or absolute path to the +log_file+ where to log messages
|
|
of level of at least +log_min_messages+ level. The 'dirname' of
|
|
the given +log_file+, if it doesn't exists, will be created by
|
|
+pgloader+. If any error prevents +pgloader+ to use the
|
|
+log_file+, it will default to using +/tmp/pgloader.log+ and say
|
|
so.
|
|
|
|
lc_messages::
|
|
|
|
The PostgreSQL session will use this +LC_MESSAGES+ setting if
|
|
given, defaults to server configuration by not issuing anything
|
|
with respect to this setting when not set.
|
|
|
|
max_parallel_sections::
|
|
|
|
Number of sections to load at the same time, each in its own
|
|
thread. Default to +1+, which is the legacy behaviour and the more
|
|
common wanted one.
|
|
|
|
== COMMON FORMAT CONFIGURATION PARAMETERS ==
|
|
|
|
You then can define any number of data section, and give them an
|
|
arbitrary name. Some options are required, some are actually optional,
|
|
in which case it is said so thereafter.
|
|
|
|
First, we'll go through common parameters, applicable whichever format
|
|
of data you're referring to. Then text-format only parameters will be
|
|
presented, followed by csv-only parameters.
|
|
|
|
template::
|
|
+
|
|
When this option is set, current section is to be considered a
|
|
template, that is only read from section(s) using it as so (see
|
|
+use_template+ below).
|
|
+
|
|
The value given to the option is not taken into account by +pgloader+,
|
|
only the fact that it exists has meaning. But +ConfigParser+ requires
|
|
a value to be affected to consider the option set. Use +True+ as a
|
|
value, for example.
|
|
|
|
use_template::
|
|
+
|
|
This option setting have to be the name of a template section, which
|
|
can define the exact same options as a normal section. If the actual
|
|
section and the +use_template+ template section both define the same
|
|
option, the former is used: actual setting overrides template's one.
|
|
|
|
table::
|
|
|
|
The table name of the database where to load data.
|
|
|
|
format::
|
|
+
|
|
The format data are to be found, either +text+, +csv+ or +fixed+.
|
|
+
|
|
See next sections for format specific options.
|
|
|
|
filename::
|
|
|
|
The absolute path to the input data file. The large object files
|
|
are to be found into the same directory. Their name can be in the
|
|
form +[bc]lob[0-9a-f]{4}.[0-9a-f]{3}+, but this information is not
|
|
used by +pgloader+.
|
|
|
|
input_encoding::
|
|
|
|
The encoding of the configured +filename+.
|
|
|
|
reject_log::
|
|
|
|
In case of errors processing input data, a human readable log per rejected
|
|
input data line is produced into the +reject_log+ file.
|
|
|
|
reject_data::
|
|
|
|
In case of errors processing input data, the rejected input line is
|
|
appended to the +reject_data+ file.
|
|
|
|
field_sep::
|
|
+
|
|
The field separator used into the data file. The same separator will
|
|
be used by the generated +COPY+ commands, thus +pgloader+ does not
|
|
have to deal with escaping the delimiter it uses (input data has to
|
|
have escaped it).
|
|
+
|
|
This parameter is optional and defaults to pipe char +$$'|'$$+.
|
|
|
|
client_encoding::
|
|
+
|
|
Set this parameter to have +pgloader+ connects to PostgreSQL using this
|
|
encoding.
|
|
+
|
|
This parameter is optional and defaults to 'latin9'.
|
|
+
|
|
As of +pgloader 2.3.3+ you can also use +pg_option_client_encoding+ which is
|
|
the more general approach.
|
|
|
|
datestyle::
|
|
+
|
|
Set this parameter to have +pgloader+ connects to PostgreSQL using this
|
|
datestyle setting.
|
|
+
|
|
This parameter is optional and has no default value, thus pgloader will
|
|
use whatever your PostgreSQL is configured to as default.
|
|
+
|
|
As of +pgloader 2.3.3+ you can also use +pg_option_datestyle+ which is
|
|
the more general approach.
|
|
|
|
pg_option_<foo>::
|
|
|
|
Replace <foo> with any option you're allowed to setup for the session
|
|
only with the +SET+ command, and +pgloader+ will do just that for
|
|
you. Consider for example +pg_option_standard_conforming_strings = on+.
|
|
|
|
null::
|
|
+
|
|
You can configure here how null value is represented into your flat
|
|
data file.
|
|
+
|
|
This parameter is optional and defaults to +''+ (that is empty
|
|
string). If defined on a table level, this local value will overwrite
|
|
the global one.
|
|
|
|
empty_string::
|
|
+
|
|
You can configure here how empty values are represented into your flat
|
|
data file.
|
|
+
|
|
This parameter is optional and defaults to '\ ' (that is backslash
|
|
followed by space). If defined on a table level, this local value will
|
|
overwrite the global one.
|
|
|
|
skip_head_lines::
|
|
|
|
Skip the +n+ first lines of the given files (headers)
|
|
|
|
|
|
//////////////////////////////////////////
|
|
index::
|
|
+
|
|
Table index definition, to be used in blob +UPDATE+'ing. You define an
|
|
index column by giving its name and its column number (as found into
|
|
your data file, and counting from 1) separated by a colon. If your
|
|
table has a composite key, then you can define multiple columns here,
|
|
separated by a comma.
|
|
+
|
|
index = colname:3, other_colname:5
|
|
//////////////////////////////////////////
|
|
|
|
columns::
|
|
+
|
|
You can define here table columns, by giving their names and
|
|
optionally column number (as found into your data file, and counting
|
|
from 1) separated by a colon.
|
|
+
|
|
columns = x, y, a, b, d:6, c:5
|
|
+
|
|
Note you'll have to define here all the columns to be found in data
|
|
file, whether you want to use them all or not. When not using them
|
|
all, use the +only_cols+ parameter to restrict.
|
|
+
|
|
As of +pgloader 2.2+ the column list used might not be the same as the
|
|
table columns definition.
|
|
+
|
|
As of +pgloader 2.2.1+ you can omit column numbering if you want to, a
|
|
counter is then maintained for you, starting from 1 and set to +$$last
|
|
value + 1$$+ on each column, where +last value+ was either computed or
|
|
given in the config. So you can even omit only 'some' columns in
|
|
there.
|
|
+
|
|
In case you have a lot a columns per table, you will want to use
|
|
multiple lines for this parameter value. Python ConfigParser module
|
|
knows how to read multi-line parameters, you don't have to escape
|
|
anything.
|
|
+
|
|
An easy way to get the list of attributes (columns) of your tables
|
|
(say +a+, +b+ and +c+) is by the following query:
|
|
+
|
|
BEGIN;
|
|
CREATE AGGREGATE array_acc(anyelement) (
|
|
SFUNC = array_append,
|
|
STYPE = anyarray,
|
|
INITCOND = '{}'
|
|
);
|
|
|
|
SELECT relname, array_acc(attname)
|
|
FROM pg_attribute a join pg_class c on a.attrelid = c.oid
|
|
WHERE relname in ('a', 'b', 'c')
|
|
and attname not in ('tableoid','cmax','xmax','cmin','xmin','ctid')
|
|
GROUP BY relname;
|
|
|
|
ROLLBACK;
|
|
+
|
|
As of +pgloader 2.3.0+ you can simply set +columns = *+ and +pgloader+
|
|
will issue the needed SQL for you. This only works if your data file
|
|
and your table definition both present the columns in the exact same
|
|
order, obviously.
|
|
+
|
|
Internally, +pgloader+ will issue a +COPY+ statement without the
|
|
column names if possible, meaning when +only_cols+ is not used at the
|
|
same time as +columns = *+ is used.
|
|
|
|
user_defined_columns::
|
|
+
|
|
Those are special columns not found in the data file but which you
|
|
want to load into the database. The configuration options beginning
|
|
with +udc_+ are taken as column names with constant values. The
|
|
following example define the column +c+ as having the value +constant
|
|
value+ for each and every row of the input data file.
|
|
+
|
|
udc_c = constant value
|
|
+
|
|
The option +copy_columns+ is used to define the exact +columnsList+
|
|
given to +COPY+.
|
|
+
|
|
A simple use case is the loading into the same database table of data
|
|
coming from more than one file. If you need to keep track of the data
|
|
origin, add a column to the table model and define a 'udc_' for
|
|
+pgloader+ to add a constant value in the database.
|
|
+
|
|
Using user-defined columns require defining +copy_columns+ and is not
|
|
compatible with +only_cols+ usage.
|
|
+
|
|
|
|
copy_columns::
|
|
+
|
|
This options defines the columns to load from the input data file and
|
|
the user defined columns, and in which order to do this. Place here
|
|
the column names separated by commas.
|
|
+
|
|
copy_columns = b, c, d
|
|
+
|
|
This option is required if any user column is defined, and conflicts
|
|
with the +only_cols+ option. It won't have any effect when used in a
|
|
section where no user column is defined.
|
|
|
|
only_cols::
|
|
+
|
|
If you want to only load a part of the columns you have into the data
|
|
file, this option let you define which columns you're interested
|
|
in. +only_col+ is a comma separated list of ranges or values, as in
|
|
following example.
|
|
+
|
|
only_cols = 1-3, 5
|
|
+
|
|
This parameter is optional and defaults to the list of all columns
|
|
given on the columns parameter list, in the colname order.
|
|
+
|
|
This option conflicts with user defined columns and +copy_columns+
|
|
option.
|
|
|
|
blob_columns::
|
|
+
|
|
The definition of the columns where to find some blob or clob
|
|
reference. This definition is composed by a table column name, a
|
|
column number (counting from one) reference into the Informix +UNLOAD+
|
|
data file, and a large object type, separated by a colon. You can have
|
|
several columns in this field, separated by a comma.
|
|
+
|
|
Supported large objects type are Informix blob and clob, the awaited
|
|
configuration string are respectively +ifx_blob+ for binary (bytea)
|
|
content type and +ifx_clob+ for text type values.
|
|
+
|
|
Here's an example:
|
|
+
|
|
blob_type = clob_column:3:ifx_blob, other_clob_column:5:ifx_clob
|
|
|
|
reformat::
|
|
+
|
|
Use this option when you need to preprocess some column data with
|
|
+pgloader+ reformatting modules, or your own. The value of this option is
|
|
a comma separated list of columns to rewrite, which are a colon
|
|
separated list of column name, reformat module name, reformat function
|
|
name. Here's an example to reformat column +dt_cx+ with the
|
|
+mysql.timestamp()+ reformatting function:
|
|
+
|
|
reformat = dt_cx:mysql:timestamp
|
|
+
|
|
See global setting option +reformat_path+ for configuring where
|
|
+pgloader+ will look for reformat packages and modules.
|
|
+
|
|
If you want to write a new formating function, provide a python
|
|
package called +reformat+ (a directory of this name containing an
|
|
empty +$$ __init__.py$$+ file will do) and place in there arbitrary named
|
|
modules (+foo.py+ files) containing functions with the following
|
|
signature:
|
|
+
|
|
def bar(reject, input)
|
|
+
|
|
The reject object has a +log(self, messages, data = None)+ method for
|
|
you to log errors into +section.rej.log+ and +section.rej+ files.
|
|
|
|
|
|
== PARALLEL LOADING ==
|
|
|
|
This section is about loading a single given section by multiple
|
|
threads. To load several sections at once in a parallel fashion,
|
|
please refer to +max_parallel_sections+ global option.
|
|
|
|
section_threads::
|
|
|
|
This option allows to configure how many threads +pgloader+ will
|
|
use to process current section. See +split_file_reading+ for more
|
|
information about how those threads will serve the
|
|
loading. Defaults to +1+, which is the legacy behaviour and the
|
|
more needed one too.
|
|
|
|
split_file_reading::
|
|
+
|
|
This option is only used by +pgloader+ when +section_threads+ is more
|
|
than +1+, and configures how the work will be spread to threads. It
|
|
defaults to +False+.
|
|
+
|
|
When +split_file_reading+ is +True+, +pgloader+ will have each section
|
|
thread process a part of the input file. The file splitting is done at
|
|
the byte level, not at the line count level: knowing how many lines
|
|
the input file has would require loading it first...
|
|
+
|
|
When +split_file_reading+ is +False+, +pgloader+ will have one thread
|
|
read the input file and give workers threads input lines to process in
|
|
a round-robin fashion. Please note the reader thread will have to
|
|
parse the lines (according to +format+ parameter).
|
|
|
|
rrqueue_size::
|
|
|
|
When +split_file_reading+ is +False+, this is the size of the
|
|
+pgloader+ queue used to balance input lines to workers
|
|
threads. Instead of giving them one line at a time in a
|
|
round-robin fashion, +pgloader+ will feed workers +rrqueue_size+
|
|
lines at a time. This defaults to +copy_every+.
|
|
|
|
|
|
== TEXT FORMAT CONFIGURATION PARAMETERS ==
|
|
|
|
field_count::
|
|
+
|
|
The +UNLOAD+ command does not escape newlines when they appear into
|
|
table data. Hence, you may obtain multi-line data files, where a
|
|
single database row (say tuple if you prefer to) can span multiple
|
|
physical lines into the unloaded file.
|
|
+
|
|
If this is your case, you may want to configure here the number of
|
|
columns per tuple. Then pgloader will count columns and buffer line
|
|
input in order to re-assemble several physical lines into one data row
|
|
when needed.
|
|
+
|
|
This parameter is optional.
|
|
|
|
trailing_sep::
|
|
+
|
|
If this option is set to True, the input data file is known to append
|
|
a +field_sep+ as the last character of each of its lines. With this
|
|
option set, this last character is then not considered as a field
|
|
separator.
|
|
+
|
|
This parameter is optional and defaults to +False+.
|
|
|
|
newline_escapes::
|
|
+
|
|
Sometimes the input data file has field values containing newlines,
|
|
and the export program used (as Informix +UNLOAD+ command) escape
|
|
in-field newlines. So you want +pgloader+ to keep those newlines,
|
|
while at the same time preserving them.
|
|
+
|
|
This option does the described work on specified fields and
|
|
considering the escaping character you configure, following this
|
|
syntax:
|
|
+
|
|
newline_escapes = colname:\, other_colname:§
|
|
+
|
|
This parameter is optional, and the extra work is only done when
|
|
set. You can configure +newline_escapes+ for as many fields as
|
|
necessary, and you may configure a different escaping character each
|
|
time.
|
|
+
|
|
Please note that at the moment, +pgloader+ does only support one
|
|
character length +newline_escapes+.
|
|
+
|
|
When both a global (see +[pgsql]+ section) +newline_escapes+ parameter
|
|
and a table local one are set, +pgloader+ issues a warning and only
|
|
consider the global setting.
|
|
|
|
== CSV FORMAT CONFIGURATION PARAMETERS ==
|
|
|
|
doublequote::
|
|
|
|
Controls how instances of +quotechar+ appearing inside a field
|
|
should be themselves be quoted. When +True+, the character is
|
|
doubled. When +False+, the +escapechar+ is used as a prefix to the
|
|
+quotechar+. It defaults to +True+.
|
|
|
|
escapechar::
|
|
|
|
A one-character string used by the writer to escape the delimiter
|
|
if quoting is set to +QUOTE_NONE+ and the +quotechar+ if
|
|
+doublequote+ is +False+. On reading, the +escapechar+ removes any
|
|
special meaning from the following character. It defaults to
|
|
+None+, which disables escaping.
|
|
|
|
quotechar::
|
|
|
|
A one-character string used to quote fields containing special
|
|
characters, such as the +delimiter+ or +quotechar+, or which
|
|
contain new-line characters. It defaults to '"'.
|
|
|
|
skipinitialspace::
|
|
|
|
When +True+, whitespace immediately following the +delimiter+ is
|
|
ignored. The default is +False+.
|
|
|
|
field_size_limit::
|
|
|
|
Sets the maximum field size allowed by the python +CSV+ parser. Accepts
|
|
an number of bytes (integer), or a string containing a number then one
|
|
of those units (case sensitive): +kB+, +MB+, +GB+, +TB+. Requires a at
|
|
least python 2.5.
|
|
|
|
== FIXED FORMAT CONFIGURATION PARAMETERS ==
|
|
|
|
fixed_specs::
|
|
+
|
|
This parameter allows to specify start position and byte length for
|
|
each columns to load. Syntax is +column_name:start:len+, separated by
|
|
comas.
|
|
+
|
|
fixed_specs = a:0:10, b:10:8, c:18:8, d:26:17
|
|
|
|
== CONFIGURATION EXAMPLE ==
|
|
|
|
Please see the given configuration example which should be distributed in
|
|
+/usr/share/doc/pgloader/examples/pgloader.conf+.
|
|
|
|
The example configuration file comes with example data and can be used
|
|
a unit test of +pgloader+.
|
|
|
|
== HISTORY ==
|
|
|
|
+pgloader+ has first been a +tcl+ tool written by Jan Wieck and
|
|
released by Christopher Kings-Lynne, who created the
|
|
http://pgfoundry.org[pgfoundry] project for it to be published. Later
|
|
on, Jean-Paul Argudo took over the maintenance. When it became clear
|
|
that it would be easier to rewrite it in another language than to
|
|
properly learn +tcl+ and develop some missing options, +pgloader+ was
|
|
rewritten in python by Dimitri Fontaine.
|
|
|
|
+pgloader+ was rewritten to act as an Informix to PostgreSQL migration
|
|
helper which imported Informix large objects directly into a
|
|
PostgreSQL database.
|
|
|
|
Then as we got some data we couldn't file tools to care about, we
|
|
decided +ifx_blob+ would become +pgloader+, as it had to be able to
|
|
import all Informix +UNLOAD+ data. Those data contains escaped
|
|
separator into unquoted data field and multi-lines fields (+\r+ and
|
|
+\n+ are not escaped).
|
|
|
|
+pgloader+ has since gained some more features allowing it to directly
|
|
import +mysqldump -T+ data, and is known to be used in production
|
|
environments where a +PostgreSQL+ database is used for reporting
|
|
against data from several servers running different RDBMS softwares.
|
|
|
|
== BUGS ==
|
|
|
|
Please report bugs to Dimitri Fontaine <dim@tapoueh.org>, and see
|
|
current list of known bugs in the BUGS.txt distributed file (debian
|
|
package includes it at +/usr/share/doc/pgloader/BUGS.txt+ or online at
|
|
following url:
|
|
http://pgloader.projects.postgresql.org/dev/BUGS.html[].
|
|
|
|
== AUTHORS ==
|
|
|
|
+pgloader+ is written by Dimitri Fontaine <dim@tapoueh.org>.
|
|
|