Merge pull request #98 from cbbrowne/master

Some wordsmithing on the docs
This commit is contained in:
Dimitri Fontaine 2014-07-24 19:04:09 +02:00
commit 6e324a1f74
2 changed files with 56 additions and 54 deletions

View File

@ -2,29 +2,30 @@
pgloader is a data loading tool for PostgreSQL, using the `COPY` command. pgloader is a data loading tool for PostgreSQL, using the `COPY` command.
Its main avantage over just using `COPY` or `\copy` and over using a Its main advantage over just using `COPY` or `\copy`, and over using a
*Foreign Data Wrapper* is the transaction behaviour, where *pgloader* will *Foreign Data Wrapper*, is its transaction behaviour, where *pgloader*
keep a separate file of rejected data and continue trying to `copy` good will keep a separate file of rejected data, but continue trying to
data in your database. `copy` good data in your database.
The default PostgreSQL behaviour is transactional, which means that any The default PostgreSQL behaviour is transactional, which means that
erroneous line in the input data (file or remote database) will stop the *any* erroneous line in the input data (file or remote database) will
bulk load for the whole table. stop the entire bulk load for the table.
pgloader also implements data reformating, the main example of that being a pgloader also implements data reformatting, a typical example of that
transformation from MySQL dates `0000-00-00` and `0000-00-00 00:00:00` to being the transformation of MySQL datestamps `0000-00-00` and
PostgreSQL `NULL` value (because our calendar never had a *year zero*). `0000-00-00 00:00:00` to PostgreSQL `NULL` value (because our calendar
never had a *year zero*).
## Versioning ## Versioning
The pgloader version 1.x from a long time ago had been developped in `TCL`. The pgloader version 1.x from a long time ago was developed in `TCL`.
When faced with maintaining that code, the new emerging development team When faced with maintaining that code, the new emerging development
(hi!) picked `python` instead because that made sense at the time. So team (hi!) picked `python` instead because that made sense at the
pgloader version 2.x were in python. time. So pgloader version 2.x were in python.
The current version of pgloader is the 3.x series, which is written in The current version of pgloader is the 3.x series, which is written in
[Common Lisp](http://cliki.net/) for better development flexibility, run [Common Lisp](http://cliki.net/) for better development flexibility,
time performances, real threading. runtime performance, and support of real threading.
The versioning is now following the Emacs model, where any X.0 release The versioning is now following the Emacs model, where any X.0 release
number means you're using a development version (alpha, beta, or release number means you're using a development version (alpha, beta, or release

View File

@ -6,12 +6,13 @@
## DESCRIPTION ## DESCRIPTION
pgloader loads data from different sources into PostgreSQL. It can tranform pgloader loads data from various sources into PostgreSQL. It can
the data it reads on the fly and send raw SQL before and after the loading. transform the data it reads on the fly and submit raw SQL before and
It uses the `COPY` PostgreSQL protocol to stream the data into the server, after the loading. It uses the `COPY` PostgreSQL protocol to stream
and manages errors by filling a pair fo *reject.dat* and *reject.log* files. the data into the server, and manages errors by filling a pair of
*reject.dat* and *reject.log* files.
pgloader operates from commands which are read from files: pgloader operates using commands which are read from files:
pgloader commands.load pgloader commands.load
@ -108,12 +109,12 @@ database of your setup. The filenames are the target table, and their
extensions are `.dat` for the rejected data and `.log` for the file extensions are `.dat` for the rejected data and `.log` for the file
containing the full PostgreSQL client side logs about the rejected data. containing the full PostgreSQL client side logs about the rejected data.
The `.dat` file is formated in PostgreSQL the text COPY format as documented The `.dat` file is formatted in PostgreSQL the text COPY format as documented
in [http://www.postgresql.org/docs/9.2/static/sql-copy.html#AEN66609](). in [http://www.postgresql.org/docs/9.2/static/sql-copy.html#AEN66609]().
## A NOTE ABOUT PERFORMANCES ## A NOTE ABOUT PERFORMANCES
pgloader has been developped with performances in mind, to be able to cope pgloader has been developed with performances in mind, to be able to cope
with ever growing needs in loading large amounts of data into PostgreSQL. with ever growing needs in loading large amounts of data into PostgreSQL.
The basic architecture it uses is the old Unix pipe model, where a thread is The basic architecture it uses is the old Unix pipe model, where a thread is
@ -265,7 +266,7 @@ Where:
Can contain any character, including colon (`:`) which must then be Can contain any character, including colon (`:`) which must then be
doubled (`::`) and at-sign (`@`) which must then be doubled (`@@`). doubled (`::`) and at-sign (`@`) which must then be doubled (`@@`).
When ommited, the *user* name defaults to the value of the `PGUSER` When omitted, the *user* name defaults to the value of the `PGUSER`
environment variable, and if it is unset, the value of the `USER` environment variable, and if it is unset, the value of the `USER`
environment variable. environment variable.
@ -275,15 +276,15 @@ Where:
be doubled (`@@`). To leave the password empty, when the *user* name be doubled (`@@`). To leave the password empty, when the *user* name
ends with at at sign, you then have to use the syntax user:@. ends with at at sign, you then have to use the syntax user:@.
When ommited, the *password* defaults to the value of the `PGPASSWORD` When omitted, the *password* defaults to the value of the `PGPASSWORD`
environement variable if it is set, otherwise the password is left environment variable if it is set, otherwise the password is left
unset. unset.
- *netloc* - *netloc*
Can be either a hostname in dotted notation, or an ipv4, or an unix Can be either a hostname in dotted notation, or an ipv4, or an Unix
domain socket path. Empty is the default network location, under a domain socket path. Empty is the default network location, under a
system providing *unix domain socket* that method is prefered, otherwise system providing *unix domain socket* that method is preferred, otherwise
the *netloc* default to `localhost`. the *netloc* default to `localhost`.
It's possible to force the *unix domain socket* path by using the syntax It's possible to force the *unix domain socket* path by using the syntax
@ -292,7 +293,7 @@ Where:
postgresql://unix:/tmp:54321/dbname postgresql://unix:/tmp:54321/dbname
The *netloc* defaults to the value of the `PGHOST` environement The *netloc* defaults to the value of the `PGHOST` environment
variable, and if it is unset, to either the default `unix` socket path variable, and if it is unset, to either the default `unix` socket path
when running on a Unix system, and `localhost` otherwise. when running on a Unix system, and `localhost` otherwise.
@ -302,11 +303,11 @@ Where:
digits and the punctuation signs comma (`,`), dash (`-`) and underscore digits and the punctuation signs comma (`,`), dash (`-`) and underscore
(`_`). (`_`).
When ommited, the *dbname* defaults to the value of the environment When omitted, the *dbname* defaults to the value of the environment
variable `PGDATABASE`, and if that is unset, to the *user* value as variable `PGDATABASE`, and if that is unset, to the *user* value as
determined above. determined above.
- The only optionnal parameter should be a possibly qualified table name. - The only optional parameter should be a possibly qualified table name.
### Regular Expressions ### Regular Expressions
@ -383,7 +384,7 @@ The global batch behaviour options are:
Supporting more than a single batch being sent at a time is on the TODO Supporting more than a single batch being sent at a time is on the TODO
list of pgloader, but is not implemented yet. This option is about list of pgloader, but is not implemented yet. This option is about
controling the memory needs of pgloader as a trade-off to the controlling the memory needs of pgloader as a trade-off to the
performances characteristics, and not about parallel activity of performances characteristics, and not about parallel activity of
pgloader. pgloader.
@ -523,7 +524,7 @@ The `csv` format command accepts the following clauses and options:
Takes a single character as argument, which must be found inside Takes a single character as argument, which must be found inside
single quotes, and might be given as the printable character itself, single quotes, and might be given as the printable character itself,
the special value \t to denote a tabulation character, or `0x` then the special value \t to denote a tabulation character, or `0x` then
an hexadecimal value read as the ascii code for the character. an hexadecimal value read as the ASCII code for the character.
This character is used as the quoting character in the `CSV` file, This character is used as the quoting character in the `CSV` file,
and defaults to double-quote. and defaults to double-quote.
@ -548,7 +549,7 @@ The `csv` format command accepts the following clauses and options:
Takes a single character as argument, which must be found inside Takes a single character as argument, which must be found inside
single quotes, and might be given as the printable character itself, single quotes, and might be given as the printable character itself,
the special value \t to denote a tabulation character, or `0x` then the special value \t to denote a tabulation character, or `0x` then
an hexadecimal value read as the ascii code for the character. an hexadecimal value read as the ASCII code for the character.
This character is used as the *field separator* when reading the This character is used as the *field separator* when reading the
`CSV` data. `CSV` data.
@ -558,7 +559,7 @@ The `csv` format command accepts the following clauses and options:
Takes a single character as argument, which must be found inside Takes a single character as argument, which must be found inside
single quotes, and might be given as the printable character itself, single quotes, and might be given as the printable character itself,
the special value \t to denote a tabulation character, or `0x` then the special value \t to denote a tabulation character, or `0x` then
an hexadecimal value read as the ascii code for the character. an hexadecimal value read as the ASCII code for the character.
This character is used to recognize *end-of-line* condition when This character is used to recognize *end-of-line* condition when
reading the `CSV` data. reading the `CSV` data.
@ -942,7 +943,7 @@ The `database` command accepts the following clauses and options:
- *no truncate* - *no truncate*
When this topion is listed, pgloader issues no `TRUNCATE` command. When this option is listed, pgloader issues no `TRUNCATE` command.
- *create tables* - *create tables*
@ -1072,8 +1073,8 @@ The `database` command accepts the following clauses and options:
existing default expression in the MySQL database for columns of the existing default expression in the MySQL database for columns of the
source type from the `CREATE TABLE` statement it generates. source type from the `CREATE TABLE` statement it generates.
The spelling *keep default* explicitely prevents that behavior and The spelling *keep default* explicitly prevents that behaviour and
can be used to overlad the default casting rules. can be used to overload the default casting rules.
- *drop not null*, *keep not null* - *drop not null*, *keep not null*
@ -1082,8 +1083,8 @@ The `database` command accepts the following clauses and options:
MySQL datatype when it creates the tables in the PostgreSQL MySQL datatype when it creates the tables in the PostgreSQL
database. database.
The spelling *keep not null* explicitely prevents that behavior and The spelling *keep not null* explicitly prevents that behaviour and
can be used to overlad the default casting rules. can be used to overload the default casting rules.
- *drop typemod*, *keep typemod* - *drop typemod*, *keep typemod*
@ -1092,13 +1093,13 @@ The `database` command accepts the following clauses and options:
the datatype definition found in the MySQL columns of the source the datatype definition found in the MySQL columns of the source
type when it created the tables in the PostgreSQL database. type when it created the tables in the PostgreSQL database.
The spelling *keep typemod* explicitely prevents that behavior and The spelling *keep typemod* explicitly prevents that behaviour and
can be used to overlad the default casting rules. can be used to overload the default casting rules.
- *using* - *using*
This option takes as its single argument the name of a function to This option takes as its single argument the name of a function to
be found un the `pgloader.transforms` Common Lisp package. See above be found in the `pgloader.transforms` Common Lisp package. See above
for details. for details.
It's possible to augment a default cast rule (such as one that It's possible to augment a default cast rule (such as one that
@ -1167,11 +1168,11 @@ the following limitations:
- Views are not migrated, - Views are not migrated,
Supporting views might require implemeting a full SQL parser for the Supporting views might require implementing a full SQL parser for the
MySQL dialect with a porting engine to rewrite the SQL against MySQL dialect with a porting engine to rewrite the SQL against
PostgreSQL, including renaming functions and changing some constructs. PostgreSQL, including renaming functions and changing some constructs.
While it's not theorically impossible, don't hold your breath. While it's not theoretically impossible, don't hold your breath.
- Triggers are not migrated - Triggers are not migrated
@ -1181,7 +1182,7 @@ the following limitations:
It's simple enough to implement, just not on the priority list yet. It's simple enough to implement, just not on the priority list yet.
- Of the geometric datatypes, onle the `POINT` database has been covered. - Of the geometric datatypes, only the `POINT` database has been covered.
The other ones should be easy enough to implement now, it's just not The other ones should be easy enough to implement now, it's just not
done yet. done yet.
@ -1209,7 +1210,7 @@ Numbers:
- type double to double precision drop typemod - type double to double precision drop typemod
- type numeric to numeric keep typemod - type numeric to numeric keep typemod
- type decimal to deciman keep typemod - type decimal to decimal keep typemod
Texts: Texts:
@ -1314,7 +1315,7 @@ The `sqlite` command accepts the following clauses and options:
- *no truncate* - *no truncate*
When this topion is listed, pgloader issues no `TRUNCATE` command. When this option is listed, pgloader issues no `TRUNCATE` command.
- *create tables* - *create tables*
@ -1375,7 +1376,7 @@ The `sqlite` command accepts the following clauses and options:
- *EXCLUDING TABLE NAMES MATCHING* - *EXCLUDING TABLE NAMES MATCHING*
Introduce a comma separated list of table names or *rugular expression* Introduce a comma separated list of table names or *regular expression*
used to exclude table names from the migration. This filter only applies used to exclude table names from the migration. This filter only applies
to the result of the *INCLUDING* filter. to the result of the *INCLUDING* filter.
@ -1454,7 +1455,7 @@ The provided transformation functions are:
- *right-trimg* - *right-trimg*
Remove whitespaces at end of string. Remove whitespace at end of string.
- *byte-vector-to-bytea* - *byte-vector-to-bytea*
@ -1464,9 +1465,9 @@ The provided transformation functions are:
## LOAD MESSAGES ## LOAD MESSAGES
This command is still experimental and allows to receive messages in UDP This command is still experimental and allows receiving messages via
with a syslod like format, and depending on matching rules load named parts UDP using a syslog like format, and, depending on rule matching, loads
them to a destination table. named portions of the data stream into a destination table.
LOAD MESSAGES LOAD MESSAGES
FROM syslog://localhost:10514/ FROM syslog://localhost:10514/