mirror of
https://github.com/dimitri/pgloader.git
synced 2026-05-04 10:31:02 +02:00
More docs improvements.
Explain the feature list of pgloader better for improving discoverability of what can be achieved with our nice little tool.
This commit is contained in:
parent
ec071af0ad
commit
eab1cbf326
215
docs/index.rst
215
docs/index.rst
@ -18,11 +18,226 @@ mode of operations, pgloader handles both the schema and data parts of the
|
||||
migration, in a single unmanned command, allowing to implement **Continuous
|
||||
Migration**.
|
||||
|
||||
Features Overview
|
||||
=================
|
||||
|
||||
pgloader has two modes of operation: loading from files, migrating
|
||||
databases. In both cases, pgloader uses the PostgreSQL COPY protocol which
|
||||
implements a **streaming** to send data in a very efficient way.
|
||||
|
||||
Loading file content in PostgreSQL
|
||||
----------------------------------
|
||||
|
||||
When loading from files, pgloader implements the following features:
|
||||
|
||||
Many source formats supported
|
||||
Support for a wide variety of file based formats are included in
|
||||
pgloader: the CSV family, fixed columns formats, dBase files (``db3``),
|
||||
and IBM IXF files.
|
||||
|
||||
The SQLite database engine is accounted for in the next section:
|
||||
pgloader considers SQLite as a database source and implements schema
|
||||
discovery from SQLite catalogs.
|
||||
|
||||
On the fly data transformation
|
||||
Often enough the data as read from a CSV file (or another format) needs
|
||||
some tweaking and clean-up before being sent to PostgreSQL.
|
||||
|
||||
For instance in the `geolite
|
||||
<https://github.com/dimitri/pgloader/blob/master/test/archive.load>`_
|
||||
example we can see that integer values are being rewritten as IP address
|
||||
ranges, allowing to target an ``ip4r`` column directly.
|
||||
|
||||
Full Field projections
|
||||
pgloader supports loading data into less fields than found on file, or
|
||||
more, doing some computation on the data read before sending it to
|
||||
PostgreSQL.
|
||||
|
||||
Reading files from an archive
|
||||
Archive formats *zip*, *tar*, and *gzip* are supported by pgloader: the
|
||||
archive is extracted in a temporary directly and expanded files are then
|
||||
loaded.
|
||||
|
||||
HTTP(S) support
|
||||
pgloader knows how to download a source file or a source archive using
|
||||
HTTP directly. It might be better to use ``curl -O- http://... |
|
||||
pgloader` and read the data from *standard input*, then allowing for
|
||||
streaming of the data from its source down to PostgreSQL.
|
||||
|
||||
Target schema discovery
|
||||
When loading in an existing table, pgloader takes into account the
|
||||
existing columns and may automatically guess the CSV format for you.
|
||||
|
||||
On error stop / On error resume next
|
||||
In some cases the source data is so damaged as to be impossible to
|
||||
migrate in full, and when loading from a file then the default for
|
||||
pgloader is to use ``on error resume next`` option, where the rows
|
||||
rejected by PostgreSQL are saved away and the migration continues with
|
||||
the other rows.
|
||||
|
||||
In other cases loading only a part of the input data might not be a
|
||||
great idea, and in such cases it's possible to use the ``on error stop``
|
||||
option.
|
||||
|
||||
Pre/Post SQL commands
|
||||
This feature allows pgloader commands to include SQL commands to run
|
||||
before and after loading a file. It might be about creating a table
|
||||
first, then loading the data into it, and then doing more processing
|
||||
on-top of the data (implementing an ``ELT`` pipeline then), or creating
|
||||
specific indexes as soon as the data has been made ready.
|
||||
|
||||
One-command migration to PostgreSQL
|
||||
-----------------------------------
|
||||
|
||||
When migrating a full database in a single command, pgloader implements the
|
||||
following features:
|
||||
|
||||
One-command migration
|
||||
The whole migration is started with a single command line and then runs
|
||||
unattended. pgloader is meant to be integrated in a fully automated
|
||||
tooling that you can repeat as many times as needed.
|
||||
|
||||
Schema discovery
|
||||
The source database is introspected using its SQL catalogs to get the
|
||||
list of tables, attributes (with data types, default values, not null
|
||||
constraints, etc), primary key constraints, foreign key constraints,
|
||||
indexes, comments, etc. This feeds an internal database catalog of all
|
||||
the objects to migrate from the source database to the target database.
|
||||
|
||||
User defined casting rules
|
||||
Some source database have ideas about their data types that might not be
|
||||
compatible with PostgreSQL implementaion of equivalent data types.
|
||||
|
||||
For instance, SQLite since version 3 has a `Dynamic Type System
|
||||
<https://www.sqlite.org/datatype3.html>`_ which of course isn't
|
||||
compatible with the idea of a `Relation
|
||||
<https://en.wikipedia.org/wiki/Relation_(database)>`_. Or MySQL accepts
|
||||
datetime for year zero, which doesn't exists in our calendar, and
|
||||
doesn't have a boolean data type.
|
||||
|
||||
When migrating from another source database technology to PostgreSQL,
|
||||
data type casting choices must be made. pgloader implements solid
|
||||
defaults that you can rely upon, and a facility for **user defined data
|
||||
type casting rules** for specific cases. The idea is to allow users to
|
||||
specify the how the migration should be done, in order for it to be
|
||||
repeatable and included in a *Continuous Migration* process.
|
||||
|
||||
On the fly data transformations
|
||||
The user defined casting rules come with on the fly rewrite of the data.
|
||||
For instance zero dates (it's not just the year, MySQL accepts
|
||||
``0000-00-00`` as a valid datetime) are rewritten to NULL values by
|
||||
default.
|
||||
|
||||
Partial Migrations
|
||||
It is possible to include only a partial list of the source database
|
||||
tables in the migration, or to exclude some of the tables on the source
|
||||
database.
|
||||
|
||||
Schema only, Data only
|
||||
This is the **ORM compatibility** feature of pgloader, where it is
|
||||
possible to create the schema using your ORM and then have pgloader
|
||||
migrate the data targeting this already created schema.
|
||||
|
||||
When doing this, it is possible for pgloader to *reindex* the target
|
||||
schema: before loading the data from the source database into PostgreSQL
|
||||
using COPY, pgloader DROPs the indexes and constraints, and reinstalls
|
||||
the exact same definitions of them once the data has been loaded.
|
||||
|
||||
The reason for operating that way is of course data load performance.
|
||||
|
||||
Repeatable (DROP+CREATE)
|
||||
By default, pgloader issues DROP statements in the target PostgreSQL
|
||||
database before issing any CREATE statement, so that you can repeat the
|
||||
migration as many times as necessary until migration specifications and
|
||||
rules are bug free.
|
||||
|
||||
On error stop / On error resume next
|
||||
The default behavior of pgloader when migrating from a database is ``on
|
||||
error stop``. The idea is to let the user fix either the migration
|
||||
specifications or the source data, and run the process again, until it
|
||||
works.
|
||||
|
||||
In some cases the source data is so damaged as to be impossible to
|
||||
migrate in full, and it might be necessary to then resort to the ``on
|
||||
error resume next`` option, where the rows rejected by PostgreSQL are
|
||||
saved away and the migration continues with the other rows.
|
||||
|
||||
Pre/Post SQL commands, Post-Schema SQL commands
|
||||
While pgloader takes care of rewriting the schema to PostgreSQL
|
||||
expectations, and even provides *user-defined data type casting rules*
|
||||
support to that end, sometimes it is necessary to add some specific SQL
|
||||
commands around the migration. It's of course supported right from
|
||||
pgloader itself, without having to script around it.
|
||||
|
||||
Online ALTER schema
|
||||
At times migrating to PostgreSQL is also a good opportunity to review
|
||||
and fix bad decisions that were made in the past, or simply that are not
|
||||
relevant to PostgreSQL.
|
||||
|
||||
The pgloader command syntax allows to ALTER pgloader's internal
|
||||
representation of the target catalogs so that the target schema can be
|
||||
created a little different from the source one. Changes supported
|
||||
include target a different *schema* or *table* name.
|
||||
|
||||
Materialized Views, or schema rewrite on-the-fly
|
||||
In some cases the schema rewriting goes deeper than just renaming the
|
||||
SQL objects to being a full normalization exercise. Because PostgreSQL
|
||||
is great at running a normalized schema in production under most
|
||||
workloads.
|
||||
|
||||
pgloader implements full flexibility in on-the-fly schema rewriting, by
|
||||
making it possible to migrate from a view definition. The view attribute
|
||||
list becomes a table definition in PostgreSQL, and the data is fetched
|
||||
by querying the view on the source system.
|
||||
|
||||
A SQL view allows to implement both content filtering at the column
|
||||
level using the SELECT projection clause, and at the row level using the
|
||||
WHERE restriction clause. And backfilling from reference tables thanks
|
||||
to JOINs.
|
||||
|
||||
Distribute to Citus
|
||||
When migrating from PostgreSQL to Citus, a important part of the process
|
||||
consists of adjusting the schema to the distribution key. Read
|
||||
`Preparing Tables and Ingesting Data
|
||||
<https://docs.citusdata.com/en/v8.0/use_cases/multi_tenant.html>`_ in
|
||||
the Citus documentation for a complete example showing how to do that.
|
||||
|
||||
When using pgloader it's possible to specify the distribution keys and
|
||||
reference tables and let pgloader take care of adjusting the table,
|
||||
indexes, primary keys and foreign key definitions all by itself.
|
||||
|
||||
Encoding Overrides
|
||||
MySQL doesn't actually enforce the encoding of the data in the database
|
||||
to match the encoding known in the metadata, defined at the database,
|
||||
table, or attribute level. Sometimes, it's necessary to override the
|
||||
metadata in order to make sense of the text, and pgloader makes it easy
|
||||
to do so.
|
||||
|
||||
|
||||
Continuous Migration
|
||||
--------------------
|
||||
|
||||
pgloader is meant to migrate a whole database in a single command line and
|
||||
without any manual intervention. The goal is to be able to setup a
|
||||
*Continuous Integration* environment as described in the `Project
|
||||
Methodology <http://mysqltopgsql.com/project/>`_ document of the `MySQL to
|
||||
PostgreSQL <http://mysqltopgsql.com/project/>`_ webpage.
|
||||
|
||||
1. Setup your target PostgreSQL Architecture
|
||||
2. Fork a Continuous Integration environment that uses PostgreSQL
|
||||
3. Migrate the data over and over again every night, from production
|
||||
4. As soon as the CI is all green using PostgreSQL, schedule the D-Day
|
||||
5. Migrate without suprise and enjoy!
|
||||
|
||||
In order to be able to follow this great methodology, you need tooling to
|
||||
implement the third step in a fully automated way. That's pgloader.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:caption: Table Of Contents:
|
||||
|
||||
intro
|
||||
quickstart
|
||||
tutorial/tutorial
|
||||
pgloader
|
||||
ref/csv
|
||||
|
||||
@ -35,30 +35,14 @@ expected input properties must be given to pgloader. In the case of a
|
||||
database, pgloader connects to the live service and knows how to fetch the
|
||||
metadata it needs directly from it.
|
||||
|
||||
Continuous Migration
|
||||
--------------------
|
||||
|
||||
pgloader is meant to migrate a whole database in a single command line and
|
||||
without any manual intervention. The goal is to be able to setup a
|
||||
*Continuous Integration* environment as described in the `Project
|
||||
Methodology <http://mysqltopgsql.com/project/>`_ document of the `MySQL to
|
||||
PostgreSQL <http://mysqltopgsql.com/project/>`_ webpage.
|
||||
|
||||
1. Setup your target PostgreSQL Architecture
|
||||
2. Fork a Continuous Integration environment that uses PostgreSQL
|
||||
3. Migrate the data over and over again every night, from production
|
||||
4. As soon as the CI is all green using PostgreSQL, schedule the D-Day
|
||||
5. Migrate without suprise and enjoy!
|
||||
|
||||
In order to be able to follow this great methodology, you need tooling to
|
||||
implement the third step in a fully automated way. That's pgloader.
|
||||
|
||||
Features Matrix
|
||||
---------------
|
||||
|
||||
Here's a comparison of the features supported depending on the source
|
||||
database engine. Most features that are not supported can be added to
|
||||
pgloader, it's just that nobody had the need to do so yet.
|
||||
database engine. Some features that are not supported can be added to
|
||||
pgloader, it's just that nobody had the need to do so yet. Those features
|
||||
are marked with ✗. Empty cells are used when the feature doesn't make sense
|
||||
for the selected source database.
|
||||
|
||||
========================== ======= ====== ====== =========== =========
|
||||
Feature SQLite MySQL MS SQL PostgreSQL Redshift
|
||||
@ -71,14 +55,13 @@ Schema only ✓ ✓ ✓ ✓
|
||||
Data only ✓ ✓ ✓ ✓ ✓
|
||||
Repeatable (DROP+CREATE) ✓ ✓ ✓ ✓ ✓
|
||||
User defined casting rules ✓ ✓ ✓ ✓ ✓
|
||||
Encoding Overrides ✗ ✓ ✗ ✗ ✗
|
||||
Encoding Overrides ✓
|
||||
On error stop ✓ ✓ ✓ ✓ ✓
|
||||
On error resume next ✓ ✓ ✓ ✓ ✓
|
||||
Pre/Post SQL commands ✓ ✓ ✓ ✓ ✓
|
||||
Post-Schema SQL commands ✗ ✓ ✓ ✓ ✓
|
||||
Primary key support ✓ ✓ ✓ ✓ ✓
|
||||
Foreign key support ✓ ✓ ✓ ✓ ✗
|
||||
Incremental data loading ✓ ✓ ✓ ✓ ✓
|
||||
Foreign key support ✓ ✓ ✓ ✓
|
||||
Online ALTER schema ✓ ✓ ✓ ✓ ✓
|
||||
Materialized views ✗ ✓ ✓ ✓ ✓
|
||||
Distribute to Citus ✗ ✓ ✓ ✓ ✓
|
||||
|
||||
@ -1,10 +1,10 @@
|
||||
PgLoader Quick Start
|
||||
--------------------
|
||||
Pgloader Quick Start
|
||||
====================
|
||||
|
||||
In simple cases, pgloader is very easy to use.
|
||||
|
||||
CSV
|
||||
^^^
|
||||
---
|
||||
|
||||
Load data from a CSV file into a pre-existing table in your database::
|
||||
|
||||
@ -26,7 +26,7 @@ For documentation about the available syntaxes for the `--field` and
|
||||
Note also that the PostgreSQL URI includes the target *tablename*.
|
||||
|
||||
Reading from STDIN
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
------------------
|
||||
|
||||
File based pgloader sources can be loaded from the standard input, as in the
|
||||
following example::
|
||||
@ -46,7 +46,7 @@ pgloader with this technique, using the Unix pipe::
|
||||
gunzip -c source.gz | pgloader --type csv ... - pgsql:///target?foo
|
||||
|
||||
Loading from CSV available through HTTP
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
---------------------------------------
|
||||
|
||||
The same command as just above can also be run if the CSV file happens to be
|
||||
found on a remote HTTP location::
|
||||
@ -84,7 +84,7 @@ Also notice that the same command will work against an archived version of
|
||||
the same data.
|
||||
|
||||
Streaming CSV data from an HTTP compressed file
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
-----------------------------------------------
|
||||
|
||||
Finally, it's important to note that pgloader first fetches the content from
|
||||
the HTTP URL it to a local file, then expand the archive when it's
|
||||
@ -110,7 +110,7 @@ and the commands and pgloader will take care of streaming the data down to
|
||||
PostgreSQL.
|
||||
|
||||
Migrating from SQLite
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
---------------------
|
||||
|
||||
The following command will open the SQLite database, discover its tables
|
||||
definitions including indexes and foreign keys, migrate those definitions
|
||||
@ -121,7 +121,7 @@ and then migrate the data over::
|
||||
pgloader ./test/sqlite/sqlite.db postgresql:///newdb
|
||||
|
||||
Migrating from MySQL
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
--------------------
|
||||
|
||||
Just create a database where to host the MySQL data and definitions and have
|
||||
pgloader do the migration for you in a single command line::
|
||||
@ -130,7 +130,7 @@ pgloader do the migration for you in a single command line::
|
||||
pgloader mysql://user@localhost/sakila postgresql:///pagila
|
||||
|
||||
Fetching an archived DBF file from a HTTP remote location
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
---------------------------------------------------------
|
||||
|
||||
It's possible for pgloader to download a file from HTTP, unarchive it, and
|
||||
only then open it to discover the schema then load the data::
|
||||
@ -1,7 +1,6 @@
|
||||
PgLoader Tutorial
|
||||
Pgloader Tutorial
|
||||
=================
|
||||
|
||||
.. include:: quickstart.rst
|
||||
.. include:: csv.rst
|
||||
.. include:: fixed.rst
|
||||
.. include:: geolite.rst
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user