mirror of
https://github.com/dimitri/pgloader.git
synced 2025-08-08 07:16:58 +02:00
268 lines
11 KiB
ReStructuredText
268 lines
11 KiB
ReStructuredText
.. pgloader documentation master file, created by
|
|
sphinx-quickstart on Tue Dec 5 19:23:32 2017.
|
|
You can adapt this file completely to your liking, but it should at least
|
|
contain the root `toctree` directive.
|
|
|
|
Welcome to pgloader's documentation!
|
|
====================================
|
|
|
|
pgloader loads data from various sources into PostgreSQL. It can transform
|
|
the data it reads on the fly and submit raw SQL before and after the
|
|
loading. It uses the `COPY` PostgreSQL protocol to stream the data into the
|
|
server, and manages errors by filling a pair of *reject.dat* and
|
|
*reject.log* files.
|
|
|
|
Thanks to being able to load data directly from a database source, pgloader
|
|
also supports from migrations from other productions to PostgreSQL. In this
|
|
mode of operations, pgloader handles both the schema and data parts of the
|
|
migration, in a single unmanned command, allowing to implement **Continuous
|
|
Migration**.
|
|
|
|
Features Overview
|
|
=================
|
|
|
|
pgloader has two modes of operation: loading from files, migrating
|
|
databases. In both cases, pgloader uses the PostgreSQL COPY protocol which
|
|
implements a **streaming** to send data in a very efficient way.
|
|
|
|
Loading file content in PostgreSQL
|
|
----------------------------------
|
|
|
|
When loading from files, pgloader implements the following features:
|
|
|
|
Many source formats supported
|
|
Support for a wide variety of file based formats are included in
|
|
pgloader: the CSV family, fixed columns formats, dBase files (``db3``),
|
|
and IBM IXF files.
|
|
|
|
The SQLite database engine is accounted for in the next section:
|
|
pgloader considers SQLite as a database source and implements schema
|
|
discovery from SQLite catalogs.
|
|
|
|
On the fly data transformation
|
|
Often enough the data as read from a CSV file (or another format) needs
|
|
some tweaking and clean-up before being sent to PostgreSQL.
|
|
|
|
For instance in the `geolite
|
|
<https://github.com/dimitri/pgloader/blob/master/test/archive.load>`_
|
|
example we can see that integer values are being rewritten as IP address
|
|
ranges, allowing to target an ``ip4r`` column directly.
|
|
|
|
Full Field projections
|
|
pgloader supports loading data into less fields than found on file, or
|
|
more, doing some computation on the data read before sending it to
|
|
PostgreSQL.
|
|
|
|
Reading files from an archive
|
|
Archive formats *zip*, *tar*, and *gzip* are supported by pgloader: the
|
|
archive is extracted in a temporary directly and expanded files are then
|
|
loaded.
|
|
|
|
HTTP(S) support
|
|
pgloader knows how to download a source file or a source archive using
|
|
HTTP directly. It might be better to use ``curl -O- http://... |
|
|
pgloader`` and read the data from *standard input*, then allowing for
|
|
streaming of the data from its source down to PostgreSQL.
|
|
|
|
Target schema discovery
|
|
When loading in an existing table, pgloader takes into account the
|
|
existing columns and may automatically guess the CSV format for you.
|
|
|
|
On error stop / On error resume next
|
|
In some cases the source data is so damaged as to be impossible to
|
|
migrate in full, and when loading from a file then the default for
|
|
pgloader is to use ``on error resume next`` option, where the rows
|
|
rejected by PostgreSQL are saved away and the migration continues with
|
|
the other rows.
|
|
|
|
In other cases loading only a part of the input data might not be a
|
|
great idea, and in such cases it's possible to use the ``on error stop``
|
|
option.
|
|
|
|
Pre/Post SQL commands
|
|
This feature allows pgloader commands to include SQL commands to run
|
|
before and after loading a file. It might be about creating a table
|
|
first, then loading the data into it, and then doing more processing
|
|
on-top of the data (implementing an *ELT* pipeline then), or creating
|
|
specific indexes as soon as the data has been made ready.
|
|
|
|
One-command migration to PostgreSQL
|
|
-----------------------------------
|
|
|
|
When migrating a full database in a single command, pgloader implements the
|
|
following features:
|
|
|
|
One-command migration
|
|
The whole migration is started with a single command line and then runs
|
|
unattended. pgloader is meant to be integrated in a fully automated
|
|
tooling that you can repeat as many times as needed.
|
|
|
|
Schema discovery
|
|
The source database is introspected using its SQL catalogs to get the
|
|
list of tables, attributes (with data types, default values, not null
|
|
constraints, etc), primary key constraints, foreign key constraints,
|
|
indexes, comments, etc. This feeds an internal database catalog of all
|
|
the objects to migrate from the source database to the target database.
|
|
|
|
User defined casting rules
|
|
Some source database have ideas about their data types that might not be
|
|
compatible with PostgreSQL implementaion of equivalent data types.
|
|
|
|
For instance, SQLite since version 3 has a `Dynamic Type System
|
|
<https://www.sqlite.org/datatype3.html>`_ which of course isn't
|
|
compatible with the idea of a `Relation
|
|
<https://en.wikipedia.org/wiki/Relation_(database)>`_. Or MySQL accepts
|
|
datetime for year zero, which doesn't exists in our calendar, and
|
|
doesn't have a boolean data type.
|
|
|
|
When migrating from another source database technology to PostgreSQL,
|
|
data type casting choices must be made. pgloader implements solid
|
|
defaults that you can rely upon, and a facility for **user defined data
|
|
type casting rules** for specific cases. The idea is to allow users to
|
|
specify the how the migration should be done, in order for it to be
|
|
repeatable and included in a *Continuous Migration* process.
|
|
|
|
On the fly data transformations
|
|
The user defined casting rules come with on the fly rewrite of the data.
|
|
For instance zero dates (it's not just the year, MySQL accepts
|
|
``0000-00-00`` as a valid datetime) are rewritten to NULL values by
|
|
default.
|
|
|
|
Partial Migrations
|
|
It is possible to include only a partial list of the source database
|
|
tables in the migration, or to exclude some of the tables on the source
|
|
database.
|
|
|
|
Schema only, Data only
|
|
This is the **ORM compatibility** feature of pgloader, where it is
|
|
possible to create the schema using your ORM and then have pgloader
|
|
migrate the data targeting this already created schema.
|
|
|
|
When doing this, it is possible for pgloader to *reindex* the target
|
|
schema: before loading the data from the source database into PostgreSQL
|
|
using COPY, pgloader DROPs the indexes and constraints, and reinstalls
|
|
the exact same definitions of them once the data has been loaded.
|
|
|
|
The reason for operating that way is of course data load performance.
|
|
|
|
Repeatable (DROP+CREATE)
|
|
By default, pgloader issues DROP statements in the target PostgreSQL
|
|
database before issuing any CREATE statement, so that you can repeat the
|
|
migration as many times as necessary until migration specifications and
|
|
rules are bug free.
|
|
|
|
The schedule the data migration to run every night (or even more often!)
|
|
for the whole duration of the code migration project. See the
|
|
`Continuous Migration <https://pgloader.io/blog/continuous-migration/>`_
|
|
methodology for more details about the approach.
|
|
|
|
On error stop / On error resume next The default behavior of pgloader when
|
|
migrating from a database is ``on error stop``. The idea is to let the
|
|
user fix either the migration specifications or the source data, and run
|
|
the process again, until it works.
|
|
|
|
In some cases the source data is so damaged as to be impossible to
|
|
migrate in full, and it might be necessary to then resort to the ``on
|
|
error resume next`` option, where the rows rejected by PostgreSQL are
|
|
saved away and the migration continues with the other rows.
|
|
|
|
Pre/Post SQL commands, Post-Schema SQL commands
|
|
While pgloader takes care of rewriting the schema to PostgreSQL
|
|
expectations, and even provides *user-defined data type casting rules*
|
|
support to that end, sometimes it is necessary to add some specific SQL
|
|
commands around the migration. It's of course supported right from
|
|
pgloader itself, without having to script around it.
|
|
|
|
Online ALTER schema
|
|
At times migrating to PostgreSQL is also a good opportunity to review
|
|
and fix bad decisions that were made in the past, or simply that are not
|
|
relevant to PostgreSQL.
|
|
|
|
The pgloader command syntax allows to ALTER pgloader's internal
|
|
representation of the target catalogs so that the target schema can be
|
|
created a little different from the source one. Changes supported
|
|
include target a different *schema* or *table* name.
|
|
|
|
Materialized Views, or schema rewrite on-the-fly
|
|
In some cases the schema rewriting goes deeper than just renaming the
|
|
SQL objects to being a full normalization exercise. Because PostgreSQL
|
|
is great at running a normalized schema in production under most
|
|
workloads.
|
|
|
|
pgloader implements full flexibility in on-the-fly schema rewriting, by
|
|
making it possible to migrate from a view definition. The view attribute
|
|
list becomes a table definition in PostgreSQL, and the data is fetched
|
|
by querying the view on the source system.
|
|
|
|
A SQL view allows to implement both content filtering at the column
|
|
level using the SELECT projection clause, and at the row level using the
|
|
WHERE restriction clause. And backfilling from reference tables thanks
|
|
to JOINs.
|
|
|
|
Distribute to Citus
|
|
When migrating from PostgreSQL to Citus, a important part of the process
|
|
consists of adjusting the schema to the distribution key. Read
|
|
`Preparing Tables and Ingesting Data
|
|
<https://docs.citusdata.com/en/v8.0/use_cases/multi_tenant.html>`_ in
|
|
the Citus documentation for a complete example showing how to do that.
|
|
|
|
When using pgloader it's possible to specify the distribution keys and
|
|
reference tables and let pgloader take care of adjusting the table,
|
|
indexes, primary keys and foreign key definitions all by itself.
|
|
|
|
Encoding Overrides
|
|
MySQL doesn't actually enforce the encoding of the data in the database
|
|
to match the encoding known in the metadata, defined at the database,
|
|
table, or attribute level. Sometimes, it's necessary to override the
|
|
metadata in order to make sense of the text, and pgloader makes it easy
|
|
to do so.
|
|
|
|
|
|
Continuous Migration
|
|
--------------------
|
|
|
|
pgloader is meant to migrate a whole database in a single command line and
|
|
without any manual intervention. The goal is to be able to setup a
|
|
*Continuous Integration* environment as described in the `Project
|
|
Methodology <http://mysqltopgsql.com/project/>`_ document of the `MySQL to
|
|
PostgreSQL <http://mysqltopgsql.com/project/>`_ webpage.
|
|
|
|
1. Setup your target PostgreSQL Architecture
|
|
2. Fork a Continuous Integration environment that uses PostgreSQL
|
|
3. Migrate the data over and over again every night, from production
|
|
4. As soon as the CI is all green using PostgreSQL, schedule the D-Day
|
|
5. Migrate without suprise and enjoy!
|
|
|
|
In order to be able to follow this great methodology, you need tooling to
|
|
implement the third step in a fully automated way. That's pgloader.
|
|
|
|
.. toctree::
|
|
:maxdepth: 2
|
|
:caption: Table Of Contents:
|
|
|
|
intro
|
|
quickstart
|
|
tutorial/tutorial
|
|
pgloader
|
|
ref/csv
|
|
ref/fixed
|
|
ref/copy
|
|
ref/dbf
|
|
ref/ixf
|
|
ref/archive
|
|
ref/mysql
|
|
ref/sqlite
|
|
ref/mssql
|
|
ref/pgsql
|
|
ref/pgsql-citus-target
|
|
ref/pgsql-redshift
|
|
ref/transforms
|
|
bugreport
|
|
|
|
Indices and tables
|
|
==================
|
|
|
|
* :ref:`genindex`
|
|
* :ref:`modindex`
|
|
* :ref:`search`
|