diff --git a/docs/index.rst b/docs/index.rst index ca5e672..1751e71 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -18,11 +18,226 @@ mode of operations, pgloader handles both the schema and data parts of the migration, in a single unmanned command, allowing to implement **Continuous Migration**. +Features Overview +================= + +pgloader has two modes of operation: loading from files, migrating +databases. In both cases, pgloader uses the PostgreSQL COPY protocol which +implements a **streaming** to send data in a very efficient way. + +Loading file content in PostgreSQL +---------------------------------- + +When loading from files, pgloader implements the following features: + +Many source formats supported + Support for a wide variety of file based formats are included in + pgloader: the CSV family, fixed columns formats, dBase files (``db3``), + and IBM IXF files. + + The SQLite database engine is accounted for in the next section: + pgloader considers SQLite as a database source and implements schema + discovery from SQLite catalogs. + +On the fly data transformation + Often enough the data as read from a CSV file (or another format) needs + some tweaking and clean-up before being sent to PostgreSQL. + + For instance in the `geolite + `_ + example we can see that integer values are being rewritten as IP address + ranges, allowing to target an ``ip4r`` column directly. + +Full Field projections + pgloader supports loading data into less fields than found on file, or + more, doing some computation on the data read before sending it to + PostgreSQL. + +Reading files from an archive + Archive formats *zip*, *tar*, and *gzip* are supported by pgloader: the + archive is extracted in a temporary directly and expanded files are then + loaded. + +HTTP(S) support + pgloader knows how to download a source file or a source archive using + HTTP directly. It might be better to use ``curl -O- http://... | + pgloader` and read the data from *standard input*, then allowing for + streaming of the data from its source down to PostgreSQL. + +Target schema discovery + When loading in an existing table, pgloader takes into account the + existing columns and may automatically guess the CSV format for you. + +On error stop / On error resume next + In some cases the source data is so damaged as to be impossible to + migrate in full, and when loading from a file then the default for + pgloader is to use ``on error resume next`` option, where the rows + rejected by PostgreSQL are saved away and the migration continues with + the other rows. + + In other cases loading only a part of the input data might not be a + great idea, and in such cases it's possible to use the ``on error stop`` + option. + +Pre/Post SQL commands + This feature allows pgloader commands to include SQL commands to run + before and after loading a file. It might be about creating a table + first, then loading the data into it, and then doing more processing + on-top of the data (implementing an ``ELT`` pipeline then), or creating + specific indexes as soon as the data has been made ready. + +One-command migration to PostgreSQL +----------------------------------- + +When migrating a full database in a single command, pgloader implements the +following features: + +One-command migration + The whole migration is started with a single command line and then runs + unattended. pgloader is meant to be integrated in a fully automated + tooling that you can repeat as many times as needed. + +Schema discovery + The source database is introspected using its SQL catalogs to get the + list of tables, attributes (with data types, default values, not null + constraints, etc), primary key constraints, foreign key constraints, + indexes, comments, etc. This feeds an internal database catalog of all + the objects to migrate from the source database to the target database. + +User defined casting rules + Some source database have ideas about their data types that might not be + compatible with PostgreSQL implementaion of equivalent data types. + + For instance, SQLite since version 3 has a `Dynamic Type System + `_ which of course isn't + compatible with the idea of a `Relation + `_. Or MySQL accepts + datetime for year zero, which doesn't exists in our calendar, and + doesn't have a boolean data type. + + When migrating from another source database technology to PostgreSQL, + data type casting choices must be made. pgloader implements solid + defaults that you can rely upon, and a facility for **user defined data + type casting rules** for specific cases. The idea is to allow users to + specify the how the migration should be done, in order for it to be + repeatable and included in a *Continuous Migration* process. + +On the fly data transformations + The user defined casting rules come with on the fly rewrite of the data. + For instance zero dates (it's not just the year, MySQL accepts + ``0000-00-00`` as a valid datetime) are rewritten to NULL values by + default. + +Partial Migrations + It is possible to include only a partial list of the source database + tables in the migration, or to exclude some of the tables on the source + database. + +Schema only, Data only + This is the **ORM compatibility** feature of pgloader, where it is + possible to create the schema using your ORM and then have pgloader + migrate the data targeting this already created schema. + + When doing this, it is possible for pgloader to *reindex* the target + schema: before loading the data from the source database into PostgreSQL + using COPY, pgloader DROPs the indexes and constraints, and reinstalls + the exact same definitions of them once the data has been loaded. + + The reason for operating that way is of course data load performance. + +Repeatable (DROP+CREATE) + By default, pgloader issues DROP statements in the target PostgreSQL + database before issing any CREATE statement, so that you can repeat the + migration as many times as necessary until migration specifications and + rules are bug free. + +On error stop / On error resume next + The default behavior of pgloader when migrating from a database is ``on + error stop``. The idea is to let the user fix either the migration + specifications or the source data, and run the process again, until it + works. + + In some cases the source data is so damaged as to be impossible to + migrate in full, and it might be necessary to then resort to the ``on + error resume next`` option, where the rows rejected by PostgreSQL are + saved away and the migration continues with the other rows. + +Pre/Post SQL commands, Post-Schema SQL commands + While pgloader takes care of rewriting the schema to PostgreSQL + expectations, and even provides *user-defined data type casting rules* + support to that end, sometimes it is necessary to add some specific SQL + commands around the migration. It's of course supported right from + pgloader itself, without having to script around it. + +Online ALTER schema + At times migrating to PostgreSQL is also a good opportunity to review + and fix bad decisions that were made in the past, or simply that are not + relevant to PostgreSQL. + + The pgloader command syntax allows to ALTER pgloader's internal + representation of the target catalogs so that the target schema can be + created a little different from the source one. Changes supported + include target a different *schema* or *table* name. + +Materialized Views, or schema rewrite on-the-fly + In some cases the schema rewriting goes deeper than just renaming the + SQL objects to being a full normalization exercise. Because PostgreSQL + is great at running a normalized schema in production under most + workloads. + + pgloader implements full flexibility in on-the-fly schema rewriting, by + making it possible to migrate from a view definition. The view attribute + list becomes a table definition in PostgreSQL, and the data is fetched + by querying the view on the source system. + + A SQL view allows to implement both content filtering at the column + level using the SELECT projection clause, and at the row level using the + WHERE restriction clause. And backfilling from reference tables thanks + to JOINs. + +Distribute to Citus + When migrating from PostgreSQL to Citus, a important part of the process + consists of adjusting the schema to the distribution key. Read + `Preparing Tables and Ingesting Data + `_ in + the Citus documentation for a complete example showing how to do that. + + When using pgloader it's possible to specify the distribution keys and + reference tables and let pgloader take care of adjusting the table, + indexes, primary keys and foreign key definitions all by itself. + +Encoding Overrides + MySQL doesn't actually enforce the encoding of the data in the database + to match the encoding known in the metadata, defined at the database, + table, or attribute level. Sometimes, it's necessary to override the + metadata in order to make sense of the text, and pgloader makes it easy + to do so. + + +Continuous Migration +-------------------- + +pgloader is meant to migrate a whole database in a single command line and +without any manual intervention. The goal is to be able to setup a +*Continuous Integration* environment as described in the `Project +Methodology `_ document of the `MySQL to +PostgreSQL `_ webpage. + + 1. Setup your target PostgreSQL Architecture + 2. Fork a Continuous Integration environment that uses PostgreSQL + 3. Migrate the data over and over again every night, from production + 4. As soon as the CI is all green using PostgreSQL, schedule the D-Day + 5. Migrate without suprise and enjoy! + +In order to be able to follow this great methodology, you need tooling to +implement the third step in a fully automated way. That's pgloader. + .. toctree:: :maxdepth: 2 :caption: Table Of Contents: intro + quickstart tutorial/tutorial pgloader ref/csv diff --git a/docs/intro.rst b/docs/intro.rst index f73c64d..f733b72 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -35,30 +35,14 @@ expected input properties must be given to pgloader. In the case of a database, pgloader connects to the live service and knows how to fetch the metadata it needs directly from it. -Continuous Migration --------------------- - -pgloader is meant to migrate a whole database in a single command line and -without any manual intervention. The goal is to be able to setup a -*Continuous Integration* environment as described in the `Project -Methodology `_ document of the `MySQL to -PostgreSQL `_ webpage. - - 1. Setup your target PostgreSQL Architecture - 2. Fork a Continuous Integration environment that uses PostgreSQL - 3. Migrate the data over and over again every night, from production - 4. As soon as the CI is all green using PostgreSQL, schedule the D-Day - 5. Migrate without suprise and enjoy! - -In order to be able to follow this great methodology, you need tooling to -implement the third step in a fully automated way. That's pgloader. - Features Matrix --------------- Here's a comparison of the features supported depending on the source -database engine. Most features that are not supported can be added to -pgloader, it's just that nobody had the need to do so yet. +database engine. Some features that are not supported can be added to +pgloader, it's just that nobody had the need to do so yet. Those features +are marked with ✗. Empty cells are used when the feature doesn't make sense +for the selected source database. ========================== ======= ====== ====== =========== ========= Feature SQLite MySQL MS SQL PostgreSQL Redshift @@ -71,14 +55,13 @@ Schema only ✓ ✓ ✓ ✓ Data only ✓ ✓ ✓ ✓ ✓ Repeatable (DROP+CREATE) ✓ ✓ ✓ ✓ ✓ User defined casting rules ✓ ✓ ✓ ✓ ✓ -Encoding Overrides ✗ ✓ ✗ ✗ ✗ +Encoding Overrides ✓ On error stop ✓ ✓ ✓ ✓ ✓ On error resume next ✓ ✓ ✓ ✓ ✓ Pre/Post SQL commands ✓ ✓ ✓ ✓ ✓ Post-Schema SQL commands ✗ ✓ ✓ ✓ ✓ Primary key support ✓ ✓ ✓ ✓ ✓ -Foreign key support ✓ ✓ ✓ ✓ ✗ -Incremental data loading ✓ ✓ ✓ ✓ ✓ +Foreign key support ✓ ✓ ✓ ✓ Online ALTER schema ✓ ✓ ✓ ✓ ✓ Materialized views ✗ ✓ ✓ ✓ ✓ Distribute to Citus ✗ ✓ ✓ ✓ ✓ diff --git a/docs/tutorial/quickstart.rst b/docs/quickstart.rst similarity index 96% rename from docs/tutorial/quickstart.rst rename to docs/quickstart.rst index abd303c..912a095 100644 --- a/docs/tutorial/quickstart.rst +++ b/docs/quickstart.rst @@ -1,10 +1,10 @@ -PgLoader Quick Start --------------------- +Pgloader Quick Start +==================== In simple cases, pgloader is very easy to use. CSV -^^^ +--- Load data from a CSV file into a pre-existing table in your database:: @@ -26,7 +26,7 @@ For documentation about the available syntaxes for the `--field` and Note also that the PostgreSQL URI includes the target *tablename*. Reading from STDIN -^^^^^^^^^^^^^^^^^^ +------------------ File based pgloader sources can be loaded from the standard input, as in the following example:: @@ -46,7 +46,7 @@ pgloader with this technique, using the Unix pipe:: gunzip -c source.gz | pgloader --type csv ... - pgsql:///target?foo Loading from CSV available through HTTP -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +--------------------------------------- The same command as just above can also be run if the CSV file happens to be found on a remote HTTP location:: @@ -84,7 +84,7 @@ Also notice that the same command will work against an archived version of the same data. Streaming CSV data from an HTTP compressed file -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +----------------------------------------------- Finally, it's important to note that pgloader first fetches the content from the HTTP URL it to a local file, then expand the archive when it's @@ -110,7 +110,7 @@ and the commands and pgloader will take care of streaming the data down to PostgreSQL. Migrating from SQLite -^^^^^^^^^^^^^^^^^^^^^ +--------------------- The following command will open the SQLite database, discover its tables definitions including indexes and foreign keys, migrate those definitions @@ -121,7 +121,7 @@ and then migrate the data over:: pgloader ./test/sqlite/sqlite.db postgresql:///newdb Migrating from MySQL -^^^^^^^^^^^^^^^^^^^^ +-------------------- Just create a database where to host the MySQL data and definitions and have pgloader do the migration for you in a single command line:: @@ -130,7 +130,7 @@ pgloader do the migration for you in a single command line:: pgloader mysql://user@localhost/sakila postgresql:///pagila Fetching an archived DBF file from a HTTP remote location -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +--------------------------------------------------------- It's possible for pgloader to download a file from HTTP, unarchive it, and only then open it to discover the schema then load the data:: diff --git a/docs/tutorial/tutorial.rst b/docs/tutorial/tutorial.rst index d542d12..8c6a4b2 100644 --- a/docs/tutorial/tutorial.rst +++ b/docs/tutorial/tutorial.rst @@ -1,7 +1,6 @@ -PgLoader Tutorial +Pgloader Tutorial ================= -.. include:: quickstart.rst .. include:: csv.rst .. include:: fixed.rst .. include:: geolite.rst