This pgloader command allows to migrate tables while changing the schema
they are found into in between their MySQL source database and their
PostgreSQL target database.
This changes the default behavior of pgloader with MySQL from always
targetting the 'public' schema to targetting by default a schema named
the same as the MySQL database. You can revert to the old behavior by
adding a rule:
ALTER SCHEMA 'dbname' RENAME TO 'public
We might want to add a patch to re-install the default behavior later.
Also see #489 where it used not to be possible to rename the schema at
migration time, causing strange errors (you need to spot NIL as the
schema name in the "failed to find target table" messages.
Following-up to the recent refactoring effort, the IXF and DB3 source
classes didn't get the memo that they could piggyback on the generic
copy-database implementation. This patch implements that.
In passing, also simplify the instanciate-table-copy-object method for
copy subclasses that need specialization here, by using change-class and
call-next-method so as to reuse the generic code as much as possible.
In order to share more code in between the different source types,
finally have a go at the quite horrible mess of anonymous data
structures floating around.
Having a catalog and schema instances not only allows for code cleanup,
but will also allow to implement some bug fixes and wishlist items such
as mapping tables from a schema to another one.
Also, supporting database sources having a notion of "schema" (in
between "catalog" and "table") should get easier, including getting
on-par with MySQL in the MS SQL support (materialized views has been
asked for already).
See #320, #316, #224 for references and a notion of progress being made.
In passing, also clean up the copy-databases methods for database source
types, so that they all use a fetch-metadata generic function and a
prepare-pgsql-database and a complete-pgsql-database generic function.
Actually, a single method does the job here.
The responsibility of introspecting the source to populate the internal
catalog/schema representation is now held by the fetch-metadata generic
function, which in turn will call the specialized versions of
list-all-columns and friends implementations. Once the catalog has been
fetched, an explicit CAST call is then needed before we can continue.
Finally, the fields/columns/transforms slots in the copy objects are
still being used by the operative code, so the internal catalog
representation is only used up to starting the data copy step, where the
copy class instances are then all that's used.
This might be refactored again in a follow-up patch.
When the option "WITH no foreign keys" is in use, it's not necessary to
go read the foreign key information_schema bits at all, so just don't
issue the query, and same thing with the "create no indexes" option.
In no that old versions of MySQL, the referential_constraints table of
information_schema doesn't exist, so this should make pgloader
compatible with MySQL 5.0 something and earlier.
We used to wait for the wrong number of workers, meaning the rest of the
code began running before the indexes where all available. A user report
where one of the indexes takes a very long time to compute made it
obvious.
In passing, also improve reporting of those rendez-vous sections.
It's been proven by Andres Freund benchmarks that the best number of
parallel COPY threads concurrently active against a single table is 2 as
of PostgreSQL current development version (up to 9.5 stable, might still
apply to 9.6 depending on how we might solve the problem).
Henceforth hardcode 2 COPY threads in pgloader. This also has the
advantage that in the presence of lots of bad rows, we should sustain a
better throughtput and not stall completely.
Finally, also improve the MySQL setup to use 8 threads by default, that
is be able to load two tables concurrently, each with 2 COPY workers, a
reader and a transformer thread.
It's all still experimental as far as performances go, next patches
should bring the capability to configure the parallelism wanted from the
command line and load command tho.
Also, other source types will want to benefit from the same
capabilities, it just happens that it's easier to play with MySQL first
for some reasons here.
We used to have a reader and a writer cooperating concurrently into
loading the data from the source to PostgreSQL. The tranformation of the
data was then the responsibility of the reader thread.
Measurements showed that the PostgreSQL processes were mostly idle,
waiting for the reader to produce data fast enough.
In this patch we introduce a third worker thread that is responsible for
processing the raw data into pre-formatted batches, allowing the reader
to focus on extracting the data only. We now have two lparallel queues
involved in the processing, the raw queue contains the vectors of raw
data directly, and the processed-queue contains batches of properly
encoded strings for the COPY text protocol.
On the test laptop the performance gain isn't noticeable yet, it might
be that we need much larger data sets to see a gain here. At least the
setup isn't detrimental to performances on smaller data sets.
Next improvements are going to allow more features: specialized batch
retry thread and parallel table copy scheduling for database sources.
Let's also continue caring about performances and play with having
several worker and writer threads for each reader. In later patches.
And some day, too, we will need to make the number of workers a user
defined variable rather than something hard coded as today. It's on the
todo list, meanwhile, dear user, consider changing the (make-kernel 6)
into (make-kernel 12) or something else in src/sources/mysql/mysql.lisp,
and consider enlighting me with whatever it is you find by doing so!
Have each thread publish its own start-time so that the main thread may
compute time spent in source and target processing, in order to fix the
crude hack of taking (max read-time write-time) in the total time column
of the summary.
We still have some strange artefacts here: we consider that the full
processing time is bound to the writer thread (:target), because it
needs to have the reader done already to be able to COPY the last
batch... but in testing I've seen some :source timings higher than the
:target ones...
Let's solve problems one at a time tho, I guess multi-threading and
accurate wall clock times aren't to be expected to mix and match that
easily anyway (multi cores, single RTC and all that).
The newly added statistics are showing that read+write times are not
enough to explain how long we wait for the data copying, so it must be
the workers setup rather than the workers themselves.
From there, let lparallel work its magic in scheduling the work we do in
parallel in pgloader: rather than doing blocking receive-result calls
for each table, only receive-result at the end of the whole
copy-database processing.
On test data here on the laptop we go from 6s to 3s to migrate the
sakila database from MySQL to PostgreSQL: that's because we have lots of
very small tables, so the cost of waiting after each COPY added up quite
quickly.
In passing, stop sharing the same connection object in between parallel
workers that used to be controlled active in-sequence, see the new API
clone-connection (which takes over new-pgsql-connection).
When devising the common API, the first step has been to implement
specific methods for each generic function of the protocol. It now
appears that in some cases we don't need the extra level of flexibility:
each change of the API has been systematically reported to all the
specific methods, so just use a single generic definition where possible.
In particular, introduce new intermediate class for COPY subclasses
allowing to share more common code in the methods implementation, rather
than having to copy/paste and maintain several versions of the same
code.
It would be good to be able to centralize more code for the database
sources and how they are organized around metadata/import-data/complete
schema, but it doesn't look obvious how to do it just now.
Update the stats used to be a quite simple incf and doing it once per
read row was good enough, but now that it involves sending a message to
the monitor thread let's only send a message per batch, reducing the
communication load here.
In order to later be able to have more worker threads sharing the
load (multiple readers and/or writers, maybe more specialized threads
too), have all the stats be managed centrally by a single thread. We
already have a "monitor" thread that get passed log messages so that the
output buffer is not subject to race conditions, extend its use to also
deal with statistics messages.
In the current code, we send a message each time we read a row. In some
future commits we should probably reduce the messaging here to something
like one message per batch in the common case.
Also, as a nice side effect of the code simplification and refactoring
this fixes#283 wherein the before/after sections of individual CSV
files within an ARCHIVE command where not counted in the reporting.
MySQL names its primary keys "PRIMARY" and we need to always uniquify
this name even when the used asked pgloader to preserve index names.
Also, the create-indexes-again function now needs to ask for index names
to be preserved specifically.
See test/parse/hans.goeuro.load for an example usage of the new option.
In passing, any error when creating indexes is now properly reported and
logged, which was missing previously. Oops.
This option is dangerous and allows to skip ALL triggers when loading
data against PostgreSQL. This includes foreign key constraints
definitions and will allow loading data out of order.
When using both the options "create no table" and "disable triggers" it
will be possible to load data into a schema prepared by your favorite
external tool, at the cost of not validating FK constraints. Use with
care.
Fix#167.
That's the big refactoring patch I've been sitting on for too long.
First, refactor connection handling to use a uniformed "connection"
concept (class and generic functions API) everywhere, so that the COPY
derived objects just use that in their :source-db and :target-db slots.
Given that, we don't need no messing around with *pgconn* and *myconn-*
and other special variables at all anywhere in the tree.
Second, clean up some oddities accumulated over time, where some parts
of the code didn't get the memo when new API got into place.
Third, fix any other oddity or missing part found while doing those
first two activities, it was long overdue anyway...