The copy format and batch facilities are no longer the meat of your
PostgreSQL support in the src/pgsql directory, so have them leave in their
own space.
Experiment with the idea of splitting the read work in several concurrent
threads, where each reader is reading portions of the target table, using a
WHERE id <= x and id > y clause in its SELECT query.
For this to kick-in a number of conditions needs to be met, as described in
the documentation. The main interest might not be faster queries to overall
fetch the same data set, but better concurrency with as many readers as
writters and each couple its own dedicated queue.
The previous patch made format-vector-row allocate its memory in one go
rather than byte after byte with vector-push-extend. In this patch we review
our usage of batches and parallelism.
Now the reader pushes each row directly to the lparallel queue and writers
concurrently consume from it, cook batches in COPY format, and then send
that chunk of data down to PostgreSQL. When looking at runtime profiles, the
time spent writing in PostgreSQL is a fraction of the time spent reading
from MySQL, so we consider that the writing thread has enough time to do the
data mungling without slowing us down.
The most interesting factor here is the memory behavor of pgloader, which
seems more stable than before, and easier to cope with for SBCL's GC.
Note that batch concurrency is no more, replaced by prefetch rows: the
reader thread no longer build batches and the count of items in the reader
queue is now a number a rows, not of batches of them.
Anyway, with this patch in I can't reproduce the following issues:
Fixes#337, Fixes#420.
Many options are now available to pgloader users, including short cuts that
where not defined clearly enough. That could result in stupid things being
done at times.
In particular, when picking the "data only" option then indexes are not to
be dropped before loading the data, but pgloader would still try and create
them again at the end of the load, because the option that controls that
behavior default to true and is not impacted by the "data only" choice.
In this patch we review the logic and ensure it's applied in the same
fashion in the different phases of the database migration: preparation,
copying, rebuilding of indexes and completion of the database model.
See also 96b2af6b2a where we began fixing
oddities but didn't go far enough.
Also known as the ORM case, it happens that other tools are used to
create the target schema. In that case pgloader job is to fill in the
exiting target tables with the data from the source tables.
We still focus on load speed and pgloader will now DROP the
constraints (Primary Key, Unique, Foreign Keys) and indexes before
running the COPY statements, and re-install the schema it found in the
target database once the data load is done.
This behavior is activated when using the “create no tables” option as
in the following test-case setup:
with create no tables, include drop, truncate
Fixes#400, for which I got a test-case to play with!
By default, pgloader will start as many parallel CREATE INDEX commands
as the maximum number of indexes you have on any single table that takes
part in the load.
As this number might be so great as to exhaust the target PostgreSQL
server (e.g. maintenance_work_mem), we add an option to limit that to
something reasonnable when the source schema isn't.
Fix#386 in which 150 indexes are found on a single source table.
More than the syntax and API tweaks, this patch also make it so that a
multi-file specification (using e.g. ALL FILENAMES IN DIRECTORY) can be
loaded with several files in the group in parallel.
To that effect, tweak again the md-connection and md-copy
implementations.
It was worker-count and it's now exposed as the worker in the WITH
clause, but we can actually keep it as worker-count in the internal API,
and it feels better that way.
Add the workers and concurrency settings to the LOAD commands for
database sources so that users can tweak them now, and add mentions of
them in the documentation too.
From the documentation string of the copy-from method as found in
src/sources/common/methods.lisp:
We allow WORKER-COUNT simultaneous workers to be active at the same time
in the context of this COPY object. A single unit of work consist of
several kinds of workers:
- a reader getting raw data from the COPY source with `map-rows',
- N transformers preparing raw data for PostgreSQL COPY protocol,
- N writers sending the data down to PostgreSQL.
The N here is setup to the CONCURRENCY parameter: with a CONCURRENCY of
2, we start (+ 1 2 2) = 5 concurrent tasks, with a CONCURRENCY of 4 we
start (+ 1 4 4) = 9 concurrent tasks, of which only WORKER-COUNT may be
active simultaneously.
Those options should find their way in the remaining sources, that's for
a follow-up patch tho.
pgloader parallel workload is still hardcoded, but at least the code now
uses clear parameters as input so that it will be possible in a later
patch to expose them to the end-user.
The notions of workers and concurrency are now handled as follows:
- concurrency is how many tasks are allowed to happen at once, by
default we have a reader thread, a transformer thread and a COPY
thread all actives for each table being loaded,
- worker-count is how many parallel threads are allowed to run
simultaneously and default to 8 currently, which means that in a
typical migration from a database source and given default
concurrency or 1 (3 threads), we might be loaded up to 3 different
tables at any time.
The idea is to expose those settings to the user in the load file and as
command line options (such as --jobs) and see what it gives us. It might
help e.g. use more cores in loading a single CSV file.
As of this patch, there still can only be only one reader thread and the
number of transformer threads must be the same as the number of COPY
threads.
Finally, the CSV-like files user-defined projections are now handled in
the tranformation threads rather than in the reader thread...
In order to share more code in between the different source types,
finally have a go at the quite horrible mess of anonymous data
structures floating around.
Having a catalog and schema instances not only allows for code cleanup,
but will also allow to implement some bug fixes and wishlist items such
as mapping tables from a schema to another one.
Also, supporting database sources having a notion of "schema" (in
between "catalog" and "table") should get easier, including getting
on-par with MySQL in the MS SQL support (materialized views has been
asked for already).
See #320, #316, #224 for references and a notion of progress being made.
In passing, also clean up the copy-databases methods for database source
types, so that they all use a fetch-metadata generic function and a
prepare-pgsql-database and a complete-pgsql-database generic function.
Actually, a single method does the job here.
The responsibility of introspecting the source to populate the internal
catalog/schema representation is now held by the fetch-metadata generic
function, which in turn will call the specialized versions of
list-all-columns and friends implementations. Once the catalog has been
fetched, an explicit CAST call is then needed before we can continue.
Finally, the fields/columns/transforms slots in the copy objects are
still being used by the operative code, so the internal catalog
representation is only used up to starting the data copy step, where the
copy class instances are then all that's used.
This might be refactored again in a follow-up patch.
We used to have a reader and a writer cooperating concurrently into
loading the data from the source to PostgreSQL. The tranformation of the
data was then the responsibility of the reader thread.
Measurements showed that the PostgreSQL processes were mostly idle,
waiting for the reader to produce data fast enough.
In this patch we introduce a third worker thread that is responsible for
processing the raw data into pre-formatted batches, allowing the reader
to focus on extracting the data only. We now have two lparallel queues
involved in the processing, the raw queue contains the vectors of raw
data directly, and the processed-queue contains batches of properly
encoded strings for the COPY text protocol.
On the test laptop the performance gain isn't noticeable yet, it might
be that we need much larger data sets to see a gain here. At least the
setup isn't detrimental to performances on smaller data sets.
Next improvements are going to allow more features: specialized batch
retry thread and parallel table copy scheduling for database sources.
Let's also continue caring about performances and play with having
several worker and writer threads for each reader. In later patches.
And some day, too, we will need to make the number of workers a user
defined variable rather than something hard coded as today. It's on the
todo list, meanwhile, dear user, consider changing the (make-kernel 6)
into (make-kernel 12) or something else in src/sources/mysql/mysql.lisp,
and consider enlighting me with whatever it is you find by doing so!
When devising the common API, the first step has been to implement
specific methods for each generic function of the protocol. It now
appears that in some cases we don't need the extra level of flexibility:
each change of the API has been systematically reported to all the
specific methods, so just use a single generic definition where possible.
In particular, introduce new intermediate class for COPY subclasses
allowing to share more common code in the methods implementation, rather
than having to copy/paste and maintain several versions of the same
code.
It would be good to be able to centralize more code for the database
sources and how they are organized around metadata/import-data/complete
schema, but it doesn't look obvious how to do it just now.