Commit Graph

15 Commits

Author SHA1 Message Date
Dimitri Fontaine
b54ca576cb Raise some log messages.
We should be able to follow the progress more easily at the log level
NOTICE, so raise some log messages from INFO to NOTICE.
2017-01-28 17:44:18 +01:00
Dimitri Fontaine
70572a2ea7 Implement support for existing target databases.
Also known as the ORM case, it happens that other tools are used to
create the target schema. In that case pgloader job is to fill in the
exiting target tables with the data from the source tables.

We still focus on load speed and pgloader will now DROP the
constraints (Primary Key, Unique, Foreign Keys) and indexes before
running the COPY statements, and re-install the schema it found in the
target database once the data load is done.

This behavior is activated when using the “create no tables” option as
in the following test-case setup:

  with create no tables, include drop, truncate

Fixes #400, for which I got a test-case to play with!
2016-08-06 20:19:15 +02:00
Dimitri Fontaine
1ed07057fd Implement --on-error-stop command line option.
The implementation uses the dynamic binding *on-error-stop* so it's also
available when pgloader is used as Common Lisp librairy.
The (not-all-that-) recent changes made to the error handling make that
implementation straightforward enough, so let's finally do it!

Fix #85.
2016-03-21 20:52:50 +01:00
Dimitri Fontaine
40c1581794 Review transaction and error handling in COPY.
The PostgreSQL COPY protocol requires an explicit initialization phase
that may fail, and in this case the Postmodern driver transaction is
already dead, so there's no way we can even send ABORT to it.

Review the error handling of our copy-batch function to cope with that
fact, and add some logging of non-retryable errors we may have.

Also improve the thread error reporting when using a binary image from
where it might be difficult to open an interactive debugger, while still
having the full blown Common Lisp debugging experience for the project
developers.

Add a test case for a missing column as in issue #339.

Fix #339, see #337.
2016-02-21 15:56:06 +01:00
Dimitri Fontaine
4e36bd3c55 Improve threads error handling.
See #328 where we are lacking useful stack trace in a --debug run
because of the previous talk-handler-bind coding, that was there to
avoid sinking the users into too many details. Let's try another
approach here.
2016-01-24 21:43:46 +01:00
Dimitri Fontaine
7dd69a11e1 Implement concurrency and workers for files sources.
More than the syntax and API tweaks, this patch also make it so that a
multi-file specification (using e.g. ALL FILENAMES IN DIRECTORY) can be
loaded with several files in the group in parallel.

To that effect, tweak again the md-connection and md-copy
implementations.
2016-01-16 22:53:55 +01:00
Dimitri Fontaine
f256e12a4f Review load parallelism settings.
pgloader parallel workload is still hardcoded, but at least the code now
uses clear parameters as input so that it will be possible in a later
patch to expose them to the end-user.

The notions of workers and concurrency are now handled as follows:

  - concurrency is how many tasks are allowed to happen at once, by
    default we have a reader thread, a transformer thread and a COPY
    thread all actives for each table being loaded,

  - worker-count is how many parallel threads are allowed to run
    simultaneously and default to 8 currently, which means that in a
    typical migration from a database source and given default
    concurrency or 1 (3 threads), we might be loaded up to 3 different
    tables at any time.

The idea is to expose those settings to the user in the load file and as
command line options (such as --jobs) and see what it gives us. It might
help e.g. use more cores in loading a single CSV file.

As of this patch, there still can only be only one reader thread and the
number of transformer threads must be the same as the number of COPY
threads.

Finally, the CSV-like files user-defined projections are now handled in
the tranformation threads rather than in the reader thread...
2016-01-11 01:43:38 +01:00
Dimitri Fontaine
9e4938cea4 Implement PostgreSQL catalogs data structure.
In order to share more code in between the different source types,
finally have a go at the quite horrible mess of anonymous data
structures floating around.

Having a catalog and schema instances not only allows for code cleanup,
but will also allow to implement some bug fixes and wishlist items such
as mapping tables from a schema to another one.

Also, supporting database sources having a notion of "schema" (in
between "catalog" and "table") should get easier, including getting
on-par with MySQL in the MS SQL support (materialized views has been
asked for already).

See #320, #316, #224 for references and a notion of progress being made.

In passing, also clean up the copy-databases methods for database source
types, so that they all use a fetch-metadata generic function and a
prepare-pgsql-database and a complete-pgsql-database generic function.
Actually, a single method does the job here.

The responsibility of introspecting the source to populate the internal
catalog/schema representation is now held by the fetch-metadata generic
function, which in turn will call the specialized versions of
list-all-columns and friends implementations. Once the catalog has been
fetched, an explicit CAST call is then needed before we can continue.

Finally, the fields/columns/transforms slots in the copy objects are
still being used by the operative code, so the internal catalog
representation is only used up to starting the data copy step, where the
copy class instances are then all that's used.

This might be refactored again in a follow-up patch.
2015-12-30 21:53:01 +01:00
Dimitri Fontaine
b4bfa18877 Fix more table name quoting, fix #163 again.
Now that we can have several threads doing COPY, each of them need to
know about the *pgsql-reserved-keywords* list. Make sure that's the case
and in passing fix some call sites to apply-identifier-case.

Also, more disturbingly, fix the code so that TRUNCATE is called from
the main thread before giving control to the COPY threads, rather than
having two concurrent threads doing the TRUNCATE twice. It's rather
strange that we got no complaint from the field on that part...
2015-12-08 11:52:43 +01:00
Dimitri Fontaine
cca44c800f Simplify batch and transformation handling.
Make batches of raw data straight from the reader output (map-rows) and
have the transformation worker focus on changing the batch content from
raw rows to copy strings.

Also review the organisation of responsabilities in the code, allowing
to move queue.lisp into utils/batch.lisp, renaming it as its scope has
been reduced to only care about preparing batches.

This came out of trying to have multiple workers concurrently processing
the batches from the reader and feeding the hardcoded 2 COPY workers,
but it failed for multiple reasons. All is left as of now is this
cleanup, which seems to be on the faster side of things, which is always
good.
2015-11-29 17:35:25 +01:00
Dimitri Fontaine
bc870ac96c Use 2 copy threads per target table.
It's been proven by Andres Freund benchmarks that the best number of
parallel COPY threads concurrently active against a single table is 2 as
of PostgreSQL current development version (up to 9.5 stable, might still
apply to 9.6 depending on how we might solve the problem).

Henceforth hardcode 2 COPY threads in pgloader. This also has the
advantage that in the presence of lots of bad rows, we should sustain a
better throughtput and not stall completely.

Finally, also improve the MySQL setup to use 8 threads by default, that
is be able to load two tables concurrently, each with 2 COPY workers, a
reader and a transformer thread.

It's all still experimental as far as performances go, next patches
should bring the capability to configure the parallelism wanted from the
command line and load command tho.

Also, other source types will want to benefit from the same
capabilities, it just happens that it's easier to play with MySQL first
for some reasons here.
2015-11-17 17:06:36 +01:00
Dimitri Fontaine
4df3167da1 Introduce another worker thread: transformers.
We used to have a reader and a writer cooperating concurrently into
loading the data from the source to PostgreSQL. The tranformation of the
data was then the responsibility of the reader thread.

Measurements showed that the PostgreSQL processes were mostly idle,
waiting for the reader to produce data fast enough.

In this patch we introduce a third worker thread that is responsible for
processing the raw data into pre-formatted batches, allowing the reader
to focus on extracting the data only. We now have two lparallel queues
involved in the processing, the raw queue contains the vectors of raw
data directly, and the processed-queue contains batches of properly
encoded strings for the COPY text protocol.

On the test laptop the performance gain isn't noticeable yet, it might
be that we need much larger data sets to see a gain here. At least the
setup isn't detrimental to performances on smaller data sets.

Next improvements are going to allow more features: specialized batch
retry thread and parallel table copy scheduling for database sources.
Let's also continue caring about performances and play with having
several worker and writer threads for each reader. In later patches.

And some day, too, we will need to make the number of workers a user
defined variable rather than something hard coded as today. It's on the
todo list, meanwhile, dear user, consider changing the (make-kernel 6)
into (make-kernel 12) or something else in src/sources/mysql/mysql.lisp,
and consider enlighting me with whatever it is you find by doing so!
2015-10-23 00:17:58 +02:00
Dimitri Fontaine
4f3b3472a2 Cleanup and timing display improvements.
Have each thread publish its own start-time so that the main thread may
compute time spent in source and target processing, in order to fix the
crude hack of taking (max read-time write-time) in the total time column
of the summary.

We still have some strange artefacts here: we consider that the full
processing time is bound to the writer thread (:target), because it
needs to have the reader done already to be able to COPY the last
batch... but in testing I've seen some :source timings higher than the
:target ones...

Let's solve problems one at a time tho, I guess multi-threading and
accurate wall clock times aren't to be expected to mix and match that
easily anyway (multi cores, single RTC and all that).
2015-10-22 22:36:40 +02:00
Dimitri Fontaine
633067a0fd Allow more parallelism in database migrations.
The newly added statistics are showing that read+write times are not
enough to explain how long we wait for the data copying, so it must be
the workers setup rather than the workers themselves.

From there, let lparallel work its magic in scheduling the work we do in
parallel in pgloader: rather than doing blocking receive-result calls
for each table, only receive-result at the end of the whole
copy-database processing.

On test data here on the laptop we go from 6s to 3s to migrate the
sakila database from MySQL to PostgreSQL: that's because we have lots of
very small tables, so the cost of waiting after each COPY added up quite
quickly.

In passing, stop sharing the same connection object in between parallel
workers that used to be controlled active in-sequence, see the new API
clone-connection (which takes over new-pgsql-connection).
2015-10-20 22:15:55 +02:00
Dimitri Fontaine
41e9eebd54 Rationalize common generic API implementation.
When devising the common API, the first step has been to implement
specific methods for each generic function of the protocol. It now
appears that in some cases we don't need the extra level of flexibility:
each change of the API has been systematically reported to all the
specific methods, so just use a single generic definition where possible.

In particular, introduce new intermediate class for COPY subclasses
allowing to share more common code in the methods implementation, rather
than having to copy/paste and maintain several versions of the same
code.

It would be good to be able to centralize more code for the database
sources and how they are organized around metadata/import-data/complete
schema, but it doesn't look obvious how to do it just now.
2015-10-05 21:25:21 +02:00