Experiment with the idea of splitting the read work in several concurrent
threads, where each reader is reading portions of the target table, using a
WHERE id <= x and id > y clause in its SELECT query.
For this to kick-in a number of conditions needs to be met, as described in
the documentation. The main interest might not be faster queries to overall
fetch the same data set, but better concurrency with as many readers as
writters and each couple its own dedicated queue.
The previous patch made format-vector-row allocate its memory in one go
rather than byte after byte with vector-push-extend. In this patch we review
our usage of batches and parallelism.
Now the reader pushes each row directly to the lparallel queue and writers
concurrently consume from it, cook batches in COPY format, and then send
that chunk of data down to PostgreSQL. When looking at runtime profiles, the
time spent writing in PostgreSQL is a fraction of the time spent reading
from MySQL, so we consider that the writing thread has enough time to do the
data mungling without slowing us down.
The most interesting factor here is the memory behavor of pgloader, which
seems more stable than before, and easier to cope with for SBCL's GC.
Note that batch concurrency is no more, replaced by prefetch rows: the
reader thread no longer build batches and the count of items in the reader
queue is now a number a rows, not of batches of them.
Anyway, with this patch in I can't reproduce the following issues:
Fixes#337, Fixes#420.
We added some confution about who's responsible to quote the SQL obejct
names in between src/utils/quoting.lisp and src/pgsql/pgsql-ddl.lisp and
as a result some migrations from MySQL with identifier case set to quote
where broken, as in #439.
To fix, remove any use of the format directive ~s in the PostgreSQL ddl
output methods: we consider that the quoting of ~s is to be decided in
apply-identifier-case. We then use ~a instead of ~s.
Fix#439.
This fixes#308 by automatically using the PostgreSQL Client Side SSL
files as documented in the following reference:
http://www.postgresql.org/docs/current/static/libpq-ssl.html#LIBPQ-SSL-FILE-USAGE
This uses the Postmodern special support for it. Unfortunately couldn't
test it locally other than it doesn't break non-ssl connections. Pushing
to have user feedback.
In order to later be able to have more worker threads sharing the
load (multiple readers and/or writers, maybe more specialized threads
too), have all the stats be managed centrally by a single thread. We
already have a "monitor" thread that get passed log messages so that the
output buffer is not subject to race conditions, extend its use to also
deal with statistics messages.
In the current code, we send a message each time we read a row. In some
future commits we should probably reduce the messaging here to something
like one message per batch in the common case.
Also, as a nice side effect of the code simplification and refactoring
this fixes#283 wherein the before/after sections of individual CSV
files within an ARCHIVE command where not counted in the reporting.
In some cases pgloader total time computing is quite off: in the archive
case because it fails to take into account per-file before and after
sections, and in the general case when there's parallel work done.
This patch helps by storing the start time explicitely and using it at
the moment when the summary is displayed: no guessing any more. This is
only used in the archive case for now because I want some feedback.
On my machine with the usual test cases I have, the difference with and
without this patch is line-noise, something more serious has to be done:
let's push testing to the user by committing this early version of the
work.
This allows users to benefit from the same flexible machinery when using
SQLite as when using MySQL, and also allows to add some more default
cast rules too.