Our PostgreSQL driver uses CFFI to load the SSL support from open ssl
and as a result the certificate and key file names should be strings
rather than pathnames. Should fix#308 again...
This fixes#308 by automatically using the PostgreSQL Client Side SSL
files as documented in the following reference:
http://www.postgresql.org/docs/current/static/libpq-ssl.html#LIBPQ-SSL-FILE-USAGE
This uses the Postmodern special support for it. Unfortunately couldn't
test it locally other than it doesn't break non-ssl connections. Pushing
to have user feedback.
That allows the copy-column-list specific method for MySQL to be a
method of the common pgloader.sources::copy-column-list generic
function, and then to be called again when needed.
This fixes an oversight in #41e9eeb and fixes#132 again.
Trying to provide a new one fails with an error, that I missed because I
must have forgotten to `make clean` when adding the previous #+ccl
variant here...
This alone doesn't allow to fix building for CCL but already improves
the situation as reported at #303. Next failure is something I fail to
understand tonight:
Fatal SIMPLE-ERROR:
Compilation failed: In MAKE-DOUBLE-FLOAT: Type declarations violated in (THE FIXNUM 4294967295) in /Users/dim/dev/pgloader/build/quicklisp/local-projects/qmynd/src/common/utilities.lisp
When using Clozure Common Lisp apparently a :absolute directory
component for make-pathname is supposed to contain a single path
component, fix by using parse-native-namestring instead.
In case it's needed, the following spelling seems portable enough:
CL-USER> (uiop:merge-pathnames*
(uiop:make-pathname* :directory '(:relative "pgloader"))
(uiop:make-pathname* :directory '(:absolute "tmp")))
#P"/tmp/pgloader/"
We used to have a reader and a writer cooperating concurrently into
loading the data from the source to PostgreSQL. The tranformation of the
data was then the responsibility of the reader thread.
Measurements showed that the PostgreSQL processes were mostly idle,
waiting for the reader to produce data fast enough.
In this patch we introduce a third worker thread that is responsible for
processing the raw data into pre-formatted batches, allowing the reader
to focus on extracting the data only. We now have two lparallel queues
involved in the processing, the raw queue contains the vectors of raw
data directly, and the processed-queue contains batches of properly
encoded strings for the COPY text protocol.
On the test laptop the performance gain isn't noticeable yet, it might
be that we need much larger data sets to see a gain here. At least the
setup isn't detrimental to performances on smaller data sets.
Next improvements are going to allow more features: specialized batch
retry thread and parallel table copy scheduling for database sources.
Let's also continue caring about performances and play with having
several worker and writer threads for each reader. In later patches.
And some day, too, we will need to make the number of workers a user
defined variable rather than something hard coded as today. It's on the
todo list, meanwhile, dear user, consider changing the (make-kernel 6)
into (make-kernel 12) or something else in src/sources/mysql/mysql.lisp,
and consider enlighting me with whatever it is you find by doing so!
Have each thread publish its own start-time so that the main thread may
compute time spent in source and target processing, in order to fix the
crude hack of taking (max read-time write-time) in the total time column
of the summary.
We still have some strange artefacts here: we consider that the full
processing time is bound to the writer thread (:target), because it
needs to have the reader done already to be able to COPY the last
batch... but in testing I've seen some :source timings higher than the
:target ones...
Let's solve problems one at a time tho, I guess multi-threading and
accurate wall clock times aren't to be expected to mix and match that
easily anyway (multi cores, single RTC and all that).
As seen in #302 it's possible to define a SQLite column of type "integer
auto_increment". In my testing tho, it doesn't mean a thing. Worse than
that, apparently when an integer column is created that is also used as
the primary key of the table, the notation "integer auto_increment
primary key" disables the rowid behavior that is certainly expected.
Let's not yet mark the bug as fixed as I suppose we will have to do
something about this rowid mess. Thanks again SQLite.
When calling lparallel:receive-results from the main threads we loose
the ability to measure proper per-table processing times, because it all
happens in parallel and the main threads seems to receive events in the
same millisecond as when the worker is started, meaning it's all 0.0s.
So when we don't have "secs" stats, pick the greatest of read or write
time, which we do have from the worker threads themselves.
The number are still wrong, but less so than the "0.0s" displayed before
this patch.
Now that we can setup many concurrent threads working against the
PostgreSQL database, and before we open the number of workers to our
users, install an heuristic to manage the PostgreSQL error classes “too
many connection” and “configuration limit exceeded” so that pgloader
waits for some time (*retry-connect-delay*) then tries connecting again.
It's quite simplistic but should cover lots of border-line cases way
more nicely than just throwing the interactive debugger at the end user.
The newly added statistics are showing that read+write times are not
enough to explain how long we wait for the data copying, so it must be
the workers setup rather than the workers themselves.
From there, let lparallel work its magic in scheduling the work we do in
parallel in pgloader: rather than doing blocking receive-result calls
for each table, only receive-result at the end of the whole
copy-database processing.
On test data here on the laptop we go from 6s to 3s to migrate the
sakila database from MySQL to PostgreSQL: that's because we have lots of
very small tables, so the cost of waiting after each COPY added up quite
quickly.
In passing, stop sharing the same connection object in between parallel
workers that used to be controlled active in-sequence, see the new API
clone-connection (which takes over new-pgsql-connection).
Add metrics to devise where the time is spent in current pgloader code
so that it's possible to then optimize away the batch processing as we
do it today.
Given the following extract of the measures, it seems that doing the
data transformations in the reader thread isn't so bright an idea. More
to come.
table name total time read write
----------------- -------------- --------- ---------
extract 2.014s
before load 0.050s
fetch 0.000s
----------------- -------------- --------- ---------
geolite.location 16.090s 15.933s 5.732s
geolite.blocks 28.896s 28.795s 5.312s
----------------- -------------- --------- ---------
after load 37.772s
----------------- -------------- --------- ---------
Total import time 1m25.082s 44.728s 11.044s
When devising the common API, the first step has been to implement
specific methods for each generic function of the protocol. It now
appears that in some cases we don't need the extra level of flexibility:
each change of the API has been systematically reported to all the
specific methods, so just use a single generic definition where possible.
In particular, introduce new intermediate class for COPY subclasses
allowing to share more common code in the methods implementation, rather
than having to copy/paste and maintain several versions of the same
code.
It would be good to be able to centralize more code for the database
sources and how they are organized around metadata/import-data/complete
schema, but it doesn't look obvious how to do it just now.
Update the stats used to be a quite simple incf and doing it once per
read row was good enough, but now that it involves sending a message to
the monitor thread let's only send a message per batch, reducing the
communication load here.
The local-time:encode-timestamp function takes a default timezone and it
is necessary to have control over it when loading from pgloader. Hence,
add a timezone option to the IXF option list, that is now explicit and
local to the IXF parser rather than shared with the DBF option list.
After all, it's shared between the CSV command parsing and the Cast
Rules parsing. src/parsers/command-csv.lisp still contains lots of
facilities shared between the file based sources, will need another
series of splits.
Commit 598c860cf5 broke user defined
casting rules by interning "precision" and "scale" in the
pgloader.user-symbols package: those symbols need to be found in the
pgloader.transforms package instead.
Luckily enough the infrastructure to do that was already in place for
cl:nil.
In order to later be able to have more worker threads sharing the
load (multiple readers and/or writers, maybe more specialized threads
too), have all the stats be managed centrally by a single thread. We
already have a "monitor" thread that get passed log messages so that the
output buffer is not subject to race conditions, extend its use to also
deal with statistics messages.
In the current code, we send a message each time we read a row. In some
future commits we should probably reduce the messaging here to something
like one message per batch in the common case.
Also, as a nice side effect of the code simplification and refactoring
this fixes#283 wherein the before/after sections of individual CSV
files within an ARCHIVE command where not counted in the reporting.
In passing, fix the condition message to read the unknown IXF data types
as decimal numbers (rather than hexadecimal) as they are documented that
way in the IBM reference documentation.
To be able to use "t" (or "nil") as a column name, pgloader needs to be
able to generate lisp code where those symbols are available. It's
simple enough in that a Common Lisp package that doesn't :use :cl
fullfills the condition, so intern user symbols in a specially crafted
package that doesn't :use :cl.
Now, we still need to be able to run transformation code that is using
the :cl package symbols and the pgloader.transforms functions too. In
this commit we introduce a heuristic to pick symbols either as functions
from pgloader.transforms or anything else in pgloader.user-symbols.
And so that user code may use NIL too, we provide an override mechanism
to the intern-symbol heuristic and use it only when parsing user code,
not when producing Common Lisp code from the parsed load command.
When building from source you should really build from current's HEAD in
git master branch...
In passing, comment out the --self-update paragraph as it's know to be
broken unless you still have all the source dependencies at the right
place for ASDF to find them... making the feature developer only.
The date format wouldn't allow using colon (:) in the noise parts of it,
and would also insist that milliseconds should be on 4 digits and micro
seconds on 6 digits. Allow for "ragged" input and take however many
digits we actually find in the input.
When the list of columns of the PostgreSQL target table isn't given in
the load command, pgloader will happily query the system catalogs to get
that information. The list-columns query didn't get the memo about the
qualified table name format and the with-schema macro... fix#288.
The TimeZone parameter should be set both for input and for output in
order to match our expected result file. Let's try to set PGTZ in the
shell environment...
The cvs-parse-date test is failing on Travis because the server up there
in the Cloud isn't using the same timezone as my local machine. Let's
just force the timezone in the SET clause...
A useful use case for date parsing at tine input level is to parse
time (hour, minutes, seconds) rather than a full date (timestamp).
Improve the code so that it's possible to use the date format facility
even when the data field lacks the year/month/day information.
Fix#288.
This is a blind fix that I hope is the one needed, wherein -1 get sent
to unsigned-to-sign which fails to handle it as expected. The protocol
and driver are then sending unsigned ints and that's what we now handle
in the CFFI layer.
This allows to push the attempt to ease testing of it, and allows me
also to cut a release without modified files handling around.
The download section now explains why binaries are not to be found here
anymore. Also we add Redpill Linpro as a sponsor, and we add a pgloader
Moral License page in the spirit of the Varnish License. Let's see what
happens with that.
The cleanup introduced by eabfbb9cc8 did
break the case when we lack a target table name for the IXF load. Just
default to using the source name in that case.
It turns out that IXF format might embed particulars of the systems the
data comes from (Informix or DB2) and we need to process some strings so
that they are compatible with PostgreSQL ("CURRENT TIMESTAMP" here).
See #272 that should be fixed here, but being in the blind, don't close
the issue just yet.
Our IXF type matching appears to be incomplete, and pgloader would then
happily try to create a table column of type NIL in PostgreSQL: prevent
that from happening by raising an error condition early. Fix#289.