Commit Graph

1288 Commits

Author SHA1 Message Date
Dimitri Fontaine
8355b2140e Implement before/after load support for SQLite, fix #321.
If there ever was a good reason not to implement before/after support
for SQLite, it's no longer valid: done.
2015-12-23 21:56:10 +01:00
Dimitri Fontaine
72e7c2af70 At long last, log cast rule choices, see #317.
To help debug the casting rule choices, output a line for each decision
that is made with the input and the output of the decision.
2015-12-08 21:27:33 +01:00
Dimitri Fontaine
735cdc8fdc Document the remove-null-characters transform.
Both as a new transformation function available, and as the default for
Text conversions when coming from MySQL. See #258, Fixes #219.
2015-12-08 21:04:47 +01:00
Aliaksei Urbanski
3a55d80411 Default cast rules for MySQL's text types fixed, see #219 2015-12-08 20:59:29 +01:00
Dimitri Fontaine
b4bfa18877 Fix more table name quoting, fix #163 again.
Now that we can have several threads doing COPY, each of them need to
know about the *pgsql-reserved-keywords* list. Make sure that's the case
and in passing fix some call sites to apply-identifier-case.

Also, more disturbingly, fix the code so that TRUNCATE is called from
the main thread before giving control to the COPY threads, rather than
having two concurrent threads doing the TRUNCATE twice. It's rather
strange that we got no complaint from the field on that part...
2015-12-08 11:52:43 +01:00
Dimitri Fontaine
dca3dacf4b Don't issue useless MySQL catalog queries...
When the option "WITH no foreign keys" is in use, it's not necessary to
go read the foreign key information_schema bits at all, so just don't
issue the query, and same thing with the "create no indexes" option.

In no that old versions of MySQL, the referential_constraints table of
information_schema doesn't exist, so this should make pgloader
compatible with MySQL 5.0 something and earlier.
2015-12-03 19:24:00 +01:00
Dimitri Fontaine
7c64a713d0 Fix PostgreSQL write times in the summary.
It turns out the summary write times included time spent waiting for
batches to be ready, which isn't fair to PostgreSQL COPY implementation,
and moreover doesn't help figuring out the bottlenecks...
2015-11-29 23:23:30 +01:00
Dimitri Fontaine
cca44c800f Simplify batch and transformation handling.
Make batches of raw data straight from the reader output (map-rows) and
have the transformation worker focus on changing the batch content from
raw rows to copy strings.

Also review the organisation of responsabilities in the code, allowing
to move queue.lisp into utils/batch.lisp, renaming it as its scope has
been reduced to only care about preparing batches.

This came out of trying to have multiple workers concurrently processing
the batches from the reader and feeding the hardcoded 2 COPY workers,
but it failed for multiple reasons. All is left as of now is this
cleanup, which seems to be on the faster side of things, which is always
good.
2015-11-29 17:35:25 +01:00
Dimitri Fontaine
2dd7f68a30 Fix index completion management in MySQL and SQLite.
We used to wait for the wrong number of workers, meaning the rest of the
code began running before the indexes where all available. A user report
where one of the indexes takes a very long time to compute made it
obvious.

In passing, also improve reporting of those rendez-vous sections.
2015-11-29 17:29:57 +01:00
Dimitri Fontaine
af9e423f0b Fix errors counting, see #313.
It came pretty obvious that the error counting was broken, it happens
that I forgot to pass the information down to the state handling parts
of the code.

In passing improve and fix CSV parse errors counting and fatal errors
reporting.
2015-11-27 11:51:48 +01:00
Dimitri Fontaine
533a49a261 Handle PostgreSQL notifications early, fix #311.
In some cases, like when client_min_messages is set to debug5,
PostgreSQL might send notification messages to the connecting client
even while opening a connection. Those are still considered WARNINGs by
the Postmodern driver...

Handle those warnings by just printing them out in the pgloader logs,
rather than considering those conditions as hard failures (signaling a
db-connection-error).
2015-11-24 10:52:25 +01:00
Dimitri Fontaine
93b6be43d4 Travis: adapt to PostgreSQL 9.1, again.
We didn't have CREATE SCHEMA IF EXISTS at the time...
2015-11-23 22:09:08 +01:00
Dimitri Fontaine
f109c3fdb4 Travis: prepare an "err" schema.
The test/errors.load set the search_path to include the 'err' schema,
which is to be created by the test itself. PostgreSQL 9.1 raises an
error where 9.4 and following just accept the setting, and Travis runs a
9.1 PostgreSQL.

Let's just create the schema before-hand so that we can still run tests
against SET search_path from the load file.
2015-11-23 15:26:18 +01:00
Dimitri Fontaine
4e23de1b2b Missing file from previous commit.
Somehow it still happens :/
2015-11-22 23:32:22 +01:00
Dimitri Fontaine
973339abc8 Add a SQLite test case from #310. 2015-11-22 22:16:05 +01:00
Dimitri Fontaine
e23de0ce9f Improve SQLite table names filtering.
Filter the list of tables we migrate directly from the SQLite query,
avoiding to return useless data. To do that, use the LIKE pattern
matching supported by SQLite, where the REGEX operator is only available
when extra features are loaded apparently.

See #310 where filtering out the view still caused errors in the
loading.
2015-11-22 22:10:26 +01:00
Dimitri Fontaine
a81f017222 Review SQLite integration with recent changes.
The current way to do parallelism in pgloader was half baked in the
SQLite source implementation, get it up to speed again.
2015-11-22 21:30:20 +01:00
Dimitri Fontaine
5f60ce3d96 Travis: create the "expected" schema.
As we don't use the `make -C test prepare` target, reproduce a missing
necessary precondition to running the unit tests.
2015-11-21 21:19:37 +01:00
Dimitri Fontaine
bc870ac96c Use 2 copy threads per target table.
It's been proven by Andres Freund benchmarks that the best number of
parallel COPY threads concurrently active against a single table is 2 as
of PostgreSQL current development version (up to 9.5 stable, might still
apply to 9.6 depending on how we might solve the problem).

Henceforth hardcode 2 COPY threads in pgloader. This also has the
advantage that in the presence of lots of bad rows, we should sustain a
better throughtput and not stall completely.

Finally, also improve the MySQL setup to use 8 threads by default, that
is be able to load two tables concurrently, each with 2 COPY workers, a
reader and a transformer thread.

It's all still experimental as far as performances go, next patches
should bring the capability to configure the parallelism wanted from the
command line and load command tho.

Also, other source types will want to benefit from the same
capabilities, it just happens that it's easier to play with MySQL first
for some reasons here.
2015-11-17 17:06:36 +01:00
Dimitri Fontaine
150d288d7a Improve our regression testing facility.
Next parallelism improvements will allow pgloader to use more than one
COPY thread to load data, with the impact of changing the order of rows
in the database.

Rather than doing a copy out and `diff` of the data just loaded, load
the reference data and do the diff in SQL:

          select * from loaded.data
  except
          select * from expected.data

If such a query returns any row, we know we didn't load what was
expected and the regression test is failing.

This regression testing facility should also allow us to finally add
support for multiple-table regression tests (sqlite, mysql, etc).
2015-11-17 17:03:08 +01:00
Dimitri Fontaine
6ca376ef9b Simplify the main function (refactor).
Move some code away in its own function for easier review and
modifications of the main entry point.
2015-11-16 16:01:25 +01:00
Dimitri Fontaine
da05782002 Allow date formats to miss time parts.
In case the seconds field are not provided just use "00" rather than NIL
as currently...
2015-11-14 21:14:20 +01:00
Dimitri Fontaine
6cbec206af Turns out SSL key/crt file paths should be strings.
Our PostgreSQL driver uses CFFI to load the SSL support from open ssl
and as a result the certificate and key file names should be strings
rather than pathnames. Should fix #308 again...
2015-11-11 23:10:29 +01:00
Dimitri Fontaine
f8ae9f22b9 Implement support for SSL client certificates.
This fixes #308 by automatically using the PostgreSQL Client Side SSL
files as documented in the following reference:

  http://www.postgresql.org/docs/current/static/libpq-ssl.html#LIBPQ-SSL-FILE-USAGE

This uses the Postmodern special support for it. Unfortunately couldn't
test it locally other than it doesn't break non-ssl connections. Pushing
to have user feedback.
2015-11-09 11:32:17 +01:00
Dimitri Fontaine
e3cc76b2d4 Export copy-column-list from pgloader.sources.
That allows the copy-column-list specific method for MySQL to be a
method of the common pgloader.sources::copy-column-list generic
function, and then to be called again when needed.

This fixes an oversight in #41e9eeb and fixes #132 again.
2015-11-08 18:48:54 +01:00
Dimitri Fontaine
042045c0a6 Clozure already provides a getenv function.
Trying to provide a new one fails with an error, that I missed because I
must have forgotten to `make clean` when adding the previous #+ccl
variant here...

This alone doesn't allow to fix building for CCL but already improves
the situation as reported at #303. Next failure is something I fail to
understand tonight:

  Fatal SIMPLE-ERROR:
  Compilation failed: In MAKE-DOUBLE-FLOAT: Type declarations violated in (THE FIXNUM 4294967295) in /Users/dim/dev/pgloader/build/quicklisp/local-projects/qmynd/src/common/utilities.lisp
2015-11-08 18:26:58 +01:00
Dimitri Fontaine
3673c5c341 Add a Travis CI Build badge. 2015-10-31 17:48:30 +01:00
Dimitri Fontaine
20ce095384 Merge pull request #305 from gitter-badger/gitter-badge
Add a Gitter chat badge to README.md
2015-10-31 17:46:20 +01:00
The Gitter Badger
56a60db146 Add Gitter badge 2015-10-31 16:40:52 +00:00
Dimitri Fontaine
478d24f865 Fix root-dir initialization for ccl, see #303.
When using Clozure Common Lisp apparently a :absolute directory
component for make-pathname is supposed to contain a single path
component, fix by using parse-native-namestring instead.

In case it's needed, the following spelling seems portable enough:

  CL-USER> (uiop:merge-pathnames*
            (uiop:make-pathname* :directory '(:relative "pgloader"))
            (uiop:make-pathname* :directory '(:absolute "tmp")))
  #P"/tmp/pgloader/"
2015-10-24 22:22:24 +02:00
Dimitri Fontaine
4df3167da1 Introduce another worker thread: transformers.
We used to have a reader and a writer cooperating concurrently into
loading the data from the source to PostgreSQL. The tranformation of the
data was then the responsibility of the reader thread.

Measurements showed that the PostgreSQL processes were mostly idle,
waiting for the reader to produce data fast enough.

In this patch we introduce a third worker thread that is responsible for
processing the raw data into pre-formatted batches, allowing the reader
to focus on extracting the data only. We now have two lparallel queues
involved in the processing, the raw queue contains the vectors of raw
data directly, and the processed-queue contains batches of properly
encoded strings for the COPY text protocol.

On the test laptop the performance gain isn't noticeable yet, it might
be that we need much larger data sets to see a gain here. At least the
setup isn't detrimental to performances on smaller data sets.

Next improvements are going to allow more features: specialized batch
retry thread and parallel table copy scheduling for database sources.
Let's also continue caring about performances and play with having
several worker and writer threads for each reader. In later patches.

And some day, too, we will need to make the number of workers a user
defined variable rather than something hard coded as today. It's on the
todo list, meanwhile, dear user, consider changing the (make-kernel 6)
into (make-kernel 12) or something else in src/sources/mysql/mysql.lisp,
and consider enlighting me with whatever it is you find by doing so!
2015-10-23 00:17:58 +02:00
Dimitri Fontaine
4f3b3472a2 Cleanup and timing display improvements.
Have each thread publish its own start-time so that the main thread may
compute time spent in source and target processing, in order to fix the
crude hack of taking (max read-time write-time) in the total time column
of the summary.

We still have some strange artefacts here: we consider that the full
processing time is bound to the writer thread (:target), because it
needs to have the reader done already to be able to COPY the last
batch... but in testing I've seen some :source timings higher than the
:target ones...

Let's solve problems one at a time tho, I guess multi-threading and
accurate wall clock times aren't to be expected to mix and match that
easily anyway (multi cores, single RTC and all that).
2015-10-22 22:36:40 +02:00
Dimitri Fontaine
933d1c8d6b Add test case for #302. 2015-10-22 22:35:32 +02:00
Dimitri Fontaine
88bb4e0b95 Register "auto_increment" as a SQLite noise word.
As seen in #302 it's possible to define a SQLite column of type "integer
auto_increment". In my testing tho, it doesn't mean a thing. Worse than
that, apparently when an integer column is created that is also used as
the primary key of the table, the notation "integer auto_increment
primary key" disables the rowid behavior that is certainly expected.

Let's not yet mark the bug as fixed as I suppose we will have to do
something about this rowid mess. Thanks again SQLite.
2015-10-22 21:55:34 +02:00
Dimitri Fontaine
f654f10d0d Fix CSV summary format string.
This got broken when adding read/write separate stats in the reporting.
2015-10-20 23:56:20 +02:00
Dimitri Fontaine
2ed14d595d Trick the reporting to show non-zero timings.
When calling lparallel:receive-results from the main threads we loose
the ability to measure proper per-table processing times, because it all
happens in parallel and the main threads seems to receive events in the
same millisecond as when the worker is started, meaning it's all 0.0s.

So when we don't have "secs" stats, pick the greatest of read or write
time, which we do have from the worker threads themselves.

The number are still wrong, but less so than the "0.0s" displayed before
this patch.
2015-10-20 23:39:15 +02:00
Dimitri Fontaine
69b8b0305d Fix reporting in case of missing values. 2015-10-20 23:24:03 +02:00
Dimitri Fontaine
1fb69b2039 Retry connecting to PostgreSQL in some cases.
Now that we can setup many concurrent threads working against the
PostgreSQL database, and before we open the number of workers to our
users, install an heuristic to manage the PostgreSQL error classes “too
many connection” and “configuration limit exceeded” so that pgloader
waits for some time (*retry-connect-delay*) then tries connecting again.

It's quite simplistic but should cover lots of border-line cases way
more nicely than just throwing the interactive debugger at the end user.
2015-10-20 23:15:05 +02:00
Dimitri Fontaine
633067a0fd Allow more parallelism in database migrations.
The newly added statistics are showing that read+write times are not
enough to explain how long we wait for the data copying, so it must be
the workers setup rather than the workers themselves.

From there, let lparallel work its magic in scheduling the work we do in
parallel in pgloader: rather than doing blocking receive-result calls
for each table, only receive-result at the end of the whole
copy-database processing.

On test data here on the laptop we go from 6s to 3s to migrate the
sakila database from MySQL to PostgreSQL: that's because we have lots of
very small tables, so the cost of waiting after each COPY added up quite
quickly.

In passing, stop sharing the same connection object in between parallel
workers that used to be controlled active in-sequence, see the new API
clone-connection (which takes over new-pgsql-connection).
2015-10-20 22:15:55 +02:00
Dimitri Fontaine
187565b181 Add read/write separate stats.
Add metrics to devise where the time is spent in current pgloader code
so that it's possible to then optimize away the batch processing as we
do it today.

Given the following extract of the measures, it seems that doing the
data transformations in the reader thread isn't so bright an idea. More
to come.

          table name         total time       read     write
   -----------------     --------------  --------- ---------
             extract             2.014s
         before load             0.050s
               fetch             0.000s
   -----------------     --------------  --------- ---------
    geolite.location            16.090s    15.933s    5.732s
      geolite.blocks            28.896s    28.795s    5.312s
   -----------------     --------------  --------- ---------
          after load            37.772s
   -----------------     --------------  --------- ---------
   Total import time          1m25.082s    44.728s   11.044s
2015-10-11 21:35:19 +02:00
Dimitri Fontaine
c3726ce07a Refrain from starting the logger twice in load-data. 2015-10-05 21:27:48 +02:00
Dimitri Fontaine
41e9eebd54 Rationalize common generic API implementation.
When devising the common API, the first step has been to implement
specific methods for each generic function of the protocol. It now
appears that in some cases we don't need the extra level of flexibility:
each change of the API has been systematically reported to all the
specific methods, so just use a single generic definition where possible.

In particular, introduce new intermediate class for COPY subclasses
allowing to share more common code in the methods implementation, rather
than having to copy/paste and maintain several versions of the same
code.

It would be good to be able to centralize more code for the database
sources and how they are organized around metadata/import-data/complete
schema, but it doesn't look obvious how to do it just now.
2015-10-05 21:25:21 +02:00
Dimitri Fontaine
0d9c2119b1 Send one update-stats message per batch.
Update the stats used to be a quite simple incf and doing it once per
read row was good enough, but now that it involves sending a message to
the monitor thread let's only send a message per batch, reducing the
communication load here.
2015-10-05 18:04:08 +02:00
Dimitri Fontaine
6bf26c52ec Implement a TimeZone option for IXF loading.
The local-time:encode-timestamp function takes a default timezone and it
is necessary to have control over it when loading from pgloader. Hence,
add a timezone option to the IXF option list, that is now explicit and
local to the IXF parser rather than shared with the DBF option list.
2015-10-05 16:46:15 +02:00
Dimitri Fontaine
7b9b8a32e7 Move sexp parsing into its own file.
After all, it's shared between the CSV command parsing and the Cast
Rules parsing. src/parsers/command-csv.lisp still contains lots of
facilities shared between the file based sources, will need another
series of splits.
2015-10-05 11:39:44 +02:00
Dimitri Fontaine
f1df6ee89a Forgot a new file, thanks Travis.
Someday I will learn not to code that late at night.
2015-10-05 02:21:59 +02:00
Dimitri Fontaine
c880f86bb6 Fix user defined casting rules.
Commit 598c860cf5 broke user defined
casting rules by interning "precision" and "scale" in the
pgloader.user-symbols package: those symbols need to be found in the
pgloader.transforms package instead.

Luckily enough the infrastructure to do that was already in place for
cl:nil.
2015-10-05 02:11:23 +02:00
Dimitri Fontaine
96a33de084 Review the stats and reporting code organisation.
In order to later be able to have more worker threads sharing the
load (multiple readers and/or writers, maybe more specialized threads
too), have all the stats be managed centrally by a single thread. We
already have a "monitor" thread that get passed log messages so that the
output buffer is not subject to race conditions, extend its use to also
deal with statistics messages.

In the current code, we send a message each time we read a row. In some
future commits we should probably reduce the messaging here to something
like one message per batch in the common case.

Also, as a nice side effect of the code simplification and refactoring
this fixes #283 wherein the before/after sections of individual CSV
files within an ARCHIVE command where not counted in the reporting.
2015-10-05 01:46:29 +02:00
Dimitri Fontaine
bc9d2d8962 Monitor events are now structures.
This allows to use typecase to dispatch events in the main loop and
avoid using destructuring-bind, as we now have properly type events.
2015-10-04 18:55:10 +02:00
Dimitri Fontaine
38a725fe74 Add support for IXF blobs to bytea.
A quick test shows it should work, so push that too.
2015-09-24 17:47:57 +02:00