126 Commits

Author SHA1 Message Date
Dimitri Fontaine
d72c711b45 Implement support for on update CURRENT_TIMESTAMP.
That's the MySQL slang for a simple ON UPDATE trigger, and that's what
pgloader nows translate the expression to. Fix #195.
2016-03-27 01:01:40 +01:00
Dimitri Fontaine
fcc6e8f813 Implement ALTER SCHEMA ... RENAME TO...
That's only available for MS SQL as of now, as it's the only source
database we have where the notion of a schema makes sense. Fix #224.
2016-03-26 20:25:03 +01:00
Dimitri Fontaine
6f078daeb9 Ensure logging of errors.
The first error of a batch was lost somewhere in the recent changes. My
current best guess is that the rewrite of the copy-batch function made
the handler-bind form setup by the handling-pgsql-notices macro
ineffective, but I can't see why that is.

See #85.
2016-03-26 17:51:38 +01:00
Dimitri Fontaine
e2fcd86868 Handle failure to convert index filters gracefully.
We should not block any processing just because we can't parse an index.
The best we can do just tonight is to try creating the index without the
filter, ideally we would have to skip building the index entirely.
That's for a later effort though, it's running late here.

See #365.
2016-03-22 00:29:25 +01:00
Dimitri Fontaine
5e18cfd7d4 Implement support for partial indexes.
MS SQL has a notion of a "filtered index" that matches the notion of a
PostgreSQL partial index: the index only applies to the rows matching
the index WHERE clause, or filter.

The WHERE clause in both case are limited to simple expressions over a
base table's row at a time, so we implement a limited WHERE clause
parser for MS SQL filters and a transformation routine to rewrite the
clause in PostgreSQL slang.

In passing, we transform the filter constants using the same
transformation functions as in the CAST rules, so that e.g. a MS SQL
bit(1) value that got transformed into a PostgreSQL boolean is properly
translated, as in the following example:

  MS SQL:     "([deleted]=(0))"  (that's from the catalogs)
  PostgreSQL: deleted = 'f'

Of course the parser is still very badly tested, let's see what happens
in the wild now.

(Should) Fix #365.
2016-03-21 23:39:45 +01:00
Dimitri Fontaine
1ed07057fd Implement --on-error-stop command line option.
The implementation uses the dynamic binding *on-error-stop* so it's also
available when pgloader is used as Common Lisp librairy.
The (not-all-that-) recent changes made to the error handling make that
implementation straightforward enough, so let's finally do it!

Fix #85.
2016-03-21 20:52:50 +01:00
Dimitri Fontaine
8476c1a359 Allow setting search_path with multiple schemas.
The PostgreSQL search_path allows multiple schemas and might even need
it to be able to reference types and other tables. Allow setting more
than one schema by using the fact that PostgreSQL schema names don't
need to be individually quoted, and passing down the exact content of
the SET search_path value down to PostgreSQL.

Fix #359.
2016-03-20 20:54:08 +01:00
Dimitri Fontaine
3e8b7df0d3 Improve column formatting.
Have a pretty-print option where we try to be nice for the reader, and
don't use it in the CAST debug messages. Also allow working with the
real maximum length of column names rather than hardcoding 22 cols...
2016-03-16 21:46:41 +01:00
Dimitri Fontaine
f1fe9ab702 Assorted fixes to MS SQL support.
Having been given a test instance of a MS SQL database allows to quickly
fix a series of assorted bugs related to schema handling of MS SQL
databases. As it's the only source with a proper notion of schema that
pgloader supports currently, it's not a surprise we had them.

Fix #343. Fix #349. Fix #354.
2016-03-16 21:43:04 +01:00
Dimitri Fontaine
c724018840 Implement ALTER TABLE clause for MySQL migrations.
The new ALTER TABLE facility allows to act on tables found in the MySQL
database before the migration happens. In this patch the only provided
actions are RENAME TO and SET SCHEMA, which fixes #224.

In order to be able to provide the same option for MS SQL users, we will
have to make it work at the SCHEMA level (ALTER SCHEMA ... RENAME TO
...) and modify the internal schema-struct so that the schema slot of
our table instances are a schema instance rather than its name.

Lacking MS SQL test database and instance, the facility is not yet
provided for that source type.
2016-03-06 21:51:33 +01:00
Dimitri Fontaine
40c1581794 Review transaction and error handling in COPY.
The PostgreSQL COPY protocol requires an explicit initialization phase
that may fail, and in this case the Postmodern driver transaction is
already dead, so there's no way we can even send ABORT to it.

Review the error handling of our copy-batch function to cope with that
fact, and add some logging of non-retryable errors we may have.

Also improve the thread error reporting when using a binary image from
where it might be difficult to open an interactive debugger, while still
having the full blown Common Lisp debugging experience for the project
developers.

Add a test case for a missing column as in issue #339.

Fix #339, see #337.
2016-02-21 15:56:06 +01:00
Dimitri Fontaine
76668c2626 Review package dependencies.
The decision to use lots of different packages in pgloader has quite
strong downsides at times, and the manual managment of dependencies is
one of the, in particular how to avoid circular ones.
2016-01-31 18:42:01 +01:00
Dimitri Fontaine
d9d9e06c0f Another attempt at fixing #323.
Rather than trying hard to have PostgreSQL fully qualify the index name
with tricks around search_path setting at the time ::regclass is
executed, simply join on pg_namespace to retrieve that schema in a new
slot in our pgsql-index structure so that we can then reuse it when
needed.

Also add a test case for the scenario, including both a UNIQUE
constraint and a classic index, because the DROP and CREATE/ALTER
instructions differ.
2016-01-17 01:54:36 +01:00
Dimitri Fontaine
7dd69a11e1 Implement concurrency and workers for files sources.
More than the syntax and API tweaks, this patch also make it so that a
multi-file specification (using e.g. ALL FILENAMES IN DIRECTORY) can be
loaded with several files in the group in parallel.

To that effect, tweak again the md-connection and md-copy
implementations.
2016-01-16 22:53:55 +01:00
Dimitri Fontaine
bfdbb2145b Fix with drop index option, fix #323.
Have PostgreSQL always fully qualify the index related objects and SQL
definition statements when fetching the list of indexes of a table, by
playing with an empty search_path.

Also improve the whole index creation by passing the table object as the
context where to derive the table-name from, so that schema qualified
tables are taken into account properly.
2016-01-15 15:04:07 +01:00
Dimitri Fontaine
1ff204c172 Typo fix. 2016-01-15 14:45:19 +01:00
Dimitri Fontaine
133028f58d Desultory review code indentation. 2016-01-12 14:52:44 +01:00
Dimitri Fontaine
94ef8674ec Typo fix (of sorts)
Some API didn't get the table-name to table memo...
2016-01-11 01:42:18 +01:00
Dimitri Fontaine
a3fd22acd3 Review pgloader encoding story.
Thanks to Common Lisp character data type, it's easy for pgloader to
enforce always speaking to PostgreSQL in utf-8, and that's what has been
done from the beginning actually.

Now, without good reason for that, the first example of a SET clause
that has been added to the docs where about how to set client_encoding,
which should NOT be done.

Fix that at the use level by removing the bad example from the docs and
adding a WARNING whenever the client_encoding is set to a known bad
value. It's a WARNING because we then simply force 'utf-8' anyway.

Also, review completely the format-vector-row function to avoid doing
double work with the Postmodern facilities we piggyback on. This was
done halfway through and the utf-8 conversion was actually done twice.
2016-01-11 01:27:36 +01:00
Dimitri Fontaine
9e4938cea4 Implement PostgreSQL catalogs data structure.
In order to share more code in between the different source types,
finally have a go at the quite horrible mess of anonymous data
structures floating around.

Having a catalog and schema instances not only allows for code cleanup,
but will also allow to implement some bug fixes and wishlist items such
as mapping tables from a schema to another one.

Also, supporting database sources having a notion of "schema" (in
between "catalog" and "table") should get easier, including getting
on-par with MySQL in the MS SQL support (materialized views has been
asked for already).

See #320, #316, #224 for references and a notion of progress being made.

In passing, also clean up the copy-databases methods for database source
types, so that they all use a fetch-metadata generic function and a
prepare-pgsql-database and a complete-pgsql-database generic function.
Actually, a single method does the job here.

The responsibility of introspecting the source to populate the internal
catalog/schema representation is now held by the fetch-metadata generic
function, which in turn will call the specialized versions of
list-all-columns and friends implementations. Once the catalog has been
fetched, an explicit CAST call is then needed before we can continue.

Finally, the fields/columns/transforms slots in the copy objects are
still being used by the operative code, so the internal catalog
representation is only used up to starting the data copy step, where the
copy class instances are then all that's used.

This might be refactored again in a follow-up patch.
2015-12-30 21:53:01 +01:00
Dimitri Fontaine
b4bfa18877 Fix more table name quoting, fix #163 again.
Now that we can have several threads doing COPY, each of them need to
know about the *pgsql-reserved-keywords* list. Make sure that's the case
and in passing fix some call sites to apply-identifier-case.

Also, more disturbingly, fix the code so that TRUNCATE is called from
the main thread before giving control to the COPY threads, rather than
having two concurrent threads doing the TRUNCATE twice. It's rather
strange that we got no complaint from the field on that part...
2015-12-08 11:52:43 +01:00
Dimitri Fontaine
7c64a713d0 Fix PostgreSQL write times in the summary.
It turns out the summary write times included time spent waiting for
batches to be ready, which isn't fair to PostgreSQL COPY implementation,
and moreover doesn't help figuring out the bottlenecks...
2015-11-29 23:23:30 +01:00
Dimitri Fontaine
cca44c800f Simplify batch and transformation handling.
Make batches of raw data straight from the reader output (map-rows) and
have the transformation worker focus on changing the batch content from
raw rows to copy strings.

Also review the organisation of responsabilities in the code, allowing
to move queue.lisp into utils/batch.lisp, renaming it as its scope has
been reduced to only care about preparing batches.

This came out of trying to have multiple workers concurrently processing
the batches from the reader and feeding the hardcoded 2 COPY workers,
but it failed for multiple reasons. All is left as of now is this
cleanup, which seems to be on the faster side of things, which is always
good.
2015-11-29 17:35:25 +01:00
Dimitri Fontaine
bc870ac96c Use 2 copy threads per target table.
It's been proven by Andres Freund benchmarks that the best number of
parallel COPY threads concurrently active against a single table is 2 as
of PostgreSQL current development version (up to 9.5 stable, might still
apply to 9.6 depending on how we might solve the problem).

Henceforth hardcode 2 COPY threads in pgloader. This also has the
advantage that in the presence of lots of bad rows, we should sustain a
better throughtput and not stall completely.

Finally, also improve the MySQL setup to use 8 threads by default, that
is be able to load two tables concurrently, each with 2 COPY workers, a
reader and a transformer thread.

It's all still experimental as far as performances go, next patches
should bring the capability to configure the parallelism wanted from the
command line and load command tho.

Also, other source types will want to benefit from the same
capabilities, it just happens that it's easier to play with MySQL first
for some reasons here.
2015-11-17 17:06:36 +01:00
Dimitri Fontaine
150d288d7a Improve our regression testing facility.
Next parallelism improvements will allow pgloader to use more than one
COPY thread to load data, with the impact of changing the order of rows
in the database.

Rather than doing a copy out and `diff` of the data just loaded, load
the reference data and do the diff in SQL:

          select * from loaded.data
  except
          select * from expected.data

If such a query returns any row, we know we didn't load what was
expected and the regression test is failing.

This regression testing facility should also allow us to finally add
support for multiple-table regression tests (sqlite, mysql, etc).
2015-11-17 17:03:08 +01:00
Dimitri Fontaine
6cbec206af Turns out SSL key/crt file paths should be strings.
Our PostgreSQL driver uses CFFI to load the SSL support from open ssl
and as a result the certificate and key file names should be strings
rather than pathnames. Should fix #308 again...
2015-11-11 23:10:29 +01:00
Dimitri Fontaine
f8ae9f22b9 Implement support for SSL client certificates.
This fixes #308 by automatically using the PostgreSQL Client Side SSL
files as documented in the following reference:

  http://www.postgresql.org/docs/current/static/libpq-ssl.html#LIBPQ-SSL-FILE-USAGE

This uses the Postmodern special support for it. Unfortunately couldn't
test it locally other than it doesn't break non-ssl connections. Pushing
to have user feedback.
2015-11-09 11:32:17 +01:00
Dimitri Fontaine
4df3167da1 Introduce another worker thread: transformers.
We used to have a reader and a writer cooperating concurrently into
loading the data from the source to PostgreSQL. The tranformation of the
data was then the responsibility of the reader thread.

Measurements showed that the PostgreSQL processes were mostly idle,
waiting for the reader to produce data fast enough.

In this patch we introduce a third worker thread that is responsible for
processing the raw data into pre-formatted batches, allowing the reader
to focus on extracting the data only. We now have two lparallel queues
involved in the processing, the raw queue contains the vectors of raw
data directly, and the processed-queue contains batches of properly
encoded strings for the COPY text protocol.

On the test laptop the performance gain isn't noticeable yet, it might
be that we need much larger data sets to see a gain here. At least the
setup isn't detrimental to performances on smaller data sets.

Next improvements are going to allow more features: specialized batch
retry thread and parallel table copy scheduling for database sources.
Let's also continue caring about performances and play with having
several worker and writer threads for each reader. In later patches.

And some day, too, we will need to make the number of workers a user
defined variable rather than something hard coded as today. It's on the
todo list, meanwhile, dear user, consider changing the (make-kernel 6)
into (make-kernel 12) or something else in src/sources/mysql/mysql.lisp,
and consider enlighting me with whatever it is you find by doing so!
2015-10-23 00:17:58 +02:00
Dimitri Fontaine
4f3b3472a2 Cleanup and timing display improvements.
Have each thread publish its own start-time so that the main thread may
compute time spent in source and target processing, in order to fix the
crude hack of taking (max read-time write-time) in the total time column
of the summary.

We still have some strange artefacts here: we consider that the full
processing time is bound to the writer thread (:target), because it
needs to have the reader done already to be able to COPY the last
batch... but in testing I've seen some :source timings higher than the
:target ones...

Let's solve problems one at a time tho, I guess multi-threading and
accurate wall clock times aren't to be expected to mix and match that
easily anyway (multi cores, single RTC and all that).
2015-10-22 22:36:40 +02:00
Dimitri Fontaine
1fb69b2039 Retry connecting to PostgreSQL in some cases.
Now that we can setup many concurrent threads working against the
PostgreSQL database, and before we open the number of workers to our
users, install an heuristic to manage the PostgreSQL error classes “too
many connection” and “configuration limit exceeded” so that pgloader
waits for some time (*retry-connect-delay*) then tries connecting again.

It's quite simplistic but should cover lots of border-line cases way
more nicely than just throwing the interactive debugger at the end user.
2015-10-20 23:15:05 +02:00
Dimitri Fontaine
633067a0fd Allow more parallelism in database migrations.
The newly added statistics are showing that read+write times are not
enough to explain how long we wait for the data copying, so it must be
the workers setup rather than the workers themselves.

From there, let lparallel work its magic in scheduling the work we do in
parallel in pgloader: rather than doing blocking receive-result calls
for each table, only receive-result at the end of the whole
copy-database processing.

On test data here on the laptop we go from 6s to 3s to migrate the
sakila database from MySQL to PostgreSQL: that's because we have lots of
very small tables, so the cost of waiting after each COPY added up quite
quickly.

In passing, stop sharing the same connection object in between parallel
workers that used to be controlled active in-sequence, see the new API
clone-connection (which takes over new-pgsql-connection).
2015-10-20 22:15:55 +02:00
Dimitri Fontaine
187565b181 Add read/write separate stats.
Add metrics to devise where the time is spent in current pgloader code
so that it's possible to then optimize away the batch processing as we
do it today.

Given the following extract of the measures, it seems that doing the
data transformations in the reader thread isn't so bright an idea. More
to come.

          table name         total time       read     write
   -----------------     --------------  --------- ---------
             extract             2.014s
         before load             0.050s
               fetch             0.000s
   -----------------     --------------  --------- ---------
    geolite.location            16.090s    15.933s    5.732s
      geolite.blocks            28.896s    28.795s    5.312s
   -----------------     --------------  --------- ---------
          after load            37.772s
   -----------------     --------------  --------- ---------
   Total import time          1m25.082s    44.728s   11.044s
2015-10-11 21:35:19 +02:00
Dimitri Fontaine
96a33de084 Review the stats and reporting code organisation.
In order to later be able to have more worker threads sharing the
load (multiple readers and/or writers, maybe more specialized threads
too), have all the stats be managed centrally by a single thread. We
already have a "monitor" thread that get passed log messages so that the
output buffer is not subject to race conditions, extend its use to also
deal with statistics messages.

In the current code, we send a message each time we read a row. In some
future commits we should probably reduce the messaging here to something
like one message per batch in the common case.

Also, as a nice side effect of the code simplification and refactoring
this fixes #283 wherein the before/after sections of individual CSV
files within an ARCHIVE command where not counted in the reporting.
2015-10-05 01:46:29 +02:00
Dimitri Fontaine
a0dc59624c Fix schema qualified table names usage again.
When the list of columns of the PostgreSQL target table isn't given in
the load command, pgloader will happily query the system catalogs to get
that information. The list-columns query didn't get the memo about the
qualified table name format and the with-schema macro... fix #288.
2015-09-11 11:53:28 +02:00
Victor Kryukov
792db2fcf4 Better regexp for PG SQL identifiers 2015-09-04 01:15:45 +02:00
Victor Kryukov
54997be2dd Fix 182: properly quote tables with . in their names 2015-09-04 01:15:29 +02:00
Dimitri Fontaine
eabfbb9cc8 Fix schema qualified table names usage (more).
When parsing table names in the target URI, we are careful of splitting
the table and schema name and store them into a cons in that case. Not
all sources methods got the memo, clean that up.

See #182 and #186, a pull request I am now going to be able to accept.
Also see #287 that should be helped by being able to apply #186.
2015-09-04 01:06:15 +02:00
Dimitri Fontaine
ea35eb575d Implement --dry-run option, fix #264.
The dry run option will currently only check database connections, but
as that happens after having correctly parsed the load file, it allows
to also check that the command file is correct for the parser.

Note that the list load-data API isn't subject to the dry-run method.

In passing, we add some more API entry points to the connection objects
and we should actually clean the code base to use the new QUERY generic
all over the place. It's for another patch tho.
2015-08-22 16:23:47 +02:00
Dimitri Fontaine
04ddf940d9 Left pad COPY octal chars with 0, fix #275.
The COPY TEXT format accepts non printable characters with an escaped
sequence wherin pgloader can pass in the octal number for the character
in its encoding. When doing that with small numbers like \6 and the
non-printable character is then followed by other numbers, then it
becomes e.g. \646 which might not be part of the target encoding...

To fix, always left pad the character octal number with zeroes, so that
we now send in \00646 which COPY knows how to read: the char at \006
then 4 then 6.

Also copy the test case over to pgloader and run it in the test suite.
2015-08-20 18:17:18 +02:00
Dimitri Fontaine
3a6120b931 Improve logging again.
The user experience is greatly enhanced by this little change, where you
know from the logs that pgloader could actually connect rather than
thinking it might be still trying...
2015-08-20 12:38:19 +02:00
Dimitri Fontaine
5e7e5391ef Fix the drop indexes option again, fix #251.
The index and constraint names given by PostgreSQL catalogs should not
be second guessed, we need to just quote them. The identifier down
casing is interesting when we get identifiers from other system for a
migration, but are wrong when dropping existing indexes in PostgreSQL.

Also, the :null special value from Postmodern was routing the code
badly, just transform it manually to nil when fetching the index list,
manually.
2015-07-26 15:38:15 +02:00
Dimitri Fontaine
3af99051d2 Fix the preserve index names option.
MySQL names its primary keys "PRIMARY" and we need to always uniquify
this name even when the used asked pgloader to preserve index names.

Also, the create-indexes-again function now needs to ask for index names
to be preserved specifically.
2015-07-18 23:39:32 +02:00
Dimitri Fontaine
54e29773d7 Fix index creation reporting, see #251.
The new option 'drop indexes' reuses the existing code to build all the
indexes in parallel but failed to properly account for that fact in the
summary report with timings.

While fixing this, also fix the SQL used to re-establish the indexes and
associated constraints to allow for parallel execution, the ALTER TABLE
statements would block in ACCESS EXCLUSIVE MODE otherwise and make our
efforts vain.
2015-07-18 23:06:15 +02:00
Dimitri Fontaine
8511294ac7 Generalize index support to handle constraints, fix #251.
PostgreSQL rightfully forbifs DROP INDEX when the index is used to
enforce a constraint, the proper SQL to issue is then ALTER TABLE DROP
CONSTRAINT. Also, in such a case pg_dump issues a single ALTER TABLE ADD
CONSTRAINT statement to restore the situation.

Have pgloader do the same with indexes that are used to back a constraint.
2015-07-17 17:06:09 +02:00
Dimitri Fontaine
4f2385fa4c Refactor code from previous commit.
The goal is to make it easy to add support for the 'drop indexes' option
in other source types (fixed, ixf, db3, file-based sources).
2015-07-16 19:35:34 +02:00
Dimitri Fontaine
49bf7e56f2 Implement a "drop indexes" option in CSV mode, fix #251.
When loading against a table that already has index definitions, the
load can be quite slow. Previous commit introduced a warning in such a
case. This commit introduces the option "drop indexes" that is not used
by default.

When this option is used, pgloader drops the indexes before loading the
data then create the indexes again with the same definitions as before.
All the indexes are created again in parallel to optimize performances.
Only primary key indexes can't be created in parallel, so those are
created in two steps (create unique index then alter table).
2015-07-16 12:22:58 +02:00
Dimitri Fontaine
7c834db6e3 Warn against pre-existing indexes.
Pre-existing indexes will reduce data loading performances and it's
generally better to DROP the index prior to the load and CREATE them
again once the load is done. See #251 for an example of that.

In that patch we just add a WARNING against the situation, the next
patch will also add support for a new WITH clause option allowing to
have pgloader take care of the DROP/CREATE dance around the data
loading.
2015-07-16 12:22:58 +02:00
Dimitri Fontaine
88c801997e Default to a static list of PostgreSQL keywords.
In some cases (such as when using a very old PostgreSQL instance or an
Amazon Redshift service, as in #255), the function pg_get_keywords()
does not exists but we assume that pgloader might still be able to
complete its job.

We're better off with a static list of keywords than with a unhandled
error here, so let's see what happens next with Redshift.
2015-07-04 20:16:50 +02:00
Alex Baretta
49dcae8068 Fix DROP TABLE statements on tables with foreign keys 2015-05-28 14:24:28 -07:00
Alex Baretta
0626d56303 Fix identifier case in FOREIGN KEY constraints 2015-05-25 18:08:37 -07:00