170 Commits

Author SHA1 Message Date
Dimitri Fontaine
dfe5c38185 Fix quoting policy in PostgreSQL ddl formating.
We already have apply-identifier-case and *identifier-case* to decide how
and when to quote our SQL object names, so don't force extra quotes in
format string: refrain from using ~s.
2017-07-06 09:47:48 +02:00
Dimitri Fontaine
9da012ca51 Fix identifiers quoting when reading PostgreSQL catalogs.
We sure can trust PostgreSQL to use names it knows how to handle. Still, it
will be happy to store in its catalogs names containing upper case, and in
that case we must quote them.
2017-07-06 03:16:06 +02:00
Dimitri Fontaine
e37cb3a9e7 Split SQL queries into their own files.
This change was long overdue. Ideally we would use something like the YeSQL
library for Clojure, but it seems like the cl-yesql equivalent is not ready
yet, and it depends on an experimental build system...

So this patch introduces an URL abstraction built on-top of a hash table.
You can then reference src/pgsql/sql/list-all-columns.sql as

  (sql "pgsql/list-all-columns.sql")

in the source code directly.

So for now the templating system is CL's format language. It is still an
improvement from embedded string. Again, one step at a time.
2017-07-06 03:16:05 +02:00
Dimitri Fontaine
d50ed64635 Defensive programming, after though.
It might be that a column-type-name is actually an sqltype instance, and
then #'string= won't be happy. Prevent that now with discarding any smarts
when the type name does not satisfies stringp.
2017-07-06 00:59:36 +02:00
Dimitri Fontaine
26d372bca3 Implement support for non-btree indexes (e.g. MySQL spatial keys).
When pgloader fetches the index list from a source database, it doesn't
fetch information about access methods for the indexes: I don't even know if
the overlap in between index access methods from one RDMBS to another covers
more than just btree...

It could happen that MySQL indexes a "geometry" column tho. This datatype is
converted automatically to "point" by pgloader, which is good. But the index
creation would fail with the following error message:

  Database error 42704: data type point has no default operator class for access method "btree"

In this patch when setting up the target schema we issue a PostgreSQL
catalog query to dynamically list those datatypes without btree support and
fetch their opclasses, with an hard-coded preference to GiST, then GIN, so
as to be able to automatically use the proper access method when btree isn't
available. And now pgloader transparently issues the proper statement:

  CREATE INDEX idx_168468_idx_location ON pagila.address USING gist(location);

Currently this exploration is limited to indexes with a single column. To
implement the general case we would need a more complex lookup: we would
have to find the intersection of all the supported access methods for all
involved columns.

Of course we might need to do that someday. One step at a time is plenty
good enough tho.
2017-07-06 00:42:43 +02:00
Dimitri Fontaine
8405c331a9 Error handling improvements for PostgreSQL schema.
In the complete PostgreSQL schema step, an error would be logged as you
expect but poorly handled: it would have the whole transaction rolled back,
meaning that a single Primary Key definition failure would cancel all the
others, plus the foreign keys, and also the triggers and comments.

It happens that other systems allow a primary column to contain NULL values,
which is forbidden in the standard and enforced by PostgreSQL, so that's not
a theoritical concern here.
2017-07-05 17:53:33 +02:00
Dimitri Fontaine
bae40d40c3 Fix identifier quoting corner cases.
In cases when pgloader needs to build a new identifer from existing
ones (mainly for renaming indexes, because they are unique per-table in the
source database and unique per-schema in PostgreSQL), and we compose the new
name from already quoted strings, pgloader was doing the wrong thing.

Fix that by having a build-identifier function that may unquote parts then
re-quote properly (if needed) the new identifier.
2017-07-05 15:37:21 +02:00
Dimitri Fontaine
3f7853491f Refactor PostgreSQL error handling.
The code was too complex and the transaction / connection handling wasn't
good enough, too many reconnections when a ROLLBACK; is all we need to be
able to continue our processing.

Also fix some stats counters about errors handled, and improve error message
by adding PostgreSQL explicitely, and the name of the table where the error
comes from.
2017-07-04 01:41:08 +02:00
Dimitri Fontaine
1e436555a8 Refactor PostgreSQL conditions.
Use a single deftype postgresql-unavailable rather than copy/pasting the
same list of conditions in several places.
2017-06-29 14:08:52 +02:00
Dimitri Fontaine
60c1146e18 Assorted fixes.
Refrain from killing the Common Lisp image when doing interactive regression
testing if we typo'ed the regression test file name...
2017-06-29 12:35:40 +02:00
Dimitri Fontaine
cea82a6aa8 Reconnect to PostgreSQL in case of connection lost.
It may happen that PostgreSQL is restarted while pgloader is running, or
that for some other reason we lose the connection to the server, and in most
cases we know how to gracefully reconnect and retry, so just do so.

Fixes #546 initial report.
2017-06-29 12:34:34 +02:00
Dimitri Fontaine
f0d1f4ef8c Fix reduce usage with max function.
The (reduce #'max ...) requires an initial value to be provided, as the max
function wants at least 1 argument, as we can see here:

CL-USER> (handler-case (reduce #'max nil) (condition (e) (format t "~a" e)))
Too few arguments in call to #<Compiled-function MAX #x300000113C2F>:
0 arguments provided, at least 1 required.
2017-06-28 16:37:27 +02:00
Dimitri Fontaine
6d66280fa5 Review parallelism and memory behavior.
The previous patch made format-vector-row allocate its memory in one go
rather than byte after byte with vector-push-extend. In this patch we review
our usage of batches and parallelism.

Now the reader pushes each row directly to the lparallel queue and writers
concurrently consume from it, cook batches in COPY format, and then send
that chunk of data down to PostgreSQL. When looking at runtime profiles, the
time spent writing in PostgreSQL is a fraction of the time spent reading
from MySQL, so we consider that the writing thread has enough time to do the
data mungling without slowing us down.

The most interesting factor here is the memory behavor of pgloader, which
seems more stable than before, and easier to cope with for SBCL's GC.

Note that batch concurrency is no more, replaced by prefetch rows: the
reader thread no longer build batches and the count of items in the reader
queue is now a number a rows, not of batches of them.

Anyway, with this patch in I can't reproduce the following issues:

Fixes #337, Fixes #420.
2017-06-27 23:10:33 +02:00
Dimitri Fontaine
7f737a5f55 Reduce memory allocation in format-vector-row.
This function is used on every bit of data we send down to PostgreSQL, so I
have good hopes of reducing its memory allocation having an impact on
loading times. In particular for sizeable data sets.
2017-06-27 15:31:49 +02:00
Dimitri Fontaine
e11ccf7bb7 Fix on-error-stop signaling.
To properly handle on-error-stop condition, make it a specific pgloader
condition with a specific handling behavior. In passing add some more log
messages for surprising conditions.

Fix #546.
2017-06-17 19:02:05 +02:00
Dimitri Fontaine
5faf8605ce Fix corner cases and how we log them.
In the prepare-pgsql-database method we were logging too much details, such
as DDL warnings on if-not-exists for successful queries. And those logs are
to be found in PostgreSQL server logs anyway.

Also fix trying to create or drop a "nil" schema.
2017-06-17 18:16:18 +02:00
Dimitri Fontaine
25e5ea9ac3 Refactor error handling in complete-pgsql-database.
Given new SQLite test case from issue #563 we see that pgloader doesn't
handle errors gracefully in post-copy stage. That's because the API were not
properly defined, we should use pgsql-execute-with-timing rather than other
construct here, because it allows the "on error resume next" behavior we
want with after load DDL statements.

See #563.
2017-06-08 12:09:11 +02:00
Dimitri Fontaine
8254d63453 Fix incorrect per-table total time metrics.
The concurrency nature of pgloader made it non obvious where to implement
the timers properly, and as a result the tracking of how long it took to
actually transfer the data was... just wrong.

Rather than trying to measure the time spent in any particular piece of the
code, we now emit "start" and "stop" stats messages to the monitor thread at
the right places (which are way easier to find, in the worker threads) and
have the monitor figure out how long it took really.

Fix #506.
2017-04-30 18:09:50 +02:00
Dimitri Fontaine
538464f078 Avoid operator is not unique errors.
When the intarray extension is installed our PostgreSQL catalog query fails
because we now have more than one operator solving smallint[] <@ smallint[].
It is easy to avoid that problem by casting to integer[], smallint being an
implementation detail here anyway.

Fix #532.
2017-04-06 23:55:06 +02:00
Dimitri Fontaine
0219f55071 Review DROP INDEX objects quoting.
Force double-quoting of objects name in DROP INDEX commands by using the
format directive ~s. The names of the objects we are dropping usually come
from a PostgreSQL catalog, but still might contain force-quote conditions
like starting with a number, as shown in #530.

This fix certainly means we will have to review all the DDL formatting we do
in pgloader and apply a single method of quoting all along. The simpler one
is of course to force quote every object name in "", but it might not be the
smartest one (what if some sources are sending already quoted object names,
that needs a check), and it's certainly not the prettier way to go at it:
people usually like to avoid unnecessary quotes, calling them clutter.

Fix #530.
2017-04-01 22:37:26 +02:00
Dimitri Fontaine
1023577f50 Review internal database migration logic.
Many options are now available to pgloader users, including short cuts that
where not defined clearly enough. That could result in stupid things being
done at times.

In particular, when picking the "data only" option then indexes are not to
be dropped before loading the data, but pgloader would still try and create
them again at the end of the load, because the option that controls that
behavior default to true and is not impacted by the "data only" choice.

In this patch we review the logic and ensure it's applied in the same
fashion in the different phases of the database migration: preparation,
copying, rebuilding of indexes and completion of the database model.

See also 96b2af6b2a2b163f7e9e3c0ba744da1733b23979 where we began fixing
oddities but didn't go far enough.
2017-02-26 14:48:36 +01:00
Dimitri Fontaine
9e2b95d9b7 Implement support for PostgreSQL storage parameters.
In PostgreSQL it is possible at CREATE TABLE time to set some extra storage
parameters, the most useful of them in the context of pgloader being the
FILLFACTOR. For the setting to be useful, it needs to be positionned at
CREATE TABLE time, before we load the data.

The BEFORE LOAD clause of the pgloader command allows to run SQL scripts
that will be executed before the load, and even before the creation of the
target schema when pgloader does that, which is nice for other use case.

Here we implement a new `ALTER TABLE` rule that one can set in the pgloader
command in order to change storage parameters at CREATE TABLE time:

  ALTER TABLE NAMES MATCHING ~/\./ SET (fillfactor='40')

Fix #516.
2017-02-25 21:49:06 +01:00
Dimitri Fontaine
57dd9fcf47 Add int as an alias for integer.
We cast MS SQL "int" type to "integer" in PostgreSQL, so add an entry in our
type name mapping where they are known equivalent to avoid WARNINGs about
the situation in DATA ONLY loads.
2017-02-25 17:54:57 +01:00
Dimitri Fontaine
5fd1e9f3aa Fix catalog merge hasards.
When reading table names from PostgreSQL, we might find some that need
systematic quoting (such as names that begin with a digit). In that case,
when later comparing the catalogs to match source database table names
against PostgreSQL catalog table names, we need to unquote the PostgreSQL
table name we are using.

In passing, force the *identifier-case* to :none when reading object names
from the PostgreSQL catalogs.
2017-02-25 17:53:08 +01:00
Dimitri Fontaine
dbf7d6e48f Don't double-quote identifiers in catalog queries.
Avoid double quoting the schema names when used in PostgreSQL catalog
queries, where the identifiers are used as literal values and need to be
single-quoted.

Fix #476, again.
2017-01-10 21:12:34 +01:00
Dimitri Fontaine
8da09d7bed Log PostgreSQL Catalog queries at SQL log level.
See #476 where it would have been helpful to see the PostgreSQL catalog
queries with `--log-min-messages sql` in the bug report. Also more
generally useful.
2017-01-10 21:12:34 +01:00
Dimitri Fontaine
381ba18b50 Add a new log level: SQL.
This sits between NOTICE and INFO, allowing to have a complete log of
the SQL queries sent to the server while avoiding the very verbose
trafic of the DEBUG log level.

See #498.
2017-01-03 22:27:17 +01:00
Dimitri Fontaine
320a545533 Fix SQL types creation: consider views too.
When migrating views from e.g. MySQL it is necessary to consider the
user defined SQL types (ENUMs) those views might be using.
2016-12-18 19:31:21 +01:00
Dimitri Fontaine
ad56cf808b Fix PostgreSQL index naming.
A PostgreSQL index is always created in the same schema as the table it
is defined against, and the CREATE INDEX command doesn't accept schema
qualified index names.
2016-12-18 19:31:21 +01:00
Dimitri Fontaine
2dc733c4d6 Fix corner case in creating indexes again.
When the option "drop indexes" is in use in loading data from a file, we
collect the indexes from the PostgreSQL catalogs and then issue DROP
commands against them before the load, then CREATE commands when it's
done.

The CREATE is done in parallel, and we create an lparallel kernel for
that. The kernel must have a worker-count of at least 1, and we where
not considering the case of 0 indexes on the target table.

Fix #484.
2016-11-20 17:17:15 +01:00
Dimitri Fontaine
d7d36c5766 Review identifier case :quote.
We added some confution about who's responsible to quote the SQL obejct
names in between src/utils/quoting.lisp and src/pgsql/pgsql-ddl.lisp and
as a result some migrations from MySQL with identifier case set to quote
where broken, as in #439.

To fix, remove any use of the format directive ~s in the PostgreSQL ddl
output methods: we consider that the quoting of ~s is to be decided in
apply-identifier-case. We then use ~a instead of ~s.

Fix #439.
2016-09-17 22:45:45 +02:00
Dimitri Fontaine
5b6adb02b0 Implement and use DROP ... IF EXISTS.
In cases where we have a WITH include drop option, we are generating
lots of SQL DROP statements. We may be running an empty target database
or in other situations where the target object of the DROP command might
not exists. Add support for that case.
2016-09-10 18:01:04 +02:00
Dimitri Fontaine
f2dcf982d8 Fix stats collections in some cases.
Calling a -with-timing from within a with-stats-collection macro is
redundant and will have the numbers counted twice. Which in this case
didn't happen because the stats label was manually copied, but borked
with a typo in one copy.
2016-08-28 20:29:53 +02:00
Dimitri Fontaine
a86a606d55 Improve existing PostgreSQL database handling.
When loading data into an existing PostgreSQL catalog, we DROP the
indexes for better performance of the data loading. Some of the indexes
are UNIQUE or even PRIMARY KEYS, and some FOREIGN KEYS might depend on
them in the PostgreSQL dependency tracking of the catalog.

We used to use the CASCADE option when dropping the indexes, which hides
a bug: if we exclude from the load tables with foreign keys pointing to
tables we target, then we would DROP those foreign keys because of the
CASCADE option, but fail to install them again at the end of the load.

To prevent that from happening, pgloader now query the PostgreSQL
pg_depend system catalog to list the “missing” foreign keys and add them
to our internal catalog representation, from which we know to DROP then
CREATE the SQL object at the proper times.

See #400 as this was an oversight in fixing this issue.
2016-08-10 22:02:06 +02:00
Dimitri Fontaine
43261e0016 Fix double-counting of fkeys in stats reports. 2016-08-08 21:09:15 +02:00
Dimitri Fontaine
53924fab01 Fix foreign key definition formatting.
When we do have a condef (constraint definition in the PostgreSQL
catalog slang), use it rather than trying to invent it again from the
bits and pieces. See #400, which it actually fixes now...
2016-08-08 01:18:36 +02:00
Dimitri Fontaine
70572a2ea7 Implement support for existing target databases.
Also known as the ORM case, it happens that other tools are used to
create the target schema. In that case pgloader job is to fill in the
exiting target tables with the data from the source tables.

We still focus on load speed and pgloader will now DROP the
constraints (Primary Key, Unique, Foreign Keys) and indexes before
running the COPY statements, and re-install the schema it found in the
target database once the data load is done.

This behavior is activated when using the “create no tables” option as
in the following test-case setup:

  with create no tables, include drop, truncate

Fixes #400, for which I got a test-case to play with!
2016-08-06 20:19:15 +02:00
Dimitri Fontaine
2d47c4f0f5 Use internal catalog when loading from files.
Replace the ad-hoc code that was used before in the load from file code
path to use our full internal catalog representation, and adjust APIs to
that end.

The goal is to use catalogs everywhere in the PostgreSQL target API and
allowing to process reason explicitely about source and target catalogs,
see #400 for the main use case.
2016-08-05 11:42:06 +02:00
Dimitri Fontaine
2aedac7037 Improve our internal catalog representation.
First, add index and foreign keys to the list of objects supported by
the shared catalog facility, where is was only found in the pgsql schema
specific package for historical raisons.

Then also add to our catalog internal structures the notion of a trigger
and a stored procedure, allowing for cleaner advanced default values
support in the MySQL cast functions.

Once we now have a proper and complete catalog, review the pgsql module
DDL output function in terms of the catalog and rewrite the schema
creation support so that it takes direct benefit of our internal
catalogs representation.

In passing, clean-up the code organisation of the pgsql target support
module to be easier to work with.

Next step consists of getting rid of src/pgsql/queries.lisp: this
facility should be replaced by the usage of a target catalog that we
fetch the usual way, thanks to the new src/pgsql/pgsql-schema.lisp file
and list-all-* functions.

That will in turn allow for an explicit step of merging the pre-existing
PostgreSQL catalog when it's been created by other tools than pgloader,
that is when migrating with the help of an ORM. See #400 for details.
2016-08-01 23:14:58 +02:00
Dimitri Fontaine
7daee9405f Fix column names quoting in reset-all-sequences.
The other user-provided names (schema and table) were already quoted
using the quote_ident() PostgreSQL functio, but the column name (attname
in the catalogs) were not.

Blind attempt to fix #425.
2016-06-20 20:52:24 +02:00
Dimitri Fontaine
42e9e521e0 Add option "max parallel create index".
By default, pgloader will start as many parallel CREATE INDEX commands
as the maximum number of indexes you have on any single table that takes
part in the load.

As this number might be so great as to exhaust the target PostgreSQL
server (e.g. maintenance_work_mem), we add an option to limit that to
something reasonnable when the source schema isn't.

Fix #386 in which 150 indexes are found on a single source table.
2016-04-11 17:40:52 +02:00
Dimitri Fontaine
31f8b5c5f0 Set application_name to 'pgloader' by default.
It's always been possible to set application_name to anything really,
making it easier to follow the PostgreSQL queries made by pgloader.
Force that setting to 'pgloader' by default.

Fix #387.
2016-04-11 17:14:38 +02:00
Dimitri Fontaine
fe3601b04c Fix SQLite index support, add foreign keys support.
It turns out recent changes broke tne SQLite index support (from adding
support for MS SQL partial/filtered indexes), so fix it by using the
pgsql-index structure rather than the specific sqlite-idx one.

In passing, improve detection of PRIMARY KEY indexes, which was still
lacking. This work showed that the introspection done by pgloader was
wrong, it's way more crazy that we though, so adjust the code to loop
over PRAGMA calls for each object we inspect.

While adding PRAGMA calls, add support for foreign keys too, we have the
code infrastructure that makes it easy now.
2016-03-27 20:39:13 +02:00
Dimitri Fontaine
cdc5d2f06b Review on update CURRENT_TIMESTAMP support.
Make it work on the second run, when the triggers and functions have
already been deplyed, by doing the DROP function and trigger before we
CREATE the table, then CREATE them again: we need to split the list
again.
2016-03-27 19:13:33 +02:00
Dimitri Fontaine
d72c711b45 Implement support for on update CURRENT_TIMESTAMP.
That's the MySQL slang for a simple ON UPDATE trigger, and that's what
pgloader nows translate the expression to. Fix #195.
2016-03-27 01:01:40 +01:00
Dimitri Fontaine
fcc6e8f813 Implement ALTER SCHEMA ... RENAME TO...
That's only available for MS SQL as of now, as it's the only source
database we have where the notion of a schema makes sense. Fix #224.
2016-03-26 20:25:03 +01:00
Dimitri Fontaine
6f078daeb9 Ensure logging of errors.
The first error of a batch was lost somewhere in the recent changes. My
current best guess is that the rewrite of the copy-batch function made
the handler-bind form setup by the handling-pgsql-notices macro
ineffective, but I can't see why that is.

See #85.
2016-03-26 17:51:38 +01:00
Dimitri Fontaine
e2fcd86868 Handle failure to convert index filters gracefully.
We should not block any processing just because we can't parse an index.
The best we can do just tonight is to try creating the index without the
filter, ideally we would have to skip building the index entirely.
That's for a later effort though, it's running late here.

See #365.
2016-03-22 00:29:25 +01:00
Dimitri Fontaine
5e18cfd7d4 Implement support for partial indexes.
MS SQL has a notion of a "filtered index" that matches the notion of a
PostgreSQL partial index: the index only applies to the rows matching
the index WHERE clause, or filter.

The WHERE clause in both case are limited to simple expressions over a
base table's row at a time, so we implement a limited WHERE clause
parser for MS SQL filters and a transformation routine to rewrite the
clause in PostgreSQL slang.

In passing, we transform the filter constants using the same
transformation functions as in the CAST rules, so that e.g. a MS SQL
bit(1) value that got transformed into a PostgreSQL boolean is properly
translated, as in the following example:

  MS SQL:     "([deleted]=(0))"  (that's from the catalogs)
  PostgreSQL: deleted = 'f'

Of course the parser is still very badly tested, let's see what happens
in the wild now.

(Should) Fix #365.
2016-03-21 23:39:45 +01:00
Dimitri Fontaine
1ed07057fd Implement --on-error-stop command line option.
The implementation uses the dynamic binding *on-error-stop* so it's also
available when pgloader is used as Common Lisp librairy.
The (not-all-that-) recent changes made to the error handling make that
implementation straightforward enough, so let's finally do it!

Fix #85.
2016-03-21 20:52:50 +01:00