722 Commits

Author SHA1 Message Date
Dimitri Fontaine
5a65da2147 Create new types in the proper schema.
Previously to this patch, pgloader wouldn't care about which schema it
creates extra types in. Extra types are mainly ENUM and SET support from
MySQL. Now, pgloader creates those extra PostgreSQL ENUM types in the same
schema as the table using them, which is a more sound default.
2017-08-10 18:57:09 +02:00
Dimitri Fontaine
981b801ce7 Fix user defined rules to cast ENUM to Text.
The MySQL enum are casted to PostgreSQL enum types just fine, but sometimes
that's not what the user wants. In case when we have a CAST rule for an ENUM
column, recognize the fact and respect user choice.

Fixes #608.
2017-08-10 18:01:17 +02:00
Dimitri Fontaine
049a1199c2 Implement support for SQLite current_date default value.
The spelling in SQLite for the default value is "current_date", instruct
pgloader about that. This commit also adds a test case in our sqlite.db
unit tests database.

Fixes #607.
2017-08-08 21:55:15 +02:00
Luke Snape
ecd6a8e25c Ignore nulls in varbinary-to-string transform (#606) 2017-08-07 21:37:37 +02:00
Dimitri Fontaine
5c1c4bf3ff Fix MySQL Enum parsing.
We use a CSV parser for the MySQL enum values, but the quote escaping wasn't
properly setup: MySQL quotes ENUM values with a single-quote (') and uses
two of them ('') for escaping single-quotes when found in the ENUM value
itself.

Fixes #597.
2017-08-01 18:40:27 +02:00
Dimitri Fontaine
3103b0dc72 Escape SQL identifiers in SQLite catalog queries.
SQLite supports the backtick escaping for SQL identifiers and we'd rather
use it. Fixes #600.
2017-07-31 23:11:29 +02:00
Dimitri Fontaine
d37ad27754 Handle empty tables in concurrency support for MySQL.
When the table is empty we get nil for min and max values of the id column.
In that case we don't compute a set of ranges and “cancel” concurrency
support for the empty table.

Fixes #596.
2017-07-18 13:35:01 +02:00
Dimitri Fontaine
b1fa3aec3c Implement a separate switch to drop the schemas.
The with option “include drop” used to also apply to schemas, which is not
that useful and problematic when trying to DROP SCHEMA public, because you
might not connect as the owner of that schema.

Even if we don't target the public schema by default, users can choose to do
so thanks to our ALTER SCHEMA ... RENAME TO ... command.

Fixes #594.
2017-07-18 13:13:36 +02:00
Dimitri Fontaine
ae0c6ed119 Add support for preserving index names in SQLite.
See #187.
2017-07-17 11:04:12 +02:00
Dimitri Fontaine
cf6182fafa Add a notice message with guessed parameters.
We might have to help users debug our decision, and I expect we will have to
improve our guess “engine” here.
2017-07-07 02:34:23 +02:00
Dimitri Fontaine
471f2b6d88 Implement automagic guessing of CSV parameters.
As we know how many columns we expect from the input file, it's possible to
read a sample (10 lines as of this patch) and try many different CSV reader
parameters combinations until we find one that works: it returns the right
number of fields.

It is still possible of course to specify parameters on the command line or
in a load file if necessary, but it makes the simple case even simpler. As
simple as:

  pgloader file.csv pgsql:///pgloader?tablename=target
2017-07-07 02:16:53 +02:00
Dimitri Fontaine
14e1830b77 Fix CLI insistance of --field.
From a load file, as soon as pgloader can retrieve the schema of the target
table the source field list defaults to the target column list. Let's apply
the same rules to the command line.
2017-07-07 01:00:55 +02:00
Dimitri Fontaine
64959595fc Back to development release in the master's branch. 2017-07-06 16:55:56 +02:00
Dimitri Fontaine
d71da6ba66 Release pgloader 3.4.1 2017-07-06 16:53:29 +02:00
Dimitri Fontaine
7a371529be Implement "drop indexes" option for MySQL and MSSQL too.
It was only offered for SQLite without good reason really, and tests show
that it works as well with MySQL of course. Offer the option there too.

See 3eab88b1440a8166786e90b95f563d153e2ba4dc for details.
2017-07-06 10:06:03 +02:00
Dimitri Fontaine
2363d8845f Fix create schema handling in data only scenarios.
In b301aa93948f05b5189382f641cf1e040fc655f2 the "create schema" default
changed to true, which is a good idea. As a consequence pgloader should
consider this operation only when "create tables" is set: we don't want to
start with creating target schemas in a target database that is said to be
ready to host the data.
2017-07-06 09:48:03 +02:00
Dimitri Fontaine
dfe5c38185 Fix quoting policy in PostgreSQL ddl formating.
We already have apply-identifier-case and *identifier-case* to decide how
and when to quote our SQL object names, so don't force extra quotes in
format string: refrain from using ~s.
2017-07-06 09:47:48 +02:00
Dimitri Fontaine
9da012ca51 Fix identifiers quoting when reading PostgreSQL catalogs.
We sure can trust PostgreSQL to use names it knows how to handle. Still, it
will be happy to store in its catalogs names containing upper case, and in
that case we must quote them.
2017-07-06 03:16:06 +02:00
Dimitri Fontaine
e87477ed31 Restrict condition handling to relevant conditions.
In md-methods copy-database function, don't pretend we are able to handle
any condition when preparing the PostgreSQL schema, database-error is all we
are dealing with there really.
2017-07-06 03:16:05 +02:00
Dimitri Fontaine
e37cb3a9e7 Split SQL queries into their own files.
This change was long overdue. Ideally we would use something like the YeSQL
library for Clojure, but it seems like the cl-yesql equivalent is not ready
yet, and it depends on an experimental build system...

So this patch introduces an URL abstraction built on-top of a hash table.
You can then reference src/pgsql/sql/list-all-columns.sql as

  (sql "pgsql/list-all-columns.sql")

in the source code directly.

So for now the templating system is CL's format language. It is still an
improvement from embedded string. Again, one step at a time.
2017-07-06 03:16:05 +02:00
Dimitri Fontaine
d50ed64635 Defensive programming, after though.
It might be that a column-type-name is actually an sqltype instance, and
then #'string= won't be happy. Prevent that now with discarding any smarts
when the type name does not satisfies stringp.
2017-07-06 00:59:36 +02:00
Dimitri Fontaine
26d372bca3 Implement support for non-btree indexes (e.g. MySQL spatial keys).
When pgloader fetches the index list from a source database, it doesn't
fetch information about access methods for the indexes: I don't even know if
the overlap in between index access methods from one RDMBS to another covers
more than just btree...

It could happen that MySQL indexes a "geometry" column tho. This datatype is
converted automatically to "point" by pgloader, which is good. But the index
creation would fail with the following error message:

  Database error 42704: data type point has no default operator class for access method "btree"

In this patch when setting up the target schema we issue a PostgreSQL
catalog query to dynamically list those datatypes without btree support and
fetch their opclasses, with an hard-coded preference to GiST, then GIN, so
as to be able to automatically use the proper access method when btree isn't
available. And now pgloader transparently issues the proper statement:

  CREATE INDEX idx_168468_idx_location ON pagila.address USING gist(location);

Currently this exploration is limited to indexes with a single column. To
implement the general case we would need a more complex lookup: we would
have to find the intersection of all the supported access methods for all
involved columns.

Of course we might need to do that someday. One step at a time is plenty
good enough tho.
2017-07-06 00:42:43 +02:00
Dimitri Fontaine
8405c331a9 Error handling improvements for PostgreSQL schema.
In the complete PostgreSQL schema step, an error would be logged as you
expect but poorly handled: it would have the whole transaction rolled back,
meaning that a single Primary Key definition failure would cancel all the
others, plus the foreign keys, and also the triggers and comments.

It happens that other systems allow a primary column to contain NULL values,
which is forbidden in the standard and enforced by PostgreSQL, so that's not
a theoritical concern here.
2017-07-05 17:53:33 +02:00
Dimitri Fontaine
bae40d40c3 Fix identifier quoting corner cases.
In cases when pgloader needs to build a new identifer from existing
ones (mainly for renaming indexes, because they are unique per-table in the
source database and unique per-schema in PostgreSQL), and we compose the new
name from already quoted strings, pgloader was doing the wrong thing.

Fix that by having a build-identifier function that may unquote parts then
re-quote properly (if needed) the new identifier.
2017-07-05 15:37:21 +02:00
Dimitri Fontaine
f6cb428c6d Check empty strings in DB3 numeric fields.
Another blind attempt at fixing pgloader from a bug report on gitter, see
2017-07-04 23:15:47 +02:00
Dimitri Fontaine
652e435843 Only catch thread errors in pgloader-image.
In the REPL we're going to have all errors pop in the interactive debugger,
and that should be what we want...
2017-07-04 01:55:27 +02:00
Dimitri Fontaine
3f7853491f Refactor PostgreSQL error handling.
The code was too complex and the transaction / connection handling wasn't
good enough, too many reconnections when a ROLLBACK; is all we need to be
able to continue our processing.

Also fix some stats counters about errors handled, and improve error message
by adding PostgreSQL explicitely, and the name of the table where the error
comes from.
2017-07-04 01:41:08 +02:00
Dimitri Fontaine
3eab88b144 Add a new "drop indexes" option for databases.
This allows to use a combination of "data only, drop indexes" so that when
the target database already exists, pgloader will use the existing schema
and still DROP INDEX before loading the data and do the CREATE INDEX dance
in parallel and all at the end of it.

Also, as I couldn't reproduce neither #539 (which is good, it's supposed to
be fixed now) nor #550 (that was open due to a regression): fixes #550.
2017-07-04 00:15:58 +02:00
Dimitri Fontaine
fc01c7acc9 Fix DBF handling of "empty" date strings.
Blind code a fix for an error when parsing empty date strings in a DBF file.
The small amount of information is surprising, I can't quite figure out
which input string can produce " - - " with the previous coding of
db3-date-to-pgsql-date.

Anyway, it seems easy enough to add some checks to a very optimistic
function and return nil when our checks aren't met.

Fixes #589, hopefully.
2017-06-30 13:37:36 +02:00
Dimitri Fontaine
1e436555a8 Refactor PostgreSQL conditions.
Use a single deftype postgresql-unavailable rather than copy/pasting the
same list of conditions in several places.
2017-06-29 14:08:52 +02:00
Dimitri Fontaine
60c1146e18 Assorted fixes.
Refrain from killing the Common Lisp image when doing interactive regression
testing if we typo'ed the regression test file name...
2017-06-29 12:35:40 +02:00
Dimitri Fontaine
cea82a6aa8 Reconnect to PostgreSQL in case of connection lost.
It may happen that PostgreSQL is restarted while pgloader is running, or
that for some other reason we lose the connection to the server, and in most
cases we know how to gracefully reconnect and retry, so just do so.

Fixes #546 initial report.
2017-06-29 12:34:34 +02:00
Dimitri Fontaine
f0d1f4ef8c Fix reduce usage with max function.
The (reduce #'max ...) requires an initial value to be provided, as the max
function wants at least 1 argument, as we can see here:

CL-USER> (handler-case (reduce #'max nil) (condition (e) (format t "~a" e)))
Too few arguments in call to #<Compiled-function MAX #x300000113C2F>:
0 arguments provided, at least 1 required.
2017-06-28 16:37:27 +02:00
Dimitri Fontaine
17a63e18ed Review "main" error handling.
The "main" function only gets used at the command line, and errors where not
cleanly reported to the users. Mainly because I almost never get to play
with pgloader that way, prefering a load command file and the REPL
environment, but that's not even acceptable as an excuse.

Now the binary program should be able to exit cleanly in all situations. In
testing, it may happens on unexpected erroneous situations that we quit
before printing all the messages in the monitoring queue, but at least now
we quit cleanly and with a non-zero exit status.

Fix #583.
2017-06-28 16:36:08 +02:00
Dimitri Fontaine
0549e74f6d Implement multiple reader per table for MySQL.
Experiment with the idea of splitting the read work in several concurrent
threads, where each reader is reading portions of the target table, using a
WHERE id <= x and id > y clause in its SELECT query.

For this to kick-in a number of conditions needs to be met, as described in
the documentation. The main interest might not be faster queries to overall
fetch the same data set, but better concurrency with as many readers as
writters and each couple its own dedicated queue.
2017-06-28 16:23:18 +02:00
Dimitri Fontaine
6d66280fa5 Review parallelism and memory behavior.
The previous patch made format-vector-row allocate its memory in one go
rather than byte after byte with vector-push-extend. In this patch we review
our usage of batches and parallelism.

Now the reader pushes each row directly to the lparallel queue and writers
concurrently consume from it, cook batches in COPY format, and then send
that chunk of data down to PostgreSQL. When looking at runtime profiles, the
time spent writing in PostgreSQL is a fraction of the time spent reading
from MySQL, so we consider that the writing thread has enough time to do the
data mungling without slowing us down.

The most interesting factor here is the memory behavor of pgloader, which
seems more stable than before, and easier to cope with for SBCL's GC.

Note that batch concurrency is no more, replaced by prefetch rows: the
reader thread no longer build batches and the count of items in the reader
queue is now a number a rows, not of batches of them.

Anyway, with this patch in I can't reproduce the following issues:

Fixes #337, Fixes #420.
2017-06-27 23:10:33 +02:00
Dimitri Fontaine
7f737a5f55 Reduce memory allocation in format-vector-row.
This function is used on every bit of data we send down to PostgreSQL, so I
have good hopes of reducing its memory allocation having an impact on
loading times. In particular for sizeable data sets.
2017-06-27 15:31:49 +02:00
Dimitri Fontaine
46d6f339df Add a user friendly message about what's happening...
Still in the abnormal termination case. pgloader might get stuck and if the
user knows it's waiting for threads to complete, they might be less worried
about the situation and opportunity to kill pgloader...
2017-06-27 11:19:07 +02:00
Dimitri Fontaine
2341ef195d Review abnormal termination code path.
In case of an exceptional condition leading to termination of the pgloader
program we tried to use log-message after the monitor should have been
closed. Also the 0.3s delay to let latests messages out looks like a poor
design.

This patch attempts to remedy both the situation: refrain from using a
closed down monitoring thread, and properly wait until it's done before
returning to the shell.

See #583.
2017-06-27 10:45:54 +02:00
Dimitri Fontaine
352f4adc8d Implement support for MySQL SET parameters.
pgloader had support for PostgreSQL SET parameters (gucs) from the
beginning, and in the same vein it might be necessary to tweak MySQL
connection parameters, and allow pgloader users to control them.

See #337 and #420 where net_read_timeout and net_write_timeout might need to
be set in order to be able to complete the migration, due to high volumes of
data being processed.
2017-06-27 10:00:47 +02:00
Dimitri Fontaine
e11ccf7bb7 Fix on-error-stop signaling.
To properly handle on-error-stop condition, make it a specific pgloader
condition with a specific handling behavior. In passing add some more log
messages for surprising conditions.

Fix #546.
2017-06-17 19:02:05 +02:00
Dimitri Fontaine
5faf8605ce Fix corner cases and how we log them.
In the prepare-pgsql-database method we were logging too much details, such
as DDL warnings on if-not-exists for successful queries. And those logs are
to be found in PostgreSQL server logs anyway.

Also fix trying to create or drop a "nil" schema.
2017-06-17 18:16:18 +02:00
Dimitri Fontaine
6c931975de Refrain from pushing too much logging trafic.
In this patch we hard-code some cases when we know the log message won't be
displayed anywhere so as to avoid sending it to the monitor thread. It
certainly is a modularity violation, but given the performance impact...
2017-06-17 18:12:33 +02:00
Dimitri Fontaine
422fab646a Typo-level fix.
Using ~s with extra quotes is quite disturbing at this place (logs only, but
still).
2017-06-17 17:23:44 +02:00
Dimitri Fontaine
7f55b21044 Improve support for http(s) resources.
The code used to take into account content-length HTTP header to load that
number of bytes in memory from the remote server. Not only it's better to
use a fixed size allocated-once buffer for that (now 4k), but also doing so
allows downloading content that you don't know the content-length of.

In passing tell the HTTP-URI parser rule that we also accept https:// as a
prefix, not just http://.

This allows running pgloader in such cases:

  $ pgloader https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite_AutoIncrementPKs.sqlite pgsql:///chinook

And it just works!
2017-06-17 16:48:15 +02:00
Dimitri Fontaine
b301aa9394 Review create-schemas default behavior.
Get back in line with what the documentation says, and also fix the case for
default MySQL migrations now that we target a PostgreSQL schema with the
same name as the MySQL database name.

Open question yet: should we also register the new schema on the search_path
by default?

  ALTER DATABASE ... SET search_path TO public, newschema, ...;

Is it more of a POLA violation to alter the search_path or to not do it?

Fix #582.
2017-06-16 09:01:18 +02:00
Dimitri Fontaine
de9b43c332 Add support for the MS-SYBDATE datatype.
Fixes #568, thanks to a test case being provided!
2017-06-14 21:02:00 +02:00
Adrian Vondendriesch
90a33b4b4c MSSQL: Add ON UPDATE / DELETE support for fkeys (#580)
The former query to find foreign key constraints doesn't consider ON
UPDATE and ON DELETE rules.
2017-06-13 12:03:51 +02:00
Adrian Vondendriesch
d966d37579 MSSQL: Fix Default value translation for getdate() (#576) (#577)
Currently the default value getdate() is replaced by 'now' which creates
statements like:
  CREATE TABLE ... created timestamp 'now' ...
which leads to table definitions like:
  default '2017-06-12 17:54:04.890129'::timestamp without time zone

This is because 'now' is evaluated when creating the table.

This commit fixes the issue by using CURRENT_TIMESTAMP as default
instead of 'now'.
2017-06-12 21:42:35 +02:00
Adrian Vondendriesch
1f3659941e MSSQL: fix typmod conversion for "max" typemods (#573)
The previous commit makes it possible to convert typemods for various
text types. In MSSQL it's possible to create a column like varchar(max).
Internally this is reported as varchar(-1) which results in a CREATE
TABLE statement that contains e.g. varchar(-1).

This patch drops the typemod if it's -1 (max).

It's based on Dimitris patch slightly modified by myself.
2017-06-09 16:12:17 +02:00