Previously to this patch, pgloader wouldn't care about which schema it
creates extra types in. Extra types are mainly ENUM and SET support from
MySQL. Now, pgloader creates those extra PostgreSQL ENUM types in the same
schema as the table using them, which is a more sound default.
This change was long overdue. Ideally we would use something like the YeSQL
library for Clojure, but it seems like the cl-yesql equivalent is not ready
yet, and it depends on an experimental build system...
So this patch introduces an URL abstraction built on-top of a hash table.
You can then reference src/pgsql/sql/list-all-columns.sql as
(sql "pgsql/list-all-columns.sql")
in the source code directly.
So for now the templating system is CL's format language. It is still an
improvement from embedded string. Again, one step at a time.
When pgloader fetches the index list from a source database, it doesn't
fetch information about access methods for the indexes: I don't even know if
the overlap in between index access methods from one RDMBS to another covers
more than just btree...
It could happen that MySQL indexes a "geometry" column tho. This datatype is
converted automatically to "point" by pgloader, which is good. But the index
creation would fail with the following error message:
Database error 42704: data type point has no default operator class for access method "btree"
In this patch when setting up the target schema we issue a PostgreSQL
catalog query to dynamically list those datatypes without btree support and
fetch their opclasses, with an hard-coded preference to GiST, then GIN, so
as to be able to automatically use the proper access method when btree isn't
available. And now pgloader transparently issues the proper statement:
CREATE INDEX idx_168468_idx_location ON pagila.address USING gist(location);
Currently this exploration is limited to indexes with a single column. To
implement the general case we would need a more complex lookup: we would
have to find the intersection of all the supported access methods for all
involved columns.
Of course we might need to do that someday. One step at a time is plenty
good enough tho.
In cases when pgloader needs to build a new identifer from existing
ones (mainly for renaming indexes, because they are unique per-table in the
source database and unique per-schema in PostgreSQL), and we compose the new
name from already quoted strings, pgloader was doing the wrong thing.
Fix that by having a build-identifier function that may unquote parts then
re-quote properly (if needed) the new identifier.
The (reduce #'max ...) requires an initial value to be provided, as the max
function wants at least 1 argument, as we can see here:
CL-USER> (handler-case (reduce #'max nil) (condition (e) (format t "~a" e)))
Too few arguments in call to #<Compiled-function MAX #x300000113C2F>:
0 arguments provided, at least 1 required.
The "main" function only gets used at the command line, and errors where not
cleanly reported to the users. Mainly because I almost never get to play
with pgloader that way, prefering a load command file and the REPL
environment, but that's not even acceptable as an excuse.
Now the binary program should be able to exit cleanly in all situations. In
testing, it may happens on unexpected erroneous situations that we quit
before printing all the messages in the monitoring queue, but at least now
we quit cleanly and with a non-zero exit status.
Fix#583.
Experiment with the idea of splitting the read work in several concurrent
threads, where each reader is reading portions of the target table, using a
WHERE id <= x and id > y clause in its SELECT query.
For this to kick-in a number of conditions needs to be met, as described in
the documentation. The main interest might not be faster queries to overall
fetch the same data set, but better concurrency with as many readers as
writters and each couple its own dedicated queue.
The previous patch made format-vector-row allocate its memory in one go
rather than byte after byte with vector-push-extend. In this patch we review
our usage of batches and parallelism.
Now the reader pushes each row directly to the lparallel queue and writers
concurrently consume from it, cook batches in COPY format, and then send
that chunk of data down to PostgreSQL. When looking at runtime profiles, the
time spent writing in PostgreSQL is a fraction of the time spent reading
from MySQL, so we consider that the writing thread has enough time to do the
data mungling without slowing us down.
The most interesting factor here is the memory behavor of pgloader, which
seems more stable than before, and easier to cope with for SBCL's GC.
Note that batch concurrency is no more, replaced by prefetch rows: the
reader thread no longer build batches and the count of items in the reader
queue is now a number a rows, not of batches of them.
Anyway, with this patch in I can't reproduce the following issues:
Fixes#337, Fixes#420.
In case of an exceptional condition leading to termination of the pgloader
program we tried to use log-message after the monitor should have been
closed. Also the 0.3s delay to let latests messages out looks like a poor
design.
This patch attempts to remedy both the situation: refrain from using a
closed down monitoring thread, and properly wait until it's done before
returning to the shell.
See #583.
pgloader had support for PostgreSQL SET parameters (gucs) from the
beginning, and in the same vein it might be necessary to tweak MySQL
connection parameters, and allow pgloader users to control them.
See #337 and #420 where net_read_timeout and net_write_timeout might need to
be set in order to be able to complete the migration, due to high volumes of
data being processed.
In this patch we hard-code some cases when we know the log message won't be
displayed anywhere so as to avoid sending it to the monitor thread. It
certainly is a modularity violation, but given the performance impact...
The code used to take into account content-length HTTP header to load that
number of bytes in memory from the remote server. Not only it's better to
use a fixed size allocated-once buffer for that (now 4k), but also doing so
allows downloading content that you don't know the content-length of.
In passing tell the HTTP-URI parser rule that we also accept https:// as a
prefix, not just http://.
This allows running pgloader in such cases:
$ pgloader https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite_AutoIncrementPKs.sqlite pgsql:///chinook
And it just works!
In order to support custom SQL files with several queries and psql like
advanced \i feature, we have our own internal SQL parser in pgloader. The
PostgreSQL variant of SQL is pretty complex and allows dollar-quoting and
other subtleties that we need to take care of.
Here we fix the case when we have a dollar sign ($) as part of a single
quoted text (such as a regexp), so both not a new dollar-quoting tag and a
part of a quoted text being read.
In passing we also fix reading double-quoted identifiers, even when they
contain a dollar sign. After all the following is totally supported by
PostgreSQL:
create table dollar("$id" serial, foo text);
select "$id", foo from dollar;
Fix#561.
The concurrency nature of pgloader made it non obvious where to implement
the timers properly, and as a result the tracking of how long it took to
actually transfer the data was... just wrong.
Rather than trying to measure the time spent in any particular piece of the
code, we now emit "start" and "stop" stats messages to the monitor thread at
the right places (which are way easier to find, in the worker threads) and
have the monitor figure out how long it took really.
Fix#506.
Now that we have fixed the output of the per-table total timing, we can only
show that timing by default. With more verbosity pgloader will add the extra
columns, and in computer oriented formats (json, csv, copy) all the details
are always provided of course.
See #506.
We have to pay attention that column names in MS SQL don't follow the same
rules as in PostgreSQL and may e.g. begin with numbers. Apply identifier
case and rules to index column names too.
In PostgreSQL it is possible at CREATE TABLE time to set some extra storage
parameters, the most useful of them in the context of pgloader being the
FILLFACTOR. For the setting to be useful, it needs to be positionned at
CREATE TABLE time, before we load the data.
The BEFORE LOAD clause of the pgloader command allows to run SQL scripts
that will be executed before the load, and even before the creation of the
target schema when pgloader does that, which is nice for other use case.
Here we implement a new `ALTER TABLE` rule that one can set in the pgloader
command in order to change storage parameters at CREATE TABLE time:
ALTER TABLE NAMES MATCHING ~/\./ SET (fillfactor='40')
Fix#516.
Now that we have a proper flush system for reporting the summary at the
proper time (see 7c5396f0975be405910d66f4b5aedc89acd75c1d), refrain from
also taking care of the reporting when stopping the monitor.
Adapt the regression driver code to flush the summary after loading the
expected data, which also provides better output.
When the summary output is sent to a file, that would also create a
backup file and replace our summary with an empty new file at monitor
stop...
Fixes#499.
This sits between NOTICE and INFO, allowing to have a complete log of
the SQL queries sent to the server while avoiding the very verbose
trafic of the DEBUG log level.
See #498.
Make it so that fatal errors are printed only once, and when possible
included in the usual log format as handled by our monitoring thread.
Also, improve error and summary reporting when we load from several
sources on the same command line.
All this work has been triggered because of an edge case where the OS
return value of the pgloader command was 0 (zero, success) although the
given file on the command line does not exists.
Fixes#486.
We added some confution about who's responsible to quote the SQL obejct
names in between src/utils/quoting.lisp and src/pgsql/pgsql-ddl.lisp and
as a result some migrations from MySQL with identifier case set to quote
where broken, as in #439.
To fix, remove any use of the format directive ~s in the PostgreSQL ddl
output methods: we consider that the quoting of ~s is to be decided in
apply-identifier-case. We then use ~a instead of ~s.
Fix#439.
In the MySQL source we have explicit support for both string equality
and regexps for the INCLUDING and EXCLUDING clauses. This got broken
when moved to be shared with the ALTER TABLE implementation, because
we were no longer using the type system in the same way in all places.
To fix, create new abstractions for strings and regexps and use those
new structs in the proper way (thanks to defstruct and CLOS).
Fixes#441.
In cases where we have a WITH include drop option, we are generating
lots of SQL DROP statements. We may be running an empty target database
or in other situations where the target object of the DROP command might
not exists. Add support for that case.
The internal catalog representation are deeply recursive in order to
make it easy to traverse the catalog both downwards (catalog to schema
to tables) and upward (table to its schema to its catalog).
In consequence we need to set *print-circles* to non-nil when we're
going to log the catalogs, so turn it to non-nil before generating the
log messages.
While at it, add logging of such catalogs in the :data log verbosity
mode. The catalog output is very verbose, but it's easy to copy/paste it
from a bug report into being a live object we can inspect in the REPL,
thanks to Common Lisp notion of a reader and readable printer!
When loading data into an existing PostgreSQL catalog, we DROP the
indexes for better performance of the data loading. Some of the indexes
are UNIQUE or even PRIMARY KEYS, and some FOREIGN KEYS might depend on
them in the PostgreSQL dependency tracking of the catalog.
We used to use the CASCADE option when dropping the indexes, which hides
a bug: if we exclude from the load tables with foreign keys pointing to
tables we target, then we would DROP those foreign keys because of the
CASCADE option, but fail to install them again at the end of the load.
To prevent that from happening, pgloader now query the PostgreSQL
pg_depend system catalog to list the “missing” foreign keys and add them
to our internal catalog representation, from which we know to DROP then
CREATE the SQL object at the proper times.
See #400 as this was an oversight in fixing this issue.
First, add index and foreign keys to the list of objects supported by
the shared catalog facility, where is was only found in the pgsql schema
specific package for historical raisons.
Then also add to our catalog internal structures the notion of a trigger
and a stored procedure, allowing for cleaner advanced default values
support in the MySQL cast functions.
Once we now have a proper and complete catalog, review the pgsql module
DDL output function in terms of the catalog and rewrite the schema
creation support so that it takes direct benefit of our internal
catalogs representation.
In passing, clean-up the code organisation of the pgsql target support
module to be easier to work with.
Next step consists of getting rid of src/pgsql/queries.lisp: this
facility should be replaced by the usage of a target catalog that we
fetch the usual way, thanks to the new src/pgsql/pgsql-schema.lisp file
and list-all-* functions.
That will in turn allow for an explicit step of merging the pre-existing
PostgreSQL catalog when it's been created by other tools than pgloader,
that is when migrating with the help of an ORM. See #400 for details.
It's possible that in MySQL a foreign key constraint definition is
pointing to a non-existing table. In such a case, issue an error message
and refrain from trying to then reinstall the faulty foreign key
definition.
The lack of error handling at this point led to a frozen instance of
pgloader apparently, I think because it could not display the
interactive debugger at the point where the error occurs.
See #328, also #337 that might be fixed here.
The max function requires at least 1 argument to be given, and in the
case where we have no table to load it then fails badly, as show here:
CL-USER> (handler-case
(reduce #'max nil)
(condition (c)
(format nil "~a" c)))
"invalid number of arguments: 0"
Of course Common Lisp comes with a very easy way around that problem:
CL-USER> (reduce #'max nil :initial-value 0)
0
Fix#381.
It's been broken by a recent commit where we did force the internal
table representation to always be an instance of the table structure,
which wasn't yet true for regression testing.
In passing, re-indent a large portion of the function, which accounts
for most of the diff.
The function needs to return a string to be added to the COPY stream, we
still need to make sure whatever given here looks like an integer. Given
the very dynamic nature of data types in SQLite, the integer-to-string
function was already a default now, but failed to be published before in
its fixed version, somehow.
The newid() function seems to be equivalent to the newsequentialid() one
if I'm to believe issue #204, so let's just add that assumption in the
code.
Fix#204.
Once more we can't use an aggregate over a text column in MS SQL to
build the index definition from its catalog structure, so we have to do
that in the lisp part of the code.
Multi-column indexes are now supported, but filtered indexes still are a
problem: the WHERE clause in MS SQL is not compatible with the
PostgreSQL syntax (because of [names] and type casting.
For example we cast MS SQL bit to PostgreSQL boolean, so
WHERE ([deleted]=(0))
should be translated to
WHERE not deleted
And the code to do that is not included yet.
The following documentation page offers more examples of WHERE
expression we might want to support:
https://technet.microsoft.com/en-us/library/cc280372.aspx
WHERE EndDate IS NOT NULL
AND ComponentID = 5
AND StartDate > '01/01/2008'
EndDate IN ('20000825', '20000908', '20000918')
It might be worth automating the translation to PostgreSQL syntax and
operators, but it's not done in this patch.
See #365, where the created index will now be as follows, which is a
problem because of being UNIQUE: some existing data won't reload fine.
CREATE UNIQUE INDEX idx_<oid>_foo_name_unique ON dbo.foo (name, type, deleted);
Have a pretty-print option where we try to be nice for the reader, and
don't use it in the CAST debug messages. Also allow working with the
real maximum length of column names rather than hardcoding 22 cols...
Having been given a test instance of a MS SQL database allows to quickly
fix a series of assorted bugs related to schema handling of MS SQL
databases. As it's the only source with a proper notion of schema that
pgloader supports currently, it's not a surprise we had them.
Fix#343. Fix#349. Fix#354.
The new ALTER TABLE facility allows to act on tables found in the MySQL
database before the migration happens. In this patch the only provided
actions are RENAME TO and SET SCHEMA, which fixes#224.
In order to be able to provide the same option for MS SQL users, we will
have to make it work at the SCHEMA level (ALTER SCHEMA ... RENAME TO
...) and modify the internal schema-struct so that the schema slot of
our table instances are a schema instance rather than its name.
Lacking MS SQL test database and instance, the facility is not yet
provided for that source type.
We target CURRENT_TIMESTAMP as the PostgreSQL default value for columns
when it was different before on the grounds that the type casting in
PostgreSQL is doing the job, as in the following example:
pgloader# create table test_ts(ts timestamptz(6) not null default CURRENT_TIMESTAMP);
CREATE TABLE
pgloader# insert into test_ts VALUES(DEFAULT);
INSERT 0 1
pgloader# table test_ts;
ts
-------------------------------
2016-02-24 18:32:22.820477+01
(1 row)
pgloader# drop table test_ts;
DROP TABLE
pgloader# create table test_ts(ts timestamptz(0) not null default CURRENT_TIMESTAMP);
CREATE TABLE
pgloader# insert into test_ts VALUES(DEFAULT);
INSERT 0 1
pgloader# table test_ts;
ts
------------------------
2016-02-24 18:32:44+01
(1 row)
Fix#341.
It turns out that MySQL catalog always store default value as strings
even when the column itself is of type bytea. In some cases, it's then
impossible to transform the expected bytea from a string.
In passing, move some code around to fix dependencies and make it
possible to issue log warnings from the default value printing code.
The decision to use lots of different packages in pgloader has quite
strong downsides at times, and the manual managment of dependencies is
one of the, in particular how to avoid circular ones.
In the theory that it's a better service to the user to refuse doing
anything at all rather than ignore his/her commands, print out FATAL
errors when options are used that are incompatible with a load command
file.
See #327 for a case where this did happen.
In passing, tweak our report code to avoid printing the footer when we
didn't print anything at all previously.
In the issue #328 the --debug level output is not helpful because of an
encoding error in the logfile. Let's see about forcing the log file
external format to utf-8 then.
Thanks to a reproducable test case we can see that MySQL default for a
varbinary column is an empty string, so tweak the transform function
byte-vector-to-bytea in order to cope with that.