When the intarray extension is installed our PostgreSQL catalog query fails
because we now have more than one operator solving smallint[] <@ smallint[].
It is easy to avoid that problem by casting to integer[], smallint being an
implementation detail here anyway.
Fix#532.
Force double-quoting of objects name in DROP INDEX commands by using the
format directive ~s. The names of the objects we are dropping usually come
from a PostgreSQL catalog, but still might contain force-quote conditions
like starting with a number, as shown in #530.
This fix certainly means we will have to review all the DDL formatting we do
in pgloader and apply a single method of quoting all along. The simpler one
is of course to force quote every object name in "", but it might not be the
smartest one (what if some sources are sending already quoted object names,
that needs a check), and it's certainly not the prettier way to go at it:
people usually like to avoid unnecessary quotes, calling them clutter.
Fix#530.
Many options are now available to pgloader users, including short cuts that
where not defined clearly enough. That could result in stupid things being
done at times.
In particular, when picking the "data only" option then indexes are not to
be dropped before loading the data, but pgloader would still try and create
them again at the end of the load, because the option that controls that
behavior default to true and is not impacted by the "data only" choice.
In this patch we review the logic and ensure it's applied in the same
fashion in the different phases of the database migration: preparation,
copying, rebuilding of indexes and completion of the database model.
See also 96b2af6b2a2b163f7e9e3c0ba744da1733b23979 where we began fixing
oddities but didn't go far enough.
In PostgreSQL it is possible at CREATE TABLE time to set some extra storage
parameters, the most useful of them in the context of pgloader being the
FILLFACTOR. For the setting to be useful, it needs to be positionned at
CREATE TABLE time, before we load the data.
The BEFORE LOAD clause of the pgloader command allows to run SQL scripts
that will be executed before the load, and even before the creation of the
target schema when pgloader does that, which is nice for other use case.
Here we implement a new `ALTER TABLE` rule that one can set in the pgloader
command in order to change storage parameters at CREATE TABLE time:
ALTER TABLE NAMES MATCHING ~/\./ SET (fillfactor='40')
Fix#516.
We cast MS SQL "int" type to "integer" in PostgreSQL, so add an entry in our
type name mapping where they are known equivalent to avoid WARNINGs about
the situation in DATA ONLY loads.
When reading table names from PostgreSQL, we might find some that need
systematic quoting (such as names that begin with a digit). In that case,
when later comparing the catalogs to match source database table names
against PostgreSQL catalog table names, we need to unquote the PostgreSQL
table name we are using.
In passing, force the *identifier-case* to :none when reading object names
from the PostgreSQL catalogs.
Avoid double quoting the schema names when used in PostgreSQL catalog
queries, where the identifiers are used as literal values and need to be
single-quoted.
Fix#476, again.
See #476 where it would have been helpful to see the PostgreSQL catalog
queries with `--log-min-messages sql` in the bug report. Also more
generally useful.
This sits between NOTICE and INFO, allowing to have a complete log of
the SQL queries sent to the server while avoiding the very verbose
trafic of the DEBUG log level.
See #498.
A PostgreSQL index is always created in the same schema as the table it
is defined against, and the CREATE INDEX command doesn't accept schema
qualified index names.
When the option "drop indexes" is in use in loading data from a file, we
collect the indexes from the PostgreSQL catalogs and then issue DROP
commands against them before the load, then CREATE commands when it's
done.
The CREATE is done in parallel, and we create an lparallel kernel for
that. The kernel must have a worker-count of at least 1, and we where
not considering the case of 0 indexes on the target table.
Fix#484.
We added some confution about who's responsible to quote the SQL obejct
names in between src/utils/quoting.lisp and src/pgsql/pgsql-ddl.lisp and
as a result some migrations from MySQL with identifier case set to quote
where broken, as in #439.
To fix, remove any use of the format directive ~s in the PostgreSQL ddl
output methods: we consider that the quoting of ~s is to be decided in
apply-identifier-case. We then use ~a instead of ~s.
Fix#439.
In cases where we have a WITH include drop option, we are generating
lots of SQL DROP statements. We may be running an empty target database
or in other situations where the target object of the DROP command might
not exists. Add support for that case.
Calling a -with-timing from within a with-stats-collection macro is
redundant and will have the numbers counted twice. Which in this case
didn't happen because the stats label was manually copied, but borked
with a typo in one copy.
When loading data into an existing PostgreSQL catalog, we DROP the
indexes for better performance of the data loading. Some of the indexes
are UNIQUE or even PRIMARY KEYS, and some FOREIGN KEYS might depend on
them in the PostgreSQL dependency tracking of the catalog.
We used to use the CASCADE option when dropping the indexes, which hides
a bug: if we exclude from the load tables with foreign keys pointing to
tables we target, then we would DROP those foreign keys because of the
CASCADE option, but fail to install them again at the end of the load.
To prevent that from happening, pgloader now query the PostgreSQL
pg_depend system catalog to list the “missing” foreign keys and add them
to our internal catalog representation, from which we know to DROP then
CREATE the SQL object at the proper times.
See #400 as this was an oversight in fixing this issue.
When we do have a condef (constraint definition in the PostgreSQL
catalog slang), use it rather than trying to invent it again from the
bits and pieces. See #400, which it actually fixes now...
Also known as the ORM case, it happens that other tools are used to
create the target schema. In that case pgloader job is to fill in the
exiting target tables with the data from the source tables.
We still focus on load speed and pgloader will now DROP the
constraints (Primary Key, Unique, Foreign Keys) and indexes before
running the COPY statements, and re-install the schema it found in the
target database once the data load is done.
This behavior is activated when using the “create no tables” option as
in the following test-case setup:
with create no tables, include drop, truncate
Fixes#400, for which I got a test-case to play with!
Replace the ad-hoc code that was used before in the load from file code
path to use our full internal catalog representation, and adjust APIs to
that end.
The goal is to use catalogs everywhere in the PostgreSQL target API and
allowing to process reason explicitely about source and target catalogs,
see #400 for the main use case.
First, add index and foreign keys to the list of objects supported by
the shared catalog facility, where is was only found in the pgsql schema
specific package for historical raisons.
Then also add to our catalog internal structures the notion of a trigger
and a stored procedure, allowing for cleaner advanced default values
support in the MySQL cast functions.
Once we now have a proper and complete catalog, review the pgsql module
DDL output function in terms of the catalog and rewrite the schema
creation support so that it takes direct benefit of our internal
catalogs representation.
In passing, clean-up the code organisation of the pgsql target support
module to be easier to work with.
Next step consists of getting rid of src/pgsql/queries.lisp: this
facility should be replaced by the usage of a target catalog that we
fetch the usual way, thanks to the new src/pgsql/pgsql-schema.lisp file
and list-all-* functions.
That will in turn allow for an explicit step of merging the pre-existing
PostgreSQL catalog when it's been created by other tools than pgloader,
that is when migrating with the help of an ORM. See #400 for details.
The other user-provided names (schema and table) were already quoted
using the quote_ident() PostgreSQL functio, but the column name (attname
in the catalogs) were not.
Blind attempt to fix#425.
By default, pgloader will start as many parallel CREATE INDEX commands
as the maximum number of indexes you have on any single table that takes
part in the load.
As this number might be so great as to exhaust the target PostgreSQL
server (e.g. maintenance_work_mem), we add an option to limit that to
something reasonnable when the source schema isn't.
Fix#386 in which 150 indexes are found on a single source table.
It's always been possible to set application_name to anything really,
making it easier to follow the PostgreSQL queries made by pgloader.
Force that setting to 'pgloader' by default.
Fix#387.
It turns out recent changes broke tne SQLite index support (from adding
support for MS SQL partial/filtered indexes), so fix it by using the
pgsql-index structure rather than the specific sqlite-idx one.
In passing, improve detection of PRIMARY KEY indexes, which was still
lacking. This work showed that the introspection done by pgloader was
wrong, it's way more crazy that we though, so adjust the code to loop
over PRAGMA calls for each object we inspect.
While adding PRAGMA calls, add support for foreign keys too, we have the
code infrastructure that makes it easy now.
Make it work on the second run, when the triggers and functions have
already been deplyed, by doing the DROP function and trigger before we
CREATE the table, then CREATE them again: we need to split the list
again.
The first error of a batch was lost somewhere in the recent changes. My
current best guess is that the rewrite of the copy-batch function made
the handler-bind form setup by the handling-pgsql-notices macro
ineffective, but I can't see why that is.
See #85.
We should not block any processing just because we can't parse an index.
The best we can do just tonight is to try creating the index without the
filter, ideally we would have to skip building the index entirely.
That's for a later effort though, it's running late here.
See #365.
MS SQL has a notion of a "filtered index" that matches the notion of a
PostgreSQL partial index: the index only applies to the rows matching
the index WHERE clause, or filter.
The WHERE clause in both case are limited to simple expressions over a
base table's row at a time, so we implement a limited WHERE clause
parser for MS SQL filters and a transformation routine to rewrite the
clause in PostgreSQL slang.
In passing, we transform the filter constants using the same
transformation functions as in the CAST rules, so that e.g. a MS SQL
bit(1) value that got transformed into a PostgreSQL boolean is properly
translated, as in the following example:
MS SQL: "([deleted]=(0))" (that's from the catalogs)
PostgreSQL: deleted = 'f'
Of course the parser is still very badly tested, let's see what happens
in the wild now.
(Should) Fix#365.
The implementation uses the dynamic binding *on-error-stop* so it's also
available when pgloader is used as Common Lisp librairy.
The (not-all-that-) recent changes made to the error handling make that
implementation straightforward enough, so let's finally do it!
Fix#85.
The PostgreSQL search_path allows multiple schemas and might even need
it to be able to reference types and other tables. Allow setting more
than one schema by using the fact that PostgreSQL schema names don't
need to be individually quoted, and passing down the exact content of
the SET search_path value down to PostgreSQL.
Fix#359.
Have a pretty-print option where we try to be nice for the reader, and
don't use it in the CAST debug messages. Also allow working with the
real maximum length of column names rather than hardcoding 22 cols...
Having been given a test instance of a MS SQL database allows to quickly
fix a series of assorted bugs related to schema handling of MS SQL
databases. As it's the only source with a proper notion of schema that
pgloader supports currently, it's not a surprise we had them.
Fix#343. Fix#349. Fix#354.
The new ALTER TABLE facility allows to act on tables found in the MySQL
database before the migration happens. In this patch the only provided
actions are RENAME TO and SET SCHEMA, which fixes#224.
In order to be able to provide the same option for MS SQL users, we will
have to make it work at the SCHEMA level (ALTER SCHEMA ... RENAME TO
...) and modify the internal schema-struct so that the schema slot of
our table instances are a schema instance rather than its name.
Lacking MS SQL test database and instance, the facility is not yet
provided for that source type.
The PostgreSQL COPY protocol requires an explicit initialization phase
that may fail, and in this case the Postmodern driver transaction is
already dead, so there's no way we can even send ABORT to it.
Review the error handling of our copy-batch function to cope with that
fact, and add some logging of non-retryable errors we may have.
Also improve the thread error reporting when using a binary image from
where it might be difficult to open an interactive debugger, while still
having the full blown Common Lisp debugging experience for the project
developers.
Add a test case for a missing column as in issue #339.
Fix#339, see #337.
The decision to use lots of different packages in pgloader has quite
strong downsides at times, and the manual managment of dependencies is
one of the, in particular how to avoid circular ones.
Rather than trying hard to have PostgreSQL fully qualify the index name
with tricks around search_path setting at the time ::regclass is
executed, simply join on pg_namespace to retrieve that schema in a new
slot in our pgsql-index structure so that we can then reuse it when
needed.
Also add a test case for the scenario, including both a UNIQUE
constraint and a classic index, because the DROP and CREATE/ALTER
instructions differ.
More than the syntax and API tweaks, this patch also make it so that a
multi-file specification (using e.g. ALL FILENAMES IN DIRECTORY) can be
loaded with several files in the group in parallel.
To that effect, tweak again the md-connection and md-copy
implementations.
Have PostgreSQL always fully qualify the index related objects and SQL
definition statements when fetching the list of indexes of a table, by
playing with an empty search_path.
Also improve the whole index creation by passing the table object as the
context where to derive the table-name from, so that schema qualified
tables are taken into account properly.
Thanks to Common Lisp character data type, it's easy for pgloader to
enforce always speaking to PostgreSQL in utf-8, and that's what has been
done from the beginning actually.
Now, without good reason for that, the first example of a SET clause
that has been added to the docs where about how to set client_encoding,
which should NOT be done.
Fix that at the use level by removing the bad example from the docs and
adding a WARNING whenever the client_encoding is set to a known bad
value. It's a WARNING because we then simply force 'utf-8' anyway.
Also, review completely the format-vector-row function to avoid doing
double work with the Postmodern facilities we piggyback on. This was
done halfway through and the utf-8 conversion was actually done twice.
In order to share more code in between the different source types,
finally have a go at the quite horrible mess of anonymous data
structures floating around.
Having a catalog and schema instances not only allows for code cleanup,
but will also allow to implement some bug fixes and wishlist items such
as mapping tables from a schema to another one.
Also, supporting database sources having a notion of "schema" (in
between "catalog" and "table") should get easier, including getting
on-par with MySQL in the MS SQL support (materialized views has been
asked for already).
See #320, #316, #224 for references and a notion of progress being made.
In passing, also clean up the copy-databases methods for database source
types, so that they all use a fetch-metadata generic function and a
prepare-pgsql-database and a complete-pgsql-database generic function.
Actually, a single method does the job here.
The responsibility of introspecting the source to populate the internal
catalog/schema representation is now held by the fetch-metadata generic
function, which in turn will call the specialized versions of
list-all-columns and friends implementations. Once the catalog has been
fetched, an explicit CAST call is then needed before we can continue.
Finally, the fields/columns/transforms slots in the copy objects are
still being used by the operative code, so the internal catalog
representation is only used up to starting the data copy step, where the
copy class instances are then all that's used.
This might be refactored again in a follow-up patch.
Now that we can have several threads doing COPY, each of them need to
know about the *pgsql-reserved-keywords* list. Make sure that's the case
and in passing fix some call sites to apply-identifier-case.
Also, more disturbingly, fix the code so that TRUNCATE is called from
the main thread before giving control to the COPY threads, rather than
having two concurrent threads doing the TRUNCATE twice. It's rather
strange that we got no complaint from the field on that part...
It turns out the summary write times included time spent waiting for
batches to be ready, which isn't fair to PostgreSQL COPY implementation,
and moreover doesn't help figuring out the bottlenecks...
Make batches of raw data straight from the reader output (map-rows) and
have the transformation worker focus on changing the batch content from
raw rows to copy strings.
Also review the organisation of responsabilities in the code, allowing
to move queue.lisp into utils/batch.lisp, renaming it as its scope has
been reduced to only care about preparing batches.
This came out of trying to have multiple workers concurrently processing
the batches from the reader and feeding the hardcoded 2 COPY workers,
but it failed for multiple reasons. All is left as of now is this
cleanup, which seems to be on the faster side of things, which is always
good.
It's been proven by Andres Freund benchmarks that the best number of
parallel COPY threads concurrently active against a single table is 2 as
of PostgreSQL current development version (up to 9.5 stable, might still
apply to 9.6 depending on how we might solve the problem).
Henceforth hardcode 2 COPY threads in pgloader. This also has the
advantage that in the presence of lots of bad rows, we should sustain a
better throughtput and not stall completely.
Finally, also improve the MySQL setup to use 8 threads by default, that
is be able to load two tables concurrently, each with 2 COPY workers, a
reader and a transformer thread.
It's all still experimental as far as performances go, next patches
should bring the capability to configure the parallelism wanted from the
command line and load command tho.
Also, other source types will want to benefit from the same
capabilities, it just happens that it's easier to play with MySQL first
for some reasons here.