Implement a generic-function API to discover the source database schema and
populate pgloader internal version of the catalogs. Cut down three copies of about
the same code-path down to a single shared one, thanks to applying some amount
of OOP to the code.
Fixed file formats might contain an header line with column names and a hint
of the size of each column. While it might be a long shot that we can
acutally use that as a proper fixed-format specification, this patch
implements a guess mode that also outputs the parsed header.
In case when the parsing is wrong in some level of details, it might
actually be a good start to copy/paste from the command output and go from
there.
Fixes#958.
The casting support for DB3 was hand-crafted and didn't get upgraded to
using the current CAST grammar and facilities, for no other reasons than
lack of time and interest. It so happens what implementing it now fixes two
bug reports.
Bug #938 is about conversion defaulting to "not null" column, and that's due
to the usage of the internal pgloader catalogs where the target column's
nullable field is NIL by default, which doesn't make much sense. With
support for user-defined casting rules, the default is nullable columns, so
that's kind of a free fix.
Fixes#927.
Fixes#938.
The idea is for pgloader to tweak the schema from a description of the
sharding model, the distribute clause. Here's an example of such a clause:
distribute company using id
distribute campaign using company_id
distribute ads using company_id from campaign
distribute clicks using company_id from ads, campaign
Given such commands, pgloader adds the distibution key to the table when
needed, to the primary key definition of the table, and also to the foreign
keys that are pointing to the changed primary key.
Then when SELECTing the data from the source database, the idea is for
pgloader to automatically JOIN the base table with the source table where to
find the distribution key, in case it was just added in the schema.
Finally, pgloader also calls the following Citus commands:
SELECT create_distributed_table('company', 'id');
SELECT create_distributed_table('campaign', 'company_id');
SELECT create_distributed_table('ads', 'company_id');
SELECT create_distributed_table('clicks', 'company_id');
It's now possible to use pgloader to migrate from PostgreSQL to PostgreSQL.
That might be useful for several reasons, including applying user defined
cast rules at COPY time, or just moving from an hosted solution to another.
Also add some schema-level support by disabling our usual index and
constraint support when the target is Redshift, because it doesn't support
those parts of SQL.
The S3 parameters are read form either the process environment variables or
from the AWS configuration files in ~/.aws.
Redshift looks like a very old PostgreSQL (8.0.2) with some extra features
and a very limited selection of data types. In this patch we parse the
PostgreSQL version() function output and automatically determine if we're
connected to Redshift.
When connected to Redshift, we then dumb-down our target catalogs to the
subset of data types that Redshift actually does support.
Also, some catalog queries can't be done in Redshift, and 8.0 didn't have
fully compliant VALUES statement, so we use a temporary table in places
where we used to use SELECT ... FROM (VALUES(...)) in pgloader.
COPYing data to Redshift isn't possible with just this set of changes,
because Redshift also don't support the COPY FROM STDIN form. COPY sources
are limited, and another patch will have to be cooked to prepare the data
from pgloader into a format and location that Redshift knows how to handle.
At least, it's possible to migrate a database schema to Redshift already.
In a previous commit we re-used the package name pgloader.copy for the now
separated implementation of the COPY protocol, but this package was already
in use for the implementation of the COPY file format as a pgloader source.
Oops.
And CCL was happily doing its magic anyway, so that I've been blind to the
problem.
To fix, rename the new package pgloader.pgcopy, and to avoid having to deal
with other problems of the same kind in the future, rename every source
package pgloader.source.<format>, so that we now have pgloader.source.copy
and pgloader.pgcopy, two visibily different packages to deal with.
This light refactoring came with a challenge tho. The split in between the
pgloader.sources API and the rest of the code involved some circular
depencendies in the namespaces. CL is pretty flexible here because it can
reload code definitions at runtime, but it was still a mess. To untangle it,
implement a new namespace, the pgloader.load package, where we can use the
pgloader.sources API and the pgloader.connection and pgloader.pgsql APIs
too.
A little problem gave birth to quite a massive patch. As it happens when
refactoring and cleaning-up the dirt in any large enough project, right?
See #748.
Refactor file organisation further to allow for adding a “direct stream”
option when the on-error-stop behavior has been selected. This happens
currently by default for databases sources.
Introduce the new WITH option “on error resume next” which forces the
classic behavior of pgloader. The option “on error stop” already existed,
its implementation is new.
When this new behavior is activated, the data is sent to PostgreSQL
directly, without intermediate batches being built. It means that the whole
operation fails at the first error, and we don't have any information in
memory to try replaying any COPY of the data. It's gone.
This behavior should be fine for database migrations as you don't usually
want to fix the data manually in intermediate files, you want to fix the
problem at the source database and do the whole dance all-over again, up
until your casting rules are perfect.
This patch might also incurr some performance benenits in terms of both
timing and memory usage, though the local testing didn't show much of
anything for the moment.
Copy some code over from cl-postgres-trivial-utf-8 and add the support for
PostgreSQL COPY escaping right at the same place, allowing to allocate our
formatted utf-8 buffer only once, with the escaping already installed.
This patch was expected to be more about perfs, but it's actually only about
code cleaning it seems, as it doesn't make a big difference in the testing I
could do here.
That said, getting rid of one intermediate buffer should be nice in terms of
memory management.
The copy format and batch facilities are no longer the meat of your
PostgreSQL support in the src/pgsql directory, so have them leave in their
own space.
The website is moving to pgloader.org and readthedocs.io is going to be
integrated. Let's see what happens. The docs build fine locally with the
sphinx tools and the docs/Makefile.
Having separate files for the documentation should help ease the maintenance
and add new topics, such as support for Common Lisp Hackers level docs,
which are currently missing.
SQLite being very very liberal in type names (I think it accepts anything
and everything actually), our simple approach of tokenizing the input and
discarding noise words is not enough.
In this patch, we implement a new light parser for the SQLite type names to
better cope with noise words and random spacing of the catalog values that
SQLite failed to normalize. Well it didn't attempt, apparently.
Fix#548.
In passing, add the identifiers case option to SQLite support, which makes
it easier to test here, and add a table named "TableName" to our local test
database.
Fix#631.
Due to errors in regression testing when using CCL, review this part of
pgloader. It turns out that cl-log:stop-messenger on a text-stream-messenger
closes the stream, which isn't a good idea when given *standard-output*.
At least it makes CCL chokes when it then wants to output something of its
own, such as when running in --batch mode (which is nice because it outputs
more diagnostic information).
To solve that problem, initialize the text-stream-messenger with a broadcast
stream made from *standard-output*, which we now may close at will.
The parser files don't depend on the sources, it's the other way round
nowadays. Also, the responsability to decipher the *sections* context should
be restricted to the monitor.lisp file, which is now the case.
And this time, fix#628 for real.
The implementation follows PostgreSQL specifications as closely as possible,
with the escaping rules and the matching rules. The default path where to
find the .pgpass (or pgpass.conf on windows) are as documented in PostgreSQL
too. Only missing are the file permissions check.
Fix#460.
It used to be that you would give the target table name as an option to the
PostgreSQL connection string, which is untasteful:
load ... into pgsql://user@host/dbname?tablename=foo.bar ...
Or even, for backwards compatibility:
load ... into pgsql://user@host/dbname?foo.bar ...
The new syntax makes provision for a separate clause for the target table
name, possibly schema-qualified:
load ... into pgsql://user@host/dbname target table foo.bar ...
Which is much better, in particular when used together with the target
columns clause.
Implementing this seemingly quite small feature had impact on many parsing
related features of pgloader, such as the regression testing facility. So
much so that some extra refactoring got into its way here, around the
lisp-code-for-loading-from-<source> functions and their usage in
`load-data'.
While at it, this patch simplifies a lot the `load-data' function by making
a good use of &allow-other-keys and :allow-other-keys t.
Finally, this patch splits main.lisp into main.lisp and api.lisp, with the
latter intended to contain functions for Common Lisp programs wanting to use
pgloader as a library. The API itself is still the same as before this
patch, tho. Just in another file for clarity.
This allows to bypass SSL when you don't need it, like over localhost for
instance. Takes the same syntax as the PostgreSQL sslmode connection string
parameter.
Use a generic function protocol in order to implement the human readable,
verbose, csv, copy and json reporting output formats. This is much cleaner
and extensible than the previous way.
Use that new power to implement a real JSON output from the internal state
object.
This feature has been asked several times, and I can't see any way to fix
the GETENV parsing mess that we have. In this patch the GETENV support is
retired and replaced with a templating system, using the Mustache syntax.
To get back the GETENV feature, our implementation of the Mustache template
system adds support for fetching the template variable values from the OS
environment.
Fixes#555, Fixes#609.
See #500, #477, #278.
This change was long overdue. Ideally we would use something like the YeSQL
library for Clojure, but it seems like the cl-yesql equivalent is not ready
yet, and it depends on an experimental build system...
So this patch introduces an URL abstraction built on-top of a hash table.
You can then reference src/pgsql/sql/list-all-columns.sql as
(sql "pgsql/list-all-columns.sql")
in the source code directly.
So for now the templating system is CL's format language. It is still an
improvement from embedded string. Again, one step at a time.
When loading data into an existing PostgreSQL catalog, we DROP the
indexes for better performance of the data loading. Some of the indexes
are UNIQUE or even PRIMARY KEYS, and some FOREIGN KEYS might depend on
them in the PostgreSQL dependency tracking of the catalog.
We used to use the CASCADE option when dropping the indexes, which hides
a bug: if we exclude from the load tables with foreign keys pointing to
tables we target, then we would DROP those foreign keys because of the
CASCADE option, but fail to install them again at the end of the load.
To prevent that from happening, pgloader now query the PostgreSQL
pg_depend system catalog to list the “missing” foreign keys and add them
to our internal catalog representation, from which we know to DROP then
CREATE the SQL object at the proper times.
See #400 as this was an oversight in fixing this issue.
Also known as the ORM case, it happens that other tools are used to
create the target schema. In that case pgloader job is to fill in the
exiting target tables with the data from the source tables.
We still focus on load speed and pgloader will now DROP the
constraints (Primary Key, Unique, Foreign Keys) and indexes before
running the COPY statements, and re-install the schema it found in the
target database once the data load is done.
This behavior is activated when using the “create no tables” option as
in the following test-case setup:
with create no tables, include drop, truncate
Fixes#400, for which I got a test-case to play with!
First, add index and foreign keys to the list of objects supported by
the shared catalog facility, where is was only found in the pgsql schema
specific package for historical raisons.
Then also add to our catalog internal structures the notion of a trigger
and a stored procedure, allowing for cleaner advanced default values
support in the MySQL cast functions.
Once we now have a proper and complete catalog, review the pgsql module
DDL output function in terms of the catalog and rewrite the schema
creation support so that it takes direct benefit of our internal
catalogs representation.
In passing, clean-up the code organisation of the pgsql target support
module to be easier to work with.
Next step consists of getting rid of src/pgsql/queries.lisp: this
facility should be replaced by the usage of a target catalog that we
fetch the usual way, thanks to the new src/pgsql/pgsql-schema.lisp file
and list-all-* functions.
That will in turn allow for an explicit step of merging the pre-existing
PostgreSQL catalog when it's been created by other tools than pgloader,
that is when migrating with the help of an ORM. See #400 for details.
MS SQL has a notion of a "filtered index" that matches the notion of a
PostgreSQL partial index: the index only applies to the rows matching
the index WHERE clause, or filter.
The WHERE clause in both case are limited to simple expressions over a
base table's row at a time, so we implement a limited WHERE clause
parser for MS SQL filters and a transformation routine to rewrite the
clause in PostgreSQL slang.
In passing, we transform the filter constants using the same
transformation functions as in the CAST rules, so that e.g. a MS SQL
bit(1) value that got transformed into a PostgreSQL boolean is properly
translated, as in the following example:
MS SQL: "([deleted]=(0))" (that's from the catalogs)
PostgreSQL: deleted = 'f'
Of course the parser is still very badly tested, let's see what happens
in the wild now.
(Should) Fix#365.
The new ALTER TABLE facility allows to act on tables found in the MySQL
database before the migration happens. In this patch the only provided
actions are RENAME TO and SET SCHEMA, which fixes#224.
In order to be able to provide the same option for MS SQL users, we will
have to make it work at the SCHEMA level (ALTER SCHEMA ... RENAME TO
...) and modify the internal schema-struct so that the schema slot of
our table instances are a schema instance rather than its name.
Lacking MS SQL test database and instance, the facility is not yet
provided for that source type.
It turns out that MySQL catalog always store default value as strings
even when the column itself is of type bytea. In some cases, it's then
impossible to transform the expected bytea from a string.
In passing, move some code around to fix dependencies and make it
possible to issue log warnings from the default value printing code.
In the recent refactoring and improvements of parallelism the indexes
creation would kick in before we know that the data is done being copied
over to the target table.
Fix that by maintaining a writers-count hashtable and only starting to
create indexes when that count reaches zero, meaning all the concurrent
tasks started to handle the COPY of the data are now done.
In order to share more code in between the different source types,
finally have a go at the quite horrible mess of anonymous data
structures floating around.
Having a catalog and schema instances not only allows for code cleanup,
but will also allow to implement some bug fixes and wishlist items such
as mapping tables from a schema to another one.
Also, supporting database sources having a notion of "schema" (in
between "catalog" and "table") should get easier, including getting
on-par with MySQL in the MS SQL support (materialized views has been
asked for already).
See #320, #316, #224 for references and a notion of progress being made.
In passing, also clean up the copy-databases methods for database source
types, so that they all use a fetch-metadata generic function and a
prepare-pgsql-database and a complete-pgsql-database generic function.
Actually, a single method does the job here.
The responsibility of introspecting the source to populate the internal
catalog/schema representation is now held by the fetch-metadata generic
function, which in turn will call the specialized versions of
list-all-columns and friends implementations. Once the catalog has been
fetched, an explicit CAST call is then needed before we can continue.
Finally, the fields/columns/transforms slots in the copy objects are
still being used by the operative code, so the internal catalog
representation is only used up to starting the data copy step, where the
copy class instances are then all that's used.
This might be refactored again in a follow-up patch.
Make batches of raw data straight from the reader output (map-rows) and
have the transformation worker focus on changing the batch content from
raw rows to copy strings.
Also review the organisation of responsabilities in the code, allowing
to move queue.lisp into utils/batch.lisp, renaming it as its scope has
been reduced to only care about preparing batches.
This came out of trying to have multiple workers concurrently processing
the batches from the reader and feeding the hardcoded 2 COPY workers,
but it failed for multiple reasons. All is left as of now is this
cleanup, which seems to be on the faster side of things, which is always
good.
Filter the list of tables we migrate directly from the SQLite query,
avoiding to return useless data. To do that, use the LIKE pattern
matching supported by SQLite, where the REGEX operator is only available
when extra features are loaded apparently.
See #310 where filtering out the view still caused errors in the
loading.
When devising the common API, the first step has been to implement
specific methods for each generic function of the protocol. It now
appears that in some cases we don't need the extra level of flexibility:
each change of the API has been systematically reported to all the
specific methods, so just use a single generic definition where possible.
In particular, introduce new intermediate class for COPY subclasses
allowing to share more common code in the methods implementation, rather
than having to copy/paste and maintain several versions of the same
code.
It would be good to be able to centralize more code for the database
sources and how they are organized around metadata/import-data/complete
schema, but it doesn't look obvious how to do it just now.
After all, it's shared between the CSV command parsing and the Cast
Rules parsing. src/parsers/command-csv.lisp still contains lots of
facilities shared between the file based sources, will need another
series of splits.
In order to later be able to have more worker threads sharing the
load (multiple readers and/or writers, maybe more specialized threads
too), have all the stats be managed centrally by a single thread. We
already have a "monitor" thread that get passed log messages so that the
output buffer is not subject to race conditions, extend its use to also
deal with statistics messages.
In the current code, we send a message each time we read a row. In some
future commits we should probably reduce the messaging here to something
like one message per batch in the common case.
Also, as a nice side effect of the code simplification and refactoring
this fixes#283 wherein the before/after sections of individual CSV
files within an ARCHIVE command where not counted in the reporting.
When parsing table names in the target URI, we are careful of splitting
the table and schema name and store them into a cons in that case. Not
all sources methods got the memo, clean that up.
See #182 and #186, a pull request I am now going to be able to accept.
Also see #287 that should be helped by being able to apply #186.
As reported by clisp maintainer (thanks jackdaniel!) when trying to load
pgloader, we had redoundant labels function names in places. Get rid of
those by pushing the new columns found directly at the end of the list,
avoiding the bulky code to then reverse the complex anonymous data
structure.
The Real Fix™ would be to define proper structures where to hold all
those database catalogs representation, but that's an invasive patch and
now isn't a good time to write it.
At least pgloader should load and run with clisp now.
As per PostgreSQL documentation on connection strings, allow overriding
of main URI components in the options parts, with a percent-encoded
syntax for parameters. It allows to bypass the main URI parser
limitations as seen in #199 (how to have a password start with a
colon?).
See:
http://www.postgresql.org/docs/9.3/interactive/libpq-connect.html#LIBPQ-CONNSTRING
It appears that db3 files are not limited to the ASCII character
encoding that they were designed with, so let's clue pgloader about
that.
This commit build
770cbe3526
and the pgloader Makefile has been updated to momentarily fetch cl-db3
from github rather than Quicklisp so that it's possible to enjoy the new
feature immediately.
PostgreSQL COPY format is not really CSV but something way easier to
parse. Funnily enough, parsing it as CSV is not that easy, so we add
here a special simple parser for the COPY format.
It should be quite useful too try loading again reject data files from
pgloader after manual fixing, too. It's still missing some documentation
without any good excuse for that, will add soon.
That's the big refactoring patch I've been sitting on for too long.
First, refactor connection handling to use a uniformed "connection"
concept (class and generic functions API) everywhere, so that the COPY
derived objects just use that in their :source-db and :target-db slots.
Given that, we don't need no messing around with *pgconn* and *myconn-*
and other special variables at all anywhere in the tree.
Second, clean up some oddities accumulated over time, where some parts
of the code didn't get the memo when new API got into place.
Third, fix any other oddity or missing part found while doing those
first two activities, it was long overdue anyway...
Now it's possible to parse a command to load data from MS SQL. The
parser was until now parsing all database URI within the same common
rule and that isn't possible anymore if we want to distinguish in
between source database right from the parser, which we actually want to
do.
This patch also implement in-passing fixes all over the place, including
the transformation function float-to-string that only happened to work
on double-float data.
Converting the table definitions (with type casting) seems to work. Also
did experiment a little with actuallt fetching some data... and had to
edit the cl-mssql driver, which is temporarily monkey patched.