Previous patch made regression failures obvious that were hidden by strange
bugs with CCL.
One such regression was introduced in commit
ab7e77c2d00decce64ab739d0eb3d2ca5bdb6a7e where we played with the complex
code generation for field projection, where the following two cases weren't
cleanly processed anymore:
column text using "constant"
column text using "field-name"
In the first case we want to load a user-defined constant in the column, in
the second case we want to load the value of the field "field-name" in the
column --- we just have different source and target names.
Another regression was introduced in the recent commit
01e5c2376390749c2b7041b17b9a974ee8efb6b2 where the create-table function was
called too early, before we have fetched *pgsql-reserved-keywords*. As a
consequence table names weren't always properly quoted as shown in the
test/csv-header.load file which targets a table named "group".
Finally, skip the test/dbf.load regression test when using CCL as this
environment doesn't have the necessary CP850 code page / encoding.
Due to errors in regression testing when using CCL, review this part of
pgloader. It turns out that cl-log:stop-messenger on a text-stream-messenger
closes the stream, which isn't a good idea when given *standard-output*.
At least it makes CCL chokes when it then wants to output something of its
own, such as when running in --batch mode (which is nice because it outputs
more diagnostic information).
To solve that problem, initialize the text-stream-messenger with a broadcast
stream made from *standard-output*, which we now may close at will.
The parser files don't depend on the sources, it's the other way round
nowadays. Also, the responsability to decipher the *sections* context should
be restricted to the monitor.lisp file, which is now the case.
And this time, fix#628 for real.
It seems that when compiling with CCL in “batch” mode, that is using
buildapp, the local symbol exporting facility didn't work at all. It needs
to be run at load time so that the compiler sees the symbols.
Fix#628.
Google Cloud SQL instances are now using the following format for the name
of their socket <PROJECT-ID>:<REGION>:<INSTANCE_NAME>. We do that by
allowing to escape a colon in the socket directory name by doubling it, as
in the username field. It also allows to accept any character in the socket
directory name, which is a good cleanup.
Fix#621.
The implementation follows PostgreSQL specifications as closely as possible,
with the escaping rules and the matching rules. The default path where to
find the .pgpass (or pgpass.conf on windows) are as documented in PostgreSQL
too. Only missing are the file permissions check.
Fix#460.
Some code was pasted twice in src/api.lisp, and a defstruct with no slots
isn't spelled the way I did in previous patches. We use a defstruct with no
slots for defining a hierarchy on which to dispatch our pretty-print
function.
It used to be that you would give the target table name as an option to the
PostgreSQL connection string, which is untasteful:
load ... into pgsql://user@host/dbname?tablename=foo.bar ...
Or even, for backwards compatibility:
load ... into pgsql://user@host/dbname?foo.bar ...
The new syntax makes provision for a separate clause for the target table
name, possibly schema-qualified:
load ... into pgsql://user@host/dbname target table foo.bar ...
Which is much better, in particular when used together with the target
columns clause.
Implementing this seemingly quite small feature had impact on many parsing
related features of pgloader, such as the regression testing facility. So
much so that some extra refactoring got into its way here, around the
lisp-code-for-loading-from-<source> functions and their usage in
`load-data'.
While at it, this patch simplifies a lot the `load-data' function by making
a good use of &allow-other-keys and :allow-other-keys t.
Finally, this patch splits main.lisp into main.lisp and api.lisp, with the
latter intended to contain functions for Common Lisp programs wanting to use
pgloader as a library. The API itself is still the same as before this
patch, tho. Just in another file for clarity.
In the previous commit we introduced support for database names including
spaces, which means that by default pgloader creates a target schema in
PostgreSQL with a space in its name. That works well as soon as you always
double-quote the schema name, which pgloader does.
Now, in our internal catalogs, we keep the schema name double-quoted. And
when comparing that schema names with quotes to the raw schema name from
PostgreSQL, they won't match, and pgloader tries to create the schema again:
ERROR Database error 42P06: schema "my sql" already exists
Fix the comparing to compare unquoted schema name, fix#614 again: the
previous fix would only work the first time.
This allows to bypass SSL when you don't need it, like over localhost for
instance. Takes the same syntax as the PostgreSQL sslmode connection string
parameter.
In this commit we fail the guess faster, allowing to test for a much larger
sample. The sample is still hard-coded, but this time to 1000 lines.
Also add a test case, see #618.
Understanding the timings requires not only the number of rows copied into
each table but also how many bytes that represent. We add that information
now in tht output.
The number of bytes presented is computed from the unicode representation we
prepare in pgloader for each row before sending it down to PostgreSQL.
Use a generic function protocol in order to implement the human readable,
verbose, csv, copy and json reporting output formats. This is much cleaner
and extensible than the previous way.
Use that new power to implement a real JSON output from the internal state
object.
It seems that SBCL still needs some help in deciding when to GC with very
large values. In a test case with a “data” column averaging 375kB (up to
about 3 MB per datum), it allows much larger batch size and prefetch rows
settings without entering lldb.
The pgstate infrastructure already had lots of details about what's going
on, add to it the information about how many bytes are sent in every batch,
and use this information in the monitor when something long is happening to
display how many rows we sent from the beginning for this (supposedly) huge
table, along with bytes and speed (bytes per seconds).
Add a message every 20 batches so that the user knows it's still going on.
Also, in passing, fix some messages: present is not precise enough to decide
if the log refers to an event that is being done or starting next.
The verbosity is not that easy to adjust. Remove useless messages and add a
new one telling when the COPY of a table is done. As we might have to wait
for some time for indexes being built. keep the CREATE INDEX lines. Also
keep the ALTER TABLE both for primary keys and foreign keys, again because
the user might have to wait for quite some time.
This feature has been asked several times, and I can't see any way to fix
the GETENV parsing mess that we have. In this patch the GETENV support is
retired and replaced with a templating system, using the Mustache syntax.
To get back the GETENV feature, our implementation of the Mustache template
system adds support for fetching the template variable values from the OS
environment.
Fixes#555, Fixes#609.
See #500, #477, #278.
It is sometimes needed to tweak MS SQL server parameters, such as the
textsize parameters which allows fetching the whole set of bytes of a text
of binary column (not kidding).
Now it's possible to add such a line in the load file:
set mssql parameters textsize to '104857600'
Fixes#603.
This code path is exercised from the command line only, which means I don't
get to run it that often. And it's a pain to debug. So make it easier to run
`process-source-and-target` from the REPL.
Startup log messages could be lost because the monitor would be started but
not ready to process messages. Fix that by “warming up” the monitoring
thread, having it execute a small computation and more importantly wait for
the result to be received back, blocking.
See #599 where parsing errors from a wrong URL were missed in the command
line output, quite disturbingly.
The previous patch fixed CREATE TYPE so that ENUM types are created in the
same schema than the table using them, but failed to update the DROP TYPE
statements to also target this schema...
Previously to this patch, pgloader wouldn't care about which schema it
creates extra types in. Extra types are mainly ENUM and SET support from
MySQL. Now, pgloader creates those extra PostgreSQL ENUM types in the same
schema as the table using them, which is a more sound default.
The MySQL enum are casted to PostgreSQL enum types just fine, but sometimes
that's not what the user wants. In case when we have a CAST rule for an ENUM
column, recognize the fact and respect user choice.
Fixes#608.
The spelling in SQLite for the default value is "current_date", instruct
pgloader about that. This commit also adds a test case in our sqlite.db
unit tests database.
Fixes#607.
Now when building a bundle file for source distribution of pgloader, always
test it by building a binary image from the bundle tarball in a test
directory. Also make it easy to target "latest" Quicklisp distribution with
the following spelling:
make BUNDLEDIST=latest bundle
When distributing a pgloader bundle we're using the ql-dist facility. In
recent commit we hand-picked the last known working distribution of
quicklisp for pgloader. Make it easy to target "latest" known distribution
or hard-code one from the Makefile or the bundle/ql.lisp file.
We use a CSV parser for the MySQL enum values, but the quote escaping wasn't
properly setup: MySQL quotes ENUM values with a single-quote (') and uses
two of them ('') for escaping single-quotes when found in the ENUM value
itself.
Fixes#597.
When the table is empty we get nil for min and max values of the id column.
In that case we don't compute a set of ranges and “cancel” concurrency
support for the empty table.
Fixes#596.
The with option “include drop” used to also apply to schemas, which is not
that useful and problematic when trying to DROP SCHEMA public, because you
might not connect as the owner of that schema.
Even if we don't target the public schema by default, users can choose to do
so thanks to our ALTER SCHEMA ... RENAME TO ... command.
Fixes#594.
As we know how many columns we expect from the input file, it's possible to
read a sample (10 lines as of this patch) and try many different CSV reader
parameters combinations until we find one that works: it returns the right
number of fields.
It is still possible of course to specify parameters on the command line or
in a load file if necessary, but it makes the simple case even simpler. As
simple as:
pgloader file.csv pgsql:///pgloader?tablename=target
From a load file, as soon as pgloader can retrieve the schema of the target
table the source field list defaults to the target column list. Let's apply
the same rules to the command line.
It was only offered for SQLite without good reason really, and tests show
that it works as well with MySQL of course. Offer the option there too.
See 3eab88b1440a8166786e90b95f563d153e2ba4dc for details.